# Google Play App's Review Data Scraper

> Aplikasi ini dibuat untuk melakukan penarikan data dari salah satu maupun beberapa aplikasi yang terdapat di Google Play Store ke dalam bentuk csv

Data yang diperoleh antara lain adalah :
- Review dari user
- Star Rating
- Hari dan Tanggal
- Informasi Aplikasi

## Setup

install packages yang dibutuhkan dan setup the imports:

(Digunakan untuk melakukan instalasi package yang diperlukan dalam melakukan data scraping)

In [1]:
!pip install -qq google-play-scraper
!pip install watermark

Defaulting to user installation because normal site-packages is not writeable


In [2]:
%reload_ext watermark
%watermark -v -p pandas,matplotlib,seaborn,google_play_scraper

Python implementation: CPython
Python version       : 3.10.0
IPython version      : 7.28.0

pandas             : 1.3.4
matplotlib         : 3.5.0
seaborn            : 0.11.2
google_play_scraper: 1.0.2



In [3]:
import json
import pandas as pd
from tqdm import tqdm

import seaborn as sns
import matplotlib.pyplot as plt

from pygments import highlight
from pygments.lexers import JsonLexer
from pygments.formatters import TerminalFormatter

from google_play_scraper import Sort, reviews, app

%matplotlib inline
%config InlineBackend.figure_format='retina'

sns.set(style='whitegrid', palette='muted', font_scale=1.2)

Aplikasi yang akan dilakukan data scraping, ditulis nama packagenya.
Dapat ditambahkan lebih dari 1 aplikasi

In [4]:
app_packages = [
  'id.co.bitcoin',
]

## Scraping App Information

Scrape informasi untuk setiap aplikasi:

In [5]:
app_infos = []

for ap in tqdm(app_packages):
  info = app(ap, lang='id', country='id')
  del info['comments']
  app_infos.append(info)

100%|██████████| 1/1 [00:00<00:00,  2.08it/s]


Merubah data hasil scraping ke dalam bentuk JSON

In [6]:
def print_json(json_object):
  json_str = json.dumps(
    json_object, 
    indent=2, 
    sort_keys=True, 
    default=str
  )
  print(highlight(json_str, JsonLexer(), TerminalFormatter()))

Data JSON dari aplikasi yang telah dilakukan scraping

In [7]:
print_json(app_infos[0])

{
  [94m"adSupported"[39;49;00m: [34mnull[39;49;00m,
  [94m"androidVersion"[39;49;00m: [33m"5.0"[39;49;00m,
  [94m"androidVersionText"[39;49;00m: [33m"5.0 dan yang lebih tinggi"[39;49;00m,
  [94m"appId"[39;49;00m: [33m"id.co.bitcoin"[39;49;00m,
  [94m"containsAds"[39;49;00m: [34mfalse[39;49;00m,
  [94m"contentRating"[39;49;00m: [33m"Rating 3+"[39;49;00m,
  [94m"contentRatingDescription"[39;49;00m: [34mnull[39;49;00m,
  [94m"currency"[39;49;00m: [33m"IDR"[39;49;00m,
  [94m"description"[39;49;00m: [33m"The biggest Indonesia's Crypto Assets marketplace within your hands!\r\n\r\nIndodax Official Mobile App!\r\n\r\nIndodax is the biggest Crypto Asset marketplace in Indonesia. We currently have more than 2 million members from Indonesia and all over the world. With the sophisticated technology, Indodax has improved its mobile application which eases all traders with a wide range of crypto assets.\r\n\r\nWith this we proudly present\r\n\r\n\u2605Buy and Sell

This contains lots of information including the number of ratings, number of reviews and number of ratings for each score (1 to 5). Let's ignore all of that and have a look at their beautiful icons:

In [None]:
def format_title(title):
  sep_index = title.find(':') if title.find(':') != -1 else title.find('-')
  if sep_index != -1:
    title = title[:sep_index]
  return title[:10]

fig, axs = plt.subplots(2, len(app_infos) // 2, figsize=(14, 5))

for i, ax in enumerate(axs.flat):
  ai = app_infos[i]
  img = plt.imread(ai['icon'])
  ax.imshow(img)
  ax.set_title(format_title(ai['title']))
  ax.axis('off')

Store Informasi aplikasi dengan converting JSON objects menjadi Pandas dataframe dan menyimpan hasilnya kedalam sebuah CSV file:

In [10]:
app_infos_df = pd.DataFrame(app_infos)
app_infos_df.to_csv('apps.csv', index=None, header=True)

## Scraping App Reviews

In an ideal world, we would get all the reviews. But there are lots of them and we're scraping the data. That wouldn't be very polite. What should we do?

We want:

- Balanced dataset - roughly the same number of reviews for each score (1-5)
- A representative sample of the reviews for each app

We can satisfy the first requirement by using the scraping package option to filter the review score. For the second, we'll sort the reviews by their helpfulness, which are the reviews that Google Play thinks are most important. Just in case, we'll get a subset from the newest, too:

In [11]:
app_reviews = []

for ap in tqdm(app_packages):
  for score in list(range(1, 6)):
    for sort_order in [Sort.MOST_RELEVANT, Sort.NEWEST]:
      rvs, _ = reviews(
        ap,
        lang='id',
        country='id',
        sort=sort_order,
        count= 200 if score == 3 else 100,
        filter_score_with=score
      )
      for r in rvs:
        r['sortOrder'] = 'most_relevant' if sort_order == Sort.MOST_RELEVANT else 'newest'
        r['appId'] = ap
      app_reviews.extend(rvs)

100%|██████████| 1/1 [00:06<00:00,  6.18s/it]


Note that we're adding the app id and sort order to each review. Here's an example for one:

In [12]:
print_json(app_reviews[0])

{
  [94m"appId"[39;49;00m: [33m"id.co.bitcoin"[39;49;00m,
  [94m"at"[39;49;00m: [33m"2021-12-19 18:43:41"[39;49;00m,
  [94m"content"[39;49;00m: [33m"Harga eminer kemarin di beberapa situs terkenal sampai 1000%, di indodax malah ga bergerak sama sekali, bikin jadi bertanya2, apakah ada permainan dari pihak indodax, bisa di bayangkan berapa uang yg harusnya di perileh oleh trader,,,perlu dipertanyakan lagi tentang kejujuran situs ini..."[39;49;00m,
  [94m"repliedAt"[39;49;00m: [33m"2021-12-19 19:07:59"[39;49;00m,
  [94m"replyContent"[39;49;00m: [33m"Dear member Indodax. Sisi kami tidak dapat mempercepat transaksi tersebut dikarenakan naik turunnya harga aset kripto mutlak ditentukan oleh permintaan dan penawaran yang ada. Anda dapat menunggu hingga harga memasuki antrian pada order book. Jika Anda mengalami kendala dalam bertransaksi, mohon langsung hubungi layanan support kami. Terima kasih."[39;49;00m,
  [94m"reviewCreatedVersion"[39;49;00m: [33m"4.2.4"[39;49;00

`repliedAt` and `replyContent` contain the developer response to the review. Of course, they can be missing.

How many app reviews did we get?



In [13]:
len(app_reviews)

1200

Let's save the reviews to a CSV file:

In [19]:
app_reviews_df = pd.DataFrame(app_reviews)
app_reviews_df.to_csv('../output/Hasil Scraping.csv', index=None, header=True)

## Summary

Well done! You now have a dataset with more than 15k user reviews from 15 productivity apps. Of course, you can go crazy and get much much more.

- [Read the tutorial](https://www.curiousily.com/posts/create-dataset-for-sentiment-analysis-by-scraping-google-play-app-reviews-using-python/)
- [Run the notebook in your browser (Google Colab)](https://colab.research.google.com/drive/1GDJIpz7BXw55jl9wTOMQDool9m8DIOyp)
- [Read the `Getting Things Done with Pytorch` book](https://github.com/curiousily/Getting-Things-Done-with-Pytorch)

You learned how to:

- Set goals and expectations for your dataset
- Scrape Google Play app information
- Scrape user reviews for Google Play apps
- Save the dataset to CSV files

Next, we're going to use the reviews for sentiment analysis with BERT. But first, we'll have to do some text preprocessing!


## References

- [Google Play Scraper for Python](https://github.com/JoMingyu/google-play-scraper)