# Create a Dataset for Sentiment Analysis

> TL;DR In this tutorial, you'll learn how to create a dataset for Sentiment Analysis by scraping user reviews for Android apps. You'll convert the app and review information into Data Frames and save that to CSV files.

- [Read the tutorial](https://www.curiousily.com/posts/create-dataset-for-sentiment-analysis-by-scraping-google-play-app-reviews-using-python/)
- [Run the notebook in your browser (Google Colab)](https://colab.research.google.com/drive/1GDJIpz7BXw55jl9wTOMQDool9m8DIOyp)
- [Read the `Getting Things Done with Pytorch` book](https://github.com/curiousily/Getting-Things-Done-with-Pytorch)

You'll learn how to:

- Set a goal and inclusion criteria for your dataset
- Get real-world user reviews by scraping Google Play
- Use Pandas to convert and save the dataset into CSV files

## Setup

Let's install the required packages and setup the imports:

In [1]:
!pip install -qq google-play-scraper

In [2]:
!pip install -qq -U watermark

In [3]:
%reload_ext watermark
%watermark -v -p pandas,matplotlib,seaborn,google_play_scraper

Python implementation: CPython
Python version       : 3.9.7
IPython version      : 7.29.0

pandas             : 1.3.4
matplotlib         : 3.4.3
seaborn            : 0.11.2
google_play_scraper: 1.0.2



In [4]:
import json
import pandas as pd
from tqdm import tqdm

import seaborn as sns
import matplotlib.pyplot as plt

from pygments import highlight
from pygments.lexers import JsonLexer
from pygments.formatters import TerminalFormatter

from google_play_scraper import Sort, reviews, app

%matplotlib inline
%config InlineBackend.figure_format='retina'

sns.set(style='whitegrid', palette='muted', font_scale=1.2)

## The Goal of the Dataset

You want to get feedback for your app. Both negative and positive are good. But the negative one can reveal critical features that are missing or downtime of your service (when it is much more frequent).

Lucky for us, Google Play has plenty of apps, reviews, and scores. We can scrape app info and reviews using the [google-play-scraper](https://github.com/JoMingyu/google-play-scraper) package.

You can choose plenty of apps to analyze. But different app categories contain different audiences, domain-specific quirks, and more. We'll start simple.

We want apps that have been around some time, so opinion is collected organically. We want to mitigate advertising strategies as much as possible. Apps are constantly being updated, so the time of the review is an important factor.

Ideally, you would want to collect every possible review and work with that. However, in the real world data is often limited (too large, inaccessible, etc). So, we'll do the best we can.

Let's choose some apps that fit the criteria from the *Productivity* category. We'll use [AppAnnie](https://www.appannie.com/apps/google-play/top-chart/?country=US&category=29&device=&date=2020-04-05&feed=All&rank_sorting_type=rank&page_number=1&page_size=100&table_selections=) to select some of the top US apps:

In [5]:
app_packages = [
  'com.shopee.id',
  'com.lazada.android',
  'com.tokopedia.tkpd',
]

## Scraping App Information

Let's scrape the info for each app:

In [6]:
app_infos = []

for ap in tqdm(app_packages):
  info = app(ap, lang='en', country='id')
 # del info['comments']
  app_infos.append(info)

100%|██████████| 3/3 [00:02<00:00,  1.32it/s]


We got the info for all 15 apps. Let's write a helper function that prints JSON objects a bit better:

In [7]:
def print_json(json_object):
  json_str = json.dumps(
    json_object, 
    indent=2, 
    sort_keys=True, 
    default=str
  )
  print(highlight(json_str, JsonLexer(), TerminalFormatter()))

Here is a sample app information from the list:

In [9]:
print_json(app_infos[0])

{
  [94m"adSupported"[39;49;00m: [34mnull[39;49;00m,
  [94m"androidVersion"[39;49;00m: [33m"4.1"[39;49;00m,
  [94m"androidVersionText"[39;49;00m: [33m"4.1 and up"[39;49;00m,
  [94m"appId"[39;49;00m: [33m"com.shopee.id"[39;49;00m,
  [94m"comments"[39;49;00m: [],
  [94m"containsAds"[39;49;00m: [34mfalse[39;49;00m,
  [94m"contentRating"[39;49;00m: [33m"Rated for 3+"[39;49;00m,
  [94m"contentRatingDescription"[39;49;00m: [34mnull[39;49;00m,
  [94m"currency"[39;49;00m: [33m"IDR"[39;49;00m,
  [94m"description"[39;49;00m: [33m"Let's celebrate 2022 with Shopee 1.1 New Year Sale! \r\n\r\n1. Cuci Gudang Potongan s/d 90%\r\n2. Potongan Ongkir XTRA\r\n3. Ekstra Potongan 10RB\r\n\r\nDownload Shopee app now & buy all your needs with lowest price!\r\n\r\nPay everything with ShopeePay!\r\nShopeePay is a digital wallet and e-money feature service to provide you an easy online payment on Shopee app & offline payment at ShopeePay Merchants. Activate & verify your Shop

This contains lots of information including the number of ratings, number of reviews and number of ratings for each score (1 to 5). Let's ignore all of that and have a look at their beautiful icons:

In [11]:
def format_title(title):
  sep_index = title.find(':') if title.find(':') != -1 else title.find('-')
  if sep_index != -1:
    title = title[:sep_index]
  return title[:10]

fig, axs = plt.subplots(5, len(app_infos) // 5, figsize=(14, 5))

for i, ax in enumerate(axs.flat):
  ai = app_infos[i]
  img = plt.imread(ai['icon'])
  ax.imshow(img)
  ax.set_title(format_title(ai['title']))
  ax.axis('off')

ValueError: Number of columns must be a positive integer, not 0

<Figure size 1008x360 with 0 Axes>

We'll store the app information for later by converting the JSON objects into a Pandas dataframe and saving the result into a CSV file:

In [12]:
app_infos_df = pd.DataFrame(app_infos)
app_infos_df.to_csv('apps.csv', index=None, header=True)

## Scraping App Reviews

In an ideal world, we would get all the reviews. But there are lots of them and we're scraping the data. That wouldn't be very polite. What should we do?

We want:

- Balanced dataset - roughly the same number of reviews for each score (1-5)
- A representative sample of the reviews for each app

We can satisfy the first requirement by using the scraping package option to filter the review score. For the second, we'll sort the reviews by their helpfulness, which are the reviews that Google Play thinks are most important. Just in case, we'll get a subset from the newest, too:

In [14]:
app_reviews = []

for ap in tqdm(app_packages):
  for score in list(range(1, 6)):
    for sort_order in [Sort.MOST_RELEVANT, Sort.NEWEST]:
      rvs, _ = reviews(
        ap,
        lang='en',
        country='id',
        sort=sort_order,
        count= 200 if score == 3 else 100,
        filter_score_with=score
      )
      for r in rvs:
        r['sortOrder'] = 'most_relevant' if sort_order == Sort.MOST_RELEVANT else 'newest'
        r['appId'] = ap
      app_reviews.extend(rvs)

100%|██████████| 3/3 [00:30<00:00, 10.07s/it]


Note that we're adding the app id and sort order to each review. Here's an example for one:

In [16]:
print_json(app_reviews[1])

{
  [94m"appId"[39;49;00m: [33m"com.shopee.id"[39;49;00m,
  [94m"at"[39;49;00m: [33m"2021-12-23 14:51:51"[39;49;00m,
  [94m"content"[39;49;00m: [33m"I'm disappointed. Shopee is heavy, HEAVY. It took me around 650MB of storage like what???The other thing is that Shopee took a long time to load and the UI is not user friendly. I'm on the latest version right now (and stable connection of course) and it seems nothing much happen after every update."[39;49;00m,
  [94m"repliedAt"[39;49;00m: [33m"2021-12-23 18:14:56"[39;49;00m,
  [94m"replyContent"[39;49;00m: [33m"Hi, sorry for the inconvenience, I suggest making sure to update your shopee application, the internet network is stable, clear cache, log out and log back in, and try periodically 1x24 hours, Sis. Thank You ^EU"[39;49;00m,
  [94m"reviewCreatedVersion"[39;49;00m: [33m"2.81.08"[39;49;00m,
  [94m"reviewId"[39;49;00m: [33m"gp:AOqpTOHjvpMcZFfhdHHr5xEI6rFu9bildrNDwFnorkmKZ8-Zvy37JWgrvSH-Oa__CgRrxD0x4Jte_RsHT-k

`repliedAt` and `replyContent` contain the developer response to the review. Of course, they can be missing.

How many app reviews did we get?



In [18]:
len(app_reviews[0])

12

In [20]:
print_json(app_reviews)

[
  {
    [94m"appId"[39;49;00m: [33m"com.shopee.id"[39;49;00m,
    [94m"at"[39;49;00m: [33m"2021-12-28 16:35:50"[39;49;00m,
    [94m"content"[39;49;00m: [33m"This is too much, I spend lots of time shopping on shopee, but nowadays its a worst experiences. The app took too long to open, and 90% of the time it closed automatically after a few seconds of opening this app. I tried to clear my cache, and uninstall-install it several times, but this issues keep happening again and again. So sick of this problems."[39;49;00m,
    [94m"repliedAt"[39;49;00m: [34mnull[39;49;00m,
    [94m"replyContent"[39;49;00m: [34mnull[39;49;00m,
    [94m"reviewCreatedVersion"[39;49;00m: [33m"2.81.21"[39;49;00m,
    [94m"reviewId"[39;49;00m: [33m"gp:AOqpTOHHLVUa5b_rxsJPsABSr5ut9L00gN3luh-Wkjceh1BFOGQDPB1pqFG-2ANWGWuBav70p-0oQem9_iTzF5M"[39;49;00m,
    [94m"score"[39;49;00m: [34m1[39;49;00m,
    [94m"sortOrder"[39;49;00m: [33m"most_relevant"[39;49;00m,
    [94m"thumbsUpCount

Let's save the reviews to a CSV file:

In [2]:
from google.colab import drive
drive.mount('/content/drive',force_remount=True)
#repo_path = '/content/gdrive/My Drive/reviewsentimen2021/'

Mounted at /content/drive


In [21]:
app_reviews_df = pd.DataFrame(app_reviews)
app_reviews_df.to_csv('review_shopping_2021.csv', index=None, header=True)

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


## Summary

Well done! You now have a dataset with more than 15k user reviews from 15 productivity apps. Of course, you can go crazy and get much much more.

- [Read the tutorial](https://www.curiousily.com/posts/create-dataset-for-sentiment-analysis-by-scraping-google-play-app-reviews-using-python/)
- [Run the notebook in your browser (Google Colab)](https://colab.research.google.com/drive/1GDJIpz7BXw55jl9wTOMQDool9m8DIOyp)
- [Read the `Getting Things Done with Pytorch` book](https://github.com/curiousily/Getting-Things-Done-with-Pytorch)

You learned how to:

- Set goals and expectations for your dataset
- Scrape Google Play app information
- Scrape user reviews for Google Play apps
- Save the dataset to CSV files

Next, we're going to use the reviews for sentiment analysis with BERT. But first, we'll have to do some text preprocessing!


## References

- [Google Play Scraper for Python](https://github.com/JoMingyu/google-play-scraper)

# New Section