# Overview
This notebook documents the process of collecting user reviews from the Tokopedia Android application using the `google-play-scraper` Python library. The resulting dataset will be used for downstream tasks such as text cleaning, sentiment analysis, and topic modeling.

# Objective
To build a structured and comprehensive dataset of user-generated reviews from the Google Play Store, including ratings, review text, timestamps, and metadata.

In [1]:
!pip install --quiet google-play-scraper

# Data Source
- **Platform:** Google Play Store  
- **Library:** `google-play-scraper`  
- **App Package Name:** `com.tokopedia.tkpd`  
- **Data Retrieved:** Review text, score, review date, user metadata, and review IDs.

In [None]:
from warnings import filterwarnings
filterwarnings('ignore')

# Core library
import pandas as pd

# Web scraping
from google_play_scraper import (
    Sort,
    reviews
)

pd.set_option('display.max_colwidth', None)

print('Ready!')

Ready!


# Scraping Method
We use the `reviews()` function from `google-play-scraper`, which provides a high-level interface for retrieving app reviews without the need to manually parse HTML or handle pagination.

# Key Parameters
- `lang`: Language of the reviews (`"id"` for Indonesian language review)
- `country`: Country store (`"id"` for Indonesian users)
- `sort`: Sorting method (`Sort.NEWEST` to get recent reviews)
- `count`: Number of reviews to fetch per request

# Fetching Reviews in Batches
The following cell fetches reviews in chunks of 20000 using the continuation token returned by `reviews()`. All retrieved entries are stored in a list, which will later be converted into a DataFrame.

In [8]:
all_reviews = []

for _ in range(25):
    batch, continuation_token = reviews(
        'com.tokopedia.tkpd',
        lang='id',
        country='id',
        sort=Sort.NEWEST,
        count=20000,
        continuation_token=None
    )

    all_reviews.extend(batch)

    if continuation_token is None:
        break

    print(f'Successfully scraped {(_+1)*20000} data')

print('Data scraped successfully!')

Successfully scraped 20000 data
Successfully scraped 40000 data
Successfully scraped 60000 data
Successfully scraped 80000 data
Successfully scraped 100000 data
Successfully scraped 120000 data
Successfully scraped 140000 data
Successfully scraped 160000 data
Successfully scraped 180000 data
Successfully scraped 200000 data
Successfully scraped 220000 data
Successfully scraped 240000 data
Successfully scraped 260000 data
Successfully scraped 280000 data
Successfully scraped 300000 data
Successfully scraped 320000 data
Successfully scraped 340000 data
Successfully scraped 360000 data
Successfully scraped 380000 data
Successfully scraped 400000 data
Successfully scraped 420000 data
Successfully scraped 440000 data
Successfully scraped 460000 data
Successfully scraped 480000 data
Successfully scraped 500000 data
Data scraped successfully!


In [14]:
df_scrape = pd.DataFrame(all_reviews)

# Choose only the necessary column
df_raw = df_scrape[['content', 'score', 'at']]
df_raw.columns = ['raw_text', 'rating', 'date']
df_raw.head(7)

Unnamed: 0,raw_text,rating,date
0,keluar masuk mulu,5,2025-11-27 08:23:08
1,good,5,2025-11-27 08:21:14
2,Penarikan Saldo refund saya kenapa masih di tahan pengembaliannya???,1,2025-11-27 07:51:54
3,update mulu heran,5,2025-11-27 07:18:06
4,"sekarang aplikasi tambah ancur, sudah boros batre dipakai nggak nyaman",1,2025-11-27 06:03:40
5,"minusnya satu kenapa customer service nya bisa lama bangett, ini perusahaan gede lohhh, please lah , aku nggk bisa narik dana refund lebih dari 2hari dan csnya terus dialihkan, ditanyakan nggk di bales2 ðŸ¥² coba diperbaiki lagi dong biar semuanya juga puas dengan pelayanan nan, dan masa iya penipu ada di tokped kamu gimana nyeleksinya heran penjual ada yang nipuðŸ˜”bukan duit sedikit lohhh ini yang aku tarik saldo refund nya",3,2025-11-27 06:03:36
6,"Sekarang kenapa susah ya menginfokan ke penjual utk lampirin orderan kita via chat, biasa begitu tanya penjual itu otomatis ke kirim orderan kita tapi sekarang ga bisa...",4,2025-11-27 05:41:14


In [16]:
df_raw.to_csv('../data/raw/review.csv', index=False)
print('Scraped data successfully saved to "../data/raw/review.csv"')

Scraped data successfully saved to "../data/raw/review.csv"


# Limitations
- Google Play may limit large-scale scraping; the library provides only accessible public data.
- Pagination relies on continuation tokens; if Google changes the API structure, functionality may break.
- Not guaranteed to retrieve **all** historical reviews due to API restrictions.
  
# Next Steps
- Clean and normalize the review text.
- Remove noise, emojis, and repeating characters.
- Perform exploratory analysis on ratings, review length, and temporal patterns.
- Apply sentiment classification, topic extraction, or clustering.