# **Overview**
This notebook documents the process of collecting user reviews from the Tokopedia Android application using the `google-play-scraper` Python library. The resulting dataset will be used for downstream tasks such as text cleaning, sentiment analysis, and topic modeling.

# **Objective**
To build a structured and comprehensive dataset of user-generated reviews from the Google Play Store, including ratings, review text, timestamps, and metadata.

In [1]:
!pip install --quiet google-play-scraper

# **Data Source**
- **Platform:** Google Play Store  
- **Library:** `google-play-scraper`  
- **App Package Name:** `com.tokopedia.tkpd`  
- **Data Retrieved:** Review text, score, review date, user metadata, and review IDs.

In [2]:
from warnings import filterwarnings
filterwarnings('ignore')

# Core library
import pandas as pd

# Web scraping
from google_play_scraper import (
    Sort,
    reviews
)

pd.set_option('display.max_colwidth', None)

print('Ready!')

Ready!


# **Scraping Method**
We use the `reviews()` function from `google-play-scraper`, which provides a high-level interface for retrieving app reviews without the need to manually parse HTML or handle pagination.

# **Key Parameters**
- `lang`: Language of the reviews (`"id"` for Indonesian language review)
- `country`: Country store (`"id"` for Indonesian users)
- `sort`: Sorting method (`Sort.NEWEST` to get recent reviews)
- `count`: Number of reviews to fetch per request

# **Fetching Reviews in Batches**
The following cell fetches reviews in chunks of 100.000 using the continuation token returned by `reviews()`. All retrieved entries are stored in a list, which will later be converted into a DataFrame.

In [26]:
all_reviews = []
continuation_token = None

for _ in range(10):
    batch, continuation_token = reviews(
        'com.tokopedia.tkpd',
        lang='id',
        country='id',
        sort=Sort.NEWEST,
        count=100000,
        continuation_token=continuation_token
    )

    all_reviews.extend(batch)
    if continuation_token is None:
        break

    print(f'Successfully scraped {(_+1)*100000} data')

print('\nData scraped successfully!')

Successfully scraped 100000 data
Successfully scraped 200000 data
Successfully scraped 300000 data
Successfully scraped 400000 data
Successfully scraped 500000 data
Successfully scraped 600000 data
Successfully scraped 700000 data
Successfully scraped 800000 data
Successfully scraped 900000 data
Successfully scraped 1000000 data

Data scraped successfully!


In [30]:
df_scrape = pd.DataFrame(all_reviews)

# Choose only the necessary column
df_raw = df_scrape[['content', 'score', 'at']]
df_raw.columns = ['text', 'rating', 'date']
df_raw.head()

Unnamed: 0,text,rating,date
0,belanja di tokopedia sangat mudah cuma sayang nya estimasi pengiriman yang tidak sesuai,5,2025-12-03 11:01:10
1,Memuaskan kan produk original,5,2025-12-03 10:35:41
2,mau nyari apa aja di mesin pencariannya TOKOPEDIA hasil timeout melulu padahal sinyal bagus maen game online aja lancar jaya,1,2025-12-03 10:11:21
3,jos mantap,5,2025-12-03 10:04:00
4,Tidak punya CS hanya ada bot yg tidak bisa memberikan solusi BURUK,1,2025-12-03 09:54:44


In [32]:
df_raw.to_csv('../data/raw/review.csv', index=False)
print('Scraped data successfully saved to "../data/raw/review.csv"')

Scraped data successfully saved to "../data/raw/review.csv"


# **Save Only Review File**

In [33]:
with open("../data/raw/all_reviews.txt", "w", encoding="utf-8") as f:
    for line in df_raw.text.astype(str):
        f.write(line.replace("\n", " ") + "\n")

# **Limitations**
- Google Play may limit large-scale scraping; the library provides only accessible public data.
- Pagination relies on continuation tokens; if Google changes the API structure, functionality may break.
- Not guaranteed to retrieve **all** historical reviews due to API restrictions.
  
# **Next Steps**
- Clean and normalize the review text.
- Remove noise, emojis, and repeating characters.
- Perform exploratory analysis on ratings, review length, and temporal patterns.
- Apply sentiment classification, topic extraction, or clustering.