# üëì **Overview**
This notebook documents the process of collecting user reviews from the Tokopedia Android application using the `google-play-scraper` Python library. The resulting dataset will be used for downstream tasks such as text cleaning, sentiment analysis, and topic modeling.

In [1]:
!pip install --quiet google-play-scraper

# üõª **Data Source**
The dataset is collected from the Google Play Store using the `google-play-scraper` library, targeting the Tokopedia application `com.tokopedia.tkpd`. All reviews originate from Indonesian users, providing domain-specific language patterns relevant to the local market. Each entry includes the review text, rating score, and review date. Since Google Play only exposes one review per user for a given app, we do not need a separate review ID, as duplication at the user level is inherently prevented by the platform.

In [3]:
from warnings import filterwarnings
filterwarnings('ignore')

# Core library
import pandas as pd

# Web scraping
from google_play_scraper import (
    Sort,
    reviews
)

pd.set_option('display.max_colwidth', None)
print('Ready!')

Ready!


# ‚õèÔ∏è **Scraping Method**
We use the `reviews()` function from `google-play-scraper`, which provides a high-level interface for retrieving app reviews without the need to manually parse HTML or handle pagination. We'll fetches reviews in chunks of 100.000 using the continuation token returned by `reviews()`. All retrieved entries are stored in a list, which will later be converted into a DataFrame.

In [26]:
all_reviews = []
continuation_token = None

for _ in range(10):
    batch, continuation_token = reviews(
        'com.tokopedia.tkpd',
        lang='id',
        country='id',
        sort=Sort.NEWEST,
        count=100000,
        continuation_token=continuation_token
    )

    all_reviews.extend(batch)
    if continuation_token is None:
        break

    print(f'Successfully scraped {(_+1)*100000} data')

print('\nData scraped successfully!')

Successfully scraped 100000 data
Successfully scraped 200000 data
Successfully scraped 300000 data
Successfully scraped 400000 data
Successfully scraped 500000 data
Successfully scraped 600000 data
Successfully scraped 700000 data
Successfully scraped 800000 data
Successfully scraped 900000 data
Successfully scraped 1000000 data

Data scraped successfully!


In [None]:
df_scrape = pd.DataFrame(all_reviews)

df_raw = df_scrape[['content', 'score', 'at']]
df_raw.columns = ['text', 'rating', 'date']

# Additional features for further exploration
df_raw["char_len"] = df_raw["text"].str.len()
df_raw["token_len"] = (
    df_raw["text"]
    .fillna("")
    .astype(str)
    .str.split()
    .apply(len)
)

print(df_raw.shape)
df_raw.head()

(709000, 5)


Unnamed: 0,text,rating,date,char_len,token_len
0,belanja di tokopedia sangat mudah cuma sayang nya estimasi pengiriman yang tidak sesuai,5,2025-12-03 11:01:10,87.0,127
1,Memuaskan kan produk original,5,2025-12-03 10:35:41,29.0,42
2,mau nyari apa aja di mesin pencariannya TOKOPEDIA hasil timeout melulu padahal sinyal bagus maen game online aja lancar jaya,1,2025-12-03 10:11:21,124.0,185
3,jos mantap,5,2025-12-03 10:04:00,10.0,17
4,Tidak punya CS hanya ada bot yg tidak bisa memberikan solusi BURUK,1,2025-12-03 09:54:44,66.0,103


Even though we scraped to 1.000.000 review, library `google-play-scraper` only returning 709.000 of reviews range from year 2020 to 2025. There is no problem since this amount of data is already large for our projects.

In [16]:
df_raw.to_csv('../data/raw/review.csv', index=False)

print('Scraped data successfully saved to "../data/raw/review.csv"')

Scraped data successfully saved to "../data/raw/review.csv"


In [17]:
with open("../data/raw/all_reviews.txt", "w", encoding="utf-8") as f:
    for line in df_raw.text.astype(str):
        f.write(line.replace("\n", " ") + "\n")

print('Scraped review only successfully saved to "../data/raw/all_reviews.txt"')

Scraped review only successfully saved to "../data/raw/all_reviews.txt"


# üöß **Limitations**
Although the scraper is designed for large-scale data collection, Google Play imposes restrictions that prevent retrieving the full set of available reviews. As a result, the scraping process may stop earlier than expected, yielding fewer reviews than the theoretical total. These limitations stem from API rate controls, pagination boundaries, and Google‚Äôs internal filtering, all of which can cap the maximum number of accessible reviews regardless of how many exist on the platform.
  
# ü™® **Next Steps**
The next step is to clean and normalize the raw review text, as the dataset is still highly noisy and contains various artifacts such as emojis, repeated characters, inconsistent slang, typos, and formatting irregularities. Preparing the text through structured cleaning will ensure that downstream analysis and modeling operate on stable, standardized input rather than raw unprocessed data.