## Step 1: Data Ingestion ‚Äì Swiggy Google Play Store Reviews

This notebook extracts Swiggy reviews from the Google Play Store,
filters reviews from June 1, 2024 onwards, and stores them as the
source-of-truth dataset.


In [2]:
pip install google-play-scraper


Note: you may need to restart the kernel to use updated packages.


In [4]:
from google_play_scraper import reviews, Sort
import pandas as pd
import time
import os


In [6]:
APP_ID = "in.swiggy.android"
os.makedirs("../data", exist_ok=True)


In [8]:
all_reviews = []
continuation_token = None
MAX_PAGES = 40  # safe limit

for page in range(MAX_PAGES):
    print(f"Fetching page {page+1}")

    result, continuation_token = reviews(
        APP_ID,
        lang="en",
        country="in",
        sort=Sort.NEWEST,
        count=200,
        continuation_token=continuation_token
    )

    if not result:
        break

    all_reviews.extend(result)

    if continuation_token is None:
        break

    time.sleep(1)

print("Total reviews fetched:", len(all_reviews))


Fetching page 1
Fetching page 2
Fetching page 3
Fetching page 4
Fetching page 5
Fetching page 6
Fetching page 7
Fetching page 8
Fetching page 9
Fetching page 10
Fetching page 11
Fetching page 12
Fetching page 13
Fetching page 14
Fetching page 15
Fetching page 16
Fetching page 17
Fetching page 18
Fetching page 19
Fetching page 20
Fetching page 21
Fetching page 22
Fetching page 23
Fetching page 24
Fetching page 25
Fetching page 26
Fetching page 27
Fetching page 28
Fetching page 29
Fetching page 30
Fetching page 31
Fetching page 32
Fetching page 33
Fetching page 34
Fetching page 35
Fetching page 36
Fetching page 37
Fetching page 38
Fetching page 39
Fetching page 40
Total reviews fetched: 8000


In [10]:
df = pd.DataFrame(all_reviews)
df = df[["content", "at"]]
df.columns = ["review_text", "review_date"]

df["review_date"] = pd.to_datetime(df["review_date"])
df = df[df["review_date"] >= "2024-06-01"]

df = df.sort_values("review_date").reset_index(drop=True)
df.head()


Unnamed: 0,review_text,review_date
0,Excellent service üíØ,2025-12-23 19:12:27
1,super,2025-12-23 19:16:37
2,worst,2025-12-23 19:17:26
3,‚ù§Ô∏è,2025-12-23 19:18:12
4,good,2025-12-23 19:19:21


In [12]:
df.to_csv("../data/swiggy_reviews_raw.csv", index=False)
print("Saved to data/swiggy_reviews_raw.csv")


Saved to data/swiggy_reviews_raw.csv
