## User Reviews Scraping
We fetch all the user reviews from the Play Store and store them in `OtherData/UserReviewsData` in a separate dataset for every app identifier.

### The Scraping Logic

Play Store reviews are paginated in the backend so the maximum request we can make at once is for 200 reviews. We fetch reviews in batches of 200.

In [1]:
from google_play_scraper import Sort, reviews
import simplejson
import pandas as pd
from tqdm import tqdm

"""
Scrape a given number of reviews for a given app in batches of 200 reviews per HTTP request

:param app_id the identifier of the app (e.g. com.foobar.app)
:param review_count (by default, it scrapes everything)

:return a dictionary containing the reviews
"""
def scrape_reviews(app_id, review_count=0):
    # continuationToken contains the metadata that keeps track of the progress we've made in scraping
    results = []
    continuation_token = None
    batch_size = 200
    total_to_fetch = review_count if review_count > 0 else float('inf')

    with tqdm(total=total_to_fetch, desc=f"Scraping reviews for {app_id}") as pbar:
        while len(results) < total_to_fetch:
            count = min(batch_size, total_to_fetch - len(results))

            result, continuation_token = reviews(
                app_id,
                lang='en',
                country='us',
                sort=Sort.NEWEST,
                count=count,
                continuation_token=continuation_token,
            )

            if not result:
                break

            results.extend(result)
            pbar.update(len(result))

            if continuation_token is None:
                break

    return results


Loaded 490 SOCKS5 proxies from Mullvad.


### Scraping Every App

We give a list of app identifiers to scrape and start working!

In [None]:
import pandas as pd
from concurrent.futures import ThreadPoolExecutor, as_completed
from tqdm import tqdm
import os

# Assuming your CSV has one column: 'app_id'
app_ids_df = pd.read_csv('OtherData/AppInfoData/unique_app_ids.csv')
app_ids = app_ids_df['app_id'].dropna().unique().tolist()  # remove NaNs if any

# Create output dir if not exists
output_dir = 'OtherData/UserReviewsData'
os.makedirs(output_dir, exist_ok=True)

def scrape_and_save_reviews(app_id):
    try:
        reviews = scrape_reviews(app_id, 2000)
        reviews_df = pd.DataFrame(reviews)
        reviews_df = reviews_df.drop(columns=['userName', 'userImage'], axis=1)
        reviews_df.to_csv(os.path.join(output_dir, f"{app_id}.csv"), index=False)
        return app_id, True
    except Exception as e:
        print(f"Failed for {app_id}: {e}")
        return app_id, False

# Multithreaded execution
def run_multithreaded_review_scraper(app_ids, max_workers=8):
    results = []
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        futures = {executor.submit(scrape_and_save_reviews, app_id): app_id for app_id in app_ids}
        for future in tqdm(as_completed(futures), total=len(futures), desc="Scraping reviews"):
            result = future.result()
            results.append(result)
    return results

# Run it
results = run_multithreaded_review_scraper(app_ids)


Scraping reviews for kemco.hitpoint.chronus:   0%|          | 0/2000 [00:00<?, ?it/s]
[A

[A[A


[A[A[A



[A[A[A[A




[A[A[A[A[A





[A[A[A[A[A[A






Scraping reviews for com.reyrey.serviceflex:   0%|          | 0/2000 [00:00<?, ?it/s]





Scraping reviews for com.sbitsoft.dn94percent2:   1%|          | 16/2000 [00:00<00:54, 36.20it/s]

Failed for com.reyrey.serviceflex: "['userName', 'userImage'] not found in axis"





[A[A



Scraping reviews for kemco.hitpoint.chronus:   9%|▉         | 176/2000 [00:01<00:10, 173.54it/s]





Scraping reviews for com.zynga.farmville3:   0%|          | 0/2000 [00:00<?, ?it/s]



[A[A[A[A

[A[A


[A[A[A



[A[A[A[A


[A[A[A






[A[A[A[A[A[A[A



[A[A[A[A

[A[A


[A[A[A


[A[A[A



[A[A[A[A

[A[A



[A[A[A[A






Scraping reviews for app.airmusic.pro:  17%|█▋        | 347/2000 [00:05<00:24, 68.56it/s]





[A[A[A[A[A






[A[A[A[A[A[A[A



[A[A[A[A






[A[A[A[A[A[A[A

[A[A



[A[A[A[A



[A[A[A[A






[A[A[A[A[A[A[A



Scraping reviews for com.digitalsmoke.tenpinshuffle: 100%|██████████| 2000/2000 [00:06<00:00, 292.76it/s]





[A[A[A[A[A



[A[A[A[A

[A[A



[A[A[A[A



[A[A[A[A






[A[A[A[A[A[A[A



[A[A[A[A






[A[A[A[A[A[A[A



[A[A[A[A






[A[A[A[A[A[A[A






[A[A[A[A[A[A[A






[A[A[A[A[A[A[A



[A

Failed for com.hydra.noods: "['userName', 'userImage'] not found in axis"




[A[A



[A[A[A[A
Scraping reviews for net.supertreat.solitaire:  20%|██        | 400/2000 [00:00<00:03, 464.83it/s]
[A

[A[A



Scraping reviews for com.king.candycrushjellysaga: 100%|██████████| 2000/2000 [00:04<00:00, 438.12it/s]





[A[A[A[A[A
Scraping reviews for net.supertreat.solitaire:  40%|████      | 800/2000 [00:01<00:02, 495.00it/s]





Scraping reviews for com.island.card: 100%|██████████| 2000/2000 [00:07<00:00, 260.27it/s]





[A[A[A[A[A





[A[A[A[A[A[A
[A





Scraping reviews for com.peppapigthemepark.florida:   2%|▏         | 34/2000 [00:00<00:11, 170.39it/s]





[A[A[A[A[A





[A[A[A[A[A[A



[A[A[A[A

[A[A



[A[A[A[A
[A

[A[A
Scraping reviews for net.supertreat.solitaire:  50%|█████     | 1000/2000 [00:02<00:03, 308.93it/s]



Scraping reviews for com.gameicreate.mycafeshopcookinggame: 100%|██████████| 2000/2000 [00:21<00:00, 94.61it/s] 





[A[A[A[A[A



[A[A[A[A





[A[A[A[A[A[A

Scraping 