# Scraping Play and App Store

In this notebook, we are going to perform scraping of the Google Play and iOS App Store.

In [1]:
import requests, time, os
import pandas as pd
import numpy as np

## Perform Scraping to File

We will first scrape the reviews and write them to CSV files.

Prior to performing the operation, we have developed an App Scraper, which basically uses  [facundoolano](https://github.com/facundoolano "facundoolano")'s [Google Play Scraper](https://github.com/facundoolano/google-play-scraper "Google Play Scraper") and [App Store Scraper](https://github.com/facundoolano/app-store-scraper "App Store Scraper"). Because they have been developed on NodeJS, we use ExpressJS to convert them into REST APIs that can be taken advantage of by other applications such as this notebook.

All codes in app scraping are developed on Docker containers.

In [20]:
# Constants

APP_NAME = ['ryde', 'tada', 'grab']

APP_IDS = [
    ['io.mvlchain.tada', 'com.rydesharing.ryde', 'com.grabtaxi.passenger'],
    ['979806982', '1412329684', '647268330']
]

FS_PREFIX = ['play', 'appstore']

URL_BASE = 'http://scraper:3000/api/'
URL_REV = '/reviews?id='
URL_PG = '&page='

In [27]:
def scrape_to_file(store, limit=150):
    '''Writes the output from the scraping to individual CSV files
    
    Parameters
    ----------
    store : string
        Define one of: 'play' or 'appstore'
    limit : int
        Restrict the number of times the URL to the app scraper is being called
    '''
    sid = FS_PREFIX.index(store)
    start = 0 if sid == 0 else 1
    
    for app in APP_IDS[sid]:
        p_dir = 'gp/' + app
        print(p_dir)

        # Make directory
        if not os.path.exists(p_dir):
            os.makedirs(p_dir)
        
        for i in range(start, limit):
            # Form url
            url = URL_BASE + store + URL_REV + app + URL_PG + str(i)
            print(url)

            # Read the output
            df = pd.read_json(url)

            # Put to file if more than 1 row
            if len(df.index) != 0:
                df.to_csv(p_dir + '/out' + str(i) + '.csv', index=False)
                time.sleep(2)
            else:
                break;

In [29]:
# Srape iOS App Store
scrape_to_file(FS_PREFIX[1])

gp/979806982
http://scraper:3000/api/appstore/reviews?id=979806982&page=1
http://scraper:3000/api/appstore/reviews?id=979806982&page=2
gp/1412329684
http://scraper:3000/api/appstore/reviews?id=1412329684&page=1
http://scraper:3000/api/appstore/reviews?id=1412329684&page=2
gp/647268330
http://scraper:3000/api/appstore/reviews?id=647268330&page=1
http://scraper:3000/api/appstore/reviews?id=647268330&page=2
http://scraper:3000/api/appstore/reviews?id=647268330&page=3
http://scraper:3000/api/appstore/reviews?id=647268330&page=4
http://scraper:3000/api/appstore/reviews?id=647268330&page=5
http://scraper:3000/api/appstore/reviews?id=647268330&page=6
http://scraper:3000/api/appstore/reviews?id=647268330&page=7
http://scraper:3000/api/appstore/reviews?id=647268330&page=8
http://scraper:3000/api/appstore/reviews?id=647268330&page=9
http://scraper:3000/api/appstore/reviews?id=647268330&page=10
http://scraper:3000/api/appstore/reviews?id=647268330&page=11


In [None]:
# Srape Google Play Store
scrape_to_file(FS_PREFIX[0])

## Consolidation

The output

In [59]:
def consolidate_files(store):
    sid = FS_PREFIX.index(store)
    
    for index, app in enumerate(APP_IDS[sid]):
        p_dir = 'gp/' + app + '/'
        out_dir = 'out/'
        
        # Make directory
        if not os.path.exists(out_dir):
            os.makedirs(out_dir)
        
        print(p_dir, APP_NAME[index])
        print(os.listdir(p_dir))
        
        out_file = out_dir + store + '_' + APP_NAME[index] + '.csv'
        
        # Combine the CSV files
        combined_csv = pd.concat([pd.read_csv(p_dir + f) for f in os.listdir(p_dir)], sort=True)
        combined_csv.to_csv(out_file, index=False)

In [60]:
for s in FS_PREFIX:
    consolidate_files(s)

gp/io.mvlchain.tada/ ryde
['out3.csv', 'out0.csv', 'out1.csv', 'out2.csv']
gp/com.rydesharing.ryde/ tada
['out3.csv', 'out11.csv', 'out16.csv', 'out13.csv', 'out0.csv', 'out10.csv', 'out9.csv', 'out14.csv', 'out12.csv', 'out5.csv', 'out6.csv', 'out1.csv', 'out17.csv', 'out2.csv', 'out8.csv', 'out15.csv', 'out4.csv', 'out7.csv', 'out18.csv']
gp/com.grabtaxi.passenger/ grab
['out3.csv', 'out107.csv', 'out20.csv', 'out110.csv', 'out11.csv', 'out28.csv', 'out25.csv', 'out109.csv', 'out93.csv', 'out90.csv', 'out104.csv', 'out81.csv', 'out79.csv', 'out50.csv', 'out16.csv', 'out33.csv', 'out108.csv', 'out61.csv', 'out40.csv', 'out71.csv', 'out30.csv', 'out35.csv', 'out13.csv', 'out73.csv', 'out106.csv', 'out49.csv', 'out21.csv', 'out39.csv', 'out89.csv', 'out82.csv', 'out0.csv', 'out26.csv', 'out37.csv', 'out34.csv', 'out92.csv', 'out70.csv', 'out98.csv', 'out51.csv', 'out19.csv', 'out60.csv', 'out69.csv', 'out10.csv', 'out47.csv', 'out9.csv', 'out64.csv', 'out36.csv', 'out97.csv', 'out85.csv

In [None]:
if os.path.exists('gp'):
    os.removedirs('gp')