# Scraping app store reviews

In the Washington Post's project, they found a "secret API" that allowed them to download all the App Store reviews of target "random chat apps." We're going to download reviews using the marketing platform Sensor Tower instead. Our target apps will be Chat with Strangers, Yubo, Holla, and Skout.

Their reviews section doesn't have a download button, so we use a Selenium web scraper to download the information instead.

<p class="reading-options">
  <a class="btn" href="/wapo-app-reviews/scrape-app-store-reviews">
    <i class="fa fa-sm fa-book"></i>
    Read online
  </a>
  <a class="btn" href="/wapo-app-reviews/notebooks/Scrape app store reviews.ipynb">
    <i class="fa fa-sm fa-download"></i>
    Download notebook
  </a>
  <a class="btn" href="https://colab.research.google.com/github/littlecolumns/ds4j-notebooks/blob/master/wapo-app-reviews/notebooks/Scrape app store reviews.ipynb" target="_new">
    <i class="fa fa-sm fa-laptop"></i>
    Interactive version
  </a>
</p>

In [2]:
from bs4 import BeautifulSoup
import pandas as pd
import time
import numpy as np

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

### Select your options and scrape

After you log in, select the following options to make sure you're only scraping US-based reviews. This is mostly to make sure we keep everything in English, as we won't be able to manually find racism etc in non-English reviews.

* **Date:** All time
* **Country:** US

In [3]:
links = [
    'https://sensortower.com/ios/US/babylon-health/app/babylon-health/858558101/review-history?selected_tab=reviews',
    
]

In [4]:
def get_page():
    doc = BeautifulSoup(driver.page_source)
    rows = doc.select("tbody tr")

    datapoints = []
    for row in rows:
        cells = row.select("td")
        data = {
            'Country': cells[0].text.strip(),
            'Date': cells[1].text.strip(),
            'Rating': cells[2].select_one('.gold')['style'],
            'Review': cells[3].select_one('.break-wrap-review').text.strip(),
            'Version': cells[4].text.strip()
        }
        datapoints.append(data)
    return datapoints

def save_data():
    all_data = []
    wait = WebDriverWait(driver, 5, poll_frequency=0.05)
    while True:
        wait.until(EC.invisibility_of_element_located((By.CSS_SELECTOR, '.ajax-loading-cover')))

        results = get_page()    
        all_data.extend(results)

        next_button = driver.find_elements_by_css_selector(".universal-flat-button")[6]
        if next_button.get_attribute('disabled'):
            break
        next_button.click()
        time.sleep(0.5)
        # Doesn't trigger fast enough!
        # wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, '.ajax-loading-cover')))

    df = pd.DataFrame(all_data)
    driver.quit()
    return df

In [9]:
def get_reviews(link):
    driver = webdriver.Chrome('../chromedriver')
    driver.get(link)
    
    def get_page():
        doc = BeautifulSoup(driver.page_source)
        rows = doc.select("tbody tr")

        datapoints = []
        for row in rows:
            cells = row.select("td")
            data = {
                'Country': cells[0].text.strip(),
                'Date': cells[1].text.strip(),
                'Rating': cells[2].select_one('.gold')['style'],
                'Review': cells[3].select_one('.break-wrap-review').text.strip(),
                'Version': cells[4].text.strip()
            }
            datapoints.append(data)
        return datapoints

    def save_data():
        all_data = []
        wait = WebDriverWait(driver, 5, poll_frequency=0.05)
        while True:
            wait.until(EC.invisibility_of_element_located((By.CSS_SELECTOR, '.ajax-loading-cover')))

            results = get_page()    
            all_data.extend(results)

            next_button = driver.find_elements_by_css_selector(".universal-flat-button")[5]
            if next_button.get_attribute('disabled'):
                break
            next_button.click()
            time.sleep(0.5)
            # Doesn't trigger fast enough!
            # wait.until(EC.visibility_of_element_located((By.CSS_SELECTOR, '.ajax-loading-cover')))

        df = pd.DataFrame(all_data)
        driver.quit()
        return df

#     try: 
#         return save_data()
#     except ElementClickInterceptedException:
    return save_data()
    
    
#     return result


In [7]:
def scrapetocsv(name,link):
    result = None
    while result is None:
        try:
            # connect
            result = get_reviews(link)
        except:
            time.sleep(5)
            pass
    result.to_csv(name+".csv", index=False)

In [53]:
scrapetocsv('babylon_itunes','https://sensortower.com/ios/US/babylon-health/app/babylon-health/858558101/review-history?selected_tab=reviews')

In [12]:
scrapetocsv('babylong_gp','https://sensortower.com/android/US/babylon-health/app/babylon-health-services-speak-to-a-doctor-24-7/com.babylon/review-history?selected_tab=reviews')

In [54]:
scrapetocsv('webmd_itunes','https://sensortower.com/ios/US/webmd/app/webmd-symptoms-doctors-rx/295076329/review-history?selected_tab=reviews')

In [10]:
scrapetocsv('webmd_gp','https://sensortower.com/android/US/webmd-llc/app/webmd-check-symptoms-find-doctors-rx-savings/com.webmd.android/review-history?selected_tab=reviews')

In [11]:
scrapetocsv('symptomate_gp','https://sensortower.com/android/US/infermedica/app/symptomate-symptom-checker/com.symptomate.mobile/review-history?selected_tab=reviews')

## Combine and add columns

Once we've saved reviews for several different apps, we're ready to go. We'll combine them all into one single file and add a note about what app each review came from.

In [13]:
babylon_itunes = pd.read_csv('babylon_itunes.csv')
babylon_itunes['source'] = 'babylon_itunes'

babylong_gp = pd.read_csv('babylong_gp.csv')
babylong_gp['source'] = 'babylong_gp'

webmd_itunes = pd.read_csv('webmd_itunes.csv')
webmd_itunes['source'] = 'webmd_itunes'

webmd_gp = pd.read_csv('webmd_gp.csv')
webmd_gp['source'] = 'webmd_gp'

symptomate_gp = pd.read_csv('symptomate_gp.csv')
symptomate_gp['source'] = 'symptomate_gp'

In [20]:
df = pd.concat([babylon_itunes, babylong_gp, webmd_itunes, webmd_gp,symptomate_gp], ignore_index=True)
df.shape

(452, 6)

In [21]:
df.source.value_counts()

babylon_itunes    183
babylong_gp       165
webmd_itunes       55
webmd_gp           44
symptomate_gp       5
Name: source, dtype: int64

We'll also add columns for racism, bullying, and unwanted sexual behavior. While we don't know which reviews contain this content yet, we'll use these columns to mark it in Excel or Google Sheets later.

### Clean up the rating

We don't have ratings that are numeric! Let's convert the weird HTML star percentage to actual numbers.

In [22]:
df.Rating.value_counts()

width: 99%;    276
width: 19%;    112
width: 39%;     27
width: 79%;     20
width: 59%;     17
Name: Rating, dtype: int64

In [23]:
df.Rating = df.Rating.replace({
    'width: 99%;': 5,
    'width: 79%;': 4,
    'width: 59%;': 3,
    'width: 39%;': 2,
    'width: 19%;': 1
})
df.head()

Unnamed: 0,Country,Date,Rating,Review,Version,source
0,Great Britain,06/08/2020,2,"It’s not great, after verifying ID it keeps as...",4.22.0,babylon_itunes
1,Great Britain,06/08/2020,1,Don’t waste your time,-,babylon_itunes
2,Great Britain,06/06/2020,5,Best thing I ever did was swap to Babylon. In ...,4.22.0,babylon_itunes
3,Great Britain,06/06/2020,5,Clearly much investment has gone into this app...,4.22.0,babylon_itunes
4,Great Britain,06/05/2020,1,Staffed by brand new baby GPs with no ability ...,4.22.0,babylon_itunes


In [52]:
df.shape

(452, 6)

In [24]:
df.Rating.value_counts()

5    276
1    112
2     27
4     20
3     17
Name: Rating, dtype: int64

In [25]:
df.to_csv("allreviews.csv", index=False)

In [27]:
reviews = pd.read_csv("allreviews.csv")

In [28]:
reviews['Review']

0      It’s not great, after verifying ID it keeps as...
1                                  Don’t waste your time
2      Best thing I ever did was swap to Babylon. In ...
3      Clearly much investment has gone into this app...
4      Staffed by brand new baby GPs with no ability ...
                             ...                        
447    Didn't even get to use the app the age for thi...
448    The map given for region not showing names so ...
449                            That was totally worth ❤️
450                                      Brillaint app++
451                  Nada ve confunde coisa com coisa ok
Name: Review, Length: 452, dtype: object

In [50]:
r = reviews[reviews['Review'].str.contains("box|search|suggestion|text|survey")]
r

Unnamed: 0,Country,Date,Rating,Review,Version,source
4,Great Britain,06/05/2020,1,Staffed by brand new baby GPs with no ability ...,4.22.0,babylon_itunes
114,Great Britain,04/22/2020,5,After hurting myself I used the Babylon app to...,4.20.3,babylon_itunes
130,Great Britain,04/11/2020,1,Rubbish dangerous app that will try and remove...,4.20.1,babylon_itunes
169,Great Britain,03/19/2020,1,Rubbish dangerous app that will try and remove...,4.18.0,babylon_itunes
317,English,03/25/2020,4,The app is pretty easy to use and has a sleek ...,4.18.1.36032000,babylong_gp
420,English,04/28/2020,5,Very accurate suggestions if you can give mult...,7.8.4,webmd_gp
425,English,04/20/2020,5,I love having this app at my disposal. I dont ...,7.8.4,webmd_gp


In [51]:
r.to_csv("keywords_review.csv")