# Scraping Glassdoor Reviews

> *Advanced Customer Analytics*  
> *MSc in Data Science, Department of Informatics*  
> *Athens University of Economics and Business*

---

<p style='text-align: justify;'>Select an English-speaking website that hosts customer reviews on products (or services, businesses, movies, events, etc.). Make sure that the website includes a free-text search box that users can use to search for products. Create a first Python notebook with a function called <code>scrape()</code>. The function should accept as a parameter a query (a word or short phrase). The function should then use <b><i>selenium</i></b> to (1) submit the query to the website's search box and retrieve the list of matching products, and (2) access the first product on the. list and download all its reviews into a csv file. For each review, the function should get the text, the rating and the date. One line per review, 3 fields per line.</p>

---

##### *Libraries*

In [4]:
import csv, time
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.chrome.service import Service
from webdriver_manager.chrome import ChromeDriverManager
from selenium.webdriver.common.by import By

##### *Define a function to scrape Glassdoor reviews*

In [10]:
def scrape(query:str, delay_time:int=5):
    """
    Parameters
    ----------
    query: str
        The job title for which reviews will be scraped from the Amazon page.
    delay_time: int
        Delay time between commands
        
    Returns
    -------
    None.
    """
    
    # open a new csv writer
    # to store the information that will be scraped
    fw = open('glassdoor_reviews.csv', 'w', encoding='utf8')
    writer = csv.writer(fw, lineterminator='\n')
    writer.writerow(['text','rating','date'])
    
    # initialize the url in which we will search for reviews
    url = 'https://www.glassdoor.com/Reviews/Amazon-Reviews-E6036.htm'
    
    # wear a mask to switch user agent
    opts = Options()
    opts.add_argument('Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36')
    
    # create a webdriver instance
    driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=opts)
    
    # visit the reviews url
    driver.get(url)
    
    # wait 5s
    time.sleep(delay_time)
    
    # accept cookies
    cookies_button = driver.find_element(by=By.CSS_SELECTOR, value='[id="onetrust-accept-btn-handler"]')
    cookies_button.click()
    
    # insert the query
    search_job_titles_field = driver.find_element(by=By.CSS_SELECTOR, value='[data-test="ContentFiltersJobTitleAC"]')
    search_job_titles_field.send_keys(query)
    
    # click the find reviews button
    find_reviews_button = driver.find_element(by=By.CSS_SELECTOR, value='[data-test="ContentFiltersFindBtn"]')
    find_reviews_button.click()
    
    input() # wait until we sign in
    
    page_counter = 1 # to keep track of page count
    
    while True:
        
        print(f'Page {page_counter}')
        
        # get all the reviews in the page
        reviews = driver.find_elements(by=By.CSS_SELECTOR, value='[class="noBorder empReview cf pb-0 mb-0"]')
        
        # loop through the reviews
        for review in reviews:
            
            # initialize main column attributes
            text, rating, date = 'NA', 'NA', 'NA'
            
            """Pros"""
            try: # to find the review pros box
                pros_box = review.find_element(by=By.CSS_SELECTOR, value='[data-test="pros"]')
            except:
                pros_box = None
            # if box found, extract pros
            if pros_box: pros = pros_box.text.strip()
                
            """Cons"""
            try: # to find the review cons box
                cons_box = review.find_element(by=By.CSS_SELECTOR, value='[data-test="cons"]')
            except:
                cons_box = None
            # if box found, extract cons
            if cons_box: cons = cons_box.text.strip()
                
            """Rating"""
            try: # to find the review rating
                rating_box = review.find_element(by=By.CSS_SELECTOR, value='[class="ratingNumber mr-xsm"]')
            except:
                rating_box = None
            # if box found, extract rating
            if rating_box: rating = rating_box.text.split('.')[0].strip()
                
            """Date"""
            try: # to find the review date
                date_box = review.find_element(by=By.CSS_SELECTOR, value='[class="middle common__EiReviewDetailsStyle__newGrey"]')
            except:
                date_box = None
            # if box found, extract date
            if date_box: date = date_box.text.split('-')[0].strip()
                
            # concat pros and cons into a single text
            text = pros + '.' + ' *separator* ' + cons + '.'
            
            # write a new row in the csv
            writer.writerow([text, rating, date])
            
        # find the next page button
        next_page_button = driver.find_element(by=By.CSS_SELECTOR, value='[data-test="pagination-next"]')
        
        # check if it's the last page
        if not next_page_button.is_enabled(): break
        
        # move to the next page
        next_page_button.click()
        
        # increment counter
        page_counter += 1
        
        # wait 5s
        time.sleep(delay_time)
        
    fw.close()

##### *Execute function*

In [11]:
query = 'data scientist'
scrape(query)


Page 1
Page 2
Page 3
Page 4
Page 5
Page 6
Page 7
Page 8
Page 9
Page 10
Page 11
Page 12
Page 13
Page 14
Page 15
Page 16
Page 17
Page 18
Page 19
Page 20
Page 21
Page 22
Page 23
Page 24


---

*Thank you!*

---