# Selenium Scraper Using a Headless Browser

**Sjifra de Leeuw**   
*Amsterdam School of Communication Research, University of Amsterdam*   
*Social Media and Political Participation Lab, New York University* 

## What is web scraping?

Web scraping or "web harvesting" is another word for extracting data from websites. In this example, we will scrape the publication history of the New York times. We do so by first artificially creating a list of urls, each of which will open the history page containing all articles published on that particular day. Within each of these days, we collect the html elements we're interested in for each separate article, i.e. the short description as well as the url to access the full article. 


## What is a headless browser and why do I need it?

When scraping websites, a browser will open that is controlled programatically via the code you have written. Unless specified otherwise, this means that the graphical user interface will appear on your screen. This puts an unnecessary claim on your CPU and memory usage. This can be avoided by using a headless browser, which is a browser without a graphical user interface. Google Chrome has announced that they will enable a headless browser in one of the following updates. However, a headless browser is already available in the Chrome Canary channel. Instructions on how to install a headless driver can be found here: https://duo.com/decipher/driving-headless-chrome-with-python

## What do I need? 

- **Chrome Canary:** https://www.google.com/chrome/canary/
- **Chrome Driver:** https://chromedriver.chromium.org/
- **Selenium:** https://pypi.org/project/selenium/

## Code 

### Preparation

Import the following **libraries**:

In [None]:
# Selenium libraries
from selenium.webdriver.common.keys import Keys  
from selenium.webdriver.chrome.options import Options 
from selenium.webdriver.common.by import By 
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.support import expected_conditions as EC 
from selenium.common.exceptions import TimeoutException 
from selenium.common.exceptions import NoSuchElementException 
from selenium import webdriver 

# Other libraries
import os  
import time 
import datetime 
import pandas as pd 
import numpy as np

To create the **search queries**, create a vector of dates. Then use these dates to create a vector of urls, each of which will open a search query for a different date. Do pay special attention to the date-format in the url. In our case, dates are fomatted as YYYYMMDD, without any hyphens. Dates may be formatted differently on other websites.

In [None]:
start = datetime.datetime.strptime("01-01-1992", "%d-%m-%Y") # start date query
end = datetime.datetime.strptime("01-01-2019", "%d-%m-%Y") # end date query
date_vector = [start + datetime.timedelta(days=x) for x in range(0, (end-start).days)] # dates as numeric vector
date_vector_chr = np.array([date_obj.strftime('%Y%m%d') for date_obj in date_vector]) # dates as character vector
df = pd.DataFrame(data = {'date': date_vector, 'date_chr': date_vector_chr}) # df containing both vectors
df["url"] = "https://www.nytimes.com/search?dropmab=false&endDate=" + df.date_chr + "&query=&sections=U.S.%7Cnyt%3A%2F%2Fsection%2Fa34d3d6c-c77f-5931-b951-241b4e28681c&sort=best&startDate=" + df.date_chr + "&types=article"
urls = np.array(df.url) # isolated vector of urls
dates = np.array(df.date_chr) # isolated vector of dates as characters

Create an **empty dataframe**. The new data collected in the scraping process will be appended to this dataframe.

In [None]:
links = pd.DataFrame(np.empty((0, 3), dtype = np.str))
links = links.rename(columns={0: "url", 1: "text", 2: "date"})

### Functions

`scroll_page(url)` is a function that scrolls down to the bottom of the page, by pressing the "show more button" by

- checking whether a button is present (using a try-except statement)
- if present, clicks the button
- then do nothing for a randomized number of seconds (to avoid being intercepted by the server)
- stop unfolding the page when the button is no longer present (using a try-except statement)

In [None]:
def scroll_page(url):
    #button_element = browser.find_element_by_xpath('//*[@id="site-content"]/div/div[2]/div[2]/div/button')
    try:
        button_element = browser.find_element_by_xpath('//*[@id="site-content"]/div/div[2]/div[2]/div/button')
    except:
        button_element = None
    while button_element: # is present
        button_element.click() # click button
        time.sleep(np.random.uniform(2, 6)) # wait randomized number of seconds
        try:
            button_element = browser.find_element_by_xpath('//*[@id="site-content"]/div/div[2]/div[2]/div/button')
        except:
            button_element = None

`collect(old_dataframe)` is a function to identify separate articles, collect urls and text and append to the dataframe produced in a prior iteration by:

- looping over a vector of elements (which we called `link_element` in the code below) containing links. 
- within these elements, it finds all attributes indicating that a url is to follow, i.e. all attributes starting with `href`
- within the link elements it then also saves the texts `link_text`
- if `link_text` indicates that the article is part of a `PRINT EDITION` it 
  - collects all urls in `link_element` an array called `url_array`
  - collects all text in `link_element` in an array called `text_array` 
  - binds two columns into dataframe called `newdata`
  - adds a new column containing the date of publication by using the command `newdata["date"]
  - now that the newdata have the same dimensions as the old data, appends the newdata to the old data using `old_dataframe.append(newdata)`
  - `return` the old dataframe with the new data included.

In [None]:
def collect(old_dataframe):
    for j in range(0, len(link_element)): # for each element in the vector of link elements
        link = link_element[j] # select one element in vector
        url = link.get_attribute("href") # get the url connected to the element
        link_text = link.text # get the text connected to the element
        if "PRINT EDITION" in link_text: # if text in element contains "PRINT EDITION"
            url_array = np.array(url) # collect urls in array
            text_array = np.array(link_text) # collect text in array 
            newdata = pd.DataFrame(data = {'url': url_array, 'text': text_array}, index=[0]) # bind arrays in df
            newdata["date"] = dates[i] # add new column containing the date
            old_dataframe = old_dataframe.append(newdata) # append to already existing df
    return(old_dataframe) # return the existing df including the new data

### URL Scraper

The scraper combines these pieces and:

- for each url `i` in vector of urls it 
- opens a headless chrome browser (change directories prior to running) 
- lists the relevant url `url = urls[i]`
- gets the url via `browser.get(url)`
- tests if "show more button" is present via a `try-except` statement
- if present it unfolds the page using the `scroll_page()` function
- creates a vector of link elements 
- for each of these link elements it collects the url, text and date via the `collect()` function
- then the browser quits `browser.quit()` and moves to the next iteration `i`

In [None]:
print ("Starting Time: " + str(datetime.datetime.now()))
for i in range(0, len(urls)):
    try: 
        chrome_options = Options()  
        chrome_options.add_argument("--headless")  
        chrome_options.binary_location = '/Applications/Google Chrome Canary.app/Contents/MacOS/Google Chrome Canary'  
        browser = webdriver.Chrome(executable_path='path/chromedriver', options=chrome_options) 
        url = urls[i]
        browser.get(url) 
        try:
            button_element = browser.find_element_by_xpath('//*[@id="site-content"]/div/div[2]/div[2]/div/button')
        except:
            button_element = None
        scroll_page(url)
        link_element = browser.find_elements_by_xpath('//*[@id="site-content"]/div/div[2]/div[1]/ol/li[.]/div/div/a')
        links = collect(links)
        browser.quit()
        links.to_csv("nyt_urls_1991_2018.csv", index=False)
        print(str(i) + " out of " + str(len(urls)), end="\r")
    except:
        pass 
print ("End Time: " + str(datetime.datetime.now()))
os.system("I am done collecting urls")

We have now ignored all errors, by writing `except: pass`. This ensured that our scraper continues collecting even when encountering an error. We can now obtain the days for which it has returned an error and collect them separately (if necessary) by matching the days in our date-vector with the days in our dataset. 

In [None]:
# Import new dataset
df = pd.read_csv('path/nyt_complete.csv') # edit path 
df['date'] = pd.to_numeric(df.date, downcast='signed')
# Convert date-vector to numeric
date_vector_numeric = pd.to_numeric(date_vector_chr, downcast='signed')
# Create an empty missing dataframe
missing = pd.DataFrame(np.empty((0, 1)))
missing = missing.rename(columns={0: "date_miss"})
# Check if value in vector is present in data frame  
for i in range(0, len(date_vector_numeric)): 
    date = date_vector_numeric[i]
    if any(df.date == date) == False: # if not append to missing dataframe
        print(str(i + 1) + " out of " + str(len(date_vector_numeric)), end="\r")
        date_mis = pd.DataFrame({'date_miss': date}, index=[0]) 
        missing = missing.append(date_mis)

Then we create the relevant urls for the missing dates and scrape the remaining pages. 

In [None]:
missing.date_miss = pd.to_numeric(missing.date_miss, downcast='signed') # convert float to integer
missing.date_miss = missing.date_miss.astype(str)
missing['url'] = "https://www.nytimes.com/search?dropmab=false&endDate=" + missing.date_miss + "&query=&sections=U.S.%7Cnyt%3A%2F%2Fsection%2Fa34d3d6c-c77f-5931-b951-241b4e28681c&sort=best&startDate=" + missing.date_miss + "&types=article"
urls = np.array(missing.url) # isolated vector of urls
dates = np.array(missing.date_miss) # isolated vector of dates as characters

# Create empty dataframe
links = pd.DataFrame(np.empty((0, 3), dtype = np.str))
links = links.rename(columns={0: "url", 1: "text", 2: "date"})

# Scraper
print ("Starting Time: " + str(datetime.datetime.now()))
for i in range(0, len(urls)):
    try: 
        chrome_options = Options()  
        chrome_options.add_argument("--headless")  
        chrome_options.binary_location = '/Applications/Google Chrome Canary.app/Contents/MacOS/Google Chrome Canary'  
        browser = webdriver.Chrome(executable_path='path/chromedriver', options=chrome_options) 
        url = urls[i]
        browser.get(url) 
        try:
            button_element = browser.find_element_by_xpath('//*[@id="site-content"]/div/div[2]/div[2]/div/button')
        except:
            button_element = None
        scroll_page(url)
        link_element = browser.find_elements_by_xpath('//*[@id="site-content"]/div/div[2]/div[1]/ol/li[.]/div/div/a')
        links = collect(links)
        browser.quit()
        links.to_csv("url_collection_missing.csv", index=False)
        print(str(i) + " out of " + str(len(urls)), end="\r")
    except:
        pass 
print ("End Time: " + str(datetime.datetime.now()))
os.system("I am done collecting urls")

Append to arleady collected urls and save as .csv

In [None]:
df1 = pd.read_csv('path/nyt_urls_1991_2018.csv')
df2 = pd.read_csv('path/url_collection_missing.csv')
df = df1.append(df2)
df.to_csv("urls_complete.csv", index=False)

### Article Scraper

Now we have obtained a dataset of URLS, we can now start scraping the content of each webpage. 

In [1]:
# Selenium libraries
from selenium.webdriver.common.keys import Keys  
from selenium.webdriver.chrome.options import Options 
from selenium.webdriver.common.by import By 
from selenium.webdriver.support.ui import WebDriverWait 
from selenium.webdriver.support import expected_conditions as EC 
from selenium.common.exceptions import TimeoutException 
from selenium.common.exceptions import NoSuchElementException 
from selenium import webdriver 

# Other libraries
import os  
import time 
import datetime 
import pandas as pd 
import numpy as np

In [4]:
# Import dataframe
df = pd.read_csv('path/urls_complete.csv')

# Separate columns into arrays
urls = np.array(df.url) 
dates = np.array(df.date) 
overview = np.array(df.text) 

# Create empty dataframe
content_df = pd.DataFrame(np.empty((0, 4), dtype = np.str)) #
content_df = content_df.rename(columns={0: "url", 1: "text", 2: "date", 3: "content"})    

In [None]:
# Loop over urls and combine all info in dataframe
for i in range(0, ):
    # obtain values from arrays
    url = urls[i]
    date = dates[i]
    text = overview[i]
    # open url 
    chrome_options = Options()  
    chrome_options.add_argument("--headless")  
    chrome_options.binary_location = '/Applications/Google Chrome Canary.app/Contents/MacOS/Google Chrome Canary'  
    browser = webdriver.Chrome(executable_path='path/chromedriver', options=chrome_options) 
    browser.get(url) 
    # collect text 
    content_element = browser.find_element_by_xpath('//*[@id="story"]')
    if content_element:
        content = content_element.text # get the text connected to the element
    else: 
        content = "NA"
    newdata = pd.DataFrame(data = {'url': url, 'text': text, 'date': date, 'content': content}, index=[0]) # bind arrays in df  
    content_df = content_df.append(newdata) # append to already existing df
    content_df.to_csv("nyt-urls-with-content-1992-2018.csv", index=False)