# Step1: Scraping Presidential Speeches

The Miller Center maintains an archive of all presidential speeches. The archive is available at https://millercenter.org/the-presidency/presidential-speeches. The archive contains speeches from all presidents from George Washington to Donald Trump. Each speech has a name of the president, title, date, a brief abstract about the speech, and transcript. Some speeches also have a video. 

The archive is a great resource for anyone interested in the history of the United States.

In [None]:
import pandas as pd
import time

from selenium import webdriver
from selenium.webdriver.firefox.service import Service
from selenium.webdriver.common.by import By

from webdriver_manager.firefox import GeckoDriverManager 

from bs4 import BeautifulSoup as bs

import re
import dateparser

## Scrape Links to Speechs from the Miller Center

The first thing we need to do is scrape links to all the presidential speeches. The main page uses JavaScript and will only reveal links as you scroll down the page. This, therefore, will require us to use Selenium to scrape this page; as selenium is our only option for scraping JavaScript pages.


In [None]:
# Start a driver session....

    # if you have selenium 3 installed, use one of these:
#driver = webdriver.Firefox(executable_path=GeckoDriverManager().install()) # this will work on Windows and Mac, and should work on Linux when run the first time
#driver = webdriver.Firefox(executable_path=<insert path to manual downloaded geckodriver>)
#driver = webdriver.Firefox() # use if geckodriver is in your PATH environmnet variable (which includes the same folder as your notebook)

    # if you hve selenium 4 installed, use one of these:
#driver = webdriver.Firefox(service=Service(GeckoDriverManager().install())) # this will work on Windows and Mac, and should work on Linux when run the first time
driver = webdriver.Firefox() # use if geckodriver is in your PATH environmnet variable (which includes the same folder as your notebook)

# load page with Selenium
driver.get("https://millercenter.org/the-presidency/presidential-speeches")
driver.implicitly_wait(10) # implicitly_wait method sets a sticky timeout to implicitly wait for an element to be found, or a command to complete. This method only 
# needs to be called one time per session. 

pause_scroll = 3 # we need to pause after each time we scroll down
previous_page_height = driver.execute_script("return document.body.scrollHeight")
while True:
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
    time.sleep(pause_scroll)
    new_page_height = driver.execute_script("return document.body.scrollHeight")
    if new_page_height == previous_page_height:
        break
    previous_page_height = new_page_height
    

page_source = driver.page_source
driver.close()

In [None]:
#retrieve urls to all speeches
bsobject_linkpage = bs(page_source,'lxml')
bs_links = bsobject_linkpage.find_all("a", href = re.compile('presidential-speeches/'))
bs_links[-1] # display the first 5

In [None]:
speech_link_list = []
for link in bs_links:   
    speech_link_list.append(link['href'])

speech_link_list[:5] # display first 5

## For each URL, scrape the speech

For each of the links that we extracted from the main page, we will load the page, and use beautiful soup to extract the speech text, title, date, about text, and president name.

NOTE: This will take a long time to run -- possible an hour or more (the Miller Center has a lot of speeches!) Also, you may notice that we pause between pages - this is to make sure we are not hitting the server too hard. Some systems will notice this and reject your requests if you are requesting pages too fast from one IP address.

In [None]:
# looking at html content...
# there is a class called president-name, episode-date, speed-loc, about-sidebar--intro, 
# presidential-speeches--title, presidential-speeches--title, view-transcript 
# 
# view-transcript content may have multiple "Transcript" Headers (header 3)
# it will also include a title ending in a colon
# 
#scrape the speech#

driver = webdriver.Firefox() # start a new session

pause_between_pages = 2

# create empty lists to store data from each page
title, speech, name, date, about = ([] for i in range(5))

for link in speech_link_list:
    #access speech page with Selenium and find div class "transcript-inner"
    driver.get(link)

    # use beautiful soup to parse the html
    bsobject_speechpage = bs(driver.page_source, 'lxml')

    #scrape speech test, tital, presidents name, date of speech and text about the speech.
    try:
        title.append(bsobject_speechpage.find('h2', class_="presidential-speeches--title").text.strip())
    except:
        title.append("No title available")
        
    try:
        speech_raw = bsobject_speechpage.find('div', class_="transcript-inner").text.strip().replace('\xa0', '')
        speech.append(re.sub(r"Transcript|\n","",speech_raw)) 
    except:
        try: # older links use the class view-transcript instead of transcript-inner; if transcript-inner doesn't work, thy view-transcript
            speech_raw = bsobject_speechpage.find('div', class_="view-transcript").text.strip().replace('\xa0', '')
            speech.append(re.sub(r"Transcript|\n"," ",speech_raw)) 
        except:
            speech.append("No speech available")
    
    try:
        name.append(bsobject_speechpage.find('p', class_="president-name").text.strip())
    except:
        name.append("No name available")
    
    try:
        date.append(dateparser.parse(bsobject_speechpage.find('p', class_="episode-date").text.strip()))
    except:
        date.append("No date available")
        
    try:
        about.append(bsobject_speechpage.find('div', class_="about-sidebar--intro").text.strip())
    except:
        about.append("No info available")
    
    # pause before getting next page
    time.sleep(pause_between_pages)

driver.close()

In [None]:
speech[:5]

## Create dataframe and save to csv

Finally, we assemble all out data into a dataframe and save it to a csv file. We will use this csv file in out next notebook.

In [None]:
#save this to a dataframe and save to a csv file#
if len(title) == len(speech) == len(name) == len(date) == len(about):
    speeches_presidents = pd.DataFrame({'name':name,'title':title,'date':date,'about':about,'speech':speech})
    speeches_presidents['speech'] = speeches_presidents['speech'].apply(lambda x: x.replace(".",". "))
    speeches_presidents.to_csv("./data/presidential_speeches.csv",index=False)
else:
    print("Something went wrong with scraping the speeches. Please check the code.")
    
    # dump the data to csv files for debugging
    df_names = pd.DataFrame({'name':name}) 
    df_names.to_csv("./data/names.csv",index=False)
    
    df_titles = pd.DataFrame({'title':title})
    df_titles.to_csv("./data/titles.csv",index=False)
    
    df_dates = pd.DataFrame({'date':date})
    df_dates.to_csv("./data/dates.csv",index=False)
    
    df_infos = pd.DataFrame({'about':about})
    df_infos.to_csv("./data/about.csv",index=False)
    
    df_speeches = pd.DataFrame({'speech':speech})
    df_speeches.to_csv("./data/speeches.csv",index=False)