#### Code to scrape data from the Washington Post's Fact Check database

Using data from the WP Fact Check database at https://www.washingtonpost.com/graphics/politics/trump-claims-database/ and Tableau to create a data visualization.   Script uses Selenium for interactive features on the page and BeautifulSoup for scraping the necessary data.


In [39]:
import pandas as pd 
from random import randint
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.common.exceptions import TimeoutException, NoSuchElementException
from bs4 import BeautifulSoup
import re

option = webdriver.ChromeOptions()
option.add_argument('-incognito')
chromedriver = '/Users/vchau76/Desktop/Graduate School/FSB/STAT5006/Final Project/chromedriver' 
driver = webdriver.Chrome(executable_path = chromedriver, options=option)

In [40]:
# Open URL of page to scrape

url = 'https://www.washingtonpost.com/graphics/politics/trump-claims-database/'
driver.get(url)

In [41]:
from IPython.core.display import clear_output
from time import sleep,time
timestart_time = time()

# Loop through and click 'Load more claims' button using selenium - loads 50 new claims each time (total of 13435 claims)

requests = 0
while True:
    try:
        driver.find_element_by_css_selector("button.pg-button").click()
        
        requests += 1
        # Set random time to wait before clicking button again
        sleep(randint(5,15))
        current_time = time()
        elapsed_time = current_time - start_time
        
        # print out each request and frequency
        print('Request: {}; Frequency: {} requests/s'.format(requests, requests/elapsed_time))
        clear_output(wait = True)
        
        button = driver.find_element_by_css_selector("button.pg-button").text
        if 'Load more claims' not in button:
            print("There are no more claims.")
            break
    except NoSuchElementException as error:
        print(error)
        break

There are no more claims.


In [50]:
# Use combination of BeautifulSoup and Selenium for webscraping data

html_soup = BeautifulSoup(driver.page_source, 'lxml')

# List to append all data values
lies = []

    # Container for each lie with all associated data values we are trying to scrape
claims_container = html_soup.find_all('div', class_ ='claim-row')
    # Loop through each container to grab data values for each lie

for container in claims_container:

    dates_elem = container.find('span',class_='label').text # date of lie
    
    analysis_elem = container.find('div',class_='analysis').find('p',class_='pg-bodyCopy').text # Washington Post analysis
    fc_rating_count = container.find_all('span',{'class': 'pinocchio'}) # count number of pinocchios
    fc_rating_elem = len(fc_rating_count)

    # flags for IF statements
    repeated_elem_flag = container.find('span', class_='repeated-total') # flag to determine if/when lie was repeated
    repeated_dates_flag = container.find('div', class_='repeats') # flag to determine if lies are repeated
    no_repeat_flag = container.find('div',class_="details not-expanded") # flag if no repeated dates 
    lies_elem_flag = container.find('p', class_='pg-bodyCopy has-apos') # lie
    
    # using selenium to expand 'Show details' link
    # driver.find_element_by_css_selector("div.expand").find_element_by_css_selector("button.pg-highlight").find_element_by_xpath("//span[@class='franklin-light']").click() 
    if lies_elem_flag:
        lies_elem = container.find('p', class_='pg-bodyCopy has-apos').text.strip('“”') # lie
        
    #checks for repeated instances of lie
    if repeated_elem_flag: 
        repeated_elem = container.find('span', class_='underline--green').text.rstrip('times').strip() # number of times lies repeated
    else:
        repeated_elem = 0
        
    if repeated_dates_flag:
        rp_dates = container.find_all('span','repeat pg-highlight')
        repeated_dates = [dates.text for dates in rp_dates]
        repeated_dates = ', '.join(repeated_dates)
    if no_repeat_flag:
        topic_elem = no_repeat_flag.select_one('p:nth-of-type(1)').text.lstrip('Topic:').strip()
        source_elem = no_repeat_flag.select_one('p:nth-of-type(2)').text.lstrip('Source:').strip()
    else:
        lies_elm = None
        repeated_dates = 0
         
    new = ((dates_elem,int(repeated_elem),repeated_dates,topic_elem,source_elem,lies_elem,analysis_elem,fc_rating_elem))
    lies.append(new)
    
             
df = pd.DataFrame(lies, columns=['date','times repeated','dates repeated','topic','source','lies','analysis','fact check rating'])

In [51]:
df

Unnamed: 0,date,times repeated,dates repeated,topic,source,lies,analysis,fact check rating
0,Oct 09 2019,26,"Oct 09 2019, Oct 08 2019, Oct 07 2019, Oct 07 ...",Ukraine probe,Twitter,A Total Scam by the Do Nothing Democrats. For ...,The issues raised by Trump's phone call with t...,0
1,Oct 09 2019,29,"Oct 09 2019, Oct 09 2019, Oct 09 2019, Oct 07 ...",Ukraine probe,Twitter,The Whistleblower’s facts have been so incorre...,The whistleblower report is correct on key det...,4
2,Oct 09 2019,0,"Oct 09 2019, Oct 09 2019, Oct 09 2019, Oct 07 ...",Ukraine probe,Twitter,The Whistleblower’s lawyer is a big Democrat. ...,The Inspector General for the Intelligence Com...,0
3,Oct 09 2019,22,"Oct 09 2019, Oct 09 2019, Oct 09 2019, Oct 08 ...",Ukraine probe,Twitter,He should be Impeached for Fraud!,Rep. Adam Schiff (D-Calif.) summarized the con...,0
4,Oct 09 2019,0,"Oct 09 2019, Oct 09 2019, Oct 09 2019, Oct 08 ...",Foreign policy,Twitter,"So make all Member Countries pay, not just the...",Trump tweeted this over a news report about ho...,0
5,Oct 09 2019,0,"Oct 09 2019, Oct 09 2019, Oct 09 2019, Oct 08 ...",Foreign policy,Twitter,Fighting between various groups that has been ...,Trump tweeted this to justify his decision to ...,0
6,Oct 09 2019,51,"Oct 09 2019, Oct 09 2019, Sep 18 2019, Sep 11 ...",Foreign policy,Twitter,The United States has spent EIGHT TRILLION DOL...,Trump started making a version of this claim s...,4
7,Oct 09 2019,9,"Oct 09 2019, Oct 09 2019, Sep 11 2019, Aug 29 ...",Foreign policy,Twitter,GOING INTO THE MIDDLE EAST IS THE WORST DECISI...,False. There is zero evidence of his supposed ...,4
8,Oct 09 2019,0,"Oct 09 2019, Oct 09 2019, Sep 11 2019, Aug 29 ...",Miscellaneous,Twitter,"The Do Nothing Democrats are Con Artists, only...",The House of Representatives has passed dozens...,0
9,Oct 09 2019,0,"Oct 09 2019, Oct 09 2019, Sep 11 2019, Aug 29 ...",Ukraine probe,Twitter,"The so-called Whistleblower, before knowing I ...",Trump is citing a memo released to the media t...,0


In [52]:
df['topic'].value_counts()

Immigration            2267
Foreign policy         1534
Trade                  1476
Miscellaneous          1357
Economy                1270
Russia                 1214
Jobs                   1165
Health care             800
Taxes                   522
Biographical record     464
Environment             382
Election                374
Ukraine probe           254
Crime                   191
                         75
Guns                     45
Terrorism                37
Education                 8
Name: topic, dtype: int64

In [53]:
df['source'].value_counts()

Remarks              4067
Campaign rally       2633
Twitter              2573
Interview            1993
Prepared speech      1423
News conference       605
Statement              56
Leaked transcript      45
Vlog                   33
Facebook                7
Name: source, dtype: int64

In [54]:
# Convert date to datetime format
df['date'] = pd.to_datetime(df['date'], format='%b %d %Y')

In [56]:
# Export to CSV file
df_export = df.to_csv('DT_lies.csv',index=False)

In [57]:
# End the Selenium browser session
driver.close()