# Scraping Statements from Politifact
This notebook is used to gather statements and related info from Politifact. My aim is to create a big enough dataset to train a classifier to recognize whether a sentence is worth fact-checking or not.

## Imports

In [13]:
import numpy as np
from bs4 import BeautifulSoup
import requests
import time
import json
import pandas as pd
import datetime

## Crawling Politifact
First, I will retrieve web pages from Politifact containing statements from various politicians.

In [2]:
page_start = 1
last_page = 2000
seconds_pause = 2

for page_number in range(page_start, last_page):
    try:
        #Gather the html data
        page_url = 'https://www.politifact.com/api/factchecks/?page={}'.format(page_number)
        response = requests.get(page_url)
        
        #Save html page in a file
        with open("html_pages/{}.html".format(page_number), "w", encoding="utf-8") as f:
            f.write(response.text)

        #Pause for two seconds
        print('scraped page {}'.format(page_number))
        time.sleep(seconds_pause)
        
    except Exception as e:
        print('\nfailed to scrape page {}\n{}\n'.format(page_number, e))

scraped page 1691
scraped page 1692
scraped page 1693
scraped page 1694
scraped page 1695
scraped page 1696
scraped page 1697
scraped page 1698
scraped page 1699
scraped page 1700
scraped page 1701
scraped page 1702
scraped page 1703
scraped page 1704
scraped page 1705
scraped page 1706
scraped page 1707
scraped page 1708
scraped page 1709
scraped page 1710
scraped page 1711
scraped page 1712
scraped page 1713
scraped page 1714
scraped page 1715
scraped page 1716
scraped page 1717
scraped page 1718
scraped page 1719
scraped page 1720
scraped page 1721
scraped page 1722
scraped page 1723
scraped page 1724
scraped page 1725
scraped page 1726
scraped page 1727
scraped page 1728
scraped page 1729
scraped page 1730
scraped page 1731
scraped page 1732
scraped page 1733
scraped page 1734
scraped page 1735
scraped page 1736
scraped page 1737
scraped page 1738
scraped page 1739
scraped page 1740
scraped page 1741
scraped page 1742
scraped page 1743
scraped page 1744
scraped page 1745
scraped pa

## Retrieve statements
The data right now is in a JSON-like form, but is not usable because of numerous non-utf8 symbols that cause the various json and pandas libraries to crash. Therefore, I created my own functions to handle the problem, in order to create an array that will contain all the statements from Politifact, together with related author, targets and truthfulness value.

In [2]:
#Utility functions for handling Politifact pages

#Removes "Says" from the beginning of the sentence. Replaces weird characters
def clean_sentence(sentence):
    
    sentence = sentence.replace("”", '"')
    sentence = sentence.replace("’", "'")
    sentence = sentence.replace("“", '"')
    if sentence[:5] == "Says ":
        sentence = sentence[5:]
    
    return sentence

#Retrieves the full name of the author
def clean_author(author):
    return author.split('"full_name":"')[1].split('","first_name"')[0]

#Retrieves the name(s) of the targets
def clean_target(target):
    
    num_targets = target.count("full_name")
    
    #There might be no target
    if num_targets == 0:
        return []
    
    #Return targets inside an array
    targets = target.split('"full_name":"')[1:]
    result = []
    for t in targets:
        result.append(t.split('","first_name"')[0])
    return result

In [3]:
statements = []
for num_page in range(1,1830):

    # Load a saved html
    print("Analyzing page nr ", num_page)
    with open("html_pages/{}.html".format(num_page), "r", encoding="utf-8") as f:
            html = f.read()

    #Remove the header
    json_rows = html.split('{"id":')[1:]

    #Retrieve the interesting information
    for row in json_rows:
        statement = {}

        sentence = row.split('"statement":"')[1].split('","ruling_slug')[0]
        author = row.split('"speaker":')[1].split(',"targets"')[0]
        target = row.split('"targets":')[1].split(',"statement"')[0]
        ruling = row.split('"ruling_slug":"')[1].split('","publication_date":')[0]
        date = row.split('"publication_date":"')[1].split('T')[0]

        statement['sentence'] = clean_sentence(sentence)
        statement['author'] = clean_author(author)
        statement['target'] = clean_target(target)
        statement['ruling'] = ruling
        statement['date'] = date

        statements.append(statement)
        
print("We gathered: {0} statements".format(len(statements)))

Analyzing page nr  1
Analyzing page nr  2
Analyzing page nr  3
Analyzing page nr  4
Analyzing page nr  5
Analyzing page nr  6
Analyzing page nr  7
Analyzing page nr  8
Analyzing page nr  9
Analyzing page nr  10
Analyzing page nr  11
Analyzing page nr  12
Analyzing page nr  13
Analyzing page nr  14
Analyzing page nr  15
Analyzing page nr  16
Analyzing page nr  17
Analyzing page nr  18
Analyzing page nr  19
Analyzing page nr  20
Analyzing page nr  21
Analyzing page nr  22
Analyzing page nr  23
Analyzing page nr  24
Analyzing page nr  25
Analyzing page nr  26
Analyzing page nr  27
Analyzing page nr  28
Analyzing page nr  29
Analyzing page nr  30
Analyzing page nr  31
Analyzing page nr  32
Analyzing page nr  33
Analyzing page nr  34
Analyzing page nr  35
Analyzing page nr  36
Analyzing page nr  37
Analyzing page nr  38
Analyzing page nr  39
Analyzing page nr  40
Analyzing page nr  41
Analyzing page nr  42
Analyzing page nr  43
Analyzing page nr  44
Analyzing page nr  45
Analyzing page nr  

## Some analysis
Though unrelated to the purpose of this work, it could be interesting to check a few simple statistics on the data.

In [5]:
df = pd.DataFrame(statements)
df['date'] = pd.to_datetime(df['date'])

In [6]:
len(df)

17580

In [7]:
df.head()

Unnamed: 0,sentence,author,target,ruling,date
0,"Kamala Harris said, ""I don't like Joe Biden an...",Facebook posts,"[Joe Biden, Kamala Harris]",false,2020-08-19
1,"Condoleezza Rice said, ""If you are taught bitt...",Facebook posts,[],half-true,2020-08-19
2,"A coloring book that describes Joe Biden as ""A...",Viral image,[Joe Biden],false,2020-08-19
3,"""In West Virginia alone, overdoses have increa...",Carol Miller,[],mostly-true,2020-08-19
4,"The United States is ""the only major industria...",Bill Clinton,[],mostly-true,2020-08-18


In [8]:
df["author"].value_counts()

Donald Trump               828
Facebook posts             783
Barack Obama               589
Bloggers                   585
Viral image                464
                          ... 
Josh Pade                    1
Activist Mommy               1
Fellowship of the Minds      1
James Roosevelt              1
Senfronia Thompson           1
Name: author, Length: 4121, dtype: int64

In [9]:
df["ruling"].value_counts()

false          3826
half-true      3199
mostly-true    3035
barely-true    2913
true           2327
pants-fire     2038
full-flop       152
half-flip        66
no-flip          24
Name: ruling, dtype: int64

In [10]:
df["target"].value_counts()

TypeError: unhashable type: 'list'

Exception ignored in: 'pandas._libs.index.IndexEngine._call_map_locations'
Traceback (most recent call last):
  File "pandas\_libs\hashtable_class_helper.pxi", line 1653, in pandas._libs.hashtable.PyObjectHashTable.map_locations
TypeError: unhashable type: 'list'


[]                              13722
[Barack Obama]                    494
[Donald Trump]                    252
[Scott Walker]                    155
[Hillary Clinton]                 149
                                ...  
[Brian Robinson]                    1
[Scott Fitzgerald]                  1
[Charlie Crist, Marco Rubio]        1
[Wayne LaPierre]                    1
[Morgan Griffith]                   1
Name: target, Length: 849, dtype: int64

In [12]:
df['date']

0       2020-08-19
1       2020-08-19
2       2020-08-19
3       2020-08-19
4       2020-08-18
           ...    
11015   2013-03-11
11016   2013-03-11
11017   2013-03-11
11018   2013-03-11
11019   2013-03-11
Name: date, Length: 11020, dtype: datetime64[ns]

In [12]:
df.to_csv("checkworthy_claims.csv", index=False)

## Retrieve articles url
Previous step was necessary to create a dataset of checkworthy claims. Now, we want to build a dataset of fact-checking articles.

It would be possible to gather most information directly from the file obtained through the Politifact API, but since some info (like claimDate or imageURL) are not accessible here, I prefer to build a dataset of URL and then scrape them directly from the website.

Politifact urls have this form: 
 - https://www.politifact.com/factchecks/2020/aug/18/viral-image/photo-doesnt-show-kamala-harris-listed-caucasian-h/
 - "https://www.politifact.com/factchecks/" + year + "/" + month (3 letters) + "/" + day + "/" + speaker_slug + "/" + article_slug + "/"

We will put the urls in a dataset which will contain:
 - claim
 - claimant
 - url
 - reviewDate   
 - Rating     
 - languageCode (en)
 - publisherName (Politifact)
 - publisherSite (politifact.com)
 - UniformRating (to get from Rating)
 - claimLink (null)

We will then scrape the following info:
 - claimDate
 - reviewTitle
 - articleBody
 - imageUrl

For reference on what these fields mean, please take a look at [this notebook ](https://colab.research.google.com/drive/15dYezybEwZXGbwvWfgQRfBWrImGxqisF?usp=sharing)

In [23]:
def build_url(article_slug, speaker_slug, date):
    url = "https://www.politifact.com/factchecks/"
    url += str(date.year)
    url += "/"
    url += date.strftime("%b").lower()
    url += "/"
    url += str(date.day)
    url += "/"
    url += speaker_slug
    url += "/"
    url += article_slug
    url += "/"
    return url

In [44]:
statements = []
for num_page in range(1,1830):

    # Load a saved html
    print("Analyzing page nr ", num_page)
    with open("html_pages/{}.html".format(num_page), "r", encoding="utf-8") as f:
            html = f.read()

    #Remove the header
    json_rows = html.split('{"id":')[1:]

    for row in json_rows:
        statement = {}
        
        #Retrieve the slugs needed for the url
        article_slug = row.split('"slug":"')[1].split('","speaker"')[0]
        speaker_slug = row.split('"speaker":{"slug":"')[1].split('","full_name"')[0]
        date = row.split('"publication_date":"')[1].split('T')[0]
        date = datetime.datetime.strptime(date, '%Y-%m-%d')
        
        #Retrieve the other necessary info
        claim = row.split('"statement":"')[1].split('","ruling_slug')[0]
        claimant = row.split('"speaker":')[1].split(',"targets"')[0]
        rating = row.split('"ruling_slug":"')[1].split('","publication_date":')[0]
        
        #Create the dataset entry
        statement['claim'] = clean_sentence(claim)
        statement['claimant'] = clean_author(claimant)
        statement['url'] = build_url(article_slug, speaker_slug, date)
        statement['reviewDate'] = date
        statement['Rating'] = rating
        statement['languageCode'] = 'en'
        statement['publisherName'] = 'Politifact'
        statement['publisherSite'] = 'politifact.com'
        statement['claimLink'] = 'NaN'
        
        statements.append(statement)
        
print("We gathered {0} statements".format(len(statements)))

Analyzing page nr  1
Analyzing page nr  2
Analyzing page nr  3
Analyzing page nr  4
Analyzing page nr  5
Analyzing page nr  6
Analyzing page nr  7
Analyzing page nr  8
Analyzing page nr  9
Analyzing page nr  10
Analyzing page nr  11
Analyzing page nr  12
Analyzing page nr  13
Analyzing page nr  14
Analyzing page nr  15
Analyzing page nr  16
Analyzing page nr  17
Analyzing page nr  18
Analyzing page nr  19
Analyzing page nr  20
Analyzing page nr  21
Analyzing page nr  22
Analyzing page nr  23
Analyzing page nr  24
Analyzing page nr  25
Analyzing page nr  26
Analyzing page nr  27
Analyzing page nr  28
Analyzing page nr  29
Analyzing page nr  30
Analyzing page nr  31
Analyzing page nr  32
Analyzing page nr  33
Analyzing page nr  34
Analyzing page nr  35
Analyzing page nr  36
Analyzing page nr  37
Analyzing page nr  38
Analyzing page nr  39
Analyzing page nr  40
Analyzing page nr  41
Analyzing page nr  42
Analyzing page nr  43
Analyzing page nr  44
Analyzing page nr  45
Analyzing page nr  

In [45]:
df_statement = pd.DataFrame(statements)

In [46]:
df_statement.head()

Unnamed: 0,claim,claimant,url,reviewDate,Rating,languageCode,publisherName,publisherSite,claimLink
0,"Kamala Harris said, ""I don't like Joe Biden an...",Facebook posts,https://www.politifact.com/factchecks/2020/aug...,2020-08-19,false,en,Politifact,politifact.com,
1,"Condoleezza Rice said, ""If you are taught bitt...",Facebook posts,https://www.politifact.com/factchecks/2020/aug...,2020-08-19,half-true,en,Politifact,politifact.com,
2,"A coloring book that describes Joe Biden as ""A...",Viral image,https://www.politifact.com/factchecks/2020/aug...,2020-08-19,false,en,Politifact,politifact.com,
3,"""In West Virginia alone, overdoses have increa...",Carol Miller,https://www.politifact.com/factchecks/2020/aug...,2020-08-19,mostly-true,en,Politifact,politifact.com,
4,"The United States is ""the only major industria...",Bill Clinton,https://www.politifact.com/factchecks/2020/aug...,2020-08-18,mostly-true,en,Politifact,politifact.com,


Now we need to make add a uniform rating

In [47]:
print("Unique ratings: ", df_statement['Rating'].unique().tolist())

Unique ratings:  ['false', 'half-true', 'mostly-true', 'pants-fire', 'true', 'barely-true', 'full-flop', 'half-flip', 'no-flip']


We drop claims with ratings "full-flop", "half-flip", "no-flip" as they refer to change on stances by politicians, so the fact-checking article won't be supporting or refuting the claim, but rather denouncing this change.

In [48]:
print("Original length: ", len(df_statement))
df_statement = df_statement[~df_statement['Rating'].isin(['full-flop', 'half-flip', 'no-flip'])]
print("New length: ", len(df_statement))

Original length:  17580
New length:  17338


In [49]:
def uniform_rating(rating):
    simplified_ratings = ["False", "Mostly False", "Exaggerated", "Half True or Out of Context", "Mostly True", "True"]
    if rating in ['false', 'pants-fire']:
        return simplified_ratings[0]
    if rating in ['barely-true']:
        return simplified_ratings[1]
    if rating in ['half-true']:
        return simplified_ratings[3]
    if rating in ['mostly-true']:
        return simplified_ratings[4]
    if rating in ['true']:
        return simplified_ratings[5]

df_statement['UniformRating'] = df_statement['Rating'].apply(uniform_rating)

In [50]:
df_statement.head()

Unnamed: 0,claim,claimant,url,reviewDate,Rating,languageCode,publisherName,publisherSite,claimLink,UniformRating
0,"Kamala Harris said, ""I don't like Joe Biden an...",Facebook posts,https://www.politifact.com/factchecks/2020/aug...,2020-08-19,false,en,Politifact,politifact.com,,False
1,"Condoleezza Rice said, ""If you are taught bitt...",Facebook posts,https://www.politifact.com/factchecks/2020/aug...,2020-08-19,half-true,en,Politifact,politifact.com,,Half True or Out of Context
2,"A coloring book that describes Joe Biden as ""A...",Viral image,https://www.politifact.com/factchecks/2020/aug...,2020-08-19,false,en,Politifact,politifact.com,,False
3,"""In West Virginia alone, overdoses have increa...",Carol Miller,https://www.politifact.com/factchecks/2020/aug...,2020-08-19,mostly-true,en,Politifact,politifact.com,,Mostly True
4,"The United States is ""the only major industria...",Bill Clinton,https://www.politifact.com/factchecks/2020/aug...,2020-08-18,mostly-true,en,Politifact,politifact.com,,Mostly True


Dataset is now ready

In [51]:
df_statement.to_csv('factcheck_politifact.csv', index=False)