# VADER President Sentiment Analysis #
This notebook will serve to analyse the sentiment of American presidents' speeches.
We will order the speeches by time and by president and use the VADER model to compute the sentiment of each sentence, to then average this for every speech.

The end result will be graphs of every president's sentiment over time.

In [1]:
from vaderSentiment.vaderSentiment import SentimentIntensityAnalyzer
analyser = SentimentIntensityAnalyzer()

In [2]:
# Just experimenting with VADER, not related to the President Speeches

def score(sentence):
    print(f'Sentence:\n\t{sentence}\nScore:\n\t{analyser.polarity_scores(sentence)}')
def is_pos(sentence):
    return analyser.polarity_scores(sentence)['compound'] > 0.05
def is_neg(sentence):
    return analyser.polarity_scores(sentence)['compound'] < -0.05

good_sentence = 'This is the greatest and best song in the world'
score(good_sentence)
print(f'Is positive? {is_pos(good_sentence)}')
print(f'Is negative? {is_neg(good_sentence)}')


Sentence:
	This is the greatest and best song in the world
Score:
	{'neg': 0.0, 'neu': 0.488, 'pos': 0.512, 'compound': 0.8555}
Is positive? True
Is negative? False


## Loading in the data ##
We traverse the directory tree in search of speech files and add them to a dictionary with metadata.
So far, we do not open any files. This keeps the memory footprint small.

In [3]:
from os import walk
import time
president_dict = {}
path = 'presidents-speeches/'
current_president = None
for (root, dirs, files) in walk(path):
    # go through child folders, skip the first parent folder
    if len(dirs) == 0:
        president_name = root.split("/")[1]
        if president_name not in president_dict:
            president_dict[president_name] = []
        for file in files:
            file_pieces = file.split('__')
            date = time.strptime(file_pieces[0], '%B %d, %Y')
            entry = {
                'date': date,
                'title': file_pieces[1],
                'path': root+'/'+file
            }
            president_dict[president_name].append(entry)

# For every president, sort their speeches by date
for pres, speeches in president_dict.items():
    speeches.sort(key=lambda e: e['date'])
print(f'There are {len(president_dict)} presidents in the file system:\n{president_dict.keys()}')

There are 11 presidents in the file system:
dict_keys(['George Washington', 'George W. Bush', 'George H. W. Bush', 'Barack Obama', 'Andrew Jackson', 'Franklin D. Roosevelt', 'Donald Trump', 'Bill Clinton', 'Thomas Jefferson', 'Ronald Reagan', 'Richard M. Nixon'])


We scrape the list of presidents and create a new, reordered, list of presidents with speech file entries.

In [4]:
import requests
from collections import OrderedDict
from bs4 import BeautifulSoup
try:
    resp = requests.get('https://www.loc.gov/rr/print/list/057_chron.html')
except requests.exceptions.RequestException as e:
    print(e)
soup = BeautifulSoup(resp._content)
ordered_president_dict = OrderedDict()
html_president_table = soup.find_all('table')[3]
for row in html_president_table.find_all('tr')[1:]:  # skipping header row
    cols = row.find_all('td')
    president = cols[1].text
    # Manually handling edge cases
    if president == 'Donald J. Trump':
        president = 'Donald Trump'
    elif president == 'George Bush':
        president = 'George H. W. Bush'
    if president in president_dict:
        ordered_president_dict[president] = president_dict[president]
print(f'There are {len(ordered_president_dict)} presidents, in order:\n{ordered_president_dict.keys()}')

There are 11 presidents, in order:
odict_keys(['George Washington', 'Thomas Jefferson', 'Andrew Jackson', 'Franklin D. Roosevelt', 'Richard M. Nixon', 'Ronald Reagan', 'George H. W. Bush', 'Bill Clinton', 'George W. Bush', 'Barack Obama', 'Donald Trump'])


### Verifying the ordering ###
To verify the ordering, we should see that the **last speech** of the **first president** is George Washington's Farewell Address (of Hamilton fame):

In [5]:
list(ordered_president_dict.items())[0][1][-1]

{'date': time.struct_time(tm_year=2001, tm_mon=1, tm_mday=20, tm_hour=0, tm_min=0, tm_sec=0, tm_wday=5, tm_yday=20, tm_isdst=-1),
 'title': 'FirstInauguralAddress.txt',
 'path': 'presidents-speeches/George Washington/speeches/January 20, 2001__FirstInauguralAddress.txt'}

Searching the file system, we find that this is not a fault in this notebook, but the scraping notebook has several invalid links for George Washingon. It seems, however, that this is the only faulty link. We can simply remove it in this processing.

In [6]:
list(ordered_president_dict.items())[0][1].pop() ; print()




And now the same command:

In [7]:
list(ordered_president_dict.items())[0][1][-1]

{'date': time.struct_time(tm_year=1796, tm_mon=12, tm_mday=7, tm_hour=0, tm_min=0, tm_sec=0, tm_wday=2, tm_yday=342, tm_isdst=-1),
 'title': 'EighthAnnualMessagetoCongress.txt',
 'path': 'presidents-speeches/George Washington/speeches/December 7, 1796__EighthAnnualMessagetoCongress.txt'}

Still not the farewell address! Manually scanning the sources yield that my thought was incorrect; Washington had another speech after he announced his retirement. Thus, we find the aftersought address next-to-last:

In [8]:
list(ordered_president_dict.items())[0][1][-2]  # [-2] for next-to-last

{'date': time.struct_time(tm_year=1796, tm_mon=9, tm_mday=19, tm_hour=0, tm_min=0, tm_sec=0, tm_wday=0, tm_yday=263, tm_isdst=-1),
 'title': 'FarewellAddress.txt',
 'path': 'presidents-speeches/George Washington/speeches/September 19, 1796__FarewellAddress.txt'}

Anyway, this shows that the ordering is working.

## Sentiment-classing speeches ##

For each speech, we load in the file, perform sentence tokenization and compute the overall sentiment of the speech by computing sentiment for each sentence and determining if it has more positive or negative sentences.

We assign each speech a new attribute: "sentiment" as a string that is allowed to be empty, indicating a neutral speech.

In [9]:
from nltk.tokenize import sent_tokenize
analyser = SentimentIntensityAnalyzer()

def process_speech_file(file):
    """
    Reads the text file and computes sentiment,
    returns string "positive"/"negative"/"" (neutral)
    """
    pos_count = 0
    neg_count = 0
    with open(file) as f:
        sentences = sent_tokenize(f.read())
        for sentence in sentences:
            if is_pos(sentence):
                pos_count += 1
            elif is_neg(sentence):
                neg_count += 1
    if pos_count == neg_count:
        return ''
    return 'positive' if pos_count > neg_count else 'negative'

president_count_dict = {}

for president, speeches in ordered_president_dict.items():
    positives = 0
    negatives = 0
    president_count_dict[president] = {'pos': [], 'neg': []}
    for speech in speeches:
        sentiment = process_speech_file(speech['path'])
        speech['sentiment'] = sentiment
        if sentiment == 'positive':
            president_count_dict[president]['pos'].append(
                time.strftime('%Y-%m-%d', speech['date'])
            )
            positives += 1
        elif sentiment == 'negative':
            president_count_dict[president]['neg'].append(
                time.strftime('%Y-%m-%d', speech['date'])
            )
            negatives += 1
    
    print(f'{president}: {positives} positive, {negatives} negative ({len(speeches)} total)')
        

George Washington: 19 positive, 0 negative (21 total)
Thomas Jefferson: 23 positive, 1 negative (24 total)
Andrew Jackson: 26 positive, 0 negative (26 total)
Franklin D. Roosevelt: 38 positive, 9 negative (49 total)
Richard M. Nixon: 21 positive, 2 negative (23 total)
Ronald Reagan: 56 positive, 3 negative (59 total)
George H. W. Bush: 23 positive, 0 negative (23 total)
Bill Clinton: 38 positive, 1 negative (39 total)
George W. Bush: 35 positive, 4 negative (39 total)
Barack Obama: 47 positive, 2 negative (50 total)
Donald Trump: 19 positive, 2 negative (22 total)


In [None]:
%matplotlib notebook
import plotly.graph_objects as go
from plotly.subplots import make_subplots
from math import ceil

for president, data in president_count_dict.items():
    
    fig = go.Figure()

    fig.add_trace(go.Histogram(
            x=data['pos'],
            name='Positive'))

    fig.add_trace(go.Histogram(
        x=data['neg'],
        name='Negative'
        ))

    # The two histograms are drawn on top of another
    fig.update_layout(barmode='stack')
    fig.update_layout(
        barmode='stack',
        title=f'Sentiment over time for {president}',
        xaxis_title='Years of service',
        yaxis_title='Speech count',
        font=dict(
            family="Helvetica",
            size=18,
            color="#7f7f7f"
        )
    )
    fig.show()