# U.S.A. Presidential Vocabulary

## Introduction
Whenever a United States of America president is elected or re-elected, an inauguration ceremony takes place to mark the beginning of the president’s term. During the ceremony, the president gives an inaugural address to the nation, dictating the tone and focus of the next four years of leadership.

In this project we will have the chance to analyze the inaugural addresses of the presidents of the United States of America, as collected by the Natural Language Toolkit, using word embeddings.

By training sets of word embeddings on subsets of inaugural address versus the collection of presidents as a whole, we can learn about the different ways in which the presidents use language to convey their agenda.

## Import Required Libraries

In [104]:
import os
import gensim
import spacy

from nltk.tokenize import PunktSentenceTokenizer
from collections import Counter

from nltk.corpus import inaugural

## Load All Speeches

In [105]:
## Get a list of All Files in Inaugural
files = sorted(list(inaugural.fileids()))
#print(files)

speeches = []
for filename in files:
    speeches.append("".join(list(inaugural.open(filename))))
#print(speeches)
#speeches[0]

## Preprocess Inaugural Speeches

In [106]:
def process_speeches(speeches):
  word_tokenized_speeches = list()
  for speech in speeches:
    sentence_tokenizer = PunktSentenceTokenizer()
    sentence_tokenized_speech = sentence_tokenizer.tokenize(speech)
    word_tokenized_sentences = list()
    for sentence in sentence_tokenized_speech:
      word_tokenized_sentence = [word.lower().strip('.').strip('?').strip('!') for word in sentence.replace(",","").replace("-"," ").replace(":","").split()]
      word_tokenized_sentences.append(word_tokenized_sentence)
    word_tokenized_speeches.append(word_tokenized_sentences)
  return word_tokenized_speeches

processed_speeches = process_speeches(speeches)
#print(processed_speeches)

## Merge All Inaugural Speeches

In [107]:
def merge_speeches(speeches):
  all_sentences = list()
  for speech in speeches:
    for sentence in speech:
      all_sentences.append(sentence)
  return all_sentences

all_sentences = merge_speeches(processed_speeches)
#all_sentences

## Bag of Words Dictionary or Most Frequent Words

In [108]:
def most_frequent_words(list_of_sentences, n = None):
  all_words = [word for sentence in list_of_sentences for word in sentence]
  return dict(Counter(all_words).most_common(n))

most_freq_words = most_frequent_words(all_sentences)
#most_freq_words

## Find Similar Words

In [109]:
def similar_words(word, n = 10, list_of_sentences = all_sentences):
    model = gensim.models.Word2Vec(list_of_sentences, vector_size = 1000, window = 5, min_count = 1, workers = 2, sg = 1)
    #model.save("word2vec.model")
    #model = gensim.models.Word2Vec.load("word2vec.model")
    similars = model.wv.most_similar(word, topn=10)
    return similars
print(similar_words("freedom"))

[('human', 0.9883624315261841), ('life', 0.9880091547966003), ('domestic', 0.9874228835105896), ('defense', 0.9866711497306824), ('order', 0.984593391418457), ('safety', 0.9840932488441467), ('influence', 0.9840729236602783), ('race', 0.9840136170387268), ('business', 0.9838665127754211), ('preservation', 0.9834845066070557)]


## For a Specific President

In [110]:
def get_president_sentences(president):
    files = sorted([file for file in list(inaugural.fileids()) if president.lower() in file.lower()])
    #speeches = [read_file(file) for file in files]
    speeches = []
    for filename in files:
        speeches.append("".join(list(inaugural.open(filename))))
    processed_speeches = process_speeches(speeches)
    all_sentences = merge_speeches(processed_speeches)
    return all_sentences

In [111]:
# president name must be choosen from inaugural.fileids()
president_last_name = "Roosevelt"

president_sentences = get_president_sentences(president_last_name)
president_most_freq_words = most_frequent_words(president_sentences)
president_most_freq_words

{'the': 440,
 'of': 366,
 'and': 217,
 'to': 186,
 'we': 163,
 'in': 142,
 'a': 141,
 'that': 114,
 'our': 112,
 'it': 79,
 'is': 78,
 'have': 71,
 'for': 51,
 'be': 50,
 'not': 48,
 'this': 46,
 'which': 45,
 'by': 45,
 'as': 44,
 'are': 42,
 'will': 40,
 'i': 40,
 'but': 38,
 'all': 38,
 'with': 34,
 'has': 33,
 'people': 31,
 'us': 31,
 'nation': 31,
 'they': 30,
 'on': 28,
 'their': 28,
 'government': 26,
 'can': 25,
 'from': 24,
 'no': 23,
 'must': 23,
 'who': 22,
 'its': 22,
 'an': 22,
 'men': 22,
 'shall': 22,
 'life': 21,
 'democracy': 20,
 'been': 19,
 'spirit': 18,
 'if': 18,
 'know': 18,
 'so': 17,
 'because': 17,
 'there': 17,
 'more': 15,
 'great': 15,
 'world': 15,
 'good': 14,
 'national': 14,
 'new': 14,
 'power': 14,
 'every': 14,
 'upon': 14,
 'these': 14,
 'at': 14,
 'those': 13,
 'my': 12,
 'such': 12,
 'other': 12,
 'only': 12,
 'peace': 12,
 'through': 12,
 'years': 12,
 'may': 12,
 'old': 11,
 'out': 11,
 'was': 11,
 'today': 11,
 'do': 11,
 'states': 11,
 'way':