In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import scipy
import sklearn
import spacy
import matplotlib.pyplot as plt
import seaborn as sns
import nltk
import re
import os
from nltk.corpus import inaugural, stopwords
from collections import Counter
from sklearn.model_selection import train_test_split
from sklearn import ensemble
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import classification_report
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.ensemble import RandomForestClassifier                          
from sklearn.linear_model import LogisticRegression 
from sklearn.svm import SVC   
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import GradientBoostingClassifier

  from numpy.core.umath_tests import inner1d


## Unsupervised Learning Capstone: Classifying 1st Inaugural Addresses (1909-2009)

US Presidents have used their 1st Inaugural Addresses as an opportunity to highlight their objectives and hopes for the future as Commander-in-chief. For this project, I wanted to analyze the thirteen most recent first inaugural addresses of past US presidents over the last century - from President Taft to President Obama's. We will try to classify the presidents and their addresses using a number of Supervised and Unsupervised Learning tools. The texts were pulled from NLTK. 

### Data Cleaning and Processing

In [2]:
nltk.download('inaugural')
print(inaugural.fileids())

[nltk_data] Downloading package inaugural to
[nltk_data]     /Users/samuelkim/nltk_data...
[nltk_data]   Package inaugural is already up-to-date!
['1789-Washington.txt', '1793-Washington.txt', '1797-Adams.txt', '1801-Jefferson.txt', '1805-Jefferson.txt', '1809-Madison.txt', '1813-Madison.txt', '1817-Monroe.txt', '1821-Monroe.txt', '1825-Adams.txt', '1829-Jackson.txt', '1833-Jackson.txt', '1837-VanBuren.txt', '1841-Harrison.txt', '1845-Polk.txt', '1849-Taylor.txt', '1853-Pierce.txt', '1857-Buchanan.txt', '1861-Lincoln.txt', '1865-Lincoln.txt', '1869-Grant.txt', '1873-Grant.txt', '1877-Hayes.txt', '1881-Garfield.txt', '1885-Cleveland.txt', '1889-Harrison.txt', '1893-Cleveland.txt', '1897-McKinley.txt', '1901-McKinley.txt', '1905-Roosevelt.txt', '1909-Taft.txt', '1913-Wilson.txt', '1917-Wilson.txt', '1921-Harding.txt', '1925-Coolidge.txt', '1929-Hoover.txt', '1933-Roosevelt.txt', '1937-Roosevelt.txt', '1941-Roosevelt.txt', '1945-Roosevelt.txt', '1949-Truman.txt', '1953-Eisenhower.txt', '1

In [3]:
#We are going to pick the 18 inaugural addresses we wish to analyze and put them into a list; 
#after cleaning it, we will create sentence-level documents for each text. 
#The addresses will then be tied to the names of the President that delivered it
#which we extracted from the txt file titles, i.e. file (5:-4) for '1969-Nixon.txt'. 

#nltk.rename('2001-Bush.txt', '2001-GWBush.txt')

labels = []

file_ids = ['1909-Taft.txt', '1913-Wilson.txt', '1917-Wilson.txt', '1921-Harding.txt', 
            '1925-Coolidge.txt', '1929-Hoover.txt', '1933-Roosevelt.txt', '1949-Truman.txt', '1953-Eisenhower.txt', '1961-Kennedy.txt', 
            '1965-Johnson.txt', '1969-Nixon.txt', '1977-Carter.txt', '1981-Reagan.txt', '1989-Bush.txt', 
            '1993-Clinton.txt', '2001-GWBush.txt', '2009-Obama.txt']
for file in file_ids:
    president = re.sub("[^a-zA-Z]", '', file[5:-4])
    labels.append([file, president])

sent_list = []
pres_list = []

for i in range(len(labels)):
    sents = inaugural.sents(labels[i][0])
    joined_sents = [(' '.join(sent), labels[i][1]) for sent in sents]
    for i in range(len(joined_sents)): 
        sent_list.append(joined_sents[i][0]) 
        pres_list.append(joined_sents[i][1])  

sent_list[:5]

['My fellow citizens : Anyone who has taken the oath I have just taken must feel a heavy weight of responsibility .',
 'If not , he has no conception of the powers and duties of the office upon which he is about to enter , or he is lacking in a proper sense of the obligation which the oath imposes .',
 'The office of an inaugural address is to give a summary outline of the main policies of the new administration , so far as they can be anticipated .',
 'I have had the honor to be one of the advisers of my distinguished predecessor , and , as such , to hold up his hands in the reforms he has initiated .',
 'I should be untrue to myself , to my promises , and to the declarations of the party platform upon which I was elected to office , if I did not make the maintenance and enforcement of those reforms a most important feature of my administration .']

In [4]:
#we will also replace upper-case sentences, punctuations and numeric values with "" or ' '
sent_list_clean = []
for sent in sent_list:
    sent = re.sub("[^a-zA-Z]", ' ', sent) 
    if sent == sent.upper():              
        sent = ""                         
    sent_list_clean.append(sent)
print(len(sent_list_clean))
sent_list_clean[:5]

1982


['My fellow citizens   Anyone who has taken the oath I have just taken must feel a heavy weight of responsibility  ',
 'If not   he has no conception of the powers and duties of the office upon which he is about to enter   or he is lacking in a proper sense of the obligation which the oath imposes  ',
 'The office of an inaugural address is to give a summary outline of the main policies of the new administration   so far as they can be anticipated  ',
 'I have had the honor to be one of the advisers of my distinguished predecessor   and   as such   to hold up his hands in the reforms he has initiated  ',
 'I should be untrue to myself   to my promises   and to the declarations of the party platform upon which I was elected to office   if I did not make the maintenance and enforcement of those reforms a most important feature of my administration  ']

In [5]:
#tokenize and lemmatize words, as well as getting rid of ' ' formed by the previous code.
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
lemmatizer = WordNetLemmatizer()
lemma_sents = []
for sent in sent_list_clean:
    words = word_tokenize(sent)                                 
    word_lemma = [lemmatizer.lemmatize(word) for word in words] 
    sent_lemma = ' '.join(word_lemma)                           
    lemma_sents.append(sent_lemma)
lemma_sents[:5]

['My fellow citizen Anyone who ha taken the oath I have just taken must feel a heavy weight of responsibility',
 'If not he ha no conception of the power and duty of the office upon which he is about to enter or he is lacking in a proper sense of the obligation which the oath imposes',
 'The office of an inaugural address is to give a summary outline of the main policy of the new administration so far a they can be anticipated',
 'I have had the honor to be one of the adviser of my distinguished predecessor and a such to hold up his hand in the reform he ha initiated',
 'I should be untrue to myself to my promise and to the declaration of the party platform upon which I wa elected to office if I did not make the maintenance and enforcement of those reform a most important feature of my administration']

In [6]:
#we can now create a dataframe with our sentences and their associated presidents
df = pd.DataFrame()
df['sent'] = lemma_sents 
df['president'] = pres_list
df = df[df.sent!=""]
df.head()

Unnamed: 0,sent,president
0,My fellow citizen Anyone who ha taken the oath...,Taft
1,If not he ha no conception of the power and du...,Taft
2,The office of an inaugural address is to give ...,Taft
3,I have had the honor to be one of the adviser ...,Taft
4,I should be untrue to myself to my promise and...,Taft


In [7]:
# we now have a giant document for all 18 texts, but with presidents' names distinguishing them
#I created a dict in case we need to study individual addresses more closely 
president_speech_dict={}
for president in df.president.unique():
    president_speech_dict[president] = df[df.president == president].sent.values.tolist()
president_speech_dict['Bush']

['Mr Chief Justice Mr President Vice President Quayle Senator Mitchell Speaker Wright Senator Dole Congressman Michael and fellow citizen neighbor and friend',
 'There is a man here who ha earned a lasting place in our heart and in our history',
 'President Reagan on behalf of our Nation I thank you for the wonderful thing that you have done for America',
 'I have just repeated word for word the oath taken by George Washington year ago and the Bible on which I placed my hand is the Bible on which he placed his',
 'It is right that the memory of Washington be with u today not only because this is our Bicentennial Inauguration but because Washington remains the Father of our Country',
 'And he would I think be gladdened by this day for today is the concrete expression of a stunning fact our continuity these year since our government began',
 'We meet on democracy s front porch a good place to talk a neighbor and a friend',
 'For this is a day when our nation is made whole when our differ

In [8]:
#each individual sentence length, after data cleaning
for president in president_speech_dict[president]:
    print(len(president))

17
135
131
57
87
76
217
13
47
57
74
182
48
167
63
174
54
34
54
38
102
166
98
290
88
17
60
127
202
99
106
85
131
127
37
54
59
114
33
121
104
47
137
111
123
85
91
18
19
120
22
163
159
200
49
39
229
294
264
87
177
84
288
147
93
187
32
154
97
129
275
68
74
460
89
178
205
157
201
51
166
97
180
113
140
190
150
24
49
157
20
65
48
359
48
99
310
82
133
24
22
30
129
187
2
7
98
89
285
9
13
42


### TF-IDF Vectorization and Latent Semantic Analysis

We can now proceed with some unsupervised feature generations.

In [9]:
#count table of sentence percentages by president, from highest to lowest. 
df_final = df.groupby('president').count()/df['sent'].count()
df_final = df_final.sort_values(by=['sent'], ascending=False)
df_final

Unnamed: 0_level_0,sent
president,Unnamed: 1_level_1
Coolidge,0.099395
Taft,0.080222
Hoover,0.079717
Harding,0.075177
Bush,0.073158
Wilson,0.064581
Reagan,0.064077
Eisenhower,0.062059
Truman,0.058527
Obama,0.056509


In [10]:
#train-test split, test size kept at 25 percent
df_train, df_test = train_test_split(df,
                                    test_size=0.25,
                                    random_state=40)

print(df_train.shape)
print(df_test.shape)

(1486, 2)
(496, 2)


In [11]:
#tf-idf vectorizer; X and Y defined
#tf-idf value, which we hope to find in each (significant) word, includes the following: 
#document frequency, inverse document frequency(idf), term frequency(tf), and tf-idf, the product of tf and idf.
vectorizer = TfidfVectorizer(stop_words='english', 
                             lowercase=True,       
                             min_df=2,             
                             max_df=0.5,           
                             use_idf=True,
                             smooth_idf=True,
                             norm='l2',
                             max_features=1200
                             )
#we kept features at 900 max.
#we've included every word repeated more than twice per document, and discarded words that appear in 75% of documents
X_train = df_train['sent']
X_test = df_test['sent']
Y_train = df_train['president']
Y_test = df_test['president']

vectorizer.fit_transform(df['sent'])
X_train_tfidf = vectorizer.transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

print(X_train_tfidf.shape)
print(X_test_tfidf.shape)

(1486, 1200)
(496, 1200)


In [12]:
#individual words are weighted; top 15 words listed
weights = np.asarray(X_train_tfidf.mean(axis=0)).ravel().tolist()
weights_df = pd.DataFrame({'word': vectorizer.get_feature_names(), 'avg_weight': weights})
print("Train:\n", weights_df.sort_values(by='avg_weight', ascending=False).head(15))

weights = np.asarray(X_test_tfidf.mean(axis=0)).ravel().tolist()
weights_df = pd.DataFrame({'word': vectorizer.get_feature_names(), 'avg_weight': weights})
print("\nTest:\n", weights_df.sort_values(by='avg_weight', ascending=False).head(15))

Train:
             word  avg_weight
1190       world    0.025320
712       people    0.021569
456           ha    0.020923
661       nation    0.020221
440   government    0.017588
676          new    0.016393
47       america    0.016049
710        peace    0.014085
418      freedom    0.013564
1083        time    0.013150
442        great    0.012977
220      country    0.012584
1186        work    0.012424
576         life    0.011983
549         know    0.010912

Test:
             word  avg_weight
661       nation    0.026600
456           ha    0.024627
712       people    0.020957
576         life    0.018546
440   government    0.016624
47       america    0.015001
1190       world    0.014699
1083        time    0.014211
220      country    0.013876
572          let    0.013603
442        great    0.012953
145       change    0.012558
959        shall    0.012193
611          man    0.011687
676          new    0.011192


The higher the tf-idf score suggests that a word was frequently used in a small portion of sentences. Ignoring "ha", a data cleaning misstep, words like "nation", "government", "peace" and "freedom" indicate presidents have preferred to use inaugural addresses as means to make promises, promises to do right by American core values as well as their civic duties. 

In [13]:
#latent semantic analysis (LSA)
# using single value decomposition (SVD), a dimensionality reduction tool, we truncate our max feature of 1000 by 25%. 
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer

svd = TruncatedSVD(250)
lsa_pipe = make_pipeline(svd, Normalizer())

X_train_lsa = lsa_pipe.fit_transform(X_train_tfidf)
X_test_lsa = lsa_pipe.transform(X_test_tfidf)

variance_explained = svd.explained_variance_ratio_
total_variance = variance_explained.sum()
print('Percent variance captured:', total_variance*100)

sent_by_component = pd.DataFrame(X_train_lsa, index=X_train)

for i in range(7):
    print('Components {}:'.format(i))
    print(sent_by_component.loc[:, i].sort_values(ascending=False)[:5])

Percent variance captured: 61.191536907593516
Components 0:
sent
I am sure our own people will not misunderstand nor will the world misconstrue                                0.522835
We have come to a new realization of our place in the world and a new appraisal of our Nation by the world    0.472248
I also know the people of the world                                                                           0.462371
Is our world gone                                                                                             0.433433
Across the world we see them embraced and we rejoice                                                          0.433433
Name: 0, dtype: float64
Components 1:
sent
Across the world we see them embraced and we rejoice                                                                        0.628239
Is our world gone                                                                                                           0.628239
Is a new world coming          

Using 25% of features, the model still captures 61% of the text's variation. While our clusters' messages are generally one of "hope" - for a brighter, united future, components 3 and 4 bring attention to the American people. Component 3 mentions their hard work ("toil and sweat") and cosmopolitan nature, whereas Component 4 also draws attention to their innocence and good intentions ("our own people will not misunderstand nor will the world misconstrue", "Our people must give and take"). 

### Supervised Learning: RFC, LR, SVC and Gradient Boosting

In [14]:
#we will now move on to Supervised Learning, using our generated features to obtain Classification accuracy scores. 
def Classify(clf):
    X_train = X_train_tfidf
    X_test = X_test_tfidf
    clf.fit(X_train, Y_train)
    print('TF-IDF:')
    print('Train accuracy:', clf.score(X_train, Y_train))
    print('Test accuracy:', clf.score(X_test, Y_test))
    print('Cross Validation:', cross_val_score(clf, X_train, Y_train, cv=5))
    
    X_train = X_train_lsa
    X_test = X_test_lsa
    clf.fit(X_train, Y_train)
    print('LSA:')
    print('Train accuracy:', clf.score(X_train, Y_train))
    print('Test accuracy:', clf.score(X_test, Y_test))
    print('Cross Validation:', cross_val_score(clf, X_train, Y_train, cv=5))

In [15]:
#Random Forest Classifier
clf = RandomForestClassifier(n_estimators=100,
                             max_depth=4,
                             random_state=42,
                             class_weight=None 
                            )   
Classify(clf)

TF-IDF:
Train accuracy: 0.2126514131897712
Test accuracy: 0.1350806451612903
Cross Validation: [0.14144737 0.1589404  0.17449664 0.14383562 0.15862069]
LSA:
Train accuracy: 0.3882907133243607
Test accuracy: 0.13306451612903225
Cross Validation: [0.13815789 0.16225166 0.16778523 0.15068493 0.16896552]


In [16]:
#Logisitc Regression
clf_1 = LogisticRegression(penalty='l2',
                           fit_intercept=False,
                           class_weight=None,
                           random_state=42, 
                           solver='lbfgs'
                          )
Classify(clf_1)

TF-IDF:
Train accuracy: 0.8600269179004038
Test accuracy: 0.27419354838709675
Cross Validation: [0.36513158 0.34437086 0.32214765 0.29452055 0.37586207]
LSA:
Train accuracy: 0.6049798115746972
Test accuracy: 0.2762096774193548
Cross Validation: [0.33223684 0.31125828 0.30536913 0.28082192 0.35517241]


In [17]:
#Support Vector Classifier
clf_2 = SVC(C=0.5, 
            class_weight=None
           )
    
Classify(clf_2)

TF-IDF:
Train accuracy: 0.09757738896366083
Test accuracy: 0.10483870967741936
Cross Validation: [0.09539474 0.09602649 0.09731544 0.09931507 0.1       ]
LSA:
Train accuracy: 0.09757738896366083
Test accuracy: 0.10483870967741936
Cross Validation: [0.09539474 0.09602649 0.09731544 0.09931507 0.1       ]


In [18]:
#Gradient Boosting
clf_3 = GradientBoostingClassifier(learning_rate=0.01)
Classify(clf_3)

TF-IDF:
Train accuracy: 0.509421265141319
Test accuracy: 0.18951612903225806
Cross Validation: [0.18092105 0.20529801 0.22483221 0.21917808 0.2137931 ]
LSA:
Train accuracy: 0.6345895020188426
Test accuracy: 0.1532258064516129
Cross Validation: [0.15460526 0.13576159 0.16107383 0.14726027 0.15517241]


The training and testing accuracy scores leave much to be desired. Despite tweaking the parameters for each, there are clear signs of overfitting - particularly for Logistic regression and Gradient Boosting. Although SVC's scores were the lowest, it showed a bit of underfitting. The accuracy scores were generally low, and even with adjustments, none of the supervised learning technique cracked the expected/desired score of 0.45. Logistic Regression had the highest accuracy, with LSA producing lesser overfitting. However, Cross-validation scores unsurprisingly indicated considerable overfitting only for Logistic Regression. 

### Conclusion

This project was particularly challenging for two main reasons. First, I wasn't sure if the data cleaning method was best suited for this set of texts. Second, the reason for a less-than-desirable accuracy score was not immediately clear to me. A large part of what could be improved are the following: 
- better addressing the class imbalances among inaugural address being used (as noted earlier)
- better identifying and removing stop words that have an adverse affect on accuracy scoring
- choosing a larger dataset or incorporating more text files from inaugural.fileids(); prior to exploring the 17 data texts, I did not anticipate the individual text files to be so small, but for the sake of computer memory, decided to commit to the initial dataset size. 

Some interesting observations - compared to supervised learning, our two unsupervised learning techniques generated far more insight into the dataset's semantics. They definitely stood out in comparison to Random Forest Classifier, its greatest drawback being its "black box", hard-to-make-sense-of nature. While adjusting the parameters for supervised learning has done little to improve the overall scores, they have increased the scores/addressed overfitting issues for some. Lowering the learning_rate for gradient boosting, for example, reduced the train accuracy from the high 90's to 50 and 63, lessening a glaring overfitting problem. 