For this project you'll dig into a large amount of text and apply most of what you've covered in this unit and in the course so far.

First, pick a set of texts. This can be either a series of novels, chapters, or articles. Anything you'd like. It just has to have multiple entries of varying characteristics. At least 100 should be good. There should also be at least 10 different authors, but try to keep the texts related (either all on the same topic of from the same branch of literature - something to make classification a bit more difficult than obviously different subjects).

This capstone can be an extension of your NLP challenge if you wish to use the same corpus. If you found problems with that data set that limited your analysis, however, it may be worth using what you learned to choose a new corpus. Reserve 25% of your corpus as a test set.

The first technique is to create a series of clusters. Try several techniques and pick the one you think best represents your data. Make sure there is a narrative and reasoning around why you have chosen the given clusters. Are authors consistently grouped into the same cluster?

Next, perform some unsupervised feature generation and selection using the techniques covered in this unit and elsewhere in the course. Using those features then build models to attempt to classify your texts by author. Try different permutations of unsupervised and supervised techniques to see which combinations have the best performance.

Lastly return to your holdout group. Does your clustering on those members perform as you'd expect? Have your clusters remained stable or changed dramatically? What about your model? Is its performance consistent?

If there is a divergence in the relative stability of your model and your clusters, delve into why.

Your end result should be a write up of how clustering and modeling compare for classifying your texts. What are the advantages of each? Why would you want to use one over the other? Approximately 3-5 pages is a good length for your write up, and remember to include visuals to help tell your story!

In [1]:
# Set up data science environment.
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.patches as mpatches
import matplotlib.pyplot as plt
import seaborn as sns
import math
import scipy
import spacy
import re
import warnings
from bokeh.io import output_notebook
from bokeh.layouts import column
from bokeh.models import HoverTool, CustomJS, ColumnDataSource, Slider
from bokeh.palettes import all_palettes
from bokeh.plotting import figure, show
from collections import Counter
from gensim import corpora, models
from gensim.models.ldamodel import LdaModel
from nltk.corpus import inaugural, stopwords
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import RegexpTokenizer
from sklearn.cluster import KMeans, MiniBatchKMeans, SpectralClustering
from sklearn.decomposition import TruncatedSVD
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.manifold import TSNE
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.naive_bayes import BernoulliNB
from sklearn.neural_network import MLPClassifier
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import MinMaxScaler, normalize, Normalizer
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics
from typing import Dict
output_notebook()

# Display preferences.
%matplotlib inline
pd.options.display.float_format = '{:.3f}'.format
sns.set_style('white')

# Suppress annoying harmless error.
warnings.filterwarnings(
    action='ignore',
    module='scipy',
    message='internal gelsd'
)

# Data Cleaning, Processing, and Language Parsing

In [2]:
# Create lists for files and presidents.
files = ["1789-Washington.txt",
         "1801-Jefferson.txt",
         "1861-Lincoln.txt",
         "1933-Roosevelt.txt",
         "1953-Eisenhower.txt",
         "1961-Kennedy.txt",
         "1981-Reagan.txt",
         "1989-Bush.txt",
         "1993-Clinton.txt",
         "2009-Obama.txt"]

presidents = ["washington",
              "jefferson",
              "lincoln",
              "fdr",
              "eisenhower",
              "kennedy",
              "reagan",
              "ghwbush",
              "clinton",
              "obama"]

# Control to make sure both lists are the same length.
assert len(files) == len(presidents)

In [3]:
# Quick function to open all files needed (and close them again).
docs = []
for file_name, president in zip(files, presidents):
    with open(f'./inaugural/{file_name}') as f:
        doc = f.read()
        docs.append((doc, president))

In [4]:
# Utility function to clean text.
def text_cleaner(text: str) -> str:
    """Function to strip all characters except letters in words."""
    
    text = re.sub(r'--', ' ', text)
    text = re.sub(r'\d+', '', text)
    text = re.sub("[\[].*?[\]]", "", text)
    text = re.sub("[\<].*?[\>]", "", text)
    text = ' '.join(text.split())
    return text

In [5]:
# Use text_cleaner on the docs, combine them into data frame (clean_docs).
clean_docs = []
for doc, pres in docs:
    clean_doc = text_cleaner(doc)
    clean_docs.append((clean_doc, pres))

In [6]:
# Iterate through each doc and print the first 1000 characters for inspection.
for doc, pres in clean_docs:
    print(doc[:1000], pres.upper()) 
    print()

Fellow-Citizens of the Senate and of the House of Representatives: Among the vicissitudes incident to life no event could have filled me with greater anxieties than that of which the notification was transmitted by your order, and received on the th day of the present month. On the one hand, I was summoned by my Country, whose voice I can never hear but with veneration and love, from a retreat which I had chosen with the fondest predilection, and, in my flattering hopes, with an immutable decision, as the asylum of my declining years a retreat which was rendered every day more necessary as well as more dear to me by the addition of habit to inclination, and of frequent interruptions in my health to the gradual waste committed on it by time. On the other hand, the magnitude and difficulty of the trust to which the voice of my country called me, being sufficient to awaken in the wisest and most experienced of her citizens a distrustful scrutiny into his qualifications, could not but over

In [7]:
# Define nlp as spacy.
nlp = spacy.load('en')
# Create an empty list for df.
df_list = []


# Create a function to parse data.
def nlp_text(text_file: str) -> doc:
    """Function that takes a text file and tokenizes it with spacy."""
    
    
    doc = nlp(text_file)
    return doc


# Create a function to lemmatize sentences.
def sentences(doc_nlp: str, speaker: str) -> [str, str]:
    """Function that takes two strings, lemmatizes the first string and 
    returns a list with two strings.
    """
    
    
    return [[sent.lemma_, speaker] for sent in doc_nlp.sents]


# Create a function to combine groups of sentences into one data frame.
def sentences_to_df(sents):
    """Function that takes a string and returns a data frame."""

    
    return pd.DataFrame(sents)


# Calling each function.
for doc, pres in clean_docs:
    parsed = nlp_text(doc)
    sents = sentences(parsed, pres)
    df = sentences_to_df(sents)
    df_list.append(df)

In [8]:
# Combine each sentence data frame into one master data frame.
sent_df = pd.concat([*df_list])

In [9]:
# Rename columns.
sent_df.columns = ['sentence', 'President']

# Check the count of sents per President.
sent_df.President.value_counts()

ghwbush       145
lincoln       139
reagan        130
eisenhower    121
obama         113
fdr            86
clinton        82
kennedy        53
jefferson      42
washington     25
Name: President, dtype: int64

In [10]:
# Filter out pronouns from results.
sent_df['sentence'] = sent_df['sentence'].str.replace('-PRON-', '')

# Creating Features

In [11]:
# Splitting the data.
X = sent_df.sentence
y = sent_df.President
X_train_eval, X_holdout, y_train_eval, y_holdout = train_test_split(
    X, y, test_size=0.25, random_state=15)

In [12]:
# Splitting into train/eval/holdout groups.
X_train, X_eval, y_train, y_eval = train_test_split(
    X_train_eval, y_train_eval, test_size=0.25, random_state=15)

In [13]:
# Create base parameters dictionary.
base_param_dict = {'strip_accents': 'unicode',
                   'lowercase': True,
                   'stop_words': 'english',
                   'ngram_range': (1, 3),
                   'max_df': 0.5,
                   'min_df': 5,
                   'max_features': 1000}

## Bag of Words

In [14]:
# Instantiate CountVectorizer.
bow = CountVectorizer(**base_param_dict)

In [15]:
# Convert X_train, X_test into dfs of bags of words.
_bow_train = bow.fit_transform(X_train)
_bow_eval = bow.transform(X_eval)
_bow_holdout = bow.transform(X_holdout)
assert len(X_train) == _bow_train.shape[0]  # df and sparse-matrix

# Find feature names.
feature_names = bow.get_feature_names()

# Sparse matrix to data frame.
X_train_bow = pd.DataFrame(_bow_train.toarray(), columns=feature_names)
X_eval_bow = pd.DataFrame(_bow_eval.toarray(), columns=feature_names)
X_holdout_bow = pd.DataFrame(_bow_holdout.toarray(), columns=feature_names)

## Tfidf

In [16]:
# Instantiate Tfidf.
tfidf = TfidfVectorizer(**base_param_dict)

In [17]:
# Convert X_train, X_test into scipy sparse matrices of tfidf values.
_tfidf_train = tfidf.fit_transform(X_train)
_tfidf_eval = tfidf.transform(X_eval)
_tfidf_holdout = tfidf.transform(X_holdout)
assert len(X_train) == _tfidf_train.shape[0]  # df and sparse-matrix

# Find feature names.
feature_names_tfidf = tfidf.get_feature_names()

# Sparse matrix to data frames.
X_train_tfidf = pd.DataFrame(
    _tfidf_train.toarray(), columns=feature_names_tfidf)
X_eval_tfidf = pd.DataFrame(
    _tfidf_eval.toarray(), columns=feature_names_tfidf)
X_holdout_tfidf = pd.DataFrame(
    _tfidf_holdout.toarray(), columns=feature_names_tfidf)

In [18]:
# Calculate weights on training data.
weights = np.asarray(X_train_tfidf.mean(axis=0)).ravel().tolist()
weights_df = pd.DataFrame(
    {'word': tfidf.get_feature_names(), 'avg_weight': weights})
print("\nTrain Weights:\n", weights_df.sort_values(
    by='avg_weight', ascending=False).head(10))

# Calculate weights on eval data.
weights = np.asarray(X_eval_tfidf.mean(axis=0)).ravel().tolist()
weights_df = pd.DataFrame(
    {'word': tfidf.get_feature_names(), 'avg_weight': weights})
print("\nEval Weights:\n", weights_df.sort_values(
    by='avg_weight', ascending=False).head(10))


Train Weights:
            word  avg_weight
147      people       0.032
227       world       0.028
80   government       0.028
131      nation       0.026
206        time       0.025
108         let       0.023
79         good       0.023
120        make       0.022
111        life       0.022
81        great       0.022

Eval Weights:
            word  avg_weight
135         new       0.029
226        work       0.029
81        great       0.028
74      freedom       0.027
132    national       0.027
194      states       0.025
131      nation       0.024
80   government       0.023
147      people       0.023
158   principle       0.023


## Latent Semantic Analysis (LSA)

In [19]:
# Instantiate MinMaxScaler. Create train/eval/holdout groups for Tfidf.
scaler = MinMaxScaler()

# Tfidf
#X_train_tfidf_scaled = pd.DataFrame(
#    scaler.fit_transform(X_train_tfidf), columns=feature_names_tfidf)
#X_eval_tfidf_scaled = pd.DataFrame(
#    scaler.transform(X_eval_tfidf), columns=feature_names_tfidf)
#X_holdout_tfidf_scaled = pd.DataFrame(
#    scaler.transform(X_holdout_tfidf), columns=feature_names_tfidf)

In [20]:
# Reduce feature space to 100 features.
svd = TruncatedSVD(100)

# Make pipeline to run SVD and normalize results.
lsa_pipe = make_pipeline(svd, Normalizer())

# Fit with training data, transform test data.
X_train_lsa = lsa_pipe.fit_transform(X_train_tfidf)
X_eval_lsa = lsa_pipe.transform(X_eval_tfidf)
X_holdout_lsa = lsa_pipe.transform(X_holdout_tfidf)

# Examine variance captured in reduced feature space.
variance_explained = svd.explained_variance_ratio_
total_variance = variance_explained.sum()
print('Percent variance captured by components:', total_variance*100)

sent_by_component = pd.DataFrame(X_train_lsa, index=X_train)

# Look at values from first 5 components.
for i in range(6):
    print('Component {}:'.format(i))
    print(sent_by_component.loc[:, i].sort_values(ascending=False)[:5])

Percent variance captured by components: 74.96045256340335
Component 0:
sentence
 government have no power except that grant  by the people .                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           

In [21]:
# Create train/eval/holdout groups for LSA.
X_train_lsa_scaled = pd.DataFrame(scaler.fit_transform(X_train_lsa))
X_eval_lsa_scaled = pd.DataFrame(scaler.transform(X_eval_lsa))
X_holdout_lsa_scaled = pd.DataFrame(scaler.transform(X_holdout_lsa))

# Clustering Models

In [22]:
# Clustering models
models = []
names = []
plot_nums = []
silhouettes = []
clust = []

for clusters in range(2, 11):
    models.append(
        (0, 'KMeans', KMeans(n_clusters=clusters,
                             init='k-means++', random_state=15)))
    models.append(
        (1, 'MiniBatch', MiniBatchKMeans(init='random',
                                         n_clusters=clusters,
                                         batch_size=500)))
    #models.append((2, 'Spectral', SpectralClustering(n_clusters=clusters)))

for plot_num, name, model in models:
    names.append(name)
    model.fit(X_train_tfidf)
    labels = model.labels_
    print(model)
    if len(set(labels)) > 1:
        ypred = model.fit_predict(X_train_tfidf)
        silhouette = metrics.silhouette_score(
            X_train_tfidf, labels, metric='euclidean')
        silhouettes.append(silhouette)
        #ax[plot_num].set_title(name)
        #plotting(plot_num, labels, ypred)
        if silhouette > 0:
            print('clusters: {}\t silhouette: {}\n'.format(
                model.n_clusters, silhouette))
            print(name, '\n', pd.crosstab(ypred, labels), '\n')

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=2, n_init=10, n_jobs=None, precompute_distances='auto',
    random_state=15, tol=0.0001, verbose=0)
clusters: 2	 silhouette: 0.02403808152620107

KMeans 
 col_0    0   1
row_0         
0      487   0
1        0  39 

MiniBatchKMeans(batch_size=500, compute_labels=True, init='random',
        init_size=None, max_iter=100, max_no_improvement=10, n_clusters=2,
        n_init=3, random_state=None, reassignment_ratio=0.01, tol=0.0,
        verbose=0)
clusters: 2	 silhouette: 0.020863830775012474

MiniBatch 
 col_0    0    1
row_0          
0      369  143
1       10    4 

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=3, n_init=10, n_jobs=None, precompute_distances='auto',
    random_state=15, tol=0.0001, verbose=0)
clusters: 3	 silhouette: 0.024610759668237767

KMeans 
 col_0    0   1   2
row_0             
0      395   0   0
1        0  59   0
2        0   0  72 

Mi

clusters: 10	 silhouette: 0.038866084346253535

KMeans 
 col_0    0   1   2   3   4   5   6   7   8   9
row_0                                         
0      251   0   0   0   0   0   0   0   0   0
1        0  43   0   0   0   0   0   0   0   0
2        0   0  24   0   0   0   0   0   0   0
3        0   0   0  29   0   0   0   0   0   0
4        0   0   0   0  21   0   0   0   0   0
5        0   0   0   0   0  61   0   0   0   0
6        0   0   0   0   0   0  27   0   0   0
7        0   0   0   0   0   0   0  32   0   0
8        0   0   0   0   0   0   0   0  18   0
9        0   0   0   0   0   0   0   0   0  20 

MiniBatchKMeans(batch_size=500, compute_labels=True, init='random',
        init_size=None, max_iter=100, max_no_improvement=10, n_clusters=10,
        n_init=3, random_state=None, reassignment_ratio=0.01, tol=0.0,
        verbose=0)
clusters: 10	 silhouette: 0.02822580157981936

MiniBatch 
 col_0    0  1  2   3   4   5  6   7   8  9
row_0                                    

## Tfidf

In [23]:
# Re-run KMeans and extract cluster information.
model_tfidf = KMeans(n_clusters=10, random_state=15).fit(X_train_tfidf)

# Extract cluster assignments for each data point.
labels = model_tfidf.labels_

In [24]:
# Create cluster assignment for eval, holdout groups.
X_eval_tfidf_labels = model_tfidf.predict(X_eval_tfidf)
X_holdout_tfidf_labels = model_tfidf.predict(X_holdout_tfidf)

# Create a column for cluster labels.
X_eval_tfidf['clusters'] = X_eval_tfidf_labels
X_holdout_tfidf['clusters'] = X_holdout_tfidf_labels

X_train_tfidf['clusters'] = labels

In [25]:
# Aggregate by cluster.
X_train_tfidf_clusters = X_train_tfidf.groupby(
    ['clusters'], as_index=False).mean()
X_train_tfidf_clusters

Unnamed: 0,clusters,act,action,administration,america,american,american people,americans,ask,authority,...,way,willing,wish,woman,word,work,world,write,year,young
0,0,0.015,0.005,0.003,0.019,0.002,0.0,0.011,0.01,0.011,...,0.003,0.0,0.004,0.006,0.001,0.0,0.001,0.006,0.003,0.007
1,1,0.0,0.0,0.0,0.007,0.007,0.0,0.014,0.012,0.0,...,0.0,0.013,0.007,0.008,0.009,0.006,0.025,0.0,0.0,0.017
2,2,0.0,0.0,0.0,0.017,0.0,0.0,0.012,0.0,0.0,...,0.03,0.03,0.0,0.0,0.01,0.0,0.017,0.0,0.015,0.0
3,3,0.0,0.0,0.0,0.016,0.021,0.0,0.019,0.0,0.0,...,0.037,0.018,0.0,0.0,0.0,0.0,0.015,0.0,0.0,0.0
4,4,0.0,0.009,0.018,0.0,0.023,0.0,0.043,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.01,0.0,0.0,0.0
5,5,0.005,0.009,0.008,0.029,0.021,0.023,0.0,0.006,0.016,...,0.008,0.0,0.008,0.015,0.005,0.0,0.195,0.0,0.006,0.0
6,6,0.011,0.036,0.0,0.015,0.013,0.014,0.027,0.016,0.0,...,0.0,0.0,0.0,0.0,0.16,0.0,0.025,0.053,0.03,0.012
7,7,0.0,0.016,0.023,0.0,0.007,0.008,0.0,0.037,0.004,...,0.022,0.0,0.01,0.0,0.0,0.0,0.0,0.0,0.041,0.0
8,8,0.027,0.0,0.0,0.021,0.0,0.0,0.036,0.017,0.0,...,0.0,0.019,0.0,0.0,0.015,0.505,0.0,0.018,0.0,0.0
9,9,0.019,0.0,0.0,0.037,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.019,0.0,0.0,0.0,0.0,0.0


In [26]:
# View the ten most common terms in each cluster.
cluster0_train_tfidf = X_train_tfidf[X_train_tfidf['clusters'] == 0]
cluster0_train_tfidf.mean().sort_values(ascending=False)[:10]

day            0.028
freedom        0.022
god            0.021
right          0.020
union          0.019
constitution   0.019
america        0.019
make           0.019
man            0.018
old            0.017
dtype: float64

In [27]:
cluster0_eval_tfidf = X_eval_tfidf[X_eval_tfidf['clusters'] == 0]
cluster0_eval_tfidf.mean().sort_values(ascending=False)[:10]

principle    0.031
law          0.029
believe      0.027
faith        0.026
states       0.024
way          0.023
new          0.023
generation   0.021
union        0.021
case         0.021
dtype: float64

In [28]:
cluster1_train_tfidf = X_train_tfidf[X_train_tfidf['clusters'] == 1]
cluster1_train_tfidf.mean().sort_values(ascending=False)[1:11]

time         0.264
nation       0.228
change       0.047
great        0.040
government   0.035
live         0.032
make         0.032
require      0.031
god          0.029
believe      0.029
dtype: float64

In [29]:
cluster1_eval_tfidf = X_eval_tfidf[X_eval_tfidf['clusters'] == 1]
cluster1_eval_tfidf.mean().sort_values(ascending=False)[1:11]

nation    0.289
time      0.209
freedom   0.134
great     0.101
new       0.095
right     0.094
think     0.075
speak     0.073
power     0.062
future    0.060
dtype: float64

In [30]:
cluster2_train_tfidf = X_train_tfidf[X_train_tfidf['clusters'] == 2]
cluster2_train_tfidf.mean().sort_values(ascending=False)[1:11]

good         0.444
make         0.088
power        0.070
government   0.061
hope         0.060
life         0.043
discipline   0.043
equal        0.040
public       0.036
peace        0.036
dtype: float64

In [31]:
cluster2_eval_tfidf = X_eval_tfidf[X_eval_tfidf['clusters'] == 2]
cluster2_eval_tfidf.mean().sort_values(ascending=False)[1:11]

good       0.357
land       0.191
national   0.175
action     0.141
place      0.138
hold       0.131
friend     0.128
meet       0.126
right      0.116
use        0.101
dtype: float64

In [32]:
cluster3_train_tfidf = X_train_tfidf[
    X_train_tfidf['clusters'] == 3]
cluster3_train_tfidf.mean().sort_values(ascending=False)[1:11]

life      0.323
hand      0.219
heart     0.097
new       0.067
present   0.057
faith     0.052
know      0.043
great     0.041
way       0.037
make      0.030
dtype: float64

In [33]:
cluster3_eval_tfidf = X_eval_tfidf[
    X_eval_tfidf['clusters'] == 3]
cluster3_eval_tfidf.mean().sort_values(ascending=False)[1:11]

life             0.279
hand             0.260
national         0.191
value            0.134
form             0.105
administration   0.096
leader           0.088
majority         0.086
heart            0.080
day              0.075
dtype: float64

In [34]:
cluster4_train_tfidf = X_train_tfidf[
    X_train_tfidf['clusters'] == 4]
cluster4_train_tfidf.mean().sort_values(ascending=False)[1:11]

citizen          0.306
fellow           0.276
fellow citizen   0.175
president        0.095
country          0.064
man              0.054
great            0.049
free             0.048
duty             0.044
americans        0.043
dtype: float64

In [35]:
cluster4_eval_tfidf = X_eval_tfidf[
    X_eval_tfidf['clusters'] == 4]
cluster4_eval_tfidf.mean().sort_values(ascending=False)[1:11]

citizen          0.458
ask              0.247
fellow citizen   0.203
fellow           0.181
america          0.154
today            0.146
hope             0.143
world            0.125
mind             0.120
heart            0.111
dtype: float64

In [36]:
cluster5_train_tfidf = X_train_tfidf[
    X_train_tfidf['clusters'] == 5]
cluster5_train_tfidf.mean().sort_values(ascending=False)[1:11]

world           0.195
people          0.194
free            0.054
states          0.049
united          0.047
great           0.039
shall           0.037
united states   0.036
economic        0.035
peace           0.030
dtype: float64

In [37]:
cluster5_eval_tfidf = X_eval_tfidf[
    X_eval_tfidf['clusters'] == 5]
cluster5_eval_tfidf.mean().sort_values(ascending=False)[1:11]

people            0.177
world             0.175
peace             0.097
america           0.095
great             0.086
american people   0.075
purpose           0.069
american          0.068
year              0.058
americans         0.058
dtype: float64

In [38]:
cluster6_train_tfidf = X_train_tfidf[
    X_train_tfidf['clusters'] == 6]
cluster6_train_tfidf.mean().sort_values(ascending=False)[1:11]

need           0.257
word           0.160
deny           0.098
write          0.053
constitution   0.043
speak          0.040
factory        0.040
great          0.039
material       0.038
moment         0.037
dtype: float64

In [39]:
cluster6_eval_tfidf = X_eval_tfidf[
    X_eval_tfidf['clusters'] == 6]
cluster6_eval_tfidf.mean().sort_values(ascending=False)[1:11]

need        0.501
action      0.152
serve       0.126
care        0.119
congress    0.109
executive   0.107
face        0.106
country     0.100
authority   0.097
today       0.097
dtype: float64

In [40]:
cluster7_train_tfidf = X_train_tfidf[
    X_train_tfidf['clusters'] == 7]
cluster7_train_tfidf.mean().sort_values(ascending=False)[1:11]

government   0.310
support      0.135
shall        0.098
people       0.065
federal      0.057
year         0.041
right        0.041
power        0.039
law          0.039
states       0.038
dtype: float64

In [41]:
cluster7_eval_tfidf = X_eval_tfidf[
    X_eval_tfidf['clusters'] == 7]
cluster7_eval_tfidf.mean().sort_values(ascending=False)[1:11]

government      0.356
states          0.089
opportunity     0.076
continue        0.076
support         0.069
united states   0.067
let             0.063
united          0.061
freedom         0.059
union           0.059
dtype: float64

In [42]:
cluster8_train_tfidf = X_train_tfidf[
    X_train_tfidf['clusters'] == 8]
cluster8_train_tfidf.mean().sort_values(ascending=False)[1:11]

work        0.505
sacrifice   0.063
let         0.056
good        0.051
return      0.049
god         0.046
thing       0.046
moral       0.043
seek        0.042
right       0.039
dtype: float64

In [43]:
cluster8_eval_tfidf = X_eval_tfidf[
    X_eval_tfidf['clusters'] == 8]
cluster8_eval_tfidf.mean().sort_values(ascending=False)[1:11]

work        0.638
know        0.093
new         0.089
people      0.080
task        0.080
old         0.075
friend      0.074
man woman   0.062
woman       0.061
stand       0.060
dtype: float64

In [44]:
cluster9_train_tfidf = X_train_tfidf[
    X_train_tfidf['clusters'] == 9]
cluster9_train_tfidf.mean().sort_values(ascending=False)[1:11]

let          0.506
begin        0.133
fear         0.081
renew        0.058
remember     0.056
hope         0.047
hard         0.041
understand   0.040
america      0.037
far          0.034
dtype: float64

In [45]:
cluster9_eval_tfidf = X_eval_tfidf[
    X_eval_tfidf['clusters'] == 9]
cluster9_eval_tfidf.mean().sort_values(ascending=False)[1:11]

let              0.554
seek             0.197
people           0.159
child            0.140
responsibility   0.138
community        0.135
country          0.120
turn             0.070
end              0.068
generation       0.067
dtype: float64

## LSA

In [46]:
# Re-run KMeans and extract cluster information.
model_lsa = KMeans(n_clusters=10, random_state=42).fit(X_train_lsa_scaled)

# Extract cluster assignments for each data point.
labels = model_lsa.labels_

In [47]:
# Create cluster assignment for eval, holdout groups.
X_eval_lsa_labels = model_lsa.predict(X_eval_lsa_scaled)
X_holdout_lsa_labels = model_lsa.predict(X_holdout_lsa_scaled)

# Create a column for cluster labels.
X_eval_lsa_scaled['clusters'] = X_eval_lsa_labels
X_holdout_lsa_scaled['clusters'] = X_holdout_lsa_labels

X_train_lsa_scaled['clusters'] = labels

In [48]:
# Aggregate by cluster.
X_train_lsa_clusters = X_train_lsa_scaled.groupby(
    ['clusters'], as_index=False).mean()
X_train_lsa_clusters

Unnamed: 0,clusters,0,1,2,3,4,5,6,7,8,...,90,91,92,93,94,95,96,97,98,99
0,0,0.253,0.363,0.51,0.355,0.45,0.421,0.442,0.488,0.485,...,0.425,0.442,0.437,0.629,0.495,0.48,0.51,0.39,0.429,0.39
1,1,0.592,0.36,0.426,0.333,0.484,0.291,0.751,0.573,0.407,...,0.418,0.431,0.435,0.637,0.492,0.478,0.517,0.396,0.424,0.368
2,2,0.249,0.47,0.515,0.375,0.464,0.511,0.502,0.447,0.363,...,0.372,0.312,0.499,0.593,0.452,0.397,0.617,0.507,0.608,0.316
3,3,0.132,0.371,0.499,0.393,0.461,0.414,0.419,0.571,0.369,...,0.407,0.466,0.497,0.593,0.408,0.412,0.548,0.389,0.415,0.447
4,4,0.639,0.43,0.399,0.351,0.48,0.358,0.417,0.458,0.331,...,0.407,0.426,0.445,0.626,0.512,0.479,0.501,0.393,0.429,0.386
5,5,0.4,0.404,0.546,0.353,0.501,0.322,0.485,0.6,0.31,...,0.461,0.46,0.409,0.667,0.524,0.484,0.488,0.366,0.32,0.418
6,6,0.267,0.419,0.426,0.409,0.586,0.375,0.456,0.499,0.412,...,0.344,0.353,0.655,0.71,0.456,0.507,0.615,0.316,0.459,0.476
7,7,0.663,0.345,0.614,0.574,0.38,0.453,0.196,0.777,0.178,...,0.423,0.433,0.441,0.628,0.477,0.471,0.497,0.387,0.417,0.375
8,8,0.41,0.349,0.434,0.284,0.469,0.532,0.398,0.487,0.505,...,0.431,0.413,0.412,0.621,0.485,0.462,0.504,0.398,0.418,0.39
9,9,0.144,0.32,0.454,0.485,0.347,0.355,0.438,0.42,0.551,...,0.461,0.43,0.475,0.619,0.522,0.408,0.454,0.355,0.426,0.347


In [49]:
# View the ten most common terms in each cluster.
cluster0_train_lsa = X_train_lsa_scaled[X_train_lsa_scaled['clusters'] == 0]
cluster0_train_lsa.mean().sort_values(ascending=False)[:10]

93   0.629
89   0.541
49   0.523
75   0.514
12   0.514
9    0.512
2    0.510
96   0.510
58   0.507
72   0.507
dtype: float64

In [50]:
cluster0_eval_lsa = X_eval_lsa_scaled[X_eval_lsa_scaled['clusters'] == 0]
cluster0_eval_lsa.mean().sort_values(ascending=False)[:10]

93   0.623
89   0.554
75   0.529
34   0.524
96   0.514
87   0.514
11   0.513
44   0.511
50   0.511
60   0.509
dtype: float64

It appears that cluster0 is a cluster that deals primarily with **people** of one sort or another. Common words in the top ten for each include "man", "woman", "man woman", and "say". 

In [51]:
cluster1_train_lsa = X_train_lsa_scaled[X_train_lsa_scaled['clusters'] == 1]
cluster1_train_lsa.mean().sort_values(ascending=False)[1:11]

6    0.751
93   0.637
14   0.618
0    0.592
41   0.577
34   0.576
7    0.573
10   0.557
19   0.548
49   0.527
dtype: float64

In [52]:
cluster1_eval_lsa = X_eval_lsa_scaled[X_eval_lsa_scaled['clusters'] == 1]
cluster1_eval_lsa.mean().sort_values(ascending=False)[1:11]

6    0.721
14   0.672
55   0.634
34   0.633
93   0.620
0    0.618
94   0.588
41   0.572
78   0.571
49   0.571
dtype: float64

Cluster1 has only one word in common for the top ten of both groups: "today". However, it is apparent that this cluster has a lot to do with **ideals**, with words like "good"/"great", "world", "peace", "principle", "nation", "state", and "constitution".

In [53]:
cluster2_train_lsa = X_train_lsa_scaled[X_train_lsa_scaled['clusters'] == 2]
cluster2_train_lsa.mean().sort_values(ascending=False)[1:11]

61   0.798
32   0.782
63   0.776
66   0.753
64   0.746
87   0.706
41   0.677
34   0.670
44   0.662
89   0.649
dtype: float64

In [54]:
cluster2_eval_lsa = X_eval_lsa_scaled[X_eval_lsa_scaled['clusters'] == 2]
cluster2_eval_lsa.mean().sort_values(ascending=False)[1:11]

87   0.846
17   0.838
63   0.831
66   0.779
61   0.747
64   0.705
41   0.699
69   0.685
96   0.671
34   0.661
dtype: float64

Cluster2 has a less obvious theme that I would like to call **values**. Two common words are "hand" and "material", but other important words include "strong", "life", "power", "liberty", "trust", and "value".

In [55]:
cluster3_train_lsa = X_train_lsa_scaled[X_train_lsa_scaled['clusters'] == 3]
cluster3_train_lsa.mean().sort_values(ascending=False)[1:11]

44   0.972
48   0.918
33   0.740
78   0.681
89   0.634
31   0.610
47   0.599
79   0.595
64   0.594
93   0.593
dtype: float64

In [56]:
cluster3_eval_lsa = X_eval_lsa_scaled[X_eval_lsa_scaled['clusters'] == 3]
cluster3_eval_lsa.mean().sort_values(ascending=False)[1:11]

1    nan
2    nan
3    nan
4    nan
5    nan
6    nan
7    nan
8    nan
9    nan
10   nan
dtype: float64

In [57]:
cluster4_train_lsa = X_train_lsa_scaled[X_train_lsa_scaled['clusters'] == 4]
cluster4_train_lsa.mean().sort_values(ascending=False)[1:11]

0    0.639
93   0.626
11   0.539
36   0.537
89   0.533
12   0.533
9    0.519
94   0.512
64   0.510
49   0.510
dtype: float64

In [58]:
cluster4_eval_lsa = X_eval_lsa_scaled[X_eval_lsa_scaled['clusters'] == 4]
cluster4_eval_lsa.mean().sort_values(ascending=False)[1:11]

0    0.668
93   0.610
89   0.577
11   0.570
34   0.544
67   0.542
96   0.533
69   0.530
72   0.528
49   0.519
dtype: float64

In [59]:
cluster5_train_lsa = X_train_lsa_scaled[X_train_lsa_scaled['clusters'] == 5]
cluster5_train_lsa.mean().sort_values(ascending=False)[1:11]

32   0.830
35   0.760
30   0.727
34   0.724
38   0.673
93   0.667
9    0.648
44   0.621
11   0.614
7    0.600
dtype: float64

In [60]:
cluster5_eval_lsa = X_eval_lsa_scaled[X_eval_lsa_scaled['clusters'] == 5]
cluster5_eval_lsa.mean().sort_values(ascending=False)[1:11]

32   0.880
35   0.825
30   0.772
34   0.770
38   0.759
75   0.723
93   0.687
83   0.644
44   0.642
12   0.628
dtype: float64

In [61]:
cluster6_train_lsa = X_train_lsa_scaled[X_train_lsa_scaled['clusters'] == 6]
cluster6_train_lsa.mean().sort_values(ascending=False)[1:11]

37   0.824
72   0.787
57   0.767
93   0.710
75   0.706
63   0.682
92   0.655
51   0.643
69   0.638
9    0.619
dtype: float64

In [62]:
cluster6_eval_lsa = X_eval_lsa_scaled[X_eval_lsa_scaled['clusters'] == 6]
cluster6_eval_lsa.mean().sort_values(ascending=False)[1:11]

1    nan
2    nan
3    nan
4    nan
5    nan
6    nan
7    nan
8    nan
9    nan
10   nan
dtype: float64

In [63]:
cluster7_train_lsa = X_train_lsa_scaled[X_train_lsa_scaled['clusters'] == 7]
cluster7_train_lsa.mean().sort_values(ascending=False)[1:11]

7    0.777
0    0.663
93   0.628
2    0.614
3    0.574
19   0.568
41   0.546
87   0.527
33   0.526
34   0.523
dtype: float64

In [64]:
cluster7_eval_lsa = X_eval_lsa_scaled[X_eval_lsa_scaled['clusters'] == 7]
cluster7_eval_lsa.mean().sort_values(ascending=False)[1:11]

7    0.708
76   0.636
29   0.632
19   0.628
49   0.625
63   0.618
50   0.592
73   0.586
0    0.585
96   0.577
dtype: float64

In [65]:
cluster8_train_lsa = X_train_lsa_scaled[X_train_lsa_scaled['clusters'] == 8]
cluster8_train_lsa.mean().sort_values(ascending=False)[1:11]

17   0.770
18   0.653
93   0.621
36   0.613
9    0.606
33   0.566
41   0.551
22   0.546
89   0.533
5    0.532
dtype: float64

In [66]:
cluster8_eval_lsa = X_eval_lsa_scaled[X_eval_lsa_scaled['clusters'] == 8]
cluster8_eval_lsa.mean().sort_values(ascending=False)[1:11]

17   0.847
93   0.635
36   0.634
33   0.607
34   0.601
9    0.582
18   0.559
94   0.550
7    0.547
14   0.545
dtype: float64

In [67]:
cluster9_train_lsa = X_train_lsa_scaled[X_train_lsa_scaled['clusters'] == 9]
cluster9_train_lsa.mean().sort_values(ascending=False)[1:11]

33   0.934
34   0.901
44   0.880
28   0.857
38   0.747
20   0.733
26   0.684
49   0.655
9    0.652
24   0.651
dtype: float64

In [68]:
cluster9_eval_lsa = X_eval_lsa_scaled[X_eval_lsa_scaled['clusters'] == 9]
cluster9_eval_lsa.mean().sort_values(ascending=False)[1:11]

1    nan
2    nan
3    nan
4    nan
5    nan
6    nan
7    nan
8    nan
9    nan
10   nan
dtype: float64

Lastly, cluster3 is what I will call **now**. Common words include "time" and "government", and other relevant words include "change", "history", "future", "courage", "generation", and "god" (lemmatized, so we'll say "God").

## Latent Dirichlet Allocation (LDA)

### Set up text for LDA

In [69]:
# Removing numerals.
sent_df['sentence_tokens'] = sent_df.sentence.map(
    lambda x: re.sub(r'\d+', '', x))
# Lower case.
sent_df['sentence_tokens'] = sent_df.sentence_tokens.map(lambda x: x.lower())
print(sent_df['sentence_tokens'][0][:500])

0    fellow - citizens of the senate and of the hou...
0    friend and fellow citizens : call upon to unde...
0             fellow - citizens of the united states :
0     be certain that  fellow americans expect that...
0     friend , before  begin the expression of thos...
0    vice president johnson , mr. speaker , mr. chi...
0    senator hatfield , mr. chief justice , mr. pre...
0    mr. chief justice , mr. president , vice presi...
0     fellow citizen , today  celebrate the mystery...
0     fellow citizen :  stand here today humble by ...
Name: sentence_tokens, dtype: object


In [70]:
# Tokenize.
sent_df['sentence_tokens'] = sent_df.sentence_tokens.map(
    lambda x: RegexpTokenizer(r'\w+').tokenize(x))
print(sent_df['sentence_tokens'][0][:25])

0    [fellow, citizens, of, the, senate, and, of, t...
0    [friend, and, fellow, citizens, call, upon, to...
0          [fellow, citizens, of, the, united, states]
0    [be, certain, that, fellow, americans, expect,...
0    [friend, before, begin, the, expression, of, t...
0    [vice, president, johnson, mr, speaker, mr, ch...
0    [senator, hatfield, mr, chief, justice, mr, pr...
0    [mr, chief, justice, mr, president, vice, pres...
0    [fellow, citizen, today, celebrate, the, myste...
0    [fellow, citizen, stand, here, today, humble, ...
Name: sentence_tokens, dtype: object


In [71]:
# Stemming.
snowball = SnowballStemmer("english")
sent_df['sentence_tokens'] = sent_df.sentence_tokens.map(
    lambda x: [snowball.stem(token) for token in x])
print(sent_df['sentence_tokens'][0][:25])

0    [fellow, citizen, of, the, senat, and, of, the...
0    [friend, and, fellow, citizen, call, upon, to,...
0              [fellow, citizen, of, the, unit, state]
0    [be, certain, that, fellow, american, expect, ...
0    [friend, befor, begin, the, express, of, those...
0    [vice, presid, johnson, mr, speaker, mr, chief...
0    [senat, hatfield, mr, chief, justic, mr, presi...
0    [mr, chief, justic, mr, presid, vice, presid, ...
0    [fellow, citizen, today, celebr, the, mysteri,...
0    [fellow, citizen, stand, here, today, humbl, b...
Name: sentence_tokens, dtype: object


In [72]:
# Stop words.
stop_en = stopwords.words('english')
sent_df['sentence_tokens'] = sent_df.sentence_tokens.map(
    lambda x: [t for t in x if t not in stop_en])
print(sent_df['sentence_tokens'][0][:25])

0               [fellow, citizen, senat, hous, repres]
0    [friend, fellow, citizen, call, upon, undertak...
0                       [fellow, citizen, unit, state]
0    [certain, fellow, american, expect, induct, pr...
0    [friend, befor, begin, express, thought, deem,...
0    [vice, presid, johnson, mr, speaker, mr, chief...
0    [senat, hatfield, mr, chief, justic, mr, presi...
0    [mr, chief, justic, mr, presid, vice, presid, ...
0    [fellow, citizen, today, celebr, mysteri, amer...
0    [fellow, citizen, stand, today, humbl, task, b...
Name: sentence_tokens, dtype: object


In [73]:
# Final cleaning.
sent_df['sentence_tokens'] = sent_df.sentence_tokens.map(
    lambda x: [t for t in x if len(t) > 1])
print(sent_df['sentence_tokens'][0][:25])

0               [fellow, citizen, senat, hous, repres]
0    [friend, fellow, citizen, call, upon, undertak...
0                       [fellow, citizen, unit, state]
0    [certain, fellow, american, expect, induct, pr...
0    [friend, befor, begin, express, thought, deem,...
0    [vice, presid, johnson, mr, speaker, mr, chief...
0    [senat, hatfield, mr, chief, justic, mr, presi...
0    [mr, chief, justic, mr, presid, vice, presid, ...
0    [fellow, citizen, today, celebr, mysteri, amer...
0    [fellow, citizen, stand, today, humbl, task, b...
Name: sentence_tokens, dtype: object


### Run LDA

In [74]:
texts = sent_df['sentence_tokens'].values
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
lda = LdaModel(corpus,
               id2word=dictionary,
               num_topics=10,
               passes=5,
               minimum_probability=0,
               random_state=15)

In [75]:
# Print topics
lda.print_topics()

[(0,
  '0.013*"america" + 0.012*"nation" + 0.011*"state" + 0.010*"shall" + 0.010*"govern" + 0.009*"god" + 0.008*"let" + 0.008*"world" + 0.007*"renew" + 0.007*"must"'),
 (1,
  '0.010*"great" + 0.010*"nation" + 0.009*"good" + 0.009*"everi" + 0.008*"peopl" + 0.007*"citizen" + 0.007*"mean" + 0.006*"may" + 0.006*"man" + 0.006*"free"'),
 (2,
  '0.016*"may" + 0.013*"constitut" + 0.009*"take" + 0.008*"state" + 0.008*"law" + 0.006*"great" + 0.006*"shall" + 0.006*"say" + 0.006*"union" + 0.006*"first"'),
 (3,
  '0.011*"american" + 0.010*"right" + 0.009*"ani" + 0.009*"onli" + 0.009*"state" + 0.008*"must" + 0.008*"govern" + 0.007*"generat" + 0.007*"peopl" + 0.006*"old"'),
 (4,
  '0.014*"peopl" + 0.012*"govern" + 0.011*"life" + 0.009*"countri" + 0.008*"strength" + 0.007*"ask" + 0.007*"free" + 0.007*"make" + 0.006*"good" + 0.005*"man"'),
 (5,
  '0.013*"peopl" + 0.011*"work" + 0.010*"new" + 0.009*"nation" + 0.008*"make" + 0.008*"great" + 0.007*"american" + 0.007*"man" + 0.007*"freedom" + 0.006*"today"

In [76]:
# Refactoring results of LDA into numpy matrix.
hm = np.array([[y for (x,y) in lda[corpus[i]]] for i in range(len(corpus))])

In [77]:
# Reduce dimensionality using t-SNE.
tsne = TSNE(random_state=15, perplexity=30, early_exaggeration=120)
embedding = tsne.fit_transform(hm)
embedding = pd.DataFrame(embedding, columns=['x','y'])
embedding['hue'] = hm.argmax(axis=1)

In [78]:
# Scatter plot using Bokeh.
source = ColumnDataSource(
    data=dict(x=embedding.x,
              y=embedding.y,
              colors=[all_palettes['Set1'][8] for i in embedding.hue],
              sentence=sent_df.sentence,
              President=sent_df.President,
              alpha=[0.9] * embedding.shape[0],
              size=[7] * embedding.shape[0]
              )
)
hover_tsne = HoverTool(names=["sent_df"], tooltips="""
    <div style="margin: 10">
        <div style="margin: 0 auto; width:300px;">
            <span style="font-size: 12px; font-weight: bold;">Title:</span>
            <span style="font-size: 12px">@title</span>
            <span style="font-size: 12px; font-weight: bold;">Year:</span>
            <span style="font-size: 12px">@year</span>
        </div>
    </div>
    """)
tools_tsne = [hover_tsne, 'pan', 'wheel_zoom', 'reset']
plot_tsne = figure(plot_width=700, plot_height=700,
                   tools=tools_tsne, title='Inaugural Addresses')
plot_tsne.circle('x', 'y', size='size', fill_color='colors',alpha='alpha',
                 line_alpha=0, line_width=0.01, source=source, name="sent_df")

callback = CustomJS(args=dict(source=source), code=
    """var data = source.data;
    var f = cb_obj.value
    x = data['x']
    y = data['y']
    colors = data['colors']
    alpha = data['alpha']
    title = data['title']
    President = data['President']
    size = data['size']
    for (i = 0; i < x.length; i++) {
        if (year[i] <= f) {
            alpha[i] = 0.9
            size[i] = 7
        } else {
            alpha[i] = 0.05
            size[i] = 4
        }
    }
    source.change.emit();
    """)

slider = Slider(
    start=sent_df.President.min(), end=sent_df.sentence.max(),values=
    ['Washington', 'Jefferson', 'Lincoln', 'FDR', 'Eisenhower', 'Kennedy',
     'Reagan', 'GHWBush', 'Clinton', 'Obama'], step=1, title="Inaugural Speeches")
slider.js_on_change('value', callback)

layout = column(plot_tsne)
show(layout)

ValueError: expected a value of type Real, got clinton of type str

# Prepare for Predictive Modeling

In [79]:
'''Create baseline score to beat. GHWBush had the most sentences, so guessing 
him for all sentences would give this percentage.
'''

print('Baseline score to beat:', sum(
    (sent_df.President == 'ghwbush') / len(sent_df.President)))

Baseline score to beat: 0.15491452991452967


In [80]:
# Pipeline helpers.
skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=15)

In [81]:
# Instantiate the models.
log_reg = LogisticRegression(solver='lbfgs', max_iter=1000, random_state=15)
tree = DecisionTreeClassifier(random_state=15)
forest = RandomForestClassifier(max_depth=10, random_state=15)
boost = GradientBoostingClassifier(random_state=15)
nb = BernoulliNB()

In [82]:
# Set up _kwargs files for convenience.
tfidf_kwargs = {'X_train': X_train_tfidf,'y_train': y_train,
                'X_eval': X_eval_tfidf,'y_eval': y_eval}
                #'X_holdout': X_holdout_tfidf, 'y_holdout': y_holdout}

lsa_kwargs = {'X_train': X_train_lsa_scaled, 'y_train': y_train,
              'X_eval': X_eval_lsa_scaled, 'y_eval': y_eval}
              #'X_holdout': X_holdout_tfidf_scaled, 'y_holdout': y_holdout}

In [83]:
# Tune parameter grids.
log_reg_params = {'model__C': [1, 10, 100, 1000]}
tree_params = {'model__criterion': ['gini']}
forest_params = {'model__n_estimators': [100, 200, 300,400],
                 'model__max_depth': [None, 5, 10]}
boost_params = {'model__n_estimators': [100]}
nb_params = {'model__alpha': [1]}

In [84]:
# Function to fit and predict all working kernals.


def fit_and_predict(model, params: Dict,
                    X_train: pd.DataFrame,
                    y_train: pd.DataFrame,
                    X_eval: pd.DataFrame,
                    y_eval: pd.DataFrame) -> None:
    """
    Takes an instantiated sklearn model, training data (X_train, y_train), 
    and performs cross-validation and then prints the mean of the cross-
    validation accuracies.
    """
    assert len(X_train) == len(y_train)
    assert len(X_eval) == len(y_eval)
    # assert len(X_holdout) == len(y_holdout)
    pipe = Pipeline(steps=[('model', model)])
    clf = GridSearchCV(pipe, cv=skf, param_grid=params, n_jobs=2)
    clf.fit(X_train, y_train)
    print('The mean cross_val accuracy on train is',
          f'{clf.cv_results_["mean_test_score"]}.')
    print('The std of the cross_val accuracy is',
          f'{clf.cv_results_["std_test_score"]}.')
    y_pred = clf.predict(X_eval)
    print(classification_report(y_eval, y_pred))
    print(confusion_matrix(y_eval, y_pred))

## Logistic Regression

### Tfidf

In [85]:
fit_and_predict(log_reg, params=log_reg_params, **tfidf_kwargs)

The mean cross_val accuracy on train is [0.34410646 0.34220532 0.31749049 0.28136882].
The std of the cross_val accuracy is [0.0365393  0.01761933 0.0400676  0.05275828].
              precision    recall  f1-score   support

     clinton       0.33      0.12      0.18        16
  eisenhower       0.35      0.42      0.38        19
         fdr       0.25      0.06      0.10        16
     ghwbush       0.15      0.44      0.22        25
   jefferson       1.00      0.10      0.18        10
     kennedy       0.00      0.00      0.00         9
     lincoln       0.58      0.62      0.60        29
       obama       0.00      0.00      0.00        25
      reagan       0.24      0.33      0.28        24
  washington       0.00      0.00      0.00         3

   micro avg       0.28      0.28      0.28       176
   macro avg       0.29      0.21      0.19       176
weighted avg       0.30      0.28      0.25       176

[[ 2  1  0  9  0  1  0  0  3  0]
 [ 0  8  0  7  0  0  1  2  1  0]
 [ 0

  'precision', 'predicted', average, warn_for)


### LSA (Latent Semantic Analysis)

In [86]:
fit_and_predict(log_reg, params=log_reg_params, **lsa_kwargs)

The mean cross_val accuracy on train is [0.33460076 0.31178707 0.31178707 0.28136882].
The std of the cross_val accuracy is [0.03091255 0.02803846 0.04177239 0.04674929].
              precision    recall  f1-score   support

     clinton       0.18      0.12      0.15        16
  eisenhower       0.33      0.47      0.39        19
         fdr       0.22      0.12      0.16        16
     ghwbush       0.17      0.40      0.24        25
   jefferson       0.33      0.10      0.15        10
     kennedy       0.50      0.11      0.18         9
     lincoln       0.50      0.55      0.52        29
       obama       0.29      0.08      0.12        25
      reagan       0.26      0.25      0.26        24
  washington       0.50      0.33      0.40         3

   micro avg       0.28      0.28      0.28       176
   macro avg       0.33      0.25      0.26       176
weighted avg       0.31      0.28      0.27       176

[[ 2  1  1  9  0  1  0  0  2  0]
 [ 0  9  0  7  0  0  2  1  0  0]
 [ 0



## Decision Trees

### Tfidf

In [87]:
fit_and_predict(tree, params=tree_params, **tfidf_kwargs)

The mean cross_val accuracy on train is [0.26996198].
The std of the cross_val accuracy is [0.0617106].
              precision    recall  f1-score   support

     clinton       0.12      0.12      0.12        16
  eisenhower       0.21      0.26      0.23        19
         fdr       0.08      0.06      0.07        16
     ghwbush       0.11      0.28      0.16        25
   jefferson       0.00      0.00      0.00        10
     kennedy       0.40      0.22      0.29         9
     lincoln       0.55      0.41      0.47        29
       obama       0.18      0.08      0.11        25
      reagan       0.33      0.25      0.29        24
  washington       0.00      0.00      0.00         3

   micro avg       0.21      0.21      0.21       176
   macro avg       0.20      0.17      0.17       176
weighted avg       0.24      0.21      0.21       176

[[ 2  1  0  9  0  0  1  1  2  0]
 [ 1  5  2  9  0  0  1  0  1  0]
 [ 1  3  1  7  1  0  0  1  2  0]
 [ 4  4  2  7  0  0  3  2  2  1]
 [ 0 



### LSA

In [88]:
fit_and_predict(tree, params=tree_params, **lsa_kwargs)

The mean cross_val accuracy on train is [0.2148289].
The std of the cross_val accuracy is [0.03451126].
              precision    recall  f1-score   support

     clinton       0.09      0.06      0.07        16
  eisenhower       0.14      0.21      0.17        19
         fdr       0.07      0.06      0.06        16
     ghwbush       0.11      0.20      0.14        25
   jefferson       0.00      0.00      0.00        10
     kennedy       0.00      0.00      0.00         9
     lincoln       0.44      0.28      0.34        29
       obama       0.09      0.04      0.06        25
      reagan       0.13      0.17      0.15        24
  washington       0.00      0.00      0.00         3

   micro avg       0.14      0.14      0.14       176
   macro avg       0.11      0.10      0.10       176
weighted avg       0.15      0.14      0.13       176

[[1 2 1 6 0 1 0 1 4 0]
 [1 4 2 9 0 1 1 1 0 0]
 [1 3 1 3 1 1 0 1 2 3]
 [2 5 2 5 0 2 3 1 5 0]
 [1 2 1 0 0 1 2 0 2 1]
 [0 1 1 3 0 0 0 0 3 1]



## Random Forest

### Tfidf

In [89]:
fit_and_predict(forest, params=forest_params, **tfidf_kwargs)



The mean cross_val accuracy on train is [0.30798479 0.31749049 0.32509506 0.32889734 0.28326996 0.29277567
 0.28707224 0.28326996 0.28707224 0.29657795 0.30228137 0.30228137].
The std of the cross_val accuracy is [0.0383545  0.03435724 0.03282714 0.03598003 0.05548412 0.05193235
 0.04677472 0.04852622 0.04649525 0.04687379 0.04612755 0.04741648].
              precision    recall  f1-score   support

     clinton       0.27      0.25      0.26        16
  eisenhower       0.21      0.32      0.25        19
         fdr       0.12      0.06      0.08        16
     ghwbush       0.15      0.32      0.21        25
   jefferson       0.50      0.30      0.37        10
     kennedy       1.00      0.11      0.20         9
     lincoln       0.56      0.52      0.54        29
       obama       0.07      0.04      0.05        25
      reagan       0.27      0.25      0.26        24
  washington       0.00      0.00      0.00         3

   micro avg       0.26      0.26      0.26       176
 

### LSA

In [90]:
fit_and_predict(forest, params=forest_params, **lsa_kwargs)



The mean cross_val accuracy on train is [0.32129278 0.31178707 0.31558935 0.32319392 0.30798479 0.29847909
 0.30988593 0.30798479 0.31178707 0.33840304 0.33079848 0.33460076].
The std of the cross_val accuracy is [0.0440672  0.04180222 0.03493958 0.03524775 0.03959405 0.04954156
 0.0501353  0.0576028  0.03656241 0.04614652 0.03676983 0.0385607 ].
              precision    recall  f1-score   support

     clinton       0.50      0.12      0.20        16
  eisenhower       0.28      0.37      0.32        19
         fdr       0.33      0.06      0.11        16
     ghwbush       0.15      0.44      0.23        25
   jefferson       0.00      0.00      0.00        10
     kennedy       0.00      0.00      0.00         9
     lincoln       0.47      0.66      0.55        29
       obama       0.17      0.04      0.06        25
      reagan       0.19      0.21      0.20        24
  washington       0.00      0.00      0.00         3

   micro avg       0.26      0.26      0.26       176
 

  'precision', 'predicted', average, warn_for)


## Gradient Boosting Machines

### Tfidf

In [91]:
fit_and_predict(boost, params=boost_params, **tfidf_kwargs)



The mean cross_val accuracy on train is [0.28326996].
The std of the cross_val accuracy is [0.03678817].
              precision    recall  f1-score   support

     clinton       0.33      0.25      0.29        16
  eisenhower       0.19      0.26      0.22        19
         fdr       0.25      0.12      0.17        16
     ghwbush       0.14      0.36      0.20        25
   jefferson       0.00      0.00      0.00        10
     kennedy       0.00      0.00      0.00         9
     lincoln       0.46      0.38      0.42        29
       obama       0.00      0.00      0.00        25
      reagan       0.33      0.33      0.33        24
  washington       0.00      0.00      0.00         3

   micro avg       0.22      0.22      0.22       176
   macro avg       0.17      0.17      0.16       176
weighted avg       0.21      0.22      0.21       176

[[ 4  3  0  6  0  0  1  0  2  0]
 [ 0  5  1  7  0  1  1  3  1  0]
 [ 0  0  2  8  0  0  1  1  4  0]
 [ 1  6  2  9  0  1  2  2  2  0]
 [ 0

### LSA

In [92]:
fit_and_predict(boost, params=boost_params, **lsa_kwargs)

The mean cross_val accuracy on train is [0.27376426].
The std of the cross_val accuracy is [0.03294693].
              precision    recall  f1-score   support

     clinton       0.30      0.19      0.23        16
  eisenhower       0.28      0.37      0.32        19
         fdr       0.14      0.06      0.09        16
     ghwbush       0.15      0.32      0.20        25
   jefferson       0.00      0.00      0.00        10
     kennedy       0.00      0.00      0.00         9
     lincoln       0.47      0.55      0.51        29
       obama       0.00      0.00      0.00        25
      reagan       0.25      0.33      0.29        24
  washington       0.00      0.00      0.00         3

   micro avg       0.24      0.24      0.24       176
   macro avg       0.16      0.18      0.16       176
weighted avg       0.20      0.24      0.21       176

[[ 3  1  1  7  0  1  0  1  2  0]
 [ 0  7  0  8  0  0  1  2  1  0]
 [ 0  2  1  7  0  0  4  1  1  0]
 [ 3  2  1  8  0  0  4  1  6  0]
 [ 0

  'precision', 'predicted', average, warn_for)


## Naive Bayes

### Tfidf

In [93]:
fit_and_predict(nb, params=nb_params, **tfidf_kwargs)

The mean cross_val accuracy on train is [0.36311787].
The std of the cross_val accuracy is [0.05182334].
              precision    recall  f1-score   support

     clinton       0.00      0.00      0.00        16
  eisenhower       0.50      0.37      0.42        19
         fdr       0.40      0.12      0.19        16
     ghwbush       0.21      0.64      0.31        25
   jefferson       0.00      0.00      0.00        10
     kennedy       0.00      0.00      0.00         9
     lincoln       0.63      0.59      0.61        29
       obama       0.00      0.00      0.00        25
      reagan       0.22      0.42      0.29        24
  washington       1.00      0.33      0.50         3

   micro avg       0.30      0.30      0.30       176
   macro avg       0.30      0.25      0.23       176
weighted avg       0.27      0.30      0.26       176

[[ 0  1  0 11  0  0  0  0  4  0]
 [ 0  7  0  8  0  0  2  0  2  0]
 [ 0  1  2  6  0  0  0  0  7  0]
 [ 0  1  0 16  0  0  1  0  7  0]
 [ 0

  'precision', 'predicted', average, warn_for)


###  LSA

In [94]:
fit_and_predict(nb, params=nb_params, **lsa_kwargs)

The mean cross_val accuracy on train is [0.16539924].
The std of the cross_val accuracy is [0.01397377].
              precision    recall  f1-score   support

     clinton       0.00      0.00      0.00        16
  eisenhower       0.50      0.05      0.10        19
         fdr       0.00      0.00      0.00        16
     ghwbush       0.14      0.92      0.25        25
   jefferson       0.00      0.00      0.00        10
     kennedy       0.00      0.00      0.00         9
     lincoln       0.00      0.00      0.00        29
       obama       0.25      0.04      0.07        25
      reagan       0.00      0.00      0.00        24
  washington       0.00      0.00      0.00         3

   micro avg       0.14      0.14      0.14       176
   macro avg       0.09      0.10      0.04       176
weighted avg       0.11      0.14      0.06       176

[[ 0  0  0 15  0  0  1  0  0  0]
 [ 0  1  0 18  0  0  0  0  0  0]
 [ 0  0  0 16  0  0  0  0  0  0]
 [ 0  0  0 23  0  0  1  1  0  0]
 [ 0

  'precision', 'predicted', average, warn_for)


# Neural Network

## *Note: there are not enough data to effectively run a neural network on this project. Section 5 is merely going through the process for the sake of the capstone.*

In [95]:
# Establish and fit the multi-level perceptron model.
mlp = MLPClassifier(
    hidden_layer_sizes=(3,), random_state=15, max_iter=5000, alpha=0.05)
mlp.fit(X_train_tfidf, y_train)

MLPClassifier(activation='relu', alpha=0.05, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(3,), learning_rate='constant',
       learning_rate_init=0.001, max_iter=5000, momentum=0.9,
       n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
       random_state=15, shuffle=True, solver='adam', tol=0.0001,
       validation_fraction=0.1, verbose=False, warm_start=False)

In [96]:
# Find MLP score.
mlp.score(X_train_tfidf, y_train)

0.7718631178707225

In [97]:
# Find cross-validation score.
cross_val_score(mlp, X_train_tfidf, y_train, cv=5)

array([0.22727273, 0.22222222, 0.24528302, 0.2745098 , 0.24      ])

In [98]:
# Adjust hidden layer parameters.
mlp1 = MLPClassifier(
    hidden_layer_sizes=(5,2,), random_state=15, max_iter=5000, alpha=0.01)
mlp1.fit(X_train_tfidf, y_train)

MLPClassifier(activation='relu', alpha=0.01, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(5, 2), learning_rate='constant',
       learning_rate_init=0.001, max_iter=5000, momentum=0.9,
       n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
       random_state=15, shuffle=True, solver='adam', tol=0.0001,
       validation_fraction=0.1, verbose=False, warm_start=False)

In [99]:
# Find accuracy score.
mlp1.score(X_train_tfidf, y_train)

0.8897338403041825

In [100]:
# Cross-validation.
cross_val_score(mlp1, X_train_tfidf, y_train, cv=5)

array([0.20909091, 0.19444444, 0.19811321, 0.25490196, 0.15      ])

In [101]:
# Adjust hidden layer parameters.
mlp2 = MLPClassifier(
    hidden_layer_sizes=(5,2,), random_state=15, max_iter=5000, alpha=0.05)
mlp2.fit(X_train_tfidf, y_train)

MLPClassifier(activation='relu', alpha=0.05, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(5, 2), learning_rate='constant',
       learning_rate_init=0.001, max_iter=5000, momentum=0.9,
       n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
       random_state=15, shuffle=True, solver='adam', tol=0.0001,
       validation_fraction=0.1, verbose=False, warm_start=False)

In [102]:
# Find accuracy score.
mlp2.score(X_train_tfidf, y_train)

0.8878326996197718

In [103]:
# Cross-validation.
cross_val_score(mlp2, X_train_tfidf, y_train, cv=5)

array([0.22727273, 0.19444444, 0.19811321, 0.2745098 , 0.15      ])

In [None]:
#fit_and_predict(log_reg, params=log_reg_params, **bow_kwargs)

In [None]:
#fit_and_predict(tree, params=tree_params, **bow_kwargs)

In [None]:
#fit_and_predict(forest, params=forest_params, **bow_kwargs)

In [None]:
#fit_and_predict(boost, params=boost_params, **bow_kwargs)

In [None]:
#fit_and_predict(nb, params=nb_params, **bow_kwargs)