For this project you'll dig into a large amount of text and apply most of what you've covered in this unit and in the course so far.

First, pick a set of texts. This can be either a series of novels, chapters, or articles. Anything you'd like. It just has to have multiple entries of varying characteristics. At least 100 should be good. There should also be at least 10 different authors, but try to keep the texts related (either all on the same topic of from the same branch of literature - something to make classification a bit more difficult than obviously different subjects).

This capstone can be an extension of your NLP challenge if you wish to use the same corpus. If you found problems with that data set that limited your analysis, however, it may be worth using what you learned to choose a new corpus. Reserve 25% of your corpus as a test set.

The first technique is to create a series of clusters. Try several techniques and pick the one you think best represents your data. Make sure there is a narrative and reasoning around why you have chosen the given clusters. Are authors consistently grouped into the same cluster?

Next, perform some unsupervised feature generation and selection using the techniques covered in this unit and elsewhere in the course. Using those features then build models to attempt to classify your texts by author. Try different permutations of unsupervised and supervised techniques to see which combinations have the best performance.

Lastly return to your holdout group. Does your clustering on those members perform as you'd expect? Have your clusters remained stable or changed dramatically? What about your model? Is its performance consistent?

If there is a divergence in the relative stability of your model and your clusters, delve into why.

Your end result should be a write up of how clustering and modeling compare for classifying your texts. What are the advantages of each? Why would you want to use one over the other? Approximately 3-5 pages is a good length for your write up, and remember to include visuals to help tell your story!

In [1]:
# Set up data science environment.
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.patches as mpatches
import matplotlib.pyplot as plt
import seaborn as sns
import spacy
import pyLDAvis
import re
import warnings
from mpl_toolkits.mplot3d import Axes3D
from bokeh.io import output_notebook
from bokeh.layouts import column
from bokeh.models import HoverTool, CustomJS, ColumnDataSource, Slider
from bokeh.palettes import all_palettes
from bokeh.plotting import figure, show
from collections import Counter
from gensim import corpora, models
from gensim.models.ldamodel import LdaModel
from nltk.corpus import inaugural, stopwords
from nltk.stem.snowball import SnowballStemmer
from nltk.tokenize import RegexpTokenizer
from sklearn.cluster import KMeans, MiniBatchKMeans, SpectralClustering
from sklearn.decomposition import TruncatedSVD
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.manifold import TSNE
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.naive_bayes import BernoulliNB
from sklearn.neural_network import MLPClassifier
from sklearn.pipeline import Pipeline, make_pipeline
from sklearn.preprocessing import MinMaxScaler, normalize, Normalizer
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics
from typing import Dict
output_notebook()

# Display preferences.
%matplotlib inline
pd.options.display.float_format = '{:.3f}'.format
sns.set_style('white')
warnings.filterwarnings(
    action='ignore',
    module='scipy',
    message='internal gelsd'
)

scipy.sparse.sparsetools is a private module for scipy.sparse, and should not be used.
  _deprecated()
  regargs, varargs, varkwargs, defaults, formatvalue=lambda value: ""


# Data Cleaning, Processing, and Language Parsing

In [2]:
# Create lists for files and presidents.
files = ["1789-Washington.txt",
         "1801-Jefferson.txt",
         "1861-Lincoln.txt",
         "1933-Roosevelt.txt",
         "1953-Eisenhower.txt",
         "1961-Kennedy.txt",
         "1981-Reagan.txt",
         "1989-Bush.txt",
         "1993-Clinton.txt",
         "2009-Obama.txt"]

presidents = ["washington",
              "jefferson",
              "lincoln",
              "fdr",
              "eisenhower",
              "kennedy",
              "reagan",
              "ghwbush",
              "clinton",
              "obama"]

# Control to make sure both lists are the same length.
assert len(files) == len(presidents)

In [3]:
# Loop to open files.
docs = []
for file_name, president in zip(files, presidents):
    with open(f'./inaugural/{file_name}') as f:
        doc = f.read()
        docs.append((doc, president))

In [4]:
def text_cleaner(text: str) -> str:
    """Function to strip all characters except letters in words."""
    text = re.sub(r'--', ' ', text)
    text = re.sub(r'\d+', '', text)
    text = re.sub(r"[\[].*?[\]]", "", text)
    text = re.sub(r"[\<].*?[\>]", "", text)
    text = ' '.join(text.split())
    return text

In [5]:
# Use text_cleaner on the docs, combine them into data frame (clean_docs).
clean_docs = []
for doc, pres in docs:
    clean_doc = text_cleaner(doc)
    clean_docs.append((clean_doc, pres))

In [6]:
# Iterate through each doc and print the first 100 characters for inspection.
for doc, pres in clean_docs:
    print(doc[:100], pres.upper()) 
    print()

Fellow-Citizens of the Senate and of the House of Representatives: Among the vicissitudes incident t WASHINGTON

Friends and Fellow Citizens: Called upon to undertake the duties of the first executive office of ou JEFFERSON

Fellow-Citizens of the United States: In compliance with a custom as old as the Government itself, I LINCOLN

I am certain that my fellow Americans expect that on my induction into the Presidency I will address FDR

My friends, before I begin the expression of those thoughts that I deem appropriate to this moment,  EISENHOWER

Vice President Johnson, Mr. Speaker, Mr. Chief Justice, President Eisenhower, Vice President Nixon,  KENNEDY

Senator Hatfield, Mr. Chief Justice, Mr. President, Vice President Bush, Vice President Mondale, Sen REAGAN

Mr. Chief Justice, Mr. President, Vice President Quayle, Senator Mitchell, Speaker Wright, Senator D GHWBUSH

My fellow citizens, today we celebrate the mystery of American renewal. This ceremony is held in the CLINTON

My fell

In [7]:
# Define nlp as spacy.
nlp = spacy.load('en')
# Create an empty list for df.
df_list = []


def nlp_text(text_file: str) -> doc:
    """Function that takes a text file and tokenizes it with spacy."""
    return nlp(text_file)


def sentences(doc_nlp: str, speaker: str) -> [str, str]:
    """Function that takes two strings, lemmatizes the first string and 
    returns a list with two strings.
    """
    return [[sent.lemma_, speaker] for sent in doc_nlp.sents]


def sentences_to_df(sents):
    """Function that takes a string and returns a data frame."""
    return pd.DataFrame(sents)


# Calling each function.
for doc, pres in clean_docs:
    parsed = nlp_text(doc)
    sents = sentences(parsed, pres)
    df = sentences_to_df(sents)
    df_list.append(df)

In [8]:
# Combine each sentence data frame into one master data frame.
sent_df = pd.concat([*df_list])

In [9]:
# Rename columns.
sent_df.columns = ['sentence', 'President']

# Check the count of sents per President.
sent_df.President.value_counts()

ghwbush       145
lincoln       139
reagan        130
eisenhower    121
obama         113
fdr            86
clinton        82
kennedy        53
jefferson      42
washington     25
Name: President, dtype: int64

In [10]:
# Filter out pronouns from results.
sent_df['sentence'] = sent_df['sentence'].str.replace('-PRON-', '')

# Creating Features

In [11]:
# Splitting the data.
X = sent_df.sentence
y = sent_df.President
X_train_eval, X_holdout, y_train_eval, y_holdout = train_test_split(
    X, y, test_size=0.25, random_state=15)

In [12]:
# Splitting into train/eval/holdout groups.
X_train, X_eval, y_train, y_eval = train_test_split(
    X_train_eval, y_train_eval, test_size=0.25, random_state=15)

In [13]:
# Create base parameters dictionary.
base_param_dict = {'strip_accents': 'unicode',
                   'lowercase': True,
                   'stop_words': 'english',
                   'ngram_range': (1, 3),
                   'max_df': 0.5,
                   'min_df': 5,
                   'max_features': 1000}

## Bag of Words

In [14]:
# Instantiate CountVectorizer.
bow = CountVectorizer(**base_param_dict)

In [15]:
# Convert X_train, X_test into dfs of bags of words.
_bow_train = bow.fit_transform(X_train)
_bow_eval = bow.transform(X_eval)
_bow_holdout = bow.transform(X_holdout)
assert len(X_train) == _bow_train.shape[0]  # df and sparse-matrix

# Find feature names.
feature_names = bow.get_feature_names()

# Sparse matrix to data frame.
X_train_bow = pd.DataFrame(_bow_train.toarray(), columns=feature_names)
X_eval_bow = pd.DataFrame(_bow_eval.toarray(), columns=feature_names)
X_holdout_bow = pd.DataFrame(_bow_holdout.toarray(), columns=feature_names)

In [16]:
# Calculate weights on training data.
weights_bow = np.asarray(X_train_bow.sum(axis=0)).ravel()
weights_bow_df = pd.DataFrame(
    {'word': bow.get_feature_names(), 'cum_weight': weights_bow})
print("\nTrain Weights:\n", weights_bow_df.sort_values(
    by='cum_weight', ascending=False).head(10))

# Calculate weights on eval data.
weights_bow_eval = np.asarray(X_eval_bow.sum(axis=0)).ravel()
weights_bow_eval_df = pd.DataFrame(
    {'word': bow.get_feature_names(), 'cum_weight': weights_bow_eval})
print("\nEval Weights:\n", weights_bow_eval_df.sort_values(
    by='cum_weight', ascending=False).head(10))


Train Weights:
            word  cum_weight
147      people          58
80   government          44
227       world          44
131      nation          39
120        make          37
121         man          35
206        time          33
79         good          32
72         free          31
81        great          29

Eval Weights:
            word  cum_weight
80   government          15
131      nation          14
81        great          14
135         new          13
147      people          12
74      freedom          11
227       world          11
226        work          10
3       america          10
194      states          10


## Tfidf

In [17]:
# Instantiate Tfidf.
tfidf = TfidfVectorizer(**base_param_dict)

In [18]:
# Convert X_train, X_test into scipy sparse matrices of tfidf values.
_tfidf_train = tfidf.fit_transform(X_train)
_tfidf_eval = tfidf.transform(X_eval)
_tfidf_holdout = tfidf.transform(X_holdout)
assert len(X_train) == _tfidf_train.shape[0]  # df and sparse-matrix

# Find feature names.
feature_names_tfidf = tfidf.get_feature_names()

# Sparse matrix to data frames.
X_train_tfidf = pd.DataFrame(
    _tfidf_train.toarray(), columns=feature_names_tfidf)
X_eval_tfidf = pd.DataFrame(
    _tfidf_eval.toarray(), columns=feature_names_tfidf)
X_holdout_tfidf = pd.DataFrame(
    _tfidf_holdout.toarray(), columns=feature_names_tfidf)

In [19]:
# Calculate weights on training data.
weights = np.asarray(X_train_tfidf.mean(axis=0)).ravel()
weights_df = pd.DataFrame(
    {'word': tfidf.get_feature_names(), 'avg_weight': weights})
print("\nTrain Weights:\n", weights_df.sort_values(
    by='avg_weight', ascending=False).head(10))

# Calculate weights on eval data.
weights = np.asarray(X_eval_tfidf.mean(axis=0)).ravel()
weights_df = pd.DataFrame(
    {'word': tfidf.get_feature_names(), 'avg_weight': weights})
print("\nEval Weights:\n", weights_df.sort_values(
    by='avg_weight', ascending=False).head(10))


Train Weights:
            word  avg_weight
147      people       0.032
227       world       0.028
80   government       0.028
131      nation       0.026
206        time       0.025
108         let       0.023
79         good       0.023
120        make       0.022
111        life       0.022
81        great       0.022

Eval Weights:
            word  avg_weight
135         new       0.029
226        work       0.029
81        great       0.028
74      freedom       0.027
132    national       0.027
194      states       0.025
131      nation       0.024
80   government       0.023
147      people       0.023
158   principle       0.023


In [47]:
tests = {'pos': ['good', 'equality', 'freedom', 'progress'],
         'neg': ['government', 'nation', 'foreign']}

In [48]:
test_pos = pd.DataFrame(tests)
test_tfidf = pd.DataFrame(tfidf.transform(test_pos))

ValueError: arrays must all be same length

In [46]:
test_tfidf

Unnamed: 0,0
0,
1,


## Latent Semantic Analysis (LSA)

In [None]:
# Reduce feature space to 100 features with SVD.
svd = TruncatedSVD(100)

# Make pipeline to run SVD and normalize results.
lsa_pipe = make_pipeline(svd, Normalizer())

# Fit with training data, transform test data.
X_train_lsa = lsa_pipe.fit_transform(X_train_tfidf)
X_eval_lsa = lsa_pipe.transform(X_eval_tfidf)
X_holdout_lsa = lsa_pipe.transform(X_holdout_tfidf)

# Examine variance captured in reduced feature space.
variance_explained = svd.explained_variance_ratio_
total_variance = variance_explained.sum()
print('Percent variance captured by components:', total_variance*100)

sent_by_component = pd.DataFrame(X_train_lsa, index=X_train)

# Look at values from first 5 components.
for i in range(5):
    print('\nComponent {}:'.format(i))
    print(sent_by_component.loc[:, i].sort_values(ascending=False)[:5])

In [None]:
sns.scatterplot(X_train_lsa[:,0], X_train_lsa[:,1], hue = y_train);

In [None]:
pres = set(y_train)
colors = range(10)
pairs = {p: c for p, c in zip(pres, colors)}

In [None]:
# Creating a temp data frame combining lsa features with target.
temp_df = pd.DataFrame(X_train_lsa)
temp_df['target'] = y_train.values

In [None]:
fig = plt.figure(figsize=(12, 8))
ax = fig.add_subplot(111, projection='3d')
#ax = fig.gca(projection='3d')
#xs = X_train_lsa[:,0]
#ys = X_train_lsa[:,1]
#zs = X_train_lsa[:,2]
#p1mask = y_train == 'washington'
#scatter = ax.scatter(xs.iloc[p1mask], ys.iloc[p1mask], zs.iloc[p1mask],
#                     edgecolors='w', c = 'red')
#scatter = ax.scatter(xs, ys, zs, edgecolors='w', c = y_train.map(pairs))
#legend1 = ax.legend(*scatter.legend_elements(), loc="lower left", 
#                    title="Classes") 
#ax.add_artist(legend1)

#ax.legend(*scatter.legend_elements())
for p, c in pairs.items():
    data = temp_df[(temp_df.target == p) & (p not in ['washington', 'ghwbush'])]
    xs = data.loc[:, 0]
    ys = data.loc[:, 1]
    zs = data.loc[:, 2]
    ax.scatter(xs, ys, zs, c='gray', label=p)
#    ax.scatter(xs, ys, zs, cmap='hot', label=p)
#    print(f'color:{c} label:{p}')

wash = temp_df[temp_df.target == 'washington']
xs = wash.loc[:, 0]
ys = wash.loc[:, 1]
zs = wash.loc[:, 2]
ax.scatter(xs, ys, zs, c='r', label='washington')

bush = temp_df[temp_df.target == 'ghwbush']
xs = bush.loc[:, 0]
ys = bush.loc[:, 1]
zs = bush.loc[:, 2]
ax.scatter(xs, ys, zs, c='c', label='ghwbush')

ax.set_xlabel('Component0')
ax.set_ylabel('Component1')
ax.set_zlabel('Component2')
ax.legend(loc=2)
#ax.legend(labels=(pairs.keys()));

In [None]:
matplotlib.__version__

In [None]:
# Instantiate MinMaxScaler.
scaler = MinMaxScaler()

In [None]:
# Create train/eval/holdout groups for LSA.
X_train_lsa_scaled = pd.DataFrame(scaler.fit_transform(X_train_lsa))
X_eval_lsa_scaled = pd.DataFrame(scaler.transform(X_eval_lsa))
X_holdout_lsa_scaled = pd.DataFrame(scaler.transform(X_holdout_lsa))

# Clustering Models

In [None]:
# Clustering models
models = []
names = []
plot_nums = []
silhouettes = []
clust = []

for clusters in range(2, 11):
    models.append(
        (0, 'KMeans', KMeans(n_clusters=clusters,
                             init='k-means++', random_state=15)))
    models.append(
        (1, 'MiniBatch', MiniBatchKMeans(init='random',
                                         n_clusters=clusters,
                                         batch_size=500)))
# Check for numbers.**
for _, name, model in models:
    names.append(name)
    model.fit(X_train_tfidf)
    labels = model.labels_
    print(model)
    if len(set(labels)) > 1:
        ypred = model.fit_predict(X_train_tfidf)
        silhouette = metrics.silhouette_score(
            X_train_tfidf, labels, metric='euclidean')
        silhouettes.append(silhouette)
        if silhouette > 0:
            print('clusters: {}\t silhouette: {}\n'.format(
                model.n_clusters, silhouette))
            print(name, '\n', pd.crosstab(ypred, labels), '\n')

## Tfidf

In [None]:
# Re-run KMeans and extract cluster information.
model_tfidf = KMeans(n_clusters=10, random_state=15).fit(X_train_tfidf)

# Extract cluster assignments for each data point.
labels = model_tfidf.labels_

In [None]:
# Create cluster assignment for eval, holdout groups.
X_eval_tfidf_labels = model_tfidf.predict(X_eval_tfidf)
X_holdout_tfidf_labels = model_tfidf.predict(X_holdout_tfidf)

# Create a column for cluster labels.
X_eval_tfidf['clusters'] = X_eval_tfidf_labels
X_holdout_tfidf['clusters'] = X_holdout_tfidf_labels

X_train_tfidf['clusters'] = labels

In [None]:
# Aggregate by cluster.
X_train_tfidf_clusters = X_train_tfidf.groupby(
    ['clusters'], as_index=False).mean()
X_train_tfidf_clusters

In [None]:
num_clusters = model_tfidf.n_clusters
for i in range(num_clusters):
    data_train = X_train_tfidf[X_train_tfidf['clusters'] == i]
    data_eval = X_eval_tfidf[X_eval_tfidf['clusters'] == i]
    print(f'Cluster {i}:')
    if i < 1:
        train = data_train.mean().sort_values(ascending=False)[:10].copy()
        print(train)
        print()
        evl = data_eval.mean().sort_values(ascending=False)[:10].copy()
        print(evl)
        print()
        train_set, eval_set = {*train.index}, {*evl.index}
        overlap = train_set.intersection(eval_set)
        print((f'There are {len(overlap)} words that are in the top ten of \
        both training and testing.'))
        if len(overlap) > 0:
            print(f'These words are: {overlap}.')
        print()
    elif i >= 1:
        train = data_train.mean().sort_values(ascending=False)[1:11].copy()
        print(train)
        print()
        evl = data_eval.mean().sort_values(ascending=False)[1:11].copy()
        print(evl)
        print()
        train_set, eval_set = {*train.index}, {*evl.index}
        overlap = train_set.intersection(eval_set)
        print(f'There are {len(overlap)} words that are in the top ten of \
        both training and testing.')
        if len(overlap) > 0:
            print(f'These words are: {overlap}.')
        print()

## LSA

In [None]:
# Re-run KMeans and extract cluster information.
model_lsa = KMeans(n_clusters=10, random_state=42).fit(X_train_lsa_scaled)

# Extract cluster assignments for each data point.
labels = model_lsa.labels_

In [None]:
# Create cluster assignment for eval, holdout groups.
X_eval_lsa_labels = model_lsa.predict(X_eval_lsa_scaled)
X_holdout_lsa_labels = model_lsa.predict(X_holdout_lsa_scaled)

# Create a column for cluster labels.
X_eval_lsa_scaled['clusters'] = X_eval_lsa_labels
X_holdout_lsa_scaled['clusters'] = X_holdout_lsa_labels

X_train_lsa_scaled['clusters'] = labels

In [None]:
# Aggregate by cluster.
X_train_lsa_clusters = X_train_lsa_scaled.groupby(
    ['clusters'], as_index=False).mean()
X_train_lsa_clusters

In [None]:
num_clusters = model_lsa.n_clusters
for i in range(num_clusters):
    data_train = X_train_lsa_scaled[X_train_lsa_scaled['clusters'] == i]
    data_eval = X_eval_lsa_scaled[X_eval_lsa_scaled['clusters'] == i]
    print(f'Cluster {i}:')
    if i < 1:
        train = data_train.mean().sort_values(ascending=False)[:10].copy()
        print(train)
        print()
        evl = data_eval.mean().sort_values(ascending=False)[:10].copy()
        print(evl)
        print()
        train_set, eval_set = {*train.index}, {*evl.index}
        overlap = train_set.intersection(eval_set)
        print((f'There are {len(overlap)} features that are in the top ten of both training and testing.'))
        if len(overlap) > 0:
            print(f'These features are: {overlap}.')
        print()
    elif i >= 1:
        train = data_train.mean().sort_values(ascending=False)[1:11].copy()
        print(train)
        print()
        evl = data_eval.mean().sort_values(ascending=False)[1:11].copy()
        print(evl)
        print()
        train_set, eval_set = {*train.index}, {*evl.index}
        overlap = train_set.intersection(eval_set)
        print(f'There are {len(overlap)} features that are in the top ten of both training and testing.')
        if len(overlap) > 0:
            print(f'These features are: {overlap}.')
        print()

## Latent Dirichlet Allocation (LDA)

### Set up text for LDA

In [None]:
# Removing numerals.
sent_df['sentence_tokens'] = sent_df.sentence.map(
    lambda x: re.sub(r'\d+', '', x))
# Lower case.
sent_df['sentence_tokens'] = sent_df.sentence_tokens.map(lambda x: x.lower())
print(sent_df['sentence_tokens'][0][:500])

In [None]:
# Tokenize.
sent_df['sentence_tokens'] = sent_df.sentence_tokens.map(
    lambda x: RegexpTokenizer(r'\w+').tokenize(x))
print(sent_df['sentence_tokens'][0][:25])

In [None]:
# Stemming.
snowball = SnowballStemmer("english")
sent_df['sentence_tokens'] = sent_df.sentence_tokens.map(
    lambda x: [snowball.stem(token) for token in x])
print(sent_df['sentence_tokens'][0][:25])

In [None]:
# Stop words.
stop_en = stopwords.words('english')
sent_df['sentence_tokens'] = sent_df.sentence_tokens.map(
    lambda x: [t for t in x if t not in stop_en])
print(sent_df['sentence_tokens'][0][:25])

In [None]:
# Final cleaning.
sent_df['sentence_tokens'] = sent_df.sentence_tokens.map(
    lambda x: [t for t in x if len(t) > 1])
print(sent_df['sentence_tokens'][0][:25])

### Run LDA

In [None]:
texts = sent_df['sentence_tokens']
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
lda = LdaModel(corpus,
               id2word=dictionary,
               num_topics=10,
               passes=5,
               minimum_probability=0,
               random_state=15)

In [None]:
# Print topics
lda.print_topics()

In [None]:
# Refactoring results of LDA into numpy matrix.
hm = np.array([[y for (x,y) in lda[corpus[i]]] for i in range(len(corpus))])

In [None]:
# Reduce dimensionality using t-SNE.
tsne = TSNE(random_state=15, perplexity=30, early_exaggeration=120)
embedding = tsne.fit_transform(hm)
embedding = pd.DataFrame(embedding, columns=['x','y'])
embedding['hue'] = hm.argmax(axis=1)

In [None]:
# Scatter plot using Bokeh.
source = ColumnDataSource(
    data=dict(x=embedding.x,
              y=embedding.y,
              colors=[all_palettes['Set1'][9] for i in embedding.hue],
              sentence=sent_df.sentence,
              President=sent_df.President,
              alpha=[0.9] * embedding.shape[0],
              size=[7] * embedding.shape[0]
              )
)
hover_tsne = HoverTool(names=["sent_df"], tooltips="""
    <div style="margin: 10">
        <div style="margin: 0 auto; width:300px;">
            <span style="font-size: 12px; font-weight: bold;">Title:</span>
            <span style="font-size: 12px">@title</span>
            <span style="font-size: 12px; font-weight: bold;">Year:</span>
            <span style="font-size: 12px">@year</span>
        </div>
    </div>
    """)
tools_tsne = [hover_tsne, 'pan', 'wheel_zoom', 'reset']
plot_tsne = figure(plot_width=700, plot_height=700,
                   tools=tools_tsne, title='Inaugural Addresses')
plot_tsne.circle('x', 'y', size='size', fill_color='colors',alpha='alpha',
                 line_alpha=0, line_width=0.01, source=source, name="sent_df")


callback = CustomJS(args=dict(source=source), code=
    """var data = source.data;
    var f = cb_obj.value
    x = data['x']
    y = data['y']
    colors = data['colors']
    alpha = data['alpha']
    title = data['title']
    President = data['President']
    size = data['size']
    for (i = 0; i < x.length; i++) {
        if (year[i] <= f) {
            alpha[i] = 0.9
            size[i] = 7
        } else {
            alpha[i] = 0.05
            size[i] = 4
        }
    }
    source.change.emit();
    """)
slider = Slider(start=0.1, end=10, value=1, step=.1, title="Amplitude")
#slider = Slider(
#    start=sent_df.President.min(), end=sent_df.sentence.max(),values=
#    ['Washington', 'Jefferson', 'Lincoln', 'FDR', 'Eisenhower', 'Kennedy', 'Reagan', 'GHWBush', 'Clinton', 'Obama'], step=1, title="Inaugural Speeches")
slider.js_on_change('value', callback)

layout = column(plot_tsne)
show(layout)

# Prepare for Predictive Modeling

In [None]:
'''Create baseline score to beat. GHWBush had the most sentences, so guessing 
him for all sentences would give this percentage.
'''

print('Baseline score to beat:', sum(
    (sent_df.President == 'ghwbush') / len(sent_df.President)))

In [None]:
# Pipeline helpers.
skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=15)

In [None]:
# Instantiate the models.
log_reg = LogisticRegression(solver='lbfgs', max_iter=1000, random_state=15)
tree = DecisionTreeClassifier(random_state=15)
forest = RandomForestClassifier(max_depth=10, random_state=15)
boost = GradientBoostingClassifier(random_state=15)
nb = BernoulliNB()

In [None]:
# Set up _kwargs files for convenience.
tfidf_kwargs = {'X_train': X_train_tfidf,'y_train': y_train,
                'X_eval': X_eval_tfidf,'y_eval': y_eval}
                #'X_holdout': X_holdout_tfidf, 'y_holdout': y_holdout}

lsa_kwargs = {'X_train': X_train_lsa_scaled, 'y_train': y_train,
              'X_eval': X_eval_lsa_scaled, 'y_eval': y_eval}
              #'X_holdout': X_holdout_tfidf_scaled, 'y_holdout': y_holdout}

In [None]:
# Tune parameter grids.
log_reg_params = {'model__C': [1, 10, 100, 1000]}
tree_params = {'model__criterion': ['gini']}
forest_params = {'model__n_estimators': [100, 200, 300,400],
                 'model__max_depth': [None, 5, 10]}
boost_params = {'model__n_estimators': [100]}
nb_params = {'model__alpha': [1]}

In [None]:
# Function to fit and predict all working kernals.


def fit_and_predict(model, params: Dict,
                    X_train: pd.DataFrame,
                    y_train: pd.DataFrame,
                    X_eval: pd.DataFrame,
                    y_eval: pd.DataFrame) -> None:
    """
    Takes an instantiated sklearn model, training data (X_train, y_train), 
    and performs cross-validation and then prints the mean of the cross-
    validation accuracies.
    """
    assert len(X_train) == len(y_train)
    assert len(X_eval) == len(y_eval)
    # assert len(X_holdout) == len(y_holdout)
    pipe = Pipeline(steps=[('model', model)])
    clf = GridSearchCV(pipe, cv=skf, param_grid=params, n_jobs=2)
    clf.fit(X_train, y_train)
    print('The mean cross_val accuracy on train is',
          f'{clf.cv_results_["mean_test_score"]}.')
    print('The std of the cross_val accuracy is',
          f'{clf.cv_results_["std_test_score"]}.')
    y_pred = clf.predict(X_eval)
    print(classification_report(y_eval, y_pred))
    print(confusion_matrix(y_eval, y_pred))

## Logistic Regression

### Tfidf

In [None]:
fit_and_predict(log_reg, params=log_reg_params, **tfidf_kwargs)

### LSA (Latent Semantic Analysis)

In [None]:
fit_and_predict(log_reg, params=log_reg_params, **lsa_kwargs)

## Decision Trees

### Tfidf

In [None]:
fit_and_predict(tree, params=tree_params, **tfidf_kwargs)

### LSA

In [None]:
fit_and_predict(tree, params=tree_params, **lsa_kwargs)

## Random Forest

### Tfidf

In [None]:
fit_and_predict(forest, params=forest_params, **tfidf_kwargs)

### LSA

In [None]:
fit_and_predict(forest, params=forest_params, **lsa_kwargs)

## Gradient Boosting Machines

### Tfidf

In [None]:
fit_and_predict(boost, params=boost_params, **tfidf_kwargs)

### LSA

In [None]:
fit_and_predict(boost, params=boost_params, **lsa_kwargs)

## Naive Bayes

### Tfidf

In [None]:
fit_and_predict(nb, params=nb_params, **tfidf_kwargs)

###  LSA

In [None]:
fit_and_predict(nb, params=nb_params, **lsa_kwargs)

# Neural Network

## *Note: there are not enough data to effectively run a neural network on this project. Section 5 is merely going through the process for the sake of the capstone.*

In [None]:
# Establish and fit the multi-level perceptron model.
mlp = MLPClassifier(
    hidden_layer_sizes=(3,), random_state=15, max_iter=5000, alpha=0.05)
mlp.fit(X_train_tfidf, y_train)

In [None]:
# Find MLP score.
mlp.score(X_train_tfidf, y_train)

In [None]:
# Find cross-validation score.
cross_val_score(mlp, X_train_tfidf, y_train, cv=5)

In [None]:
# Adjust hidden layer parameters.
mlp1 = MLPClassifier(
    hidden_layer_sizes=(5,2,), random_state=15, max_iter=5000, alpha=0.01)
mlp1.fit(X_train_tfidf, y_train)

In [None]:
# Find accuracy score.
mlp1.score(X_train_tfidf, y_train)

In [None]:
# Cross-validation.
cross_val_score(mlp1, X_train_tfidf, y_train, cv=5)

In [None]:
# Adjust hidden layer parameters.
mlp2 = MLPClassifier(
    hidden_layer_sizes=(5,2,), random_state=15, max_iter=5000, alpha=0.05)
mlp2.fit(X_train_tfidf, y_train)

In [None]:
# Find accuracy score.
mlp2.score(X_train_tfidf, y_train)

In [None]:
# Cross-validation.
cross_val_score(mlp2, X_train_tfidf, y_train, cv=5)

In [None]:
#fit_and_predict(log_reg, params=log_reg_params, **bow_kwargs)

In [None]:
#fit_and_predict(tree, params=tree_params, **bow_kwargs)

In [None]:
#fit_and_predict(forest, params=forest_params, **bow_kwargs)

In [None]:
#fit_and_predict(boost, params=boost_params, **bow_kwargs)

In [None]:
#fit_and_predict(nb, params=nb_params, **bow_kwargs)