For this project you'll dig into a large amount of text and apply most of what you've covered in this unit and in the course so far.

First, pick a set of texts. This can be either a series of novels, chapters, or articles. Anything you'd like. It just has to have multiple entries of varying characteristics. At least 100 should be good. There should also be at least 10 different authors, but try to keep the texts related (either all on the same topic of from the same branch of literature - something to make classification a bit more difficult than obviously different subjects).

This capstone can be an extension of your NLP challenge if you wish to use the same corpus. If you found problems with that data set that limited your analysis, however, it may be worth using what you learned to choose a new corpus. Reserve 25% of your corpus as a test set.

The first technique is to create a series of clusters. Try several techniques and pick the one you think best represents your data. Make sure there is a narrative and reasoning around why you have chosen the given clusters. Are authors consistently grouped into the same cluster?

Next, perform some unsupervised feature generation and selection using the techniques covered in this unit and elsewhere in the course. Using those features then build models to attempt to classify your texts by author. Try different permutations of unsupervised and supervised techniques to see which combinations have the best performance.

Lastly return to your holdout group. Does your clustering on those members perform as you'd expect? Have your clusters remained stable or changed dramatically? What about your model? Is its performance consistent?

If there is a divergence in the relative stability of your model and your clusters, delve into why.

Your end result should be a write up of how clustering and modeling compare for classifying your texts. What are the advantages of each? Why would you want to use one over the other? Approximately 3-5 pages is a good length for your write up, and remember to include visuals to help tell your story!

In [1]:
# Set up data science environment.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import math
import scipy
import spacy
import re
import warnings
from nltk.corpus import inaugural, stopwords
from sklearn.cluster import KMeans, MiniBatchKMeans, SpectralClustering
from sklearn.cluster import AffinityPropagation, MeanShift, estimate_bandwidth
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.naive_bayes import BernoulliNB
from sklearn.neural_network import MLPClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler, normalize
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics
from typing import Dict

# Display preferences.
%matplotlib inline
pd.options.display.float_format = '{:.3f}'.format
sns.set_style('white')

# Suppress annoying harmless error.
warnings.filterwarnings(
    action='ignore',
    module='scipy',
    message='internal gelsd'
)

# Data Cleaning, Processing, and Language Parsing

In [2]:
# Create lists for files and presidents.
files = ["1789-Washington.txt",
         "1801-Jefferson.txt",
         "1861-Lincoln.txt",
         "1933-Roosevelt.txt",
         "1953-Eisenhower.txt",
         "1961-Kennedy.txt",
         "1981-Reagan.txt",
         "1989-Bush.txt",
         "1993-Clinton.txt",
         "2009-Obama.txt"]

presidents = ["washington",
              "jefferson",
              "lincoln",
              "fdr",
              "eisenhower",
              "kennedy",
              "reagan",
              "bush41",
              "clinton",
              "obama"]

# Control to make sure both lists are the same length.
assert len(files) == len(presidents)

In [3]:
# Quick function to open all files needed (and close them again).
docs = []
for file_name, president in zip(files, presidents):
    with open(f'./inaugural/{file_name}') as f:
        doc = f.read()
        docs.append((doc, president))

In [4]:
# Utility function to clean text.
def text_cleaner(text: str) -> str:
    """Function to strip all characters except letters in words."""
    text = re.sub(r'--', ' ', text)
    text = re.sub("[\[].*?[\]]", "", text)
    text = re.sub("[\<].*?[\>]", "", text)
    text = ' '.join(text.split())
    return text

In [5]:
# Use text_cleaner on the docs, combine them into data frame (clean_docs).
clean_docs = []
for doc, pres in docs:
    clean_doc = text_cleaner(doc)
    clean_docs.append((clean_doc, pres))

In [6]:
# Iterate through each doc and print the first 1000 characters for inspection.
for doc, pres in clean_docs:
    print(doc[:1000], pres.upper()) 
    print()

Fellow-Citizens of the Senate and of the House of Representatives: Among the vicissitudes incident to life no event could have filled me with greater anxieties than that of which the notification was transmitted by your order, and received on the 14th day of the present month. On the one hand, I was summoned by my Country, whose voice I can never hear but with veneration and love, from a retreat which I had chosen with the fondest predilection, and, in my flattering hopes, with an immutable decision, as the asylum of my declining years a retreat which was rendered every day more necessary as well as more dear to me by the addition of habit to inclination, and of frequent interruptions in my health to the gradual waste committed on it by time. On the other hand, the magnitude and difficulty of the trust to which the voice of my country called me, being sufficient to awaken in the wisest and most experienced of her citizens a distrustful scrutiny into his qualifications, could not but ov

In [7]:
# Function to parse data.
nlp = spacy.load('en')
df_list = []


def nlp_text(text_file):
    doc = nlp(text_file)
    return doc


def sentences(doc_nlp, speaker):
    return [[sent.lemma_, speaker] for sent in doc_nlp.sents]


def sentences_to_df(sents):
    return pd.DataFrame(sents)


for doc, pres in clean_docs:
    parsed = nlp_text(doc)
    sents = sentences(parsed, pres)
    df = sentences_to_df(sents)
    df_list.append(df)

In [8]:
# Combine each sentence data frame into one master data frame.
sent_df = pd.concat([*df_list])

In [9]:
# Rename columns.
sent_df.columns = ['sentence', 'President']

# Check the count of sents per President.
sent_df.President.value_counts()

bush41        145
lincoln       139
reagan        130
eisenhower    121
obama         113
fdr            86
clinton        82
kennedy        53
jefferson      42
washington     25
Name: President, dtype: int64

In [10]:
sent_df['sentence'] = sent_df['sentence'].str.replace('-PRON-', '')

# Creating Features

In [11]:
# Splitting the data.
X = sent_df.sentence
y = sent_df.President
X_train_eval, X_holdout, y_train_eval, y_holdout = train_test_split(
    X, y, test_size=0.25, random_state=15)

In [12]:
# Splitting into train/eval/holdout groups.
X_train, X_eval, y_train, y_eval = train_test_split(
    X_train_eval, y_train_eval, test_size=0.25, random_state=15)

In [13]:
# Create base parameters dictionary.
base_param_dict = {'strip_accents': 'unicode',
                   'lowercase': True,
                   'stop_words': 'english',
                   'ngram_range': (1, 3),
                   'max_df': 0.5,
                   'min_df': 5,
                   'max_features': 1000}

## Bag of Words

In [14]:
# Instantiate CountVectorizer.
bow = CountVectorizer(**base_param_dict)

In [15]:
# Convert X_train, X_test into dfs of bags of words.
_bow_train = bow.fit_transform(X_train)
_bow_eval = bow.transform(X_eval)
_bow_holdout = bow.transform(X_holdout)
assert len(X_train) == _bow_train.shape[0]  # df and sparse-matrix

# Find feature names.
feature_names = bow.get_feature_names()

# Sparse matrix to data frame.
X_train_bow = pd.DataFrame(_bow_train.toarray(), columns=feature_names)
X_eval_bow = pd.DataFrame(_bow_eval.toarray(), columns=feature_names)
X_holdout_bow = pd.DataFrame(_bow_holdout.toarray(), columns=feature_names)

In [16]:
len(feature_names)

231

## Tfidf

In [17]:
# Instantiate Tfidf.
tfidf = TfidfVectorizer(**base_param_dict)

In [18]:
# Convert X_train, X_test into dfs of tfidf values.
_tfidf_train = tfidf.fit_transform(X_train)
_tfidf_eval = tfidf.transform(X_eval)
_tfidf_holdout = tfidf.transform(X_holdout)
assert len(X_train) == _tfidf_train.shape[0]  # df and sparse-matrix

# Find feature names.
feature_names_tfidf = tfidf.get_feature_names()

# Set up data frames.
X_train_tfidf = pd.DataFrame(
    _tfidf_train.toarray(), columns=feature_names_tfidf)
X_eval_tfidf = pd.DataFrame(
    _tfidf_eval.toarray(), columns=feature_names_tfidf)
X_holdout_tfidf = pd.DataFrame(
    _tfidf_holdout.toarray(), columns=feature_names_tfidf)

# Clustering Models

In [19]:
# Instantiate MinMaxScaler. Create train/eval/holdout groups.
scaler = MinMaxScaler()
X_train_tfidf_scaled = pd.DataFrame(scaler.fit_transform(X_train_tfidf))
print(len(X_train_tfidf_scaled))
X_eval_tfidf_scaled = pd.DataFrame(scaler.transform(X_eval_tfidf))
X_holdout_tfidf_scaled = pd.DataFrame(scaler.transform(X_holdout_tfidf))

526


In [20]:
feature_names = X_train_tfidf.columns

X_train_tfidf_scaled.columns = feature_names
X_eval_tfidf_scaled.columns = feature_names
X_holdout_tfidf_scaled.columns = feature_names

In [21]:
# Clustering models
models = []
names = []
plot_nums = []
silhouettes = []
clust = []

for clusters in range(2,11):
    bandwidth = estimate_bandwidth(
        X_train_tfidf_scaled, quantile=0.2, n_samples=500)
    models.append((0, 'KMeans', KMeans(n_clusters=clusters,
                                   init='k-means++', random_state=42)))
    models.append((1, 'MiniBatch', MiniBatchKMeans(
        init='random', n_clusters=clusters, batch_size=500)))
    #models.append((2, 'MeanShift', MeanShift(
        #bandwidth=bandwidth, bin_seeding=True)))
    models.append((2, 'Spectral', SpectralClustering(n_clusters=clusters)))
    #models.append((4, 'Affinity', AffinityPropagation()))
    
for _, name, model in models:
    names.append(name)
    model.fit(X_train_tfidf_scaled)
    labels = model.labels_
    print(model)
    if len(set(labels)) > 1:
        ypred = model.fit_predict(X_train_tfidf_scaled)
        silhouette = metrics.silhouette_score(
            X_train_tfidf_scaled, labels, metric='euclidean')
        silhouettes.append(silhouette)
        #ax[plot_num].set_title(name)
        #plotting(plot_num, labels, ypred)
        if silhouette > 0:
            print('clusters: {}\t silhouette: {}\n'.format(
                model.n_clusters, silhouette))
            print(name, '\n', pd.crosstab(ypred, labels), '\n')

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=2, n_init=10, n_jobs=None, precompute_distances='auto',
    random_state=42, tol=0.0001, verbose=0)
clusters: 2	 silhouette: 0.06326433706790946

KMeans 
 col_0   0    1
row_0         
0      30    0
1       0  496 

MiniBatchKMeans(batch_size=500, compute_labels=True, init='random',
        init_size=None, max_iter=100, max_no_improvement=10, n_clusters=2,
        n_init=3, random_state=None, reassignment_ratio=0.01, tol=0.0,
        verbose=0)
clusters: 2	 silhouette: 0.03941501150911587

MiniBatch 
 col_0    0   1
row_0         
0      460  58
1        8   0 

SpectralClustering(affinity='rbf', assign_labels='kmeans', coef0=1, degree=3,
          eigen_solver=None, eigen_tol=0.0, gamma=1.0, kernel_params=None,
          n_clusters=2, n_init=10, n_jobs=None, n_neighbors=10,
          random_state=None)
clusters: 2	 silhouette: 0.12603026347952262

Spectral 
 col_0    0  1
row_0        
0      522  0


SpectralClustering(affinity='rbf', assign_labels='kmeans', coef0=1, degree=3,
          eigen_solver=None, eigen_tol=0.0, gamma=1.0, kernel_params=None,
          n_clusters=8, n_init=10, n_jobs=None, n_neighbors=10,
          random_state=None)
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=9, n_init=10, n_jobs=None, precompute_distances='auto',
    random_state=42, tol=0.0001, verbose=0)
clusters: 9	 silhouette: 0.007891566303634773

KMeans 
 col_0  0   1   2   3   4    5   6   7   8
row_0                                    
0      3   0   0   0   0    0   0   0   0
1      0  30   0   0   0    0   0   0   0
2      0   0  20   0   0    0   0   0   0
3      0   0   0  38   0    0   0   0   0
4      0   0   0   0  10    0   0   0   0
5      0   0   0   0   0  345   0   0   0
6      0   0   0   0   0    0  26   0   0
7      0   0   0   0   0    0   0  23   0
8      0   0   0   0   0    0   0   0  31 

MiniBatchKMeans(batch_size=500, compute_labels=Tr

In [22]:
# Re-run KMeans and extract cluster information.
model = KMeans(n_clusters=4, random_state=42).fit(X_train_tfidf_scaled)

# Extract cluster assignments for each data point.
labels = model.labels_

X_train_tfidf_scaled['clusters'] = labels

In [23]:
X_eval_labels = model.predict(X_eval_tfidf_scaled)

In [24]:
X_holdout_labels = model.predict(X_holdout_tfidf_scaled)

In [25]:
X_eval_tfidf_scaled['clusters'] = X_eval_labels

In [26]:
X_holdout_tfidf_scaled['clusters'] = X_holdout_labels

In [27]:
X_train_tfidf_clusters = X_train_tfidf_scaled.groupby(
    ['clusters'], as_index=False).mean()
X_train_tfidf_clusters

Unnamed: 0,clusters,act,action,administration,america,american,american people,americans,ask,authority,...,way,willing,wish,woman,word,work,world,write,year,young
0,0,0.0,0.035,0.0,0.03,0.0,0.0,0.012,0.012,0.0,...,0.0,0.0,0.0,0.151,0.24,0.0,0.038,0.022,0.034,0.023
1,1,0.012,0.012,0.009,0.018,0.012,0.007,0.017,0.011,0.013,...,0.012,0.007,0.008,0.0,0.001,0.02,0.028,0.01,0.012,0.007
2,2,0.0,0.0,0.0,0.018,0.0,0.0,0.0,0.0,0.0,...,0.0,0.044,0.0,0.0,0.0,0.02,0.014,0.0,0.0,0.0
3,3,0.0,0.0,0.0,0.0,0.019,0.023,0.028,0.0,0.0,...,0.0,0.0,0.02,0.0,0.019,0.009,0.029,0.0,0.0,0.044


In [28]:
cluster0_train = X_train_tfidf_scaled[X_train_tfidf_scaled['clusters'] == 0]
cluster0_train.mean().sort_values(ascending=False)[:10]

man         0.343
word        0.240
woman       0.151
man woman   0.131
free        0.115
free man    0.105
say         0.063
moment      0.062
country     0.052
deny        0.050
dtype: float64

In [29]:
cluster0_eval = X_eval_tfidf_scaled[X_eval_tfidf_scaled['clusters'] == 0]
cluster0_eval.mean().sort_values(ascending=False)[:10]

man              0.521
man woman        0.296
woman            0.296
law              0.252
faith            0.230
moral            0.147
fellow citizen   0.128
say              0.125
liberty          0.118
rule             0.115
dtype: float64

In [30]:
cluster1_train = X_train_tfidf_scaled[X_train_tfidf_scaled['clusters'] == 1]
cluster1_train.mean().sort_values(ascending=False)[1:11]

people         0.035
make           0.029
world          0.028
government     0.027
constitution   0.026
let            0.025
power          0.025
nation         0.024
today          0.024
good           0.024
dtype: float64

In [31]:
cluster1_eval = X_eval_tfidf_scaled[X_eval_tfidf_scaled['clusters'] == 1]
cluster1_eval.mean().sort_values(ascending=False)[1:11]

national    0.049
principle   0.048
states      0.043
today       0.035
new         0.031
great       0.030
peace       0.030
work        0.029
way         0.027
union       0.027
dtype: float64

In [32]:
cluster2_train = X_train_tfidf_scaled[X_train_tfidf_scaled['clusters'] == 2]
cluster2_train.mean().sort_values(ascending=False)[1:11]

hand       0.327
offer      0.171
thing      0.164
peace      0.096
material   0.082
strong     0.081
life       0.050
power      0.050
new        0.048
liberty    0.045
dtype: float64

In [33]:
cluster2_eval = X_eval_tfidf_scaled[X_eval_tfidf_scaled['clusters'] == 2]
cluster2_eval.mean().sort_values(ascending=False)[1:11]

hand        0.455
majority    0.364
leader      0.279
form        0.185
hold        0.150
heart       0.140
material    0.130
trust       0.125
value       0.125
political   0.119
dtype: float64

In [34]:
cluster3_train = X_train_tfidf_scaled[X_train_tfidf_scaled['clusters'] == 3]
cluster3_train.mean().sort_values(ascending=False)[1:11]

time         0.680
live         0.115
change       0.087
government   0.065
nation       0.064
history      0.061
courage      0.057
meet         0.049
country      0.049
congress     0.047
dtype: float64

In [35]:
cluster3_eval = X_eval_tfidf_scaled[X_eval_tfidf_scaled['clusters'] == 3]
cluster3_eval.mean().sort_values(ascending=False)[1:11]

time         1.085
speak        0.266
future       0.220
right        0.185
form         0.000
great        0.000
government   0.000
good         0.000
god          0.000
generation   0.000
dtype: float64

# Neural Network

In [36]:
# Establish and fit the multi-level perceptron model.
mlp = MLPClassifier(hidden_layer_sizes=(20,20,), random_state=15)
mlp.fit(X_train_tfidf, ypred)



MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(20, 20), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
       random_state=15, shuffle=True, solver='adam', tol=0.0001,
       validation_fraction=0.1, verbose=False, warm_start=False)

In [37]:
# Find MLP score.
mlp.score(X_train_tfidf, ypred)

0.9866920152091255

In [38]:
# Find cross-validation score.
cross_val_score(mlp, X_train_tfidf, ypred, cv=5)



array([0.83333333, 0.81481481, 0.79245283, 0.78640777, 0.74257426])

In [39]:
# Adjust hidden layer parameters.
mlp1 = MLPClassifier(hidden_layer_sizes=(10,10,), random_state=15)
mlp1.fit(X_train_tfidf, ypred)



MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(10, 10), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
       random_state=15, shuffle=True, solver='adam', tol=0.0001,
       validation_fraction=0.1, verbose=False, warm_start=False)

In [40]:
# Find accuracy score.
mlp1.score(X_train_tfidf, ypred)

0.94106463878327

In [41]:
# Cross-validation.
cross_val_score(mlp1, X_train_tfidf, ypred, cv=5)



array([0.73148148, 0.71296296, 0.72641509, 0.70873786, 0.73267327])

In [42]:
# Adjust hidden layer parameters.
mlp2 = MLPClassifier(hidden_layer_sizes=(25,15,), random_state=15)
mlp2.fit(X_train_tfidf, ypred)



MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(25, 15), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
       random_state=15, shuffle=True, solver='adam', tol=0.0001,
       validation_fraction=0.1, verbose=False, warm_start=False)

In [43]:
# Find accuracy score.
mlp2.score(X_train_tfidf, ypred)

0.9961977186311787

In [44]:
# Cross-validation.
cross_val_score(mlp2, X_train_tfidf, ypred, cv=5)



array([0.84259259, 0.7962963 , 0.82075472, 0.80582524, 0.8019802 ])

# Prepare for Predictive Modeling

In [45]:
# Create baseline score to beat.
# Bush41 had the most sentences, so guessing him
# for all sentences would give this percentage.
print('Baseline score to beat:', sum(
    (sent_df.President == 'bush41') / len(sent_df.President)))

Baseline score to beat: 0.15491452991452967


In [46]:
# Pipeline helpers.
skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=15)

In [47]:
# Instantiate the models.
log_reg = LogisticRegression(solver='lbfgs', max_iter=1000)
tree = DecisionTreeClassifier()
forest = RandomForestClassifier()
boost = GradientBoostingClassifier()
nb = BernoulliNB()

In [48]:
# Set up _kwargs files for convenience.
bow_kwargs = {'X_train': X_train_bow, 'y_train': y_train,
              'X_eval': X_eval_bow, 'y_eval': y_eval,
              'X_holdout': X_holdout_bow, 'y_holdout': y_holdout}

tfidf_kwargs = {'X_train': X_train_tfidf_scaled,'y_train': y_train,
                'X_eval': X_eval_tfidf_scaled,'y_eval': y_eval,
                'X_holdout': X_eval_tfidf_scaled,'y_holdout': y_holdout}

In [49]:
# Tune parameter grids.
log_reg_params = {'model__C': [1]}
tree_params = {'model__criterion': ['gini']}
forest_params = {'model__n_estimators': [100]}
boost_params = {'model__n_estimators': [100]}
nb_params = {'model__alpha': [1]}

In [50]:
# Function to fit and predict all working kernals.


def fit_and_predict(model, params: Dict,
                    X_train: pd.DataFrame,
                    y_train: pd.DataFrame,
                    X_eval: pd.DataFrame,
                    y_eval: pd.DataFrame,
                    X_holdout: pd.DataFrame,
                    y_holdout: pd.DataFrame) -> None:
    """
    Takes an instantiated sklearn model, training data (X_train, y_train), 
    and performs cross-validation and then prints the mean of the cross-
    validation accuracies.
    """
    assert len(X_train) == len(y_train)
    assert len(X_eval) == len(y_eval)
    # assert len(X_holdout) == len(y_holdout)
    pipe = Pipeline(steps=[('model', model)])
    clf = GridSearchCV(pipe, cv=skf, param_grid=params, n_jobs=2)
    clf.fit(X_train, y_train)
    print('The mean cross_val accuracy on train is',
          f'{clf.cv_results_["mean_test_score"]}.')
    print('The std of the cross_val accuracy is',
          f'{clf.cv_results_["std_test_score"]}.')
    y_pred = clf.predict(X_eval)
    print(classification_report(y_eval, y_pred))
    print(confusion_matrix(y_eval, y_pred))

## Logistic Regression

### Bag of Words

In [51]:
fit_and_predict(log_reg, params=log_reg_params, **bow_kwargs)

The mean cross_val accuracy on train is [0.36121673].
The std of the cross_val accuracy is [0.04323161].
              precision    recall  f1-score   support

      bush41       0.16      0.40      0.23        25
     clinton       0.40      0.12      0.19        16
  eisenhower       0.29      0.37      0.33        19
         fdr       0.29      0.12      0.17        16
   jefferson       0.50      0.10      0.17        10
     kennedy       0.00      0.00      0.00         9
     lincoln       0.63      0.59      0.61        29
       obama       0.14      0.04      0.06        25
      reagan       0.32      0.54      0.40        24
  washington       1.00      0.33      0.50         3

   micro avg       0.31      0.31      0.31       176
   macro avg       0.37      0.26      0.27       176
weighted avg       0.33      0.31      0.28       176

[[10  0  5  1  0  0  2  1  6  0]
 [ 8  2  2  0  0  1  0  0  3  0]
 [ 8  0  7  0  0  0  1  2  1  0]
 [ 5  0  1  2  0  0  0  0  8  0]
 [ 3



### Tfidf

In [52]:
fit_and_predict(log_reg, params=log_reg_params, **tfidf_kwargs)

The mean cross_val accuracy on train is [0.36692015].
The std of the cross_val accuracy is [0.02287176].
              precision    recall  f1-score   support

      bush41       0.19      0.48      0.27        25
     clinton       0.20      0.06      0.10        16
  eisenhower       0.38      0.47      0.42        19
         fdr       0.38      0.19      0.25        16
   jefferson       0.60      0.30      0.40        10
     kennedy       0.00      0.00      0.00         9
     lincoln       0.58      0.62      0.60        29
       obama       0.17      0.04      0.06        25
      reagan       0.26      0.38      0.31        24
  washington       0.00      0.00      0.00         3

   micro avg       0.32      0.32      0.32       176
   macro avg       0.28      0.25      0.24       176
weighted avg       0.31      0.32      0.29       176

[[12  0  4  1  0  0  2  0  6  0]
 [ 6  1  4  0  0  0  0  0  5  0]
 [ 7  0  9  0  0  0  0  2  1  0]
 [ 5  1  1  3  0  0  1  0  5  0]
 [ 1

  'precision', 'predicted', average, warn_for)


## Decision Trees

### Bag of Words

In [53]:
fit_and_predict(tree, params=tree_params, **bow_kwargs)

The mean cross_val accuracy on train is [0.30228137].
The std of the cross_val accuracy is [0.03817604].
              precision    recall  f1-score   support

      bush41       0.17      0.40      0.24        25
     clinton       0.25      0.31      0.28        16
  eisenhower       0.17      0.16      0.16        19
         fdr       0.44      0.50      0.47        16
   jefferson       0.50      0.10      0.17        10
     kennedy       0.25      0.11      0.15         9
     lincoln       0.65      0.52      0.58        29
       obama       0.11      0.04      0.06        25
      reagan       0.29      0.25      0.27        24
  washington       0.67      0.67      0.67         3

   micro avg       0.30      0.30      0.30       176
   macro avg       0.35      0.31      0.30       176
weighted avg       0.32      0.30      0.29       176

[[10  3  4  3  0  1  1  0  3  0]
 [ 7  5  1  0  0  0  0  3  0  0]
 [ 9  1  3  0  0  1  2  0  3  0]
 [ 5  0  0  8  0  0  0  1  2  0]
 [ 2



### Tfidf

In [54]:
fit_and_predict(tree, params=tree_params, **tfidf_kwargs)

The mean cross_val accuracy on train is [0.28326996].
The std of the cross_val accuracy is [0.04593993].
              precision    recall  f1-score   support

      bush41       0.10      0.20      0.13        25
     clinton       0.24      0.25      0.24        16
  eisenhower       0.21      0.32      0.25        19
         fdr       0.24      0.25      0.24        16
   jefferson       0.00      0.00      0.00        10
     kennedy       0.50      0.22      0.31         9
     lincoln       0.50      0.41      0.45        29
       obama       0.29      0.16      0.21        25
      reagan       0.31      0.21      0.25        24
  washington       0.00      0.00      0.00         3

   micro avg       0.24      0.24      0.24       176
   macro avg       0.24      0.20      0.21       176
weighted avg       0.27      0.24      0.24       176

[[ 5  4  5  3  2  0  2  2  1  1]
 [ 5  4  2  0  0  0  1  2  2  0]
 [ 7  0  6  2  0  0  2  1  1  0]
 [ 5  0  3  4  1  0  1  0  2  0]
 [ 3



## Random Forest

### Bag of Words

In [55]:
fit_and_predict(forest, params=forest_params, **bow_kwargs)

The mean cross_val accuracy on train is [0.30798479].
The std of the cross_val accuracy is [0.05299282].
              precision    recall  f1-score   support

      bush41       0.15      0.28      0.20        25
     clinton       0.29      0.38      0.32        16
  eisenhower       0.28      0.37      0.32        19
         fdr       0.25      0.19      0.21        16
   jefferson       0.50      0.20      0.29        10
     kennedy       0.50      0.11      0.18         9
     lincoln       0.62      0.62      0.62        29
       obama       0.10      0.04      0.06        25
      reagan       0.30      0.33      0.31        24
  washington       0.00      0.00      0.00         3

   micro avg       0.30      0.30      0.30       176
   macro avg       0.30      0.25      0.25       176
weighted avg       0.31      0.30      0.29       176

[[ 7  3  3  4  0  0  2  1  5  0]
 [ 8  6  0  0  0  0  1  0  1  0]
 [ 6  0  7  0  0  1  0  2  3  0]
 [ 5  0  1  3  1  0  2  1  3  0]
 [ 1

  'precision', 'predicted', average, warn_for)


### Tfidf

In [56]:
fit_and_predict(forest, params=forest_params, **tfidf_kwargs)



The mean cross_val accuracy on train is [0.34030418].
The std of the cross_val accuracy is [0.03311836].
              precision    recall  f1-score   support

      bush41       0.15      0.32      0.21        25
     clinton       0.21      0.19      0.20        16
  eisenhower       0.28      0.37      0.32        19
         fdr       0.08      0.06      0.07        16
   jefferson       0.50      0.30      0.37        10
     kennedy       0.33      0.11      0.17         9
     lincoln       0.52      0.52      0.52        29
       obama       0.10      0.04      0.06        25
      reagan       0.35      0.33      0.34        24
  washington       0.00      0.00      0.00         3

   micro avg       0.27      0.27      0.27       176
   macro avg       0.25      0.22      0.23       176
weighted avg       0.27      0.27      0.26       176

[[ 8  3  4  3  0  0  3  1  3  0]
 [ 8  3  1  0  0  0  1  0  3  0]
 [ 6  0  7  1  1  0  1  2  1  0]
 [ 8  0  2  1  0  0  2  1  2  0]
 [ 0

## Gradient Boosting Machines

### Bag of Words

In [57]:
fit_and_predict(boost, params=boost_params, **bow_kwargs)



The mean cross_val accuracy on train is [0.3269962].
The std of the cross_val accuracy is [0.04869892].
              precision    recall  f1-score   support

      bush41       0.16      0.48      0.24        25
     clinton       0.31      0.25      0.28        16
  eisenhower       0.20      0.16      0.18        19
         fdr       0.20      0.12      0.15        16
   jefferson       0.20      0.10      0.13        10
     kennedy       0.33      0.11      0.17         9
     lincoln       0.56      0.48      0.52        29
       obama       0.29      0.08      0.12        25
      reagan       0.36      0.33      0.35        24
  washington       0.00      0.00      0.00         3

   micro avg       0.27      0.27      0.27       176
   macro avg       0.26      0.21      0.21       176
weighted avg       0.30      0.27      0.26       176

[[12  2  3  2  0  0  2  1  3  0]
 [ 9  4  1  0  0  1  0  0  1  0]
 [12  0  3  1  0  0  1  1  1  0]
 [ 8  0  0  2  1  0  1  0  4  0]
 [ 2 

### Tfidf

In [58]:
fit_and_predict(boost, params=boost_params, **tfidf_kwargs)



The mean cross_val accuracy on train is [0.28897338].
The std of the cross_val accuracy is [0.03636282].
              precision    recall  f1-score   support

      bush41       0.13      0.32      0.18        25
     clinton       0.36      0.25      0.30        16
  eisenhower       0.22      0.26      0.24        19
         fdr       0.30      0.19      0.23        16
   jefferson       0.00      0.00      0.00        10
     kennedy       0.00      0.00      0.00         9
     lincoln       0.54      0.45      0.49        29
       obama       0.09      0.04      0.06        25
      reagan       0.36      0.38      0.37        24
  washington       0.00      0.00      0.00         3

   micro avg       0.24      0.24      0.24       176
   macro avg       0.20      0.19      0.19       176
weighted avg       0.25      0.24      0.24       176

[[ 8  1  5  2  0  2  3  2  2  0]
 [ 6  4  3  0  0  0  1  0  2  0]
 [ 6  0  5  1  1  1  1  3  1  0]
 [ 7  0  0  3  1  0  0  1  4  0]
 [ 3

## Naive Bayes

### Bag of Words

In [59]:
fit_and_predict(nb, params=nb_params, **bow_kwargs)

The mean cross_val accuracy on train is [0.36121673].
The std of the cross_val accuracy is [0.0564142].
              precision    recall  f1-score   support

      bush41       0.20      0.64      0.30        25
     clinton       0.00      0.00      0.00        16
  eisenhower       0.50      0.37      0.42        19
         fdr       0.50      0.12      0.20        16
   jefferson       0.00      0.00      0.00        10
     kennedy       0.00      0.00      0.00         9
     lincoln       0.68      0.59      0.63        29
       obama       0.50      0.04      0.07        25
      reagan       0.24      0.46      0.32        24
  washington       1.00      0.33      0.50         3

   micro avg       0.31      0.31      0.31       176
   macro avg       0.36      0.26      0.25       176
weighted avg       0.36      0.31      0.27       176

[[16  0  1  0  0  0  1  0  7  0]
 [11  0  1  0  0  0  0  0  4  0]
 [ 8  0  7  0  0  0  2  0  2  0]
 [ 7  0  1  2  0  0  0  0  6  0]
 [ 3 

  'precision', 'predicted', average, warn_for)


### Tfidf

In [60]:
fit_and_predict(nb, params=nb_params, **tfidf_kwargs)

The mean cross_val accuracy on train is [0.36501901].
The std of the cross_val accuracy is [0.05183521].
              precision    recall  f1-score   support

      bush41       0.21      0.68      0.32        25
     clinton       0.00      0.00      0.00        16
  eisenhower       0.50      0.37      0.42        19
         fdr       0.40      0.12      0.19        16
   jefferson       0.00      0.00      0.00        10
     kennedy       0.00      0.00      0.00         9
     lincoln       0.65      0.59      0.62        29
       obama       0.00      0.00      0.00        25
      reagan       0.23      0.42      0.30        24
  washington       1.00      0.33      0.50         3

   micro avg       0.31      0.31      0.31       176
   macro avg       0.30      0.25      0.24       176
weighted avg       0.28      0.31      0.26       176

[[17  0  1  0  0  0  1  0  6  0]
 [11  0  1  0  0  0  0  0  4  0]
 [ 8  0  7  0  0  0  2  0  2  0]
 [ 7  0  1  2  0  0  0  0  6  0]
 [ 3

  'precision', 'predicted', average, warn_for)
