For this project you'll dig into a large amount of text and apply most of what you've covered in this unit and in the course so far.

First, pick a set of texts. This can be either a series of novels, chapters, or articles. Anything you'd like. It just has to have multiple entries of varying characteristics. At least 100 should be good. There should also be at least 10 different authors, but try to keep the texts related (either all on the same topic of from the same branch of literature - something to make classification a bit more difficult than obviously different subjects).

This capstone can be an extension of your NLP challenge if you wish to use the same corpus. If you found problems with that data set that limited your analysis, however, it may be worth using what you learned to choose a new corpus. Reserve 25% of your corpus as a test set.

The first technique is to create a series of clusters. Try several techniques and pick the one you think best represents your data. Make sure there is a narrative and reasoning around why you have chosen the given clusters. Are authors consistently grouped into the same cluster?

Next, perform some unsupervised feature generation and selection using the techniques covered in this unit and elsewhere in the course. Using those features then build models to attempt to classify your texts by author. Try different permutations of unsupervised and supervised techniques to see which combinations have the best performance.

Lastly return to your holdout group. Does your clustering on those members perform as you'd expect? Have your clusters remained stable or changed dramatically? What about your model? Is its performance consistent?

If there is a divergence in the relative stability of your model and your clusters, delve into why.

Your end result should be a write up of how clustering and modeling compare for classifying your texts. What are the advantages of each? Why would you want to use one over the other? Approximately 3-5 pages is a good length for your write up, and remember to include visuals to help tell your story!

In [1]:
# Set up data science environment.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import math
import scipy
import spacy
import re
import warnings
from nltk.corpus import inaugural, stopwords
from sklearn.cluster import KMeans, MiniBatchKMeans, SpectralClustering
from sklearn.cluster import AffinityPropagation, MeanShift, estimate_bandwidth
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score, train_test_split
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.naive_bayes import BernoulliNB
from sklearn.neural_network import MLPClassifier
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import MinMaxScaler, normalize
from sklearn.tree import DecisionTreeClassifier
from sklearn import metrics
from typing import Dict

# Display preferences.
%matplotlib inline
pd.options.display.float_format = '{:.3f}'.format
sns.set_style('white')

# Suppress annoying harmless error.
warnings.filterwarnings(
    action='ignore',
    module='scipy',
    message='internal gelsd'
)

# Data Cleaning, Processing, and Language Parsing

In [2]:
files = ["1789-Washington.txt",
         "1801-Jefferson.txt",
         "1861-Lincoln.txt",
         "1933-Roosevelt.txt",
         "1953-Eisenhower.txt",
         "1961-Kennedy.txt",
         "1981-Reagan.txt",
         "1989-Bush.txt",
         "1993-Clinton.txt",
         "2009-Obama.txt"]

presidents = ["washington",
              "jefferson",
              "lincoln",
              "fdr",
              "eisenhower",
              "kennedy",
              "reagan",
              "bush41",
              "clinton",
              "obama"]

assert len(files) == len(presidents)

In [3]:
docs = []
for file_name, president in zip(files, presidents):
    with open(f'./inaugural/{file_name}') as f:
        doc = f.read()
        docs.append((doc, president))

In [4]:
# Utility function to clean text.
def text_cleaner(text: str) -> str:
    """Function to strip all characters except letters in words."""
    text = re.sub(r'--', ' ', text)
    text = re.sub("[\[].*?[\]]", "", text)
    text = re.sub("[\<].*?[\>]", "", text)
    text = ' '.join(text.split())
    return text

In [5]:
clean_docs = []
for doc, pres in docs:
    clean_doc = text_cleaner(doc)
    clean_docs.append((clean_doc, pres))

In [6]:
for doc, pres in clean_docs:
    print(doc[:1000], pres.upper()) 
    print()

Fellow-Citizens of the Senate and of the House of Representatives: Among the vicissitudes incident to life no event could have filled me with greater anxieties than that of which the notification was transmitted by your order, and received on the 14th day of the present month. On the one hand, I was summoned by my Country, whose voice I can never hear but with veneration and love, from a retreat which I had chosen with the fondest predilection, and, in my flattering hopes, with an immutable decision, as the asylum of my declining years a retreat which was rendered every day more necessary as well as more dear to me by the addition of habit to inclination, and of frequent interruptions in my health to the gradual waste committed on it by time. On the other hand, the magnitude and difficulty of the trust to which the voice of my country called me, being sufficient to awaken in the wisest and most experienced of her citizens a distrustful scrutiny into his qualifications, could not but ov

In [7]:
# Function to parse data.
nlp = spacy.load('en')
df_list = []


def nlp_text(text_file):
    doc = nlp(text_file)
    return doc

def sentences(doc_nlp, speaker):
    return [[sent.text, speaker] for sent in doc_nlp.sents]

  
def sentences_to_df(sents):
    return pd.DataFrame(sents)


for doc, pres in clean_docs:
    parsed = nlp_text(doc)
    sents = sentences(parsed, pres)
    df = sentences_to_df(sents)
    df_list.append(df)

In [8]:
sent_df = pd.concat([*df_list])

In [9]:
# Rename columns.
sent_df.columns = ['sentence', 'President']

# Check the count of sents per President.
sent_df.President.value_counts()

bush41        145
lincoln       139
reagan        130
eisenhower    121
obama         113
fdr            86
clinton        82
kennedy        53
jefferson      42
washington     25
Name: President, dtype: int64

# Creating Features
## Bag of Words

In [10]:
# Splitting the data.
X = sent_df.sentence
y = sent_df.President
X_train_eval, X_holdout, y_train_eval, y_holdout = train_test_split(
    X, y, test_size=0.25, random_state=15)

In [11]:
# Splitting into train/eval/holdout groups.
X_train, X_eval, y_train, y_eval = train_test_split(
    X_train_eval, y_train_eval, test_size=0.25, random_state=15)

In [12]:
# Create base parameters dictionary.
base_param_dict = {'strip_accents': 'unicode',
                   'lowercase': True,
                   'stop_words': 'english'}

In [13]:
# Instantiate CountVectorizer.
bow = CountVectorizer(**base_param_dict)

In [14]:
# Convert X_train, X_test into dfs of bags of words.
_bow_train = bow.fit_transform(X_train)
_bow_eval = bow.transform(X_eval)
_bow_holdout = bow.transform(X_holdout)
assert len(X_train) == _bow_train.shape[0]  # df and sparse-matrix

# Find feature names.
feature_names = bow.get_feature_names()

# Sparse matrix to data frame.
X_train_bow = pd.DataFrame(_bow_train.toarray(), columns=feature_names)
X_eval_bow = pd.DataFrame(_bow_eval.toarray(), columns=feature_names)
X_holdout_bow = pd.DataFrame(_bow_holdout.toarray(), columns=feature_names)

## Tfidf

In [15]:
# Instantiate Tfidf.
tfidf = TfidfVectorizer(**base_param_dict)

In [16]:
# Convert X_train, X_test into dfs of tfidf values.
_tfidf_train = tfidf.fit_transform(X_train)
_tfidf_eval = tfidf.transform(X_eval)
_tfidf_holdout = tfidf.transform(X_holdout)
assert len(X_train) == _tfidf_train.shape[0]  # df and sparse-matrix

# Find feature names.
feature_names_tfidf = tfidf.get_feature_names()

# Set up data frames.
X_train_tfidf = pd.DataFrame(
    _tfidf_train.toarray(), columns=feature_names_tfidf)
X_eval_tfidf = pd.DataFrame(
    _tfidf_eval.toarray(), columns=feature_names_tfidf)
X_holdout_tfidf = pd.DataFrame(
    _tfidf_holdout.toarray(), columns=feature_names_tfidf)

# Clustering Models

In [17]:
scaler = MinMaxScaler()
X_train_tfidf_scaled = scaler.fit_transform(X_train_tfidf)
print(len(X_train_tfidf_scaled))
X_eval_tfidf_scaled = scaler.transform(X_eval_tfidf)
X_holdout_tfidf_scaled = scaler.transform(X_holdout_tfidf)

526


In [18]:
# Clustering models
models = []
names = []
plot_nums = []
silhouettes = []
clust = []

for clusters in range(2,11):
    bandwidth = estimate_bandwidth(
        X_train_tfidf_scaled, quantile=0.2, n_samples=500)
    models.append((0, 'KMeans', KMeans(n_clusters=clusters,
                                   init='k-means++', random_state=42)))
    models.append((1, 'MiniBatch', MiniBatchKMeans(
        init='random', n_clusters=clusters, batch_size=500)))
    models.append((2, 'MeanShift', MeanShift(
        bandwidth=bandwidth, bin_seeding=True)))
    models.append((3, 'Spectral', SpectralClustering(n_clusters=clusters)))
    #models.append((4, 'Affinity', AffinityPropagation()))
    
for _, name, model in models:
    names.append(name)
    model.fit(X_train_tfidf_scaled)
    labels = model.labels_
    print(model)
    if len(set(labels)) > 1:
        ypred = model.fit_predict(X_train_tfidf_scaled)
        silhouette = metrics.silhouette_score(
            X_train_tfidf_scaled, labels, metric='euclidean')
        silhouettes.append(silhouette)
        #ax[plot_num].set_title(name)
        #plotting(plot_num, labels, ypred)
        print('clusters: {}\t silhouette: {}\n'.format(
            model.n_clusters, silhouette))
        print(name, '\n', pd.crosstab(ypred, labels), '\n')

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=2, n_init=10, n_jobs=None, precompute_distances='auto',
    random_state=42, tol=0.0001, verbose=0)
clusters: 2	 silhouette: 0.5439510271051321

KMeans 
 col_0  0    1
row_0        
0      1    0
1      0  525 

MiniBatchKMeans(batch_size=500, compute_labels=True, init='random',
        init_size=None, max_iter=100, max_no_improvement=10, n_clusters=2,
        n_init=3, random_state=None, reassignment_ratio=0.01, tol=0.0,
        verbose=0)
clusters: 2	 silhouette: 0.20226804816371816

MiniBatch 
 col_0  0    1
row_0        
0      2  523
1      0    1 

MeanShift(bandwidth=3.071450947055253, bin_seeding=True, cluster_all=True,
     min_bin_freq=1, n_jobs=None, seeds=None)
SpectralClustering(affinity='rbf', assign_labels='kmeans', coef0=1, degree=3,
          eigen_solver=None, eigen_tol=0.0, gamma=1.0, kernel_params=None,
          n_clusters=2, n_init=10, n_jobs=None, n_neighbors=10,
          random

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
    n_clusters=8, n_init=10, n_jobs=None, precompute_distances='auto',
    random_state=42, tol=0.0001, verbose=0)
clusters: 8	 silhouette: -0.05594710095693154

KMeans 
 col_0  0  1  2  3  4    5  6  7
row_0                          
0      1  0  0  0  0    0  0  0
1      0  5  0  0  0    0  0  0
2      0  0  1  0  0    0  0  0
3      0  0  0  3  0    0  0  0
4      0  0  0  0  1    0  0  0
5      0  0  0  0  0  513  0  0
6      0  0  0  0  0    0  1  0
7      0  0  0  0  0    0  0  1 

MiniBatchKMeans(batch_size=500, compute_labels=True, init='random',
        init_size=None, max_iter=100, max_no_improvement=10, n_clusters=8,
        n_init=3, random_state=None, reassignment_ratio=0.01, tol=0.0,
        verbose=0)
clusters: 8	 silhouette: -0.1119446724589565

MiniBatch 
 col_0  0    1  2  3  4  5   6  7
row_0                           
0      0    7  0  0  0  0   1  0
1      1  494  1  1  1  1  12  1
2      0    1 

In [29]:
# Re-run KMeans and extract cluster information.
model = KMeans(n_clusters=2, random_state=15).fit(X_train_tfidf_scaled)

# Extract cluster assignments for each data point.
labels = model.labels_

X_train_tfidf['clusters'] = labels

In [30]:
X_train_tfidf_clusters = X_train_tfidf.groupby(
    ['clusters'], as_index=False).mean()
X_train_tfidf_clusters

Unnamed: 0,clusters,100,14th,1776,1787,1917,200,21st,abdicated,abhorring,...,write,written,wrong,year,yearn,years,yes,yields,young,zeal
0,0,0.0,0.0,0.0,0.0,0.0,0.008,0.0,0.0,0.0,...,0.0,0.032,0.0,0.0,0.01,0.018,0.0,0.0,0.0,0.0
1,1,0.001,0.001,0.001,0.001,0.001,0.0,0.001,0.001,0.001,...,0.001,0.002,0.002,0.001,0.0,0.002,0.003,0.001,0.004,0.001


In [31]:
cluster0 = X_train_tfidf_clusters[X_train_tfidf_clusters['clusters'] == 0]
cluster0.describe()

Unnamed: 0,clusters,100,14th,1776,1787,1917,200,21st,abdicated,abhorring,...,write,written,wrong,year,yearn,years,yes,yields,young,zeal
count,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
mean,0.0,0.0,0.0,0.0,0.0,0.0,0.008,0.0,0.0,0.0,...,0.0,0.032,0.0,0.0,0.01,0.018,0.0,0.0,0.0,0.0
std,,,,,,,,,,,...,,,,,,,,,,
min,0.0,0.0,0.0,0.0,0.0,0.0,0.008,0.0,0.0,0.0,...,0.0,0.032,0.0,0.0,0.01,0.018,0.0,0.0,0.0,0.0
25%,0.0,0.0,0.0,0.0,0.0,0.0,0.008,0.0,0.0,0.0,...,0.0,0.032,0.0,0.0,0.01,0.018,0.0,0.0,0.0,0.0
50%,0.0,0.0,0.0,0.0,0.0,0.0,0.008,0.0,0.0,0.0,...,0.0,0.032,0.0,0.0,0.01,0.018,0.0,0.0,0.0,0.0
75%,0.0,0.0,0.0,0.0,0.0,0.0,0.008,0.0,0.0,0.0,...,0.0,0.032,0.0,0.0,0.01,0.018,0.0,0.0,0.0,0.0
max,0.0,0.0,0.0,0.0,0.0,0.0,0.008,0.0,0.0,0.0,...,0.0,0.032,0.0,0.0,0.01,0.018,0.0,0.0,0.0,0.0


In [39]:
# Get value_counts of cluster0.
cluster0.value_counts()

AttributeError: 'DataFrame' object has no attribute 'value_counts'

In [32]:
cluster1 = X_train_tfidf_clusters[X_train_tfidf_clusters['clusters'] == 1]
cluster1.describe()

Unnamed: 0,clusters,100,14th,1776,1787,1917,200,21st,abdicated,abhorring,...,write,written,wrong,year,yearn,years,yes,yields,young,zeal
count,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0
mean,1.0,0.001,0.001,0.001,0.001,0.001,0.0,0.001,0.001,0.001,...,0.001,0.002,0.002,0.001,0.0,0.002,0.003,0.001,0.004,0.001
std,,,,,,,,,,,...,,,,,,,,,,
min,1.0,0.001,0.001,0.001,0.001,0.001,0.0,0.001,0.001,0.001,...,0.001,0.002,0.002,0.001,0.0,0.002,0.003,0.001,0.004,0.001
25%,1.0,0.001,0.001,0.001,0.001,0.001,0.0,0.001,0.001,0.001,...,0.001,0.002,0.002,0.001,0.0,0.002,0.003,0.001,0.004,0.001
50%,1.0,0.001,0.001,0.001,0.001,0.001,0.0,0.001,0.001,0.001,...,0.001,0.002,0.002,0.001,0.0,0.002,0.003,0.001,0.004,0.001
75%,1.0,0.001,0.001,0.001,0.001,0.001,0.0,0.001,0.001,0.001,...,0.001,0.002,0.002,0.001,0.0,0.002,0.003,0.001,0.004,0.001
max,1.0,0.001,0.001,0.001,0.001,0.001,0.0,0.001,0.001,0.001,...,0.001,0.002,0.002,0.001,0.0,0.002,0.003,0.001,0.004,0.001


In [38]:
# Let's see how many words are in each cluster.
X_train_tfidf_clusters.groupby('clusters').count().sum()

100              2
14th             2
1776             2
1787             2
1917             2
200              2
21st             2
abdicated        2
abhorring        2
abide            2
abiding          2
ability          2
able             2
abraham          2
abroad           2
absolute         2
abuses           2
accept           2
accession        2
accidental       2
accomplished     2
according        2
accordingly      2
account          2
achieve          2
achieved         2
achievement      2
acknowledge      2
acknowledging    2
acquiescence     2
                ..
wish             2
wishes           2
withal           2
withered         2
withhold         2
women            2
wonder           2
wonderful        2
wonders          2
word             2
words            2
work             2
workers          2
working          2
world            2
worldly          2
worse            2
worst            2
worth            2
worthy           2
write            2
written     

# Neural Network

In [19]:
# Establish and fit the multi-level perceptron model.
mlp = MLPClassifier(hidden_layer_sizes=(20,20,), random_state=15)
mlp.fit(X_train_tfidf, ypred)



MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(20, 20), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
       random_state=15, shuffle=True, solver='adam', tol=0.0001,
       validation_fraction=0.1, verbose=False, warm_start=False)

In [20]:
# Find MLP score.
mlp.score(X_train_tfidf, ypred)

0.9923954372623575

In [21]:
# Find cross-validation score.
cross_val_score(mlp, X_train_tfidf, ypred, cv=5)



array([0.75454545, 0.73636364, 0.80769231, 0.80392157, 0.85      ])

In [22]:
# Adjust hidden layer parameters.
mlp1 = MLPClassifier(hidden_layer_sizes=(10,10,), random_state=15)
mlp1.fit(X_train_tfidf, ypred)



MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(10, 10), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
       random_state=15, shuffle=True, solver='adam', tol=0.0001,
       validation_fraction=0.1, verbose=False, warm_start=False)

In [23]:
# Find accuracy score.
mlp1.score(X_train_tfidf, ypred)

0.9752851711026616

In [24]:
# Cross-validation.
cross_val_score(mlp1, X_train_tfidf, ypred, cv=5)



array([0.74545455, 0.73636364, 0.79807692, 0.79411765, 0.83      ])

In [25]:
# Adjust hidden layer parameters.
mlp2 = MLPClassifier(hidden_layer_sizes=(25,15,), random_state=15)
mlp2.fit(X_train_tfidf, ypred)



MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
       beta_2=0.999, early_stopping=False, epsilon=1e-08,
       hidden_layer_sizes=(25, 15), learning_rate='constant',
       learning_rate_init=0.001, max_iter=200, momentum=0.9,
       n_iter_no_change=10, nesterovs_momentum=True, power_t=0.5,
       random_state=15, shuffle=True, solver='adam', tol=0.0001,
       validation_fraction=0.1, verbose=False, warm_start=False)

In [26]:
# Find accuracy score.
mlp2.score(X_train_tfidf, ypred)

0.9961977186311787

In [27]:
# Cross-validation.
cross_val_score(mlp2, X_train_tfidf, ypred, cv=5)



array([0.76363636, 0.73636364, 0.79807692, 0.80392157, 0.85      ])