For this challenge, I will use the inaugrual speeches corpus from nltk and create an analysis pipeline that includes the following steps:

- Data cleaning / processing / language parsing

- Create features using two different NLP methods: For example, BoW vs tf-idf.

- Use the features to fit supervised learning models for each feature set to predict the category outcomes.

- Assess your models using cross-validation and determine whether one model performed better.

- Pick one of the models and try to increase accuracy by at least 5 percentage points.

I will be using the first inaugural speech of Obama and Trump and attempt to have my models predict the speaker correctly

In [5]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import re
import nltk
from nltk.corpus import inaugural
import spacy
from collections import Counter
import warnings

warnings.filterwarnings('ignore')

# Cleaning, Processing, and Parsing

In [14]:
obama = inaugural.raw('2009-Obama.txt')
print(obama[:10000])

My fellow citizens:

I stand here today humbled by the task before us, grateful for the trust you have bestowed, mindful of the sacrifices borne by our ancestors. I thank President Bush for his service to our nation, as well as the generosity and cooperation he has shown throughout this transition.

Forty-four Americans have now taken the presidential oath. The words have been spoken during rising tides of prosperity and the still waters of peace. Yet, every so often the oath is taken amidst gathering clouds and raging storms. At these moments, America has carried on not simply because of the skill or vision of those in high office, but because We the People have remained faithful to the ideals of our forbearers, and true to our founding documents.

So it has been. So it must be with this generation of Americans.

That we are in the midst of crisis is now well understood. Our nation is at war, against a far-reaching network of violence and hatred. Our economy is badly weakened, a conse

As we consider the road that unfolds before us, we remember with humble gratitude those brave Americans who, at this very hour, patrol fa


In [17]:
len(obama)

13439

In [19]:
trump = inaugural.raw('2017-Trump.txt')
print(trump[:10000])

Chief Justice Roberts, President Carter, President Clinton, President Bush, President Obama, fellow Americans, and people of the world: Thank you.

We, the citizens of America, are now joined in a great national effort to rebuild our country and restore its promise for all of our people. Together, we will determine the course of America and the world for many, many years to come. We will face challenges, we will confront hardships, but we will get the job done.

Every 4 years, we gather on these steps to carry out the orderly and peaceful transfer of power, and we are grateful to President Obama and First Lady Michelle Obama for their gracious aid throughout this transition. They have been magnificent. Thank you.

Today's ceremony, however, has very special meaning. Because today we are not merely transferring power from one administration to another or from one party to another, but we are transferring power from Washington, DC, and giving it back to you, the people.

For too long, a 




In [21]:
len(trump)

8449

In [23]:
def text_cleaner(text):
    text = re.sub(r'--',' ',text)
    text = ' '.join(text.split())
    return text

obama = text_cleaner(obama)
trump = text_cleaner(trump)

In [24]:
nlp = spacy.load('en')
obama_doc = nlp(obama)
trump_doc = nlp(trump)

# Bag of Words+ Features Model

I will create a model that uses the Bag of Words approach, plus a few extra features

In [39]:
obama_sents = [[sent, 'obama'] for sent in obama_doc.sents]
trump_sents = [[sent, 'trump'] for sent in trump_doc.sents]

sentences = pd.DataFrame(obama_sents + trump_sents)

sentences.head()

Unnamed: 0,0,1
0,"(My, fellow, citizens, :, I, stand, here, toda...",obama
1,"(I, thank, President, Bush, for, his, service,...",obama
2,"(Forty, -, four, Americans, have, now, taken, ...",obama
3,"(The, words, have, been, spoken, during, risin...",obama
4,"(Yet, ,, every, so, often, the, oath, is, take...",obama


In [41]:
# Utility function to create a list of the 500 most common words.
def bag_of_words(text):
    
    # Filter out punctuation and stop words.
    allwords = [token.lemma_
                for token in text
                if not token.is_punct
                and not token.is_stop]
    
    # Return the most common words.
    return [item[0] for item in Counter(allwords).most_common(500)]

obama_words = bag_of_words(obama_doc)
trump_words = bag_of_words(trump_doc)

common_words = set(obama_words + trump_words)

In [43]:
len(common_words)

758

In [46]:
# Creates a data frame with features for each word in our common word set.
# Each value is the count of the times the word appears in each sentence.
# Also features for sentence length, amount of punctuation per sentence,
# and parts of speech counts

def nlp_features(sentences, common_words):
    
    # Scaffold the data frame
    df = pd.DataFrame(columns=common_words)
    df['sentence'] = sentences[0]
    df['speaker'] = sentences[1]
    # Set empty features
    df['sentence_length'] = np.nan
    df['punctuation_count'] = np.nan
    df['noun_count'] = np.nan
    df['verb_count'] = np.nan
    df['adjective_count'] = np.nan
    df['adverb_count'] = np.nan
    df['pronoun_count'] = np.nan
    df['proper_noun_count'] = np.nan
    df['conjunction_count'] = np.nan
    # Initialize word counts to zer0
    df.loc[:, common_words] = 0
    
    # Process each row, counting the occurrence of words in each sentence.
    for i, sentence in enumerate(df['sentence']):
        
        # Convert the sentence to lemmas, then filter out punctuation,
        # stop words, and uncommon words.
        words = [token.lemma_
                 for token in sentence
                 if (
                     not token.is_punct
                     and not token.is_stop
                     and token.lemma_ in common_words
                 )]
        
        # Populate the row with word counts.
        for word in words:
            df.loc[i, word] += 1
        
        # Create the other feature counts
        df.loc[i, 'sentence_length'] = len([token for token in sentence if not token.is_punct])
        df.loc[i, 'punctuation_count'] = len([token for token in sentence if token.is_punct])
        df.loc[i, 'noun_count'] = len([token for token in sentence if token.pos_ == 'NOUN'])
        df.loc[i, 'verb_count'] = len([token for token in sentence if token.pos_ == 'VERB'])
        df.loc[i, 'conjunction_count'] = len([token for token in sentence if token.pos_ == 'CONJ' or 'ADP'])
        df.loc[i, 'adjective_count'] = len([token for token in sentence if token.pos_ == 'ADJ'])
        df.loc[i, 'adverb_count'] = len([token for token in sentence if token.pos_ == 'ADV'])
        df.loc[i, 'pronoun_count'] = len([token for token in sentence if token.pos_ == 'PRON'])
        df.loc[i, 'proper_noun_count'] = len([token for token in sentence if token.pos_ == 'PROPN'])
            
    return df

In [47]:
nlp = nlp_features(sentences, common_words)

nlp.head()

Unnamed: 0,measure,crisis,change,control,favor,peril,scarcely,govern,ceremony,depletion,...,speaker,sentence_length,punctuation_count,noun_count,verb_count,adjective_count,adverb_count,pronoun_count,proper_noun_count,conjunction_count
0,0,0,0,0,0,0,0,0,0,0,...,obama,28.0,4.0,6.0,5.0,5.0,1.0,3.0,0.0,32.0
1,0,0,0,0,0,0,0,0,0,0,...,obama,23.0,2.0,5.0,3.0,2.0,2.0,2.0,2.0,25.0
2,0,0,0,0,0,0,0,0,0,0,...,obama,9.0,2.0,1.0,2.0,1.0,1.0,0.0,1.0,11.0
3,0,0,0,0,0,0,0,0,0,0,...,obama,16.0,1.0,5.0,4.0,0.0,1.0,0.0,0.0,17.0
4,0,0,0,0,0,0,0,0,0,0,...,obama,14.0,2.0,4.0,3.0,0.0,3.0,0.0,0.0,16.0


## Model Validation and Selection

I will use cross validation on Gradient Boosting, Random Forest, and Logistic Regression models and select which performs best to represent the Bag of Words+ feature approach

In [53]:
from sklearn.model_selection import cross_val_score as cvs
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression

X = nlp.drop(columns=['sentence', 'speaker'])
y = nlp['speaker']

In [54]:
gbc = GradientBoostingClassifier(n_estimators=1000, max_depth=5, learning_rate=0.1, random_state=42)

cvs(gbc, X, y, cv=5)

array([0.5       , 0.54761905, 0.6097561 , 0.6       , 0.525     ])

In [55]:
gbc = GradientBoostingClassifier(n_estimators=100, max_depth=5, learning_rate=0.1, random_state=42)

cvs(gbc, X, y, cv=5)

array([0.47619048, 0.57142857, 0.63414634, 0.6       , 0.425     ])

In [56]:
gbc = GradientBoostingClassifier(n_estimators=1000, max_depth=2, learning_rate=0.1, random_state=42)

cvs(gbc, X, y, cv=5)

array([0.57142857, 0.5952381 , 0.58536585, 0.525     , 0.525     ])

In [57]:
gbc = GradientBoostingClassifier(n_estimators=1000, max_depth=2, learning_rate=0.01, random_state=42)

cvs(gbc, X, y, cv=5)

array([0.47619048, 0.54761905, 0.63414634, 0.625     , 0.55      ])

In [58]:
rfc = RandomForestClassifier(n_estimators=1000, max_depth=5, random_state=42)

cvs(rfc, X, y, cv=5)

array([0.57142857, 0.54761905, 0.63414634, 0.625     , 0.575     ])

In [59]:
rfc = RandomForestClassifier(n_estimators=1000, max_depth=2, random_state=42)

cvs(rfc, X, y, cv=5)

array([0.54761905, 0.54761905, 0.56097561, 0.55      , 0.525     ])

In [60]:
rfc = RandomForestClassifier(n_estimators=10000, max_depth=2, random_state=42)

cvs(rfc, X, y, cv=5)

array([0.54761905, 0.57142857, 0.56097561, 0.55      , 0.55      ])

In [65]:
rfc = RandomForestClassifier(n_estimators=10000, max_depth=3, random_state=42)

cvs(rfc, X, y, cv=5)

array([0.5952381 , 0.57142857, 0.56097561, 0.55      , 0.575     ])

In [67]:
rfc = RandomForestClassifier(n_estimators=10000, max_depth=5, random_state=42)

cvs(rfc, X, y, cv=5)

array([0.57142857, 0.57142857, 0.63414634, 0.6       , 0.575     ])

In [61]:
lr = LogisticRegression(penalty='l2', solver='lbfgs')

cross_val_score(lr, X, y, cv=5)

array([0.71428571, 0.5       , 0.68292683, 0.6       , 0.575     ])

In [64]:
lr = LogisticRegression(penalty='l1', solver='liblinear')
cross_val_score(lr, X, y, cv=5)

array([0.64285714, 0.45238095, 0.70731707, 0.625     , 0.575     ])

The most stable performing model for this feature set is the Random Forest Classifier with 10,000 trees and a node depth of 3 (a depth of 5 performed slightly better, but had less consistency.

I will move to feature extraction through tfidf to see if I can get a better model.

# Tf-Idf Feature based Model

In [109]:
obama = inaugural.sents('2009-Obama.txt')
trump = inaugural.sents('2017-Trump.txt')

obama_sents = []
for sentence in obama:
    sentence=[re.sub(r'--','',word) for word in sentence]
    #Forming each paragraph into a string and adding it to the list of strings.
    obama_sents.append(' '.join(sentence))
    
trump_sents = []
for sentence in trump:
    sentence=[re.sub(r'--', '', word) for word in para]
    trump_sents.append(' '.join(sentence))


In [110]:
obama_sents = [[sentence, 'Obama'] for sentence in obama_sents]
trump_sents = [[sentence, 'Trump'] for sentence in trump_sents]

obama_df = pd.DataFrame(obama_sents)
trump_df = pd.DataFrame(trump_sents)

sentences = pd.concat([obama_df, trump_df])
sentences.columns = ['sentence', 'speaker']
sentences.head()

Unnamed: 0,sentence,speaker
0,My fellow citizens :,Obama
1,I stand here today humbled by the task before ...,Obama
2,I thank President Bush for his service to our ...,Obama
3,Forty - four Americans have now taken the pres...,Obama
4,The words have been spoken during rising tides...,Obama


In [118]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(max_df=0.50, # drop words that occur in more than half of the sentences
                        min_df=2, # only use words that appear at least twice
                        stop_words='english', 
                        lowercase=True,
                        use_idf=True,
                        norm='l2',
                        smooth_idf=True 
                        )

features = tfidf.fit_transform(sentences.sentence).toarray()
speaker = sentences.speaker
features.shape

(202, 153)

I now have a tf-idf vector with 153 features for all 202 of the sentences in our two speakers' speeches. I will now perform dimension reduction using SVD before putting these features through models.

In [123]:
from sklearn.decomposition import TruncatedSVD
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import Normalizer

svd = TruncatedSVD(50)
lsa = make_pipeline(svd, Normalizer(copy=False))
# Run SVD on the training data, then project the training data.
features_lsa = lsa.fit_transform(features)

variance_explained=svd.explained_variance_ratio_
total_variance = variance_explained.sum()
print("Percent variance captured by all components:",total_variance*100)

Percent variance captured by all components: 86.08211742480594


In [124]:
features_lsa = pd.DataFrame(features_lsa)

In [125]:
features_lsa.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,40,41,42,43,44,45,46,47,48,49
0,-7.046199e-10,-0.014653,-0.003307,0.021139,-0.039822,0.01309,-0.025672,-0.010206,-0.039347,-0.067744,...,0.074523,-0.312874,0.000251,-0.317145,0.015131,-0.133358,0.002149,-0.075164,0.118999,-0.186333
1,0.0003908475,0.125269,-0.049826,0.173678,0.011714,-0.061401,0.024625,0.294839,0.273731,-0.197758,...,0.157111,-0.156559,0.027084,0.168092,0.323889,0.055963,0.291603,-0.103671,-0.135865,0.124915
2,0.0002252687,0.31522,-0.224252,-0.273194,-0.099771,-0.20205,-0.032729,-0.013078,0.025797,-0.061808,...,0.026618,0.026993,-0.001353,-0.007741,-0.000705,0.004791,-0.001914,-0.032065,0.006595,-0.021935
3,1.885232e-06,0.032362,-0.046252,0.019306,-0.066298,0.22751,-0.087092,-0.057219,0.225864,0.143585,...,0.015019,0.057856,0.012784,-0.041788,-0.043848,0.022843,0.015696,-0.057411,-0.105684,0.011366
4,0.0003821903,0.169116,-0.150595,0.099924,-0.052433,0.023431,-0.008582,-0.026819,-0.044918,0.090594,...,0.214141,0.068422,0.232494,-0.021983,-0.182088,0.074826,0.140921,0.063264,-0.154343,0.020363


In [127]:
X = features_lsa
y = speaker

gbc = GradientBoostingClassifier(n_estimators=1000, max_depth=5, learning_rate=0.1, random_state=42)

cvs(gbc, X, y, cv=5)

array([0.95121951, 1.        , 1.        , 1.        , 0.975     ])

In [128]:
rfc = RandomForestClassifier(n_estimators=10000, max_depth=3, random_state=42)

cvs(rfc, X, y, cv=5)

array([1., 1., 1., 1., 1.])

In [130]:
lr = LogisticRegression(penalty='l2', solver='lbfgs')

cross_val_score(lr, X, y, cv=5)

array([1.   , 1.   , 1.   , 1.   , 0.975])

It appears that I have created some very good models using the tf-idf features with reduced dimensions. Perhaps too good, as overfitting is now a querstion. But within the scope of the questions of predicting whether a sentence came from Obama or Trump's inaugrual speaches, I have some very well-performing models to use.