## Challenge: Build your own NLP model

For this challenge, you will need to choose a corpus of data from nltk or another source that includes categories you can predict and create an analysis pipeline that includes the following steps:

1. Data cleaning / processing / language parsing
2. Create features using two different NLP methods: For example, BoW vs tf-idf
3. Use the features to fit supervised learning models for each feature set to predict the category outcomes
4. Assess your models using cross-validation and determine whether one model performed better
5. Pick one of the models and try to increase accuracy by at least 5 percentage points

Write up your report in a Jupyter notebook. Be sure to explicitly justify the choices you make throughout, and submit it below.

In [1]:
# Import data science environment.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn
import scipy
import re
import spacy
from nltk.corpus import shakespeare, stopwords
from collections import Counter
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split
%matplotlib inline

In [2]:
# Utility function to clean text.
def text_cleaner(text):
    """Function to strip all characters except letters in words."""
    text = re.sub(r'--', ' ', text)
    text = re.sub("[\[].*?[\]]", "", text)
    text = re.sub("[\<].*?[\>]", "", text)
    text = ' '.join(text.split())
    return text

In [3]:
# Import texts
much_ado = open("Much Ado About Nothing.txt", encoding='utf-16')
romeo = open("Romeo and Juliet.txt", encoding='utf-16')

# Read the data.
much_ado_raw = much_ado.read()
romeo_raw = romeo.read()

In [4]:
# Clean the data.
much_ado_clean = text_cleaner(much_ado_raw)
romeo_clean = text_cleaner(romeo_raw)

In [5]:
much_ado_clean

"I learn in this letter that Don Pedro of Arragon comes this night to Messina. He is very near by this: he was not three leagues off when I left him. How many gentlemen have you lost in this action? But few of any sort, and none of name. A victory is twice itself when the achiever brings home full numbers. I find here that Don Pedro hath bestowed much honour on a young Florentine called Claudio. Much deserved on his part and equally remembered by Don Pedro. He hath borne himself beyond the promise of his age, doing in the figure of a lamb the feats of a lion: he hath indeed better bettered expectation than you must expect of me to tell you how. He hath an uncle here in Messina will be very much glad of it. I have already delivered him letters, and there appears much joy in him; even so much that joy could not show itself modest enough without a badge of bitterness. Did he break out into tears? In great measure. A kind overflow of kindness. There are no faces truer than those that are s

In [6]:
romeo_clean



In [7]:
# Parse the data. This can take some time.
nlp = spacy.load('en')
much_ado_doc = nlp(much_ado_clean)
romeo_doc = nlp(romeo_clean)

In [8]:
# Group into sentences.
much_ado_sents = [[sent, "Much Ado"] for sent in much_ado_doc.sents]
romeo_sents = [[sent, "Romeo"] for sent in romeo_doc.sents]

# Combine the sentences from the two plays into one data frame.
sentences = pd.DataFrame(much_ado_sents + romeo_sents)
sentences.head()

Unnamed: 0,0,1
0,"(I, learn, in, this, letter, that, Don, Pedro,...",Much Ado
1,"(He, is, very, near, by, this, :, he, was, not...",Much Ado
2,"(How, many, gentlemen, have, you, lost, in, th...",Much Ado
3,"(But, few, of, any, sort, ,, and, none, of, na...",Much Ado
4,"(A, victory, is, twice, itself, when, the, ach...",Much Ado


In [9]:
# Set up bag of words function for each text.
def bag_of_words(text):
    """Counts the total number of instances of each word in a doc."""
    allwords = [token.lemma_
                for token in text
                if not token.is_punct
                and not token.is_stop]
    return [item[0] for item in Counter(allwords).most_common(500)]


def bow_features(sentences, common_words):
    """The 'sentences' variable is a data frame; common_words is a set."""
    df = pd.DataFrame(columns=common_words)
    df['text_sentence'] = sentences[0]
    df['text_source'] = sentences[1]
    df.loc[:, common_words] = 0

    for i, sentence in enumerate(df['text_sentence']):
        words = [token.lemma_
                 for token in sentence
                 if (
                     not token.is_punct
                     and not token.is_stop
                     and token.lemma_ in common_words
                 )]
        for word in words:
            df.loc[i, word] += 1
        if i % 100 == 0:
            print('Processing row {}'.format(i))
    return df

In [10]:
# Set up bags for each play.
much_ado_words = bag_of_words(much_ado_doc)
romeo_words = bag_of_words(romeo_doc)

# Make bag of common words.
common_words = set(much_ado_words + romeo_words)

In [11]:
word_counts = bow_features(sentences, common_words)
word_counts.head()

Processing row 0
Processing row 100
Processing row 200
Processing row 300
Processing row 400
Processing row 500
Processing row 600
Processing row 700
Processing row 800
Processing row 900
Processing row 1000
Processing row 1100
Processing row 1200
Processing row 1300
Processing row 1400
Processing row 1500
Processing row 1600
Processing row 1700
Processing row 1800
Processing row 1900
Processing row 2000
Processing row 2100
Processing row 2200
Processing row 2300
Processing row 2400
Processing row 2500
Processing row 2600
Processing row 2700
Processing row 2800
Processing row 2900
Processing row 3000
Processing row 3100
Processing row 3200
Processing row 3300
Processing row 3400
Processing row 3500
Processing row 3600
Processing row 3700
Processing row 3800
Processing row 3900


Unnamed: 0,earth,bring,now,this,sweet,church,consent,rime,give,guest,...,forth,in,ring,long,from,monument,ursula,heaven,text_sentence,text_source
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(I, learn, in, this, letter, that, Don, Pedro,...",Much Ado
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(He, is, very, near, by, this, :, he, was, not...",Much Ado
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(How, many, gentlemen, have, you, lost, in, th...",Much Ado
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(But, few, of, any, sort, ,, and, none, of, na...",Much Ado
4,0,1,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(A, victory, is, twice, itself, when, the, ach...",Much Ado


In [12]:
from sklearn import ensemble

rfc = ensemble.RandomForestClassifier()
Y = word_counts['text_source']
X = np.array(word_counts.drop(['text_sentence', 'text_source'], 1))

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    Y,
                                                    test_size=0.4,
                                                    random_state=15
                                                   )
train = rfc.fit(X_train, y_train)

print('Training set score: ', rfc.score(X_train, y_train))
print('\nTest set score:', rfc.score(X_test, y_test))



Training set score:  0.9501915708812261

Test set score: 0.6768837803320562


In [13]:
from sklearn.linear_model import LogisticRegression

lr = LogisticRegression()
train = lr.fit(X_train, y_train)
print(X_train.shape, y_train.shape)
print('Training set score:', lr.score(X_train, y_train))
print('\nTest set score:', lr.score(X_test, y_test))

(2349, 695) (2349,)
Training set score: 0.8437633035334184

Test set score: 0.7273307790549169




In [14]:
clf = ensemble.GradientBoostingClassifier()
train = clf.fit(X_train, y_train)

print('Training set score:', clf.score(X_train, y_train))
print('\nTest set score:', clf.score(X_test, y_test))

Training set score: 0.7454235845040442

Test set score: 0.6717752234993615


In [15]:
# SVM model, import packages.
from sklearn.svm import LinearSVC

svc = LinearSVC()
svc.fit(X_train, y_train)
print(svc)
print('Training set score:', svc.score(X_train, y_train))
print('\nTest set score:', svc.score(X_test, y_test))

LinearSVC(C=1.0, class_weight=None, dual=True, fit_intercept=True,
     intercept_scaling=1, loss='squared_hinge', max_iter=1000,
     multi_class='ovr', penalty='l2', random_state=None, tol=0.0001,
     verbose=0)
Training set score: 0.8671775223499362

Test set score: 0.7171136653895275


In [16]:
# Calculate word frequencies.
def word_frequencies(text, include_stop=True):
    """A data frame that will keep track of word usage."""
    # Build a list of words.
    # Strip out punctuation and stop words.
    words = []
    for token in text:
        if not token.is_punct and (not token.is_stop or include_stop):
            words.append(token.text)
            
    # Build and return a Counter object containing word counts.
    return Counter(words)

# The most frequent words:
much_ado_freq = word_frequencies(
    much_ado_doc, include_stop=False).most_common(100)
print('Much Ado:', much_ado_freq)
romeo_freq = word_frequencies(romeo_doc, include_stop=False).most_common(100)

Much Ado: [('I', 712), ("'s", 170), ('And', 116), ('man', 111), ('love', 90), ('good', 78), ('thou', 74), ('thee', 74), ('shall', 72), ('hath', 67), ('lord', 66), ('God', 63), ('Hero', 63), ('Claudio', 57), ('know', 56), ('Benedick', 56), ('But', 51), ('let', 51), ('prince', 51), ('like', 50), ('thy', 50), ('If', 49), ('lady', 48), ("'ll", 48), ('O', 46), ('come', 45), ('think', 45), ('What', 43), ('You', 43), ('Beatrice', 40), ('tell', 39), ('hear', 38), ('brother', 37), ('The', 37), ('To', 37), ('night', 35), ('No', 35), ('Signior', 33), ('cousin', 33), ('heart', 33), ('Come', 33), ('That', 33), ('Why', 32), ('sir', 31), ('Leonato', 30), ('daughter', 30), ('wit', 29), ('Well', 28), ('speak', 28), ('men', 28), ('In', 27), ('pray', 27), ('How', 25), ('Yea', 25), ('My', 23), ('Is', 23), ('answer', 23), ('true', 22), ('doth', 22), ('For', 22), ('morrow', 22), ('He', 21), ('A', 21), ('faith', 21), ('till', 21), ('hand', 21), ('Nay', 21), ('As', 21), ('She', 21), ('fashion', 20), ('old', 2

In [17]:
# Pull out just the text from our frequency lists.
much_ado_common = [pair[0] for pair in much_ado_freq]
romeo_common = [pair[0] for pair in romeo_freq]

# Use sets to find the unique values in each top 100.
print('Unique to Much Ado:', set(much_ado_common) - set(romeo_common))
print('Unique to Romeo:', set(romeo_common) - set(much_ado_common))

Unique to Much Ado: {'great', 'pray', 'She', 'Don', 'thank', 'Lady', 'brother', 'wit', 'Beatrice', 'husband', 'Yea', 'Nay', 'think', 'answer', 'cousin', 'bid', 'Claudio', 'leave', 'John', 'Well', 'way', 'marry', 'fool', 'till', 'matter', 'Signior', 'Benedick', 'Hero', 'Count', 'Good', 'prince', 'fashion', 'gentleman', 'faith', 'He', 'There', 'daughter', 'Margaret', 'wear', 'said', 'Leonato'}
Unique to Romeo: {'dead', 'sweet', 'hast', 'bed', 'eyes', 'Where', 'Romeo', 'Ay', 'Here', 'With', 'Which', 'father', 'Tybalt', 'Juliet', 'Thou', 'stay', 'light', 'nurse', 'lie', 'Of', 'tears', 'find', 'word', 'Or', 'This', 'Now', 'art', 'time', 'comes', 'gone', 'Go', 'Then', 'house', 'dear', 'stand', 'Montague', 'Paris', 'tis', 'heaven', 'face', 'wilt'}


In [18]:
# Utility function to calculate how frequently lemmas appear in the text.
def lemma_frequencies(text, include_stop=True):
    """Function to identify lemma frequencies"""
    # Build a list of lemmas.
    # Strip out punctuation and stop words.
    lemmas = []
    for token in text:
        if not token.is_punct and (not token.is_stop or include_stop):
            lemmas.append(token.lemma_)
            
    # Build and return a Counter object containing word counts.
    return Counter(lemmas)

# Instantiate our list of most common lemmas.
much_ado_lemma_freq = lemma_frequencies(
    much_ado_doc, include_stop=False).most_common(100)
romeo_lemma_freq = lemma_frequencies(
    romeo_doc, include_stop=False).most_common(100)
print('\nMuch Ado:', much_ado_lemma_freq)
print('Romeo:', romeo_lemma_freq)

# Again, identify the lemmas common to one text but not the other.
much_ado_lemma_common = [pair[0] for pair in much_ado_lemma_freq]
romeo_lemma_common = [pair[0] for pair in romeo_lemma_freq]

print('Unique to Much Ado:', set(much_ado_lemma_common) -
      set(romeo_lemma_common))
print('Unique to Romeo:', set(romeo_lemma_common) -
      set(much_ado_lemma_common))


Much Ado: [('-PRON-', 877), ('man', 140), ('and', 116), ('love', 113), ("'s", 107), ('good', 105), ('be', 104), ('come', 99), ('thou', 86), ('know', 84), ('shall', 82), ('hath', 76), ('thee', 74), ('lord', 73), ('lady', 69), ('god', 68), ('let', 65), ('hero', 63), ('will', 61), ('think', 61), ('prince', 59), ('tell', 58), ('claudio', 57), ('benedick', 56), ('like', 52), ('thy', 52), ('but', 51), ('hear', 51), ('speak', 51), ('o', 49), ('if', 49), ('what', 43), ('signior', 42), ('beatrice', 40), ('brother', 39), ('marry', 39), ('night', 37), ('sir', 37), ('the', 37), ('to', 37), ('cousin', 36), ('heart', 36), ('no', 35), ('pray', 34), ('that', 33), ('why', 32), ('wit', 31), ('daughter', 31), ('leonato', 30), ('look', 29), ('say', 29), ('go', 28), ('well', 28), ('yea', 28), ('count', 28), ('master', 28), ('in', 27), ('swear', 27), ('die', 27), ('word', 26), ('hand', 26), ('answer', 26), ('how', 25), ('true', 25), ('leave', 24), ('faith', 24), ('doth', 24), ('nay', 24), ('bring', 22), ('

In [19]:
# Let's see how many sentences are in each play.
sents_much_ado = list(much_ado_doc.sents)
sents_romeo = list(romeo_doc.sents)

print("Much Ado About Nothing has {} sentences.".format(len(sents_much_ado)))
print("Romeo and Juliet has {} sentences.".format(len(sents_romeo)))

Much Ado About Nothing has 1685 sentences.
Romeo and Juliet has 2230 sentences.


In [20]:
# Copy word_counts data frame as a form of version control.
word_counts2 = word_counts

# Add a column for the word counts in each sentence to the data frame.
word_counts2['sent_length'] = word_counts2.text_sentence.map(lambda x: len(x))

# Let's create a count for parts of speech.
# Adverbs in each sentence.
sentences2 = word_counts2.text_sentence
adv_count = []
for sent in sentences2:
    advs = 0
    for token in sent:
        if token.pos_ == 'ADV':
            advs +=1
    adv_count.append(advs)
    
# Add adverbs column to data frame.
word_counts2['adv_count'] = adv_count

# Verbs in each sentence.
verb_count = []
for sent in sentences2:
    verb = 0
    for token in sent:
        if token.pos_ == 'VERB':
            verb +=1
    verb_count.append(verb)
    
# Add verbs column to data frame.
word_counts2['verb_count'] = verb_count

# Nouns in each sentence:
noun_count = []
for sent in sentences2:
    noun = 0
    for token in sent:
        if token.pos_ == 'NOUN':
            noun +=1
    noun_count.append(noun)
    
# Add nouns column to data frame.
word_counts2['noun_count'] = noun_count

# Punctuation marks in each sentence.
punct_count = []
for sent in sentences2:
    punct = 0
    for token in sent:
        if token.pos_ == 'PUNCT':
            punct +=1
    punct_count.append(punct)
    
# Add punctuation column to data frame.
word_counts2['punct_count'] = punct_count

word_counts2.head()

Unnamed: 0,earth,bring,now,this,sweet,church,consent,rime,give,guest,...,monument,ursula,heaven,text_sentence,text_source,sent_length,adv_count,verb_count,noun_count,punct_count
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,"(I, learn, in, this, letter, that, Don, Pedro,...",Much Ado,16,0,2,2,1
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,"(He, is, very, near, by, this, :, he, was, not...",Much Ado,18,3,3,1,2
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,"(How, many, gentlemen, have, you, lost, in, th...",Much Ado,10,1,2,2,1
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,"(But, few, of, any, sort, ,, and, none, of, na...",Much Ado,11,0,0,3,2
4,0,1,0,0,0,0,0,0,0,0,...,0,0,0,"(A, victory, is, twice, itself, when, the, ach...",Much Ado,13,2,2,4,1


In [27]:
# Let's go back and re-try SVM with the new features.
Y2 = word_counts2['text_source']
X2 = np.array(word_counts2.drop(['text_sentence', 'text_source'], 1))

X2_train, X2_test, y2_train, y2_test = train_test_split(X,Y,test_size=0.4,random_state=15)
svm2 = LinearSVC()
train2 = svm2.fit(X2_train, y2_train)
print('Training set score: ', svm2.score(X2_train, y2_train))
print('\nTest set score: ', svm2.score(X2_test, y2_test))

Training set score:  0.8718603661132397

Test set score:  0.723499361430396




In [28]:
# Let's see this in crosstab.
svm2_predicted = svm2.predict(X_test)
pd.crosstab(y2_test, svm2_predicted)

col_0,Much Ado,Romeo
text_source,Unnamed: 1_level_1,Unnamed: 2_level_1
Much Ado,410,270
Romeo,163,723


In [29]:
# Let's try the Random Forest again with the new features.
rfc2 = ensemble.RandomForestClassifier()
Y3 = word_counts2['text_source']
X3 = np.array(word_counts2.drop(['text_sentence', 'text_source'], 1))

X3_train, X3_test, y3_train, y3_test = train_test_split(X3,
                                                        Y3,
                                                        test_size=0.4,
                                                        random_state=15
                                                       )

train3 = rfc2.fit(X3_train, y3_train)

print('Training set score: ', rfc2.score(X3_train, y3_train))
print('\nTest set score:', rfc2.score(X3_test, y3_test))



Training set score:  0.9753086419753086

Test set score: 0.6877394636015326


In [30]:
# Create new variables that aren't spacy tokens.
much_ado_tfidf = much_ado_clean
romeo_tfidf = romeo_clean

# Group into sentences.
much_ado_tfidf_sents = [[sent, "Much Ado"] for sent in much_ado_tfidf]
romeo_tfidf_sents = [[sent, "Romeo"] for sent in romeo_tfidf]

# Combine sentences from two plays into one data frame.
sentences_tfidf = pd.DataFrame(much_ado_tfidf_sents + romeo_tfidf_sents)
sentences_tfidf.head()

Unnamed: 0,0,1
0,I,Much Ado
1,,Much Ado
2,l,Much Ado
3,e,Much Ado
4,a,Much Ado


In [32]:
# Let's set up our model again.
Y4 = word_counts2['text_source']
X4 = np.array(word_counts2.drop(['text_sentence', 'text_source'], 1))

X4_train, X4_test, y4_train, y4_test = train_test_split(X4,
                                                        Y4,
                                                        test_size=0.4,
                                                        random_state=15)

lr = LogisticRegression(solver='lbfgs', max_iter=5000)
train4 = lr.fit(X4_train, y4_train)
print(X4_train.shape, y4_train.shape)
print('Training set score:', lr.score(X4_train, y4_train))
print('\nTest set score:', lr.score(X4_test, y4_test))

(2349, 700) (2349,)
Training set score: 0.8424861643252448

Test set score: 0.7369093231162197


After trying this many different ways with many Q & A Session visits with multiple mentors, I am unable to improve upon the results of these models. Some of the models do improve as we go on, but nothing close to the 5% improvement that is requested.