## Challenge: Build your own NLP model

For this challenge, you will need to choose a corpus of data from nltk or another source that includes categories you can predict and create an analysis pipeline that includes the following steps:

1. Data cleaning / processing / language parsing
2. Create features using two different NLP methods: For example, BoW vs tf-idf
3. Use the features to fit supervised learning models for each feature set to predict the category outcomes
4. Assess your models using cross-validation and determine whether one model performed better
5. Pick one of the models and try to increase accuracy by at least 5 percentage points

Write up your report in a Jupyter notebook. Be sure to explicitly justify the choices you make throughout, and submit it below.

In [1]:
# Import data science environment.
import numpy as np
import pandas as pd
import re
import spacy
from typing import List, Set
from nltk.corpus import shakespeare, stopwords
from collections import Counter
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix, classification_report

# Data Cleaning, Processing, Language Parsing

In [2]:
# Import texts and read the data.
with open(r"./Much Ado About Nothing.txt", encoding='utf-16') as much_ado:
    much_ado_raw = much_ado.read()
with open(r"./Romeo and Juliet.txt", encoding='utf-16') as romeo:
    romeo_raw = romeo.read()

In [3]:
# Utility function to clean text.
def text_cleaner(text:str) -> str :
    """Function to strip all characters except letters in words."""
    text = re.sub(r'--', ' ', text)
    text = re.sub("[\[].*?[\]]", "", text)
    text = re.sub("[\<].*?[\>]", "", text)
    text = ' '.join(text.split())
    return text

In [4]:
# Clean the data.
much_ado_clean = text_cleaner(much_ado_raw)
romeo_clean = text_cleaner(romeo_raw)

In [5]:
# Print "Much Ado" cleaned text.
much_ado_clean[:1000]

'I learn in this letter that Don Pedro of Arragon comes this night to Messina. He is very near by this: he was not three leagues off when I left him. How many gentlemen have you lost in this action? But few of any sort, and none of name. A victory is twice itself when the achiever brings home full numbers. I find here that Don Pedro hath bestowed much honour on a young Florentine called Claudio. Much deserved on his part and equally remembered by Don Pedro. He hath borne himself beyond the promise of his age, doing in the figure of a lamb the feats of a lion: he hath indeed better bettered expectation than you must expect of me to tell you how. He hath an uncle here in Messina will be very much glad of it. I have already delivered him letters, and there appears much joy in him; even so much that joy could not show itself modest enough without a badge of bitterness. Did he break out into tears? In great measure. A kind overflow of kindness. There are no faces truer than those that are s

In [6]:
# Print "Romeo" cleaned text.
romeo_clean[:1000]

"Two households, both alike in dignity, In fair Verona, where we lay our scene, From ancient grudge break to new mutiny, Where civil blood makes civil hands unclean. From forth the fatal loins of these two foes A pair of star-cross'd lovers take their life; Whose misadventur'd piteous overthrows Do with their death bury their parents' strife. The fearful passage of their death-mark'd love, And the continuance of their parents' rage, Which, but their children's end, nought could remove, Is now the two hours' traffick of our stage; The which if you with patient ears attend, What here shall miss, our toil shall strive to mend. Gregory, o' my word, we'll not carry coals. No. for then we should be colliers. I mean, an we be in choler, we'll draw. Ay, while you live, draw your neck out o' the collar. I strike quickly, being moved. But thou art not quickly moved to strike. A dog of the house of Montague moves me. To move is to stir, and to be valiant is to stand; therefore, if thou art moved,

# Creating Features

## Bag of Words

In [7]:
# Parse the data. This can take some time.
nlp = spacy.load('en')
much_ado_doc = nlp(much_ado_clean)
romeo_doc = nlp(romeo_clean)

In [8]:
# Group into sentences.
much_ado_sents = [[sent, "Much Ado"] for sent in much_ado_doc.sents]
romeo_sents = [[sent, "Romeo"] for sent in romeo_doc.sents]

# Combine the sentences from the two plays into one data frame.
sentences = pd.DataFrame(much_ado_sents + romeo_sents)
sentences.head()

Unnamed: 0,0,1
0,"(I, learn, in, this, letter, that, Don, Pedro,...",Much Ado
1,"(He, is, very, near, by, this, :, he, was, not...",Much Ado
2,"(How, many, gentlemen, have, you, lost, in, th...",Much Ado
3,"(But, few, of, any, sort, ,, and, none, of, na...",Much Ado
4,"(A, victory, is, twice, itself, when, the, ach...",Much Ado


In [9]:
# Set up bag of words function for each text.
def bag_of_words(text: spacy.tokens.doc.Doc) -> List:
    """Counts the total number of instances of each word in a doc."""
    allwords = [token.lemma_  # extracts base from word, pos (-PRON-)
                for token in text
                if not token.is_punct  # eliminates punctuation
                and not token.is_stop]  # eliminates stop words
    var = Counter(allwords).most_common(500)
    #print(f"Counter(allwords).most_common(500) is {var[0]}")
    #print(f"First element(allwords) is {allwords[0]}")
    #print(f"Type of first element is {type(allwords[0])}")
    return [item[0] for item in var]


def bow_features_common(
        sentences: pd.DataFrame, common_words: Set) -> pd.DataFrame:
    """Takes a Set and a DataFrame."""
    df = pd.DataFrame(columns=common_words)  # Transforming set to df.
    df['text_sentence'] = sentences[0]  # type = spacy.tokens.doc.Doc.
    df['text_source'] = sentences[1]  # play label.
    df.loc[:, common_words] = 0  # zero all cells except labels.

    for i, sentence in enumerate(df['text_sentence']):
        common_lemmas = [token.lemma_  # extracts base from word, pos (-PRON-)
                         for token in sentence  #
                         if (
                             not token.is_punct  # eliminates punctuation
                             and not token.is_stop  # eliminates stop words
                             and token.lemma_ in common_words
                         )]
        # print(words)
        # break

        for lemma in common_lemmas:
            df.loc[i, lemma] += 1
        if i % 100 == 0:
            print('Processing row {}'.format(i))
    return df  # row of df contains Doc, label, common lemmas

In [10]:
# Set up bags for each play.
much_ado_words = bag_of_words(much_ado_doc)
romeo_words = bag_of_words(romeo_doc)

# Make bag of common words.
common_words = set(much_ado_words + romeo_words)

In [11]:
# Set up features from BoW.
lemma_counts = bow_features_common(sentences, common_words)
lemma_counts.head()

Processing row 0
Processing row 100
Processing row 200
Processing row 300
Processing row 400
Processing row 500
Processing row 600
Processing row 700
Processing row 800
Processing row 900
Processing row 1000
Processing row 1100
Processing row 1200
Processing row 1300
Processing row 1400
Processing row 1500
Processing row 1600
Processing row 1700
Processing row 1800
Processing row 1900
Processing row 2000
Processing row 2100
Processing row 2200
Processing row 2300
Processing row 2400
Processing row 2500
Processing row 2600
Processing row 2700
Processing row 2800
Processing row 2900
Processing row 3000
Processing row 3100
Processing row 3200
Processing row 3300
Processing row 3400
Processing row 3500
Processing row 3600
Processing row 3700
Processing row 3800
Processing row 3900


Unnamed: 0,write,liking,face,shalt,change,fair,conrade,blood,ho,goose,...,yet,desperate,for,head,longer,outward,grow,happy,text_sentence,text_source
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(I, learn, in, this, letter, that, Don, Pedro,...",Much Ado
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(He, is, very, near, by, this, :, he, was, not...",Much Ado
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(How, many, gentlemen, have, you, lost, in, th...",Much Ado
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(But, few, of, any, sort, ,, and, none, of, na...",Much Ado
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(A, victory, is, twice, itself, when, the, ach...",Much Ado


In [12]:
# Calculate word frequencies.
def word_frequencies(text, include_stop=True):
    """A data frame that will keep track of word usage."""
    # Build a list of words.
    # Strip out punctuation and stop words.
    words = []
    for token in text:
        if not token.is_punct and (not token.is_stop or include_stop):
            words.append(token.text.lower())
            
    # Build and return a Counter object containing word counts.
    return Counter(words)

# The most frequent words:
much_ado_freq = word_frequencies(
    much_ado_doc, include_stop=False).most_common(100)
print('Much Ado:', much_ado_freq)
romeo_freq = word_frequencies(romeo_doc, include_stop=False).most_common(100)
print('Romeo:', romeo_freq)

Much Ado: [('i', 712), ("'s", 170), ('and', 116), ('man', 111), ('good', 94), ('love', 92), ('thou', 86), ('shall', 82), ('come', 78), ('hath', 76), ('thee', 74), ('lord', 71), ('lady', 66), ('god', 66), ('let', 65), ('hero', 63), ('know', 59), ('claudio', 57), ('benedick', 56), ('prince', 54), ('thy', 52), ('but', 51), ('like', 51), ('o', 49), ('if', 49), ("'ll", 48), ('think', 47), ('what', 43), ('you', 43), ('signior', 42), ('hear', 41), ('beatrice', 40), ('tell', 39), ('brother', 38), ('sir', 37), ('the', 37), ('to', 37), ('night', 35), ('no', 35), ('cousin', 34), ('heart', 33), ('that', 33), ('why', 32), ('marry', 32), ('pray', 31), ('daughter', 31), ('speak', 31), ('leonato', 30), ('wit', 29), ('men', 29), ('well', 28), ('yea', 28), ('count', 28), ('in', 27), ('how', 25), ('true', 25), ('doth', 24), ('answer', 24), ('nay', 24), ('my', 23), ('faith', 23), ('is', 23), ('till', 22), ('for', 22), ('morrow', 22), ('he', 21), ('a', 21), ('old', 21), ('hand', 21), ('as', 21), ('she', 21

In [13]:
# Pull out just the text from our frequency lists.
much_ado_common = [pair[0] for pair in much_ado_freq]
romeo_common = [pair[0] for pair in romeo_freq]

# Use sets to find the unique values in each top 500.
print('Unique to Much Ado:', set(much_ado_common) - set(romeo_common))
print('Unique to Romeo:', set(romeo_common) - set(much_ado_common))

Unique to Much Ado: {'grace', 'there', 'truly', 'beatrice', 'brother', 'great', 'daughter', 'answer', 'claudio', 'think', 'matter', 'husband', 'john', 'bid', 'margaret', 'she', 'prince', 'troth', 'signior', 'faith', 'yea', 'he', 'fool', 'morrow', 'leave', 'count', 'leonato', 'don', 'well', 'hero', 'wit', 'fashion', 'said', 'nay', 'cousin', 'gentleman', 'pray', 'benedick', 'we'}
Unique to Romeo: {'light', 'word', 'then', 'paris', 'young', 'comes', 'face', 'stay', 'now', 'bed', 'dead', 'madam', 'art', 'stand', 'time', 'dear', 'with', 'romeo', 'eyes', 'ay', 'juliet', 'gone', 'this', 'friar', 'where', 'or', 'wilt', 'tybalt', 'heaven', 'which', 'nurse', 'hast', 'here', 'house', 'of', 'montague', 'life', 'go', 'find'}


In [14]:
# Utility function to calculate how frequently lemmas appear in the text.
def lemma_frequencies(text, include_stop=True):
    """Function to identify lemma frequencies"""
    # Build a list of lemmas.
    # Strip out punctuation and stop words.
    lemmas = []
    for token in text:
        if not token.is_punct and (not token.is_stop or include_stop):
            lemmas.append(token.lemma_)
            
    # Build and return a Counter object containing word counts.
    return Counter(lemmas)

# Instantiate our list of most common lemmas.
much_ado_lemma_freq = lemma_frequencies(
    much_ado_doc, include_stop=False).most_common(500)
romeo_lemma_freq = lemma_frequencies(
    romeo_doc, include_stop=False).most_common(500)
print('\nMuch Ado:', much_ado_lemma_freq)
print('Romeo:', romeo_lemma_freq)

# Again, identify the lemmas common to one text but not the other.
much_ado_lemma_common = [pair[0] for pair in much_ado_lemma_freq]
romeo_lemma_common = [pair[0] for pair in romeo_lemma_freq]

print('Unique to Much Ado:', set(much_ado_lemma_common) -
      set(romeo_lemma_common))
print('Unique to Romeo:', set(romeo_lemma_common) -
      set(much_ado_lemma_common))


Much Ado: [('-PRON-', 877), ('man', 140), ('and', 116), ('love', 113), ("'s", 107), ('good', 105), ('be', 104), ('come', 99), ('thou', 86), ('know', 84), ('shall', 82), ('hath', 76), ('thee', 74), ('lord', 73), ('lady', 69), ('god', 68), ('let', 65), ('hero', 63), ('will', 61), ('think', 61), ('prince', 59), ('tell', 58), ('claudio', 57), ('benedick', 56), ('like', 52), ('thy', 52), ('but', 51), ('hear', 51), ('speak', 51), ('o', 49), ('if', 49), ('what', 43), ('signior', 42), ('beatrice', 40), ('brother', 39), ('marry', 39), ('night', 37), ('sir', 37), ('the', 37), ('to', 37), ('cousin', 36), ('heart', 36), ('no', 35), ('pray', 34), ('that', 33), ('why', 32), ('wit', 31), ('daughter', 31), ('leonato', 30), ('look', 29), ('say', 29), ('go', 28), ('well', 28), ('yea', 28), ('count', 28), ('master', 28), ('in', 27), ('swear', 27), ('die', 27), ('word', 26), ('hand', 26), ('answer', 26), ('how', 25), ('true', 25), ('leave', 24), ('faith', 24), ('doth', 24), ('nay', 24), ('bring', 22), ('

In [15]:
# Let's see how many sentences are in each play.
sents_much_ado = list(much_ado_doc.sents)
sents_romeo = list(romeo_doc.sents)

print("Much Ado About Nothing has {} sentences.".format(len(sents_much_ado)))
print("Romeo and Juliet has {} sentences.".format(len(sents_romeo)))

Much Ado About Nothing has 1685 sentences.
Romeo and Juliet has 2230 sentences.


In [16]:
'''# Copy lemma_counts data frame as a form of version control.
lemma_counts2 = word_counts

# Add a column for the lemma counts in each sentence to the data frame.
lemma_counts2['sent_length'] = lemma_counts2.text_sentence.map(
    lambda x: len(x))

# Let's create a count for parts of speech.
# Adverbs in each sentence.
sentences = lemma_counts2.text_sentence
adv_count = []
for sent in sentences:
    advs = 0
    for token in sent:
        if token.pos_ == 'ADV':
            advs += 1
    adv_count.append(advs)

# Add adverbs column to data frame.
lemma_counts2['adv_count'] = adv_count

# Verbs in each sentence.
verb_count = []
for sent in sentences:
    verb = 0
    for token in sent:
        if token.pos_ == 'VERB':
            verb += 1
    verb_count.append(verb)

# Add verbs column to data frame.
lemma_counts2['verb_count'] = verb_count

# Nouns in each sentence:
noun_count = []
for sent in sentences:
    noun = 0
    for token in sent:
        if token.pos_ == 'NOUN':
            noun += 1
    noun_count.append(noun)

# Add nouns column to data frame.
lemma_counts2['noun_count'] = noun_count

# Punctuation marks in each sentence.
punct_count = []
for sent in sentences:
    punct = 0
    for token in sent:
        if token.pos_ == 'PUNCT':
            punct += 1
    punct_count.append(punct)

# Add punctuation column to data frame.
lemma_counts2['punct_count'] = punct_count

lemma_counts2.head()'''

"# Copy lemma_counts data frame as a form of version control.\nlemma_counts2 = word_counts\n\n# Add a column for the lemma counts in each sentence to the data frame.\nlemma_counts2['sent_length'] = lemma_counts2.text_sentence.map(\n    lambda x: len(x))\n\n# Let's create a count for parts of speech.\n# Adverbs in each sentence.\nsentences = lemma_counts2.text_sentence\nadv_count = []\nfor sent in sentences:\n    advs = 0\n    for token in sent:\n        if token.pos_ == 'ADV':\n            advs += 1\n    adv_count.append(advs)\n\n# Add adverbs column to data frame.\nlemma_counts2['adv_count'] = adv_count\n\n# Verbs in each sentence.\nverb_count = []\nfor sent in sentences:\n    verb = 0\n    for token in sent:\n        if token.pos_ == 'VERB':\n            verb += 1\n    verb_count.append(verb)\n\n# Add verbs column to data frame.\nlemma_counts2['verb_count'] = verb_count\n\n# Nouns in each sentence:\nnoun_count = []\nfor sent in sentences:\n    noun = 0\n    for token in sent:\n      

## Tfidf

In [17]:
# Create new variables that aren't spacy tokens.
much_ado_tfidf = much_ado_clean
romeo_tfidf = romeo_clean

much_ado_tfidf = TfidfVectorizer(much_ado_tfidf)
type(much_ado_tfidf)
#df_much_ado = pd.DataFrame(much_ado_tfidf)
# Group into sentences.
#much_ado_tfidf_sents = [[sent, "Much Ado"] for sent in much_ado_tfidf.split()]
#romeo_tfidf_sents = [[sent, "Romeo"] for sent in romeo_tfidf.split()]

# Combine sentences from two plays into one data frame.
#sentences_tfidf = pd.DataFrame(much_ado_tfidf_sents + romeo_tfidf_sents)
#sentences_tfidf.head()

sklearn.feature_extraction.text.TfidfVectorizer

# Creating training/testing splits for both data sets.

In [18]:
# Splitting to train/test for Bag of Words.
Y_bow = lemma_counts['text_source']
X_bow = lemma_counts.drop(['text_sentence', 'text_source'], 1).values

# Create train, test sets for model.
X_bow_train, X_bow_test, y_bow_train, y_bow_test = train_test_split(
    X_bow,  # ndarray
    Y_bow,  # Series
    test_size=0.25,
    random_state=15
)

In [19]:
# Splitting to train/test for Tfidf.
Y_tfidf = sentences_tfidf['text_source']
X_tfidf = sentences_tfidf.drop(['text_sentence', 'text_source'], 1).values

# Create train, test sets for model.
X_tfidf_train, X_tfidf_test, y_tfidf_train, y_tfidf_test = train_test_split(
    X_tfidf,  # ndarray
    Y_tfidf,  # Series
    test_size=0.25,
    random_state=15
)

NameError: name 'sentences_tfidf' is not defined

In [None]:
sentences_tfidf.head()

In [None]:
# Create a function to fit and show our predictive models.
def fit_and_predict(model, X_train, y_train, X_test, y_test):
    model.fit(X_train, y_train)
    print(f'The {model} model scored {model.score(X_train, y_train)} on train.')
    print(f'The {model} model scored {model.score(X_test, y_test)} on test')
    y_pred = model.predict(y_test)
    print(classification_report(y_test, y_pred))
    print(confusion_matrix(y_test, y_pred))

In [None]:
# Instantiate Decision Tree Classifier.
dtc = DecisionTreeClassifier()
fit_and_predict(dtc, X_bow_train, y_bow_train
# Fit model.
dtc.fit(X_train, y_train)

# Print training and test scores.
print('Training set score: ', dtc.score(X_train, y_train))
print('\nTest set score:', dtc.score(X_test, y_test))

In [None]:
# Instantiate Random Forest Classifier.
rfc = ensemble.RandomForestClassifier()

# Fit model.
rfc.fit(X_train, y_train)

# Print training and test scores.
print('Training set score: ', rfc.score(X_train, y_train))
print('\nTest set score:', rfc.score(X_test, y_test))

In [None]:
# Instantiate Logistic Regression Classifier.
lr = LogisticRegression()
# Fit model.
lr.fit(X_train, y_train)

# Print shape, training and test scores.
print(X_train.shape, y_train.shape)
print('Training set score:', lr.score(X_train, y_train))
print('\nTest set score:', lr.score(X_test, y_test))

In [None]:
# Instantiate Gradient Boosting Classifier.
clf = GradientBoostingClassifier()
# Fit model.
clf.fit(X_train, y_train)

# Print shape, training and test scores.
print(X_train.shape, y_train.shape)
print('Training set score:', clf.score(X_train, y_train))
print('\nTest set score:', clf.score(X_test, y_test))

In [None]:
# Instantiate Support Vector model.
svc = LinearSVC()
# Fit model.
svc.fit(X_train, y_train)

# Print training and test set scores.
print('Training set score:', svc.score(X_train, y_train))
print('\nTest set score:', svc.score(X_test, y_test))

In [None]:
# Let's go back and re-try SVM with the new features.
Y = lemma_counts2['text_source']
X = np.array(lemma_counts2.drop(['text_sentence', 'text_source'], 1))

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    Y,
                                                    test_size=0.4,
                                                    random_state=15
                                                   )
svm2 = LinearSVC()
train = svm2.fit(X_train, y_train)
print('Training set score: ', svm2.score(X_train, y_train))
print('\nTest set score: ', svm2.score(X_test, y_test))

In [None]:
# Let's see this in crosstab.
svm2_predicted = svm2.predict(X_test)
pd.crosstab(y_test, svm2_predicted)

In [None]:
# Let's try the Random Forest again with the new features.
rfc2 = ensemble.RandomForestClassifier()
Y = lemma_counts2['text_source']
X = np.array(lemma_counts2.drop(['text_sentence', 'text_source'], 1))

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    Y,
                                                    test_size=0.4,
                                                    random_state=15
                                                   )

train = rfc2.fit(X_train, y_train)

print('Training set score: ', rfc2.score(X_train, y_train))
print('\nTest set score:', rfc2.score(X_test, y_test))

In [None]:
# Let's set up our model again.
Y = lemma_counts2['text_source']
X = np.array(lemma_counts2.drop(['text_sentence', 'text_source'], 1))

X_train, X_test, y_train, y_test = train_test_split(X,
                                                    Y,
                                                    test_size=0.25,
                                                    random_state=15
                                                   )

lr = LogisticRegression(solver='lbfgs', max_iter=5000)
train = lr.fit(X_train, y_train)
print(X_train.shape, y_train.shape)
print('Training set score:', lr.score(X_train, y_train))
print('\nTest set score:', lr.score(X_test, y_test))

After trying this many different ways with many Q & A Session visits with multiple mentors, I am unable to improve upon the results of these models. Some of the models do improve as we go on, but nothing close to the 5% improvement that is requested.