## Challenge: Build your own NLP model

For this challenge, you will need to choose a corpus of data from nltk or another source that includes categories you can predict and create an analysis pipeline that includes the following steps:

1. Data cleaning / processing / language parsing
2. Create features using two different NLP methods: For example, BoW vs tf-idf
3. Use the features to fit supervised learning models for each feature set to predict the category outcomes
4. Assess your models using cross-validation and determine whether one model performed better
5. Pick one of the models and try to increase accuracy by at least 5 percentage points

Write up your report in a Jupyter notebook. Be sure to explicitly justify the choices you make throughout, and submit it below.

In [1]:
%matplotlib inline
import numpy as np
import pandas as pd
import scipy
import sklearn
import matplotlib.pyplot as plt
import seaborn as sns
import re
import spacy
from nltk.corpus import shakespeare, gutenberg, stopwords

In [2]:
# Utility function to clean text.
def text_cleaner(text):

    # Visual inspection shows spaCy does not recognize the double dash '--'.
    # Better get rid of it now!
    text = re.sub(r'--', ' ', text)

    # Get rid of headings in square brackets.
    text = re.sub("[\[].*?[\]]", "", text)

    # Get rid of text in angled brackets (<>).
    text = re.sub("[\<].*?[\>]", "", text)

    # Get rid of chapter titles.
    text = re.sub(r'Chapter \d+', '', text)

    # Get rid of extra whitespace.
    text = ' '.join(text.split())

    return text[0:900000]


much_ado = open("Much Ado About Nothing.txt", encoding='utf-16')
midsummer = open("A Midsummer-Night's Dream.txt", encoding='utf-16')
twelfth = open("Twelfth-Night; or What You Will.txt", encoding='utf-16')
merchant = open("The Merchant of Venice.txt", encoding='utf-16')
romeo = open("Romeo and Juliet.txt", encoding='utf-16')
hamlet = open("Hamlet, Prince of Denmark.txt", encoding='utf-16')
othello = open("Othello, the Moor of Venice.txt", encoding='utf-16')

much_ado_raw = much_ado.read()
midsummer_raw = midsummer.read()
twelfth_raw = twelfth.read()
merchant_raw = merchant.read()
romeo_raw = romeo.read()
hamlet_raw = hamlet.read()
othello_raw = othello.read()

# Clean the data.
much_ado_clean = text_cleaner(much_ado_raw)
midsummer_clean = text_cleaner(midsummer_raw)
twelfth_clean = text_cleaner(twelfth_raw)
merchant_clean = text_cleaner(merchant_raw)
romeo_clean = text_cleaner(romeo_raw)
hamlet_clean = text_cleaner(hamlet_raw)
othello_clean = text_cleaner(othello_raw)

In [3]:
# Parse the data. This can take some time.
nlp = spacy.load('en')
much_ado_doc = nlp(much_ado_clean)
midsummer_doc = nlp(midsummer_clean)
twelfth_doc = nlp(twelfth_clean)
merchant_doc = nlp(merchant_clean)
romeo_doc = nlp(romeo_clean)
hamlet_doc = nlp(hamlet_clean)
othello_doc = nlp(othello_clean)

In [4]:
[print(much_ado_doc[:25750])]

I learn in this letter that Don Pedro of Arragon comes this night to Messina. He is very near by this: he was not three leagues off when I left him. How many gentlemen have you lost in this action? But few of any sort, and none of name. A victory is twice itself when the achiever brings home full numbers. I find here that Don Pedro hath bestowed much honour on a young Florentine called Claudio. Much deserved on his part and equally remembered by Don Pedro. He hath borne himself beyond the promise of his age, doing in the figure of a lamb the feats of a lion: he hath indeed better bettered expectation than you must expect of me to tell you how. He hath an uncle here in Messina will be very much glad of it. I have already delivered him letters, and there appears much joy in him; even so much that joy could not show itself modest enough without a badge of bitterness. Did he break out into tears? In great measure. A kind overflow of kindness. There are no faces truer than those that are so

[None]

In [5]:
[print(midsummer_doc[:20120])]



[None]

In [6]:
[print(twelfth_doc[:24270])]

If music be the food of love, play on; Give me excess of it, that, surfeiting, The appetite may sicken, and so die. That strain again! it had a dying fall: O! it came o'er my ear like the sweet sound That breathes upon a bank of violets, Stealing and giving odour. Enough! no more: 'Tis not so sweet now as it was before. O spirit of love! how quick and fresh art thou, That, notwithstanding thy capacity Receiveth as the sea, nought enters there, Of what validity and pitch soe'er, But falls into abatement and low price, Even in a minute: so full of shapes is fancy, That it alone is high fantastical. Will you go hunt, my lord? What, Curio? The hart. Why, so I do, the noblest that I have. O! when mine eyes did see Olivia first, Methought she purg'd the air of pestilence. That instant was I turn'd into a hart, And my desires, like fell and cruel hounds, E'er since pursue me. How now! what news from her? So please my lord, I might not be admitted; But from her handmaid do return this answer: 

[None]

In [7]:
print(merchant_doc[:25550])



In [8]:
[print(romeo_doc[:30245])]



[None]

In [9]:
[print(hamlet_doc[:36950])]



[None]

In [10]:
[print(othello_doc[:32350])]

Tush! Never tell me; I take it much unkindly That thou, Iago, who hast had my purse As if the strings were thine, shouldst know of this. 'Sblood, but you will not hear me: If ever I did dream of such a matter, Abhor me. Thou told'st me thou didst hold him in thy hate. Despise me if I do not. Three great ones of the city, In personal suit to make me his lieutenant, Off-capp'd to him; and, by the faith of man. I know my price, I am worth no worse a place; But he, as loving his own pride and purposes, Evades them, with a bombast circumstance Horribly stuff'd with epithets of war; And, in conclusion, Nonsuits my mediators; for, 'Certes,' says he, 'I have already chose my officer.' And what was he? Forsooth, a great arithmetician, One Michael Cassio, a Florentine, A fellow almost damn'd in a fair wife; That never set a squadron in the field, Nor the division of a battle knows More than a spinster; unless the bookish theoric, Wherein the toged consuls can propose As masterly as he: mere prat

[None]

In [11]:
# Group into sentences.
much_ado_sents = [[sent, "Much Ado"] for sent in much_ado_doc.sents]
midsummer_sents = [[sent, "Midsummer"] for sent in midsummer_doc.sents]
twelfth_sents = [[sent, "Twelfth"] for sent in twelfth_doc.sents]
merchant_sents = [[sent, "Merchant"] for sent in merchant_doc.sents]
romeo_sents = [[sent, "Romeo"] for sent in romeo_doc.sents]
hamlet_sents = [[sent, "Hamlet"] for sent in hamlet_doc.sents]
othello_sents = [[sent, "Othello"] for sent in othello_doc.sents]

# Combine the sentences from the two novels into one data frame.
sentences = pd.DataFrame(much_ado_sents +
                         midsummer_sents +
                         twelfth_sents +
                         merchant_sents +
                         romeo_sents +
                         hamlet_sents +
                         othello_sents
                         )
sentences.head()

Unnamed: 0,0,1
0,"(I, learn, in, this, letter, that, Don, Pedro,...",Much Ado
1,"(He, is, very, near, by, this, :, he, was, not...",Much Ado
2,"(How, many, gentlemen, have, you, lost, in, th...",Much Ado
3,"(But, few, of, any, sort, ,, and, none, of, na...",Much Ado
4,"(A, victory, is, twice, itself, when, the, ach...",Much Ado


In [12]:
from collections import Counter

# Utility function to calculate how frequently words appear in the text.


def word_frequencies(text, include_stop=True):

    # Build a list of words.
    # Strip out punctuation and, optionally, stop words.
    words = []
    for token in text:
        if not token.is_punct and (not token.is_stop or include_stop):
            words.append(token.text)

    # Build and return a Counter object containing word counts.
    return Counter(words)


# The most frequent words:
much_ado_freq = word_frequencies(much_ado_doc,
                                 include_stop=False).most_common(10)
midsummer_freq = word_frequencies(midsummer_doc,
                                  include_stop=False).most_common(10)
twelfth_freq = word_frequencies(twelfth_doc,
                                include_stop=False).most_common(10)
merchant_freq = word_frequencies(merchant_doc,
                                 include_stop=False).most_common(10)
romeo_freq = word_frequencies(romeo_doc,
                              include_stop=False).most_common(10)
hamlet_freq = word_frequencies(hamlet_doc,
                               include_stop=False).most_common(10)
othello_freq = word_frequencies(othello_doc,
                                include_stop=False).most_common(10)
print('Much_Ado: ', much_ado_freq)
print('Midsummer: ', midsummer_freq)
print('Twelfth: ', twelfth_freq)
print('Merchant: ', merchant_freq)
print('Romeo: ', romeo_freq)
print('Hamlet: ', hamlet_freq)
print('Othello: ', othello_freq)

Much_Ado:  [('I', 712), ("'s", 170), ('And', 116), ('man', 111), ('love', 90), ('good', 78), ('thou', 74), ('thee', 74), ('shall', 72), ('hath', 67)]
Midsummer:  [('I', 461), ('And', 199), ("'s", 126), ('love', 103), ('thou', 98), ('The', 78), ('shall', 65), ('thee', 63), ('But', 61), ('To', 61)]
Twelfth:  [('I', 650), ("'s", 187), ('sir', 117), ('thou', 114), ('thy', 99), ('And', 91), ('thee', 91), ('love', 80), ("'ll", 76), ('What', 72)]
Merchant:  [('I', 693), ('And', 182), ("'s", 142), ('To', 98), ('shall', 98), ('The', 88), ('thou', 82), ('But', 74), ('That', 69), ('Jew', 67)]
Romeo:  [('I', 644), ("'s", 266), ('thou', 238), ('And', 228), ('O', 150), ('thy', 149), ('love', 144), ('thee', 134), ('Romeo', 126), ('shall', 94)]
Hamlet:  [('I', 598), ('And', 268), ("'s", 234), ('lord', 205), ('The', 146), ('That', 134), ('To', 125), ('O', 115), ('shall', 111), ('But', 110)]
Othello:  [('I', 881), ("'s", 217), ('And', 199), ('O', 148), ('thou', 122), ('Cassio', 122), ('That', 114), ('Wh

In [13]:
# Pull out just the text from our frequency lists.
much_ado_common = [pair[0] for pair in much_ado_freq]
midsummer_common = [pair[0] for pair in midsummer_freq]
twelfth_common = [pair[0] for pair in twelfth_freq]
merchant_common = [pair[0] for pair in merchant_freq]
romeo_common = [pair[0] for pair in romeo_freq]
hamlet_common = [pair[0] for pair in hamlet_freq]
othello_common = [pair[0] for pair in othello_freq]

# Use sets to find the unique values in each top ten.
print('Unique to Much_Ado:', set(much_ado_common) - set(midsummer_common)
      - set(twelfth_common) - set(merchant_common) - set(romeo_common)
      - set(hamlet_common) - set(othello_common))

print('Unique to Midsummer:', set(midsummer_common) - set(much_ado_common)
      - set(twelfth_common) - set(merchant_common) - set(romeo_common)
      - set(hamlet_common) - set(othello_common))

print('Unique to Twelfth:', set(twelfth_common) - set(midsummer_common)
      - set(much_ado_common) - set(merchant_common) - set(romeo_common)
      - set(hamlet_common) - set(othello_common))

print('Unique to Merchant:', set(merchant_common) - set(midsummer_common)
      - set(twelfth_common) - set(much_ado_common) - set(romeo_common)
      - set(hamlet_common) - set(othello_common))

print('Unique to Romeo:', set(romeo_common) - set(midsummer_common)
      - set(twelfth_common) - set(merchant_common) - set(much_ado_common)
      - set(hamlet_common) - set(othello_common))

print('Unique to Hamlet:', set(hamlet_common) - set(midsummer_common)
      - set(twelfth_common) - set(merchant_common) - set(romeo_common)
      - set(much_ado_common) - set(othello_common))

print('Unique to Othello:', set(othello_common) - set(midsummer_common)
      - set(twelfth_common) - set(merchant_common) - set(romeo_common)
      - set(hamlet_common) - set(much_ado_common))

Unique to Much_Ado: {'man', 'hath', 'good'}
Unique to Midsummer: set()
Unique to Twelfth: {'sir', "'ll"}
Unique to Merchant: {'Jew'}
Unique to Romeo: {'Romeo'}
Unique to Hamlet: {'lord'}
Unique to Othello: {'Cassio'}


In [14]:
# Utility function to calculate how frequently lemmas appear in the text.
def lemma_frequencies(text, include_stop=True):

    # Build a list of lemmas.
    # Strip out punctuation and, optionally, stop words.
    lemmas = []
    for token in text:
        if not token.is_punct and (not token.is_stop or include_stop):
            lemmas.append(token.lemma_)

    # Build and return a Counter object containing word counts.
    return Counter(lemmas)


# Instantiate our list of most common lemmas.
much_ado_lemma_freq = lemma_frequencies(
    much_ado_doc, include_stop=False).most_common()
midsummer_lemma_freq = lemma_frequencies(
    midsummer_doc, include_stop=False).most_common()
twelfth_lemma_freq = lemma_frequencies(
    twelfth_doc, include_stop=False).most_common()
merchant_lemma_freq = lemma_frequencies(
    merchant_doc, include_stop=False).most_common()
romeo_lemma_freq = lemma_frequencies(
    romeo_doc, include_stop=False).most_common()
hamlet_lemma_freq = lemma_frequencies(
    hamlet_doc, include_stop=False).most_common()
othello_lemma_freq = lemma_frequencies(
    othello_doc, include_stop=False).most_common()
print('\nMuch Ado:', much_ado_lemma_freq)
print('\nMidsummer:', midsummer_lemma_freq)
print('\nTwelfth:', twelfth_lemma_freq)
print('\nMerchant:', merchant_lemma_freq)
print('\nRomeo:', romeo_lemma_freq)
print('\nHamlet:', hamlet_lemma_freq)
print('\nOthello:', othello_lemma_freq)

# Pull out just the text from our frequency lists.
much_ado_lemma_common = [pair[0] for pair in much_ado_lemma_freq]
midsummer_lemma_common = [pair[0] for pair in midsummer_lemma_freq]
twelfth_lemma_common = [pair[0] for pair in twelfth_lemma_freq]
merchant_lemma_common = [pair[0] for pair in merchant_lemma_freq]
romeo_lemma_common = [pair[0] for pair in romeo_lemma_freq]
hamlet_lemma_common = [pair[0] for pair in hamlet_lemma_freq]
othello_lemma_common = [pair[0] for pair in othello_lemma_freq]

# Use sets to find the unique values in each play.
print('Unique to Much_Ado:',
      set(much_ado_lemma_common)
      - set(midsummer_lemma_common)
      - set(twelfth_lemma_common)
      - set(merchant_lemma_common)
      - set(romeo_lemma_common)
      - set(hamlet_lemma_common)
      - set(othello_lemma_common))

print('Unique to Midsummer:',
      set(midsummer_lemma_common)
      - set(much_ado_lemma_common)
      - set(twelfth_lemma_common)
      - set(merchant_lemma_common)
      - set(romeo_lemma_common)
      - set(hamlet_lemma_common)
      - set(othello_lemma_common))

print('Unique to Twelfth:',
      set(twelfth_lemma_common)
      - set(midsummer_lemma_common)
      - set(much_ado_lemma_common)
      - set(merchant_lemma_common)
      - set(romeo_lemma_common)
      - set(hamlet_lemma_common)
      - set(othello_lemma_common))

print('Unique to Merchant:',
      set(merchant_lemma_common)
      - set(midsummer_lemma_common)
      - set(twelfth_lemma_common)
      - set(much_ado_lemma_common)
      - set(romeo_lemma_common)
      - set(hamlet_lemma_common)
      - set(othello_lemma_common))

print('Unique to Romeo:',
      set(romeo_lemma_common)
      - set(midsummer_lemma_common)
      - set(twelfth_lemma_common)
      - set(merchant_lemma_common)
      - set(much_ado_lemma_common)
      - set(hamlet_lemma_common)
      - set(othello_lemma_common))

print('Unique to Hamlet:',
      set(hamlet_lemma_common)
      - set(midsummer_lemma_common)
      - set(twelfth_lemma_common)
      - set(merchant_lemma_common)
      - set(romeo_lemma_common)
      - set(much_ado_lemma_common)
      - set(othello_lemma_common))

print('Unique to Othello:',
      set(othello_lemma_common)
      - set(midsummer_lemma_common)
      - set(twelfth_lemma_common)
      - set(merchant_lemma_common)
      - set(romeo_lemma_common)
      - set(hamlet_lemma_common)
      - set(much_ado_lemma_common))


Much Ado: [('-PRON-', 877), ('man', 140), ('and', 116), ('love', 113), ("'s", 107), ('good', 105), ('be', 104), ('come', 99), ('thou', 86), ('know', 84), ('shall', 82), ('hath', 76), ('thee', 74), ('lord', 73), ('lady', 69), ('god', 68), ('let', 65), ('hero', 63), ('will', 61), ('think', 61), ('prince', 59), ('tell', 58), ('claudio', 57), ('benedick', 56), ('like', 52), ('thy', 52), ('but', 51), ('hear', 51), ('speak', 51), ('o', 49), ('if', 49), ('what', 43), ('signior', 42), ('beatrice', 40), ('brother', 39), ('marry', 39), ('night', 37), ('sir', 37), ('the', 37), ('to', 37), ('cousin', 36), ('heart', 36), ('no', 35), ('pray', 34), ('that', 33), ('why', 32), ('wit', 31), ('daughter', 31), ('leonato', 30), ('look', 29), ('say', 29), ('go', 28), ('well', 28), ('yea', 28), ('count', 28), ('master', 28), ('in', 27), ('swear', 27), ('die', 27), ('word', 26), ('hand', 26), ('answer', 26), ('how', 25), ('true', 25), ('leave', 24), ('faith', 24), ('doth', 24), ('nay', 24), ('bring', 22), ('

Unique to Twelfth: {'damask', 'votre', 'dumbness', 'madonna', 'swarth', 'clerestory', 'covetousness', 'roam', 'competitor', 'profanation', 'pension', "cruell'st", 'whence', 'pacify', 'exalt', 'purposely', 'kickchaws', 'induce', 'trice', 'extracting', 'rapi', "sow'd", 'od', 'malignancy', 'me,—', 'pilchard', 'barful', 'odour', 'inhabit', 'day!—', 'unsuitable', 'rudely', 'curio', 'gaskin', 'denial', 'hyperbolical', 'relique', 'sportful', 'obstacle', "knight,'—", 'geck', 'duello', 'prose', "does't", 'twang', "have't", 'mollification', 'dwelt', 'swaggering', 'wavering', 'wittily', 'deceivable', 'handmaid', 'addict', "unclasp'd", 'provident', 'masculine', 'sicken', 'leasing', "garter'd", 'bloodless', 'implacable', 'consideration', 'distractedly', 'not,—', 'quinapalus', 'firenew', 'phrygia', 'hung', 'vouchsafed', 'incredulous', 'sot', 'indignation', 'arion', "usurp'd", 'convent', 'halting', 'tuck', 'augmentation', 'negative', 'nonpareil', 'welkin', 'firago', 'agone', "divulg'd", 'unprizable',

In [15]:
tragedies_lemma_freq = romeo_lemma_freq + \
    hamlet_lemma_freq + othello_lemma_freq
print(tragedies_lemma_freq)

comedies_lemma_freq = much_ado_lemma_freq + \
    midsummer_lemma_freq + twelfth_lemma_freq + merchant_lemma_freq
print(comedies_lemma_freq)

[('-PRON-', 877), ('man', 140), ('and', 116), ('love', 113), ("'s", 107), ('good', 105), ('be', 104), ('come', 99), ('thou', 86), ('know', 84), ('shall', 82), ('hath', 76), ('thee', 74), ('lord', 73), ('lady', 69), ('god', 68), ('let', 65), ('hero', 63), ('will', 61), ('think', 61), ('prince', 59), ('tell', 58), ('claudio', 57), ('benedick', 56), ('like', 52), ('thy', 52), ('but', 51), ('hear', 51), ('speak', 51), ('o', 49), ('if', 49), ('what', 43), ('signior', 42), ('beatrice', 40), ('brother', 39), ('marry', 39), ('night', 37), ('sir', 37), ('the', 37), ('to', 37), ('cousin', 36), ('heart', 36), ('no', 35), ('pray', 34), ('that', 33), ('why', 32), ('wit', 31), ('daughter', 31), ('leonato', 30), ('look', 29), ('say', 29), ('go', 28), ('well', 28), ('yea', 28), ('count', 28), ('master', 28), ('in', 27), ('swear', 27), ('die', 27), ('word', 26), ('hand', 26), ('answer', 26), ('how', 25), ('true', 25), ('leave', 24), ('faith', 24), ('doth', 24), ('nay', 24), ('bring', 22), ('fashion', 2

In [16]:
tragedy_only = set(tragedies_lemma_freq) - set(comedies_lemma_freq)
comedy_only = set(comedies_lemma_freq) - set(tragedies_lemma_freq)

print("Tragedy only: ", tragedy_only)
print("Comedy only: ", comedy_only)

Comedy only:  {('saw', 1), ('twist', 1), ('holp', 2), ('flatterer', 1), ('claim', 2), ('talk', 12), ('injurious', 1), ('sit', 6), ('drift', 1), ('ursula', 6), ('gentlewoman', 5), ('impose', 1), ('defy', 3), ('making', 1), ('bloody', 2), ('four', 2), ('shall', 72), ('agone', 1), ('peaceably', 1), ('bless', 12), ('harmless', 1), ('law—', 1), ('hope', 12), ('qualm', 1), ('odour', 3), ('want', 1), ('onion', 1), ('live', 12), ('cat', 3), ('wearer', 1), ('just', 1), ('belie', 4), ('pandar', 1), ('slaughter', 1), ("preach'd", 1), ('hue', 2), ('notable', 5), ('noise', 1), ('there', 26), ('cross', 5), ('appetite', 3), ('send', 3), ('friend?—', 1), ('out', 6), ('outswear', 1), ('coward', 6), ('vulgo', 1), ("imprison'd", 1), ('fog', 1), ("infus'd", 1), ('hear', 31), ('disfigure', 2), ('honour', 9), ('amend', 3), ('rude', 1), ('trifle', 2), ('belong', 1), ('stream', 2), ('suspect', 3), ('reveller', 1), ('robin', 7), ('unhospitable', 1), ('freedom', 1), ('judge', 23), ('shouldst', 1), ("deform'd", 

In [17]:
'''# Organize the parsed doc into sentences, while filtering out punctuation
# and stop words, and converting words to lower case lemmas.
sentences = []
for sentence in much_ado_doc.sents:
    sentence = [
        token.lemma_.lower()
        for token in sentence
        if not token.is_stop
        and not token.is_punct
    ]
    sentences.append(sentence)
    
print(sentences[20])
print('We have {} sentences and {} tokens.'.format(len(sentences), len(much_ado_clean)))'''

"# Organize the parsed doc into sentences, while filtering out punctuation\n# and stop words, and converting words to lower case lemmas.\nsentences = []\nfor sentence in much_ado_doc.sents:\n    sentence = [\n        token.lemma_.lower()\n        for token in sentence\n        if not token.is_stop\n        and not token.is_punct\n    ]\n    sentences.append(sentence)\n    \nprint(sentences[20])\nprint('We have {} sentences and {} tokens.'.format(len(sentences), len(much_ado_clean)))"

In [18]:
'''# Organize the parsed doc into sentences, while filtering out punctuation
# and stop words, and converting words to lower case lemmas.
for sentence in midsummer_doc.sents:
    sentence = [
        token.lemma_.lower()
        for token in sentence
        if not token.is_stop
        and not token.is_punct
    ]
    sentences.append(sentence)
    
print(sentences[20])
print('We have {} sentences and {} tokens.'.format(len(sentences), len(midsummer_clean)))'''

"# Organize the parsed doc into sentences, while filtering out punctuation\n# and stop words, and converting words to lower case lemmas.\nfor sentence in midsummer_doc.sents:\n    sentence = [\n        token.lemma_.lower()\n        for token in sentence\n        if not token.is_stop\n        and not token.is_punct\n    ]\n    sentences.append(sentence)\n    \nprint(sentences[20])\nprint('We have {} sentences and {} tokens.'.format(len(sentences), len(midsummer_clean)))"

In [19]:
'''# Organize the parsed doc into sentences, while filtering out punctuation
# and stop words, and converting words to lower case lemmas.
for sentence in twelfth_doc.sents:
    sentence = [
        token.lemma_.lower()
        for token in sentence
        if not token.is_stop
        and not token.is_punct
    ]
    sentences.append(sentence)
    
print(sentences[20])
print('We have {} sentences and {} tokens.'.format(len(sentences), len(twelfth_clean)))'''

"# Organize the parsed doc into sentences, while filtering out punctuation\n# and stop words, and converting words to lower case lemmas.\nfor sentence in twelfth_doc.sents:\n    sentence = [\n        token.lemma_.lower()\n        for token in sentence\n        if not token.is_stop\n        and not token.is_punct\n    ]\n    sentences.append(sentence)\n    \nprint(sentences[20])\nprint('We have {} sentences and {} tokens.'.format(len(sentences), len(twelfth_clean)))"

In [20]:
'''# Organize the parsed doc into sentences, while filtering out punctuation
# and stop words, and converting words to lower case lemmas.
for sentence in merchant_doc.sents:
    sentence = [
        token.lemma_.lower()
        for token in sentence
        if not token.is_stop
        and not token.is_punct
    ]
    sentences.append(sentence)
    
print(sentences[20])
print('We have {} sentences and {} tokens.'.format(len(sentences), len(merchant_clean)))'''

"# Organize the parsed doc into sentences, while filtering out punctuation\n# and stop words, and converting words to lower case lemmas.\nfor sentence in merchant_doc.sents:\n    sentence = [\n        token.lemma_.lower()\n        for token in sentence\n        if not token.is_stop\n        and not token.is_punct\n    ]\n    sentences.append(sentence)\n    \nprint(sentences[20])\nprint('We have {} sentences and {} tokens.'.format(len(sentences), len(merchant_clean)))"

In [21]:
'''# Organize the parsed doc into sentences, while filtering out punctuation
# and stop words, and converting words to lower case lemmas.
for sentence in romeo_doc.sents:
    sentence = [
        token.lemma_.lower()
        for token in sentence
        if not token.is_stop
        and not token.is_punct
    ]
    sentences.append(sentence)
    
print(sentences[20])
print('We have {} sentences and {} tokens.'.format(len(sentences), len(romeo_clean)))'''

"# Organize the parsed doc into sentences, while filtering out punctuation\n# and stop words, and converting words to lower case lemmas.\nfor sentence in romeo_doc.sents:\n    sentence = [\n        token.lemma_.lower()\n        for token in sentence\n        if not token.is_stop\n        and not token.is_punct\n    ]\n    sentences.append(sentence)\n    \nprint(sentences[20])\nprint('We have {} sentences and {} tokens.'.format(len(sentences), len(romeo_clean)))"

In [22]:
'''# Organize the parsed doc into sentences, while filtering out punctuation
# and stop words, and converting words to lower case lemmas.
for sentence in hamlet_doc.sents:
    sentence = [
        token.lemma_.lower()
        for token in sentence
        if not token.is_stop
        and not token.is_punct
    ]
    sentences.append(sentence)
    
print(sentences[20])
print('We have {} sentences and {} tokens.'.format(len(sentences), len(hamlet_clean)))'''

"# Organize the parsed doc into sentences, while filtering out punctuation\n# and stop words, and converting words to lower case lemmas.\nfor sentence in hamlet_doc.sents:\n    sentence = [\n        token.lemma_.lower()\n        for token in sentence\n        if not token.is_stop\n        and not token.is_punct\n    ]\n    sentences.append(sentence)\n    \nprint(sentences[20])\nprint('We have {} sentences and {} tokens.'.format(len(sentences), len(hamlet_clean)))"

In [23]:
'''# Organize the parsed doc into sentences, while filtering out punctuation
# and stop words, and converting words to lower case lemmas.
for sentence in othello_doc.sents:
    sentence = [
        token.lemma_.lower()
        for token in sentence
        if not token.is_stop
        and not token.is_punct
    ]
    sentences.append(sentence)
    
print(sentences[20])
print('We have {} sentences and {} tokens.'.format(len(sentences), len(othello_clean)))'''

"# Organize the parsed doc into sentences, while filtering out punctuation\n# and stop words, and converting words to lower case lemmas.\nfor sentence in othello_doc.sents:\n    sentence = [\n        token.lemma_.lower()\n        for token in sentence\n        if not token.is_stop\n        and not token.is_punct\n    ]\n    sentences.append(sentence)\n    \nprint(sentences[20])\nprint('We have {} sentences and {} tokens.'.format(len(sentences), len(othello_clean)))"

In [24]:
# Utility function to create a list of the 2000 most common words.
def bag_of_words(text):

    # Filter out punctuation and stop words.
    allwords = [token.lemma_
                for token in text
                if not token.is_punct
                and not token.is_stop]

    # Return the most common words.
    return [item[0] for item in Counter(allwords).most_common(2000)]

# Creates a data frame with features for each word in our common word set.
# Each value is the count of the times the word appears in each sentence.


def bow_features(sentences, comedy_common_words):

    # Scaffold the data frame and initialize counts to zero.
    df = pd.DataFrame(columns=comedy_common_words)
    df['text_sentence'] = sentences[0]
    df['text_source'] = sentences[1]
    df.loc[:, comedy_common_words] = 0

    # Process each row, counting the occurrence of words in each sentence.
    for i, sentence in enumerate(df['text_sentence']):

        # Convert the sentence to lemmas, then filter out punctuation,
        # stop words, and uncommon words.
        words = [token.lemma_
                 for token in sentence
                 if (
                     not token.is_punct
                     and not token.is_stop
                     and token.lemma_ in comedy_common_words
                 )]

        # Populate the row with word counts.
        for word in words:
            df.loc[i, word] += 1

        # This counter is just to make sure the kernel didn't hang.
        if i % 50 == 0:
            print("Processing row{}".format(i))

    return df


# Set up the bags.
much_ado_words = bag_of_words(much_ado_doc)
midsummer_words = bag_of_words(midsummer_doc)
twelfth_words = bag_of_words(twelfth_doc)
merchant_words = bag_of_words(merchant_doc)

# Combine bags to create a set of unique words.
comedy_common_words = set(
    much_ado_words + midsummer_words + twelfth_words + merchant_words)

In [25]:
# Create our data frame with features. This can take a while to run.
comedy_word_counts = bow_features(sentences, comedy_common_words)
comedy_word_counts.head()

Processing row0
Processing row50
Processing row100
Processing row150
Processing row200
Processing row250
Processing row300
Processing row350
Processing row400
Processing row450
Processing row500
Processing row550
Processing row600
Processing row650
Processing row700
Processing row750
Processing row800
Processing row850
Processing row900
Processing row950
Processing row1000
Processing row1050
Processing row1100
Processing row1150
Processing row1200
Processing row1250
Processing row1300
Processing row1350
Processing row1400
Processing row1450
Processing row1500
Processing row1550
Processing row1600
Processing row1650
Processing row1700
Processing row1750
Processing row1800
Processing row1850
Processing row1900
Processing row1950
Processing row2000
Processing row2050
Processing row2100
Processing row2150
Processing row2200
Processing row2250
Processing row2300
Processing row2350
Processing row2400
Processing row2450
Processing row2500
Processing row2550
Processing row2600
Processing row26

Unnamed: 0,brier,infirmity,owe,nymph,retention,hound,richly,dumbness,oberon,canopy,...,couldst,alive,greet,place,radiant,wick,chapel,roof,text_sentence,text_source
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(I, learn, in, this, letter, that, Don, Pedro,...",Much Ado
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(He, is, very, near, by, this, :, he, was, not...",Much Ado
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(How, many, gentlemen, have, you, lost, in, th...",Much Ado
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(But, few, of, any, sort, ,, and, none, of, na...",Much Ado
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,"(A, victory, is, twice, itself, when, the, ach...",Much Ado


In [26]:
print(sentences.head())


                                                   0         1
0  (I, learn, in, this, letter, that, Don, Pedro,...  Much Ado
1  (He, is, very, near, by, this, :, he, was, not...  Much Ado
2  (How, many, gentlemen, have, you, lost, in, th...  Much Ado
3  (But, few, of, any, sort, ,, and, none, of, na...  Much Ado
4  (A, victory, is, twice, itself, when, the, ach...  Much Ado


In [5]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split

X_train, X_test = train_test_split(
    much_ado_paras, test_size=0.4, random_state=0)

vectorizer = TfidfVectorizer(max_df=0.5,  # drop words that occur in more
                             # than half the paragraphs.
                             min_df=2,  # only use words that appear at least
                             # twice
                             stop_words='english',
                             lowercase=True,  # convert everything to lower
                             # case
                             # we definitely want to use inverse document
                             # frequencies in our weighting.
                             use_idf=True,
                             # Applies a correction factor so that longer
                             # paragraphs get treated equally.
                             norm=u'12',
                             # Adds 1 to all docuemnt frequencies, as if an
                             # extra document existed that # used
                             smooth_idf=True
                             # every word once. Prevents divide-by-zero
                             # errors.
                             )

# Applying the vectorizer.
much_ado_paras_tfidf = vectorizer.fit_transform(much_ado_paras)
print("Number of features: %d" % much_ado_paras_tfidf.get_shape()[1])

# Splitting into training and test sets.
X_train_tfidf, X_test_tfidf = train_test_split(
    much_ado_paras_tfidf, test_size=0.4, random_state=0)

# Reshapes the vectorizer output into something people can read.
X_train_tfidf_csr = X_train_tfidf.tocsr()

# Number of paragraphs.
n = X_train_tfidf_csr.shape[0]

# A list of dictionaries, one per paragraph.
tfidf_bypara = [{} for _ in range(0, n)]

# List of features.
terms = vectorizer.get_feature_names()

# For each paragraph, lists the feature words and their tf-idf dcores.
for i, j in zip(*X_train_tfidf_csr.nonzero()):
    tfidf_bypara[i][terms[j]] = X_train_tfidf_csr[i, j]

# Keep in mind that the log base 2 of 1 is 0, so a tfidf score of 0 indicates
# that the word was present once in that sentence.
print('Original sentence:', X_train[5])
print('Tf_idf vector:', tfidf_bypara[5])

NameError: name 'much_ado_paras' is not defined