# Problem 1

Converting words or sentences into numeric vectors is fundamental when working with text data. To make sure that you have a solid handle on how these vectors work, generate the TF-IDF vectors for the last three sentences of the example from the beginning of this checkpoint (from the BoW revisited: TF-IDF section).

**Sentence 4:** "The Lumberjack Song is the funniest Monty Python bit; I can't think of it without laughing."

**Sentence 5:** "I would rather put strawberries on my ice cream for dessert; they have the best taste."

**Sentence 6:** "The taste of caramel is a fantastic accompaniment to tasty mint ice cream."

The table with each term's *document frequency* and *inverse document frequency* is shown below.

N = number of total documents = 6 total sentences
idf = log(base 2)(6/df)

| Term | Document Frequency | Inverse Document Frequency |
|---|---|---|
| Monty | 2 | 1.585 | 
| Python | 3 | 1 |
| sketch | 2 | 1.585 |
| laugh | 3 | 1 |
| funny | 2 | 1.585 |
| best | 4 | 0.585 |
| ice cream | 3 | 1 |
| dessert | 2 | 1.585 |
| taste | 3 | 1 |

The table showing each word's term frequency and a 2nd table with their tf-idf are shown below.


**Term Frequency Matrix for Final 3 Sentences**

| Sentence Number | 4 | 5 | 6 |
|---|---|---|---|
| **Term** |
| Monty | 1 | 0 | 0 |
| Python | 1 | 0 | 0 |
| sketch | 0 | 0 | 0 |
| laugh | 1 | 0 | 0 |
| funny | 1 | 0 | 0 |
| best | 0 | 1 | 0 |
| ice cream | 0 | 1 | 1 |
| dessert | 0 | 1 | 0 |
| taste | 0 | 1 | 2 |

These individual values are then multiplied by each term's IDF to return each term's TF-IDF within each sentence, obtaining the final following TF-IDF matrix.


**TF-IDF Matrix for Final 3 Sentences**

| Sentence Number | 4 | 5 | 6 |
|---|---|---|---|
| **Term** |
| Monty | 1.585 | 0 | 0 |
| Python | 1 | 0 | 0 |
| sketch | 0 | 0 | 0 |
| laugh | 1 | 0 | 0 |
| funny | 1.585 | 0 | 0 |
| best | 0 | 0.585 | 0 |
| ice cream | 0 | 1 | 1 |
| dessert | 0 | 1.585 | 0 |
| taste | 0 | 1 | 2 |

# Problem 2

In the 2-grams example above, you only used 2-grams as your features. This time, use both 1-grams and 2-grams together as your feature set. Run the same models as in the example and compare the results.

In [22]:
# let's bring the model preparation methods used in the checkpoint notebook into here
# then apply the 1 and 2-gram method in another cell

# import and gutenberg download
import numpy as np
import pandas as pd
import sklearn
import spacy
import re
from nltk.corpus import gutenberg
import nltk

nltk.download('gutenberg')

# text-cleaning function for the 2 texts from gutenberg
def text_cleaner(text):
    text = re.sub(r'--',' ',text) # double dash
    text = re.sub("[\[].*?[\]]", "", text) # brackets
    text = re.sub(r"(\b|\s+\-?|^\-?)(\d+|\d*\.\d+)\b", " ", text)
    text = ' '.join(text.split()) # get rid of whitespace at the ends
    return text

# load text files
persuasion = gutenberg.raw('austen-persuasion.txt')
alice = gutenberg.raw('carroll-alice.txt')

# get rid of chapter indicators
persuasion = re.sub(r'Chapter \d+', '', persuasion)
alice = re.sub(r'CHAPTER .*', '', alice)
    
# apply the text-cleaning function
alice = text_cleaner(alice)
persuasion = text_cleaner(persuasion)

# parse the text files with spacy (tokenization)
nlp = spacy.load('en_core_web_sm')
alice_doc = nlp(alice)
persuasion_doc = nlp(persuasion)

# group into sentences
alice_sents = [[sent, "Carroll"] for sent in alice_doc.sents]
persuasion_sents = [[sent, "Austen"] for sent in persuasion_doc.sents]

# Combine the sentences from the two novels into one DataFrame
# DataFrame is made by combining two lists of lists,
# where each list position contains a list with 2 items, the sentences and the author
sentences_df = pd.DataFrame(alice_sents + persuasion_sents, columns = ["text", "author"])

# get rid of stop words and punctuation, lemmatize
for i, sentence in enumerate(sentences_df["text"]):
    sentences_df.loc[i, "text"] = " ".join(
        [token.lemma_ for token in sentence if not token.is_punct and not token.is_stop])

# check it out
sentences_df

[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Package gutenberg is already up-to-date!


In [26]:
# in this cell we load the vectorizer using BOTH 1 and 2-grams
# and make the final DataFrame used for the models
from sklearn.feature_extraction.text import TfidfVectorizer

# create vectorizer
vectorizer = TfidfVectorizer(
    max_df=0.5, min_df=2, use_idf=True, norm=u'l2', smooth_idf=True, ngram_range=(1,2)
)

# apply the vectorizer
X = vectorizer.fit_transform(sentences_df["text"])

tfidf_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
sentences_df = pd.concat([tfidf_df, sentences[["text", "author"]]], axis=1)

# check out the model DataFrames with both 1 and 2-grams
sentences_df

Unnamed: 0,abide,ability,able,able bear,able persuade,abominate,abroad,absence,absence home,absent,absolute,absolute necessity,absolutely,absolutely hopeless,absurd,abuse,accept,acceptable,acceptance,accession,accident,accident lyme,accidentally,accidentally hear,accommodate,accommodation,accommodation man,accompany,accomplish,accomplishment,accord,accordingly,account,account louisa,account small,accuse,acknowledge,acknowledgement,acquaint,acquaint captain,...,wrong,wrought,yard,yarmouth,yawn,ye,year,year ago,year anne,year go,year half,year monkford,year old,year pass,year school,year year,yer,yer honour,yes,yes mr,yes say,yes yes,yesterday,yield,young,young child,young fellow,young friend,young lady,young man,young people,young person,young sister,young woman,youth,youth say,zeal,zealous,text,author
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Alice begin tired sit sister bank have twice p...,Carroll
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,consider mind hot day feel sleepy stupid pleas...,Carroll
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,remarkable Alice think way hear Rabbit oh dear,Carroll
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,oh dear,Carroll
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,shall late,Carroll
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5843,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,spring felicity glow spirit friend Anne warmth...,Austen
5844,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,Anne tenderness worth Captain Wentworth affection,Austen
5845,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,profession friend wish tenderness dread future...,Austen
5846,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,glory sailor wife pay tax quick alarm belong p...,Austen


In [27]:
# imports for models
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import train_test_split

Y = sentences_df.author
X = np.array(sentences_df.drop(["text", "author"], 1))

# split into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.4, random_state=70)

# models themselves
lr = LogisticRegression()
rfc = RandomForestClassifier()
gbc = GradientBoostingClassifier()

lr.fit(X_train, y_train)
rfc.fit(X_train, y_train)
gbc.fit(X_train, y_train)

print("LR SCORES")
print("Training set: ", lr.score(X_train, y_train))
print("Test set: ", lr.score(X_test, y_test))
print("\n")

print("RFC SCORES")
print("Training set: ", rfc.score(X_train, y_train))
print("Test set: ", rfc.score(X_test, y_test))
print("\n")

print("GBC SCORES")
print("Training set: ", gbc.score(X_train, y_train))
print("Test set: ", gbc.score(X_test, y_test))

LR SCORES
Training set:  0.9042189281641961
Test set:  0.8555555555555555


RFC SCORES
Training set:  0.9640820980615735
Test set:  0.8474358974358974


GBC SCORES
Training set:  0.822690992018244
Test set:  0.805982905982906


Using both 1 and 2-grams did at least as well as the 1-gram model, and outperformed the 2-gram model from the checkpoint notebook.