# Problem 1

Your task is to increase the performance of the models that you implemented in the bank-of-words example. 

## Importing and Text Preprocessing

In [1]:
import numpy as np
import pandas as pd
import sklearn
import re
import spacy
from nltk.corpus import gutenberg
import nltk
import warnings
warnings.filterwarnings("ignore")

# get gutenberg in here
nltk.download('gutenberg')

[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Unzipping corpora/gutenberg.zip.


True

In [2]:
# bring over the text cleaning function from the checkpoint notebook
def text_cleaner(text):
    text = re.sub(r'--', ' ', text)
    text = re.sub("[\[].*?[\]]", "", text)
    text = re.sub(r"(\b|\s+\-?|^\-?)(\d+|\d*\.\d+)\b", " ", text)
    text = ' '.join(text.split())
    return text


# load and clean the data as shown in the checkpoint notebook
persuasion = gutenberg.raw('austen-persuasion.txt')
alice = gutenberg.raw('carroll-alice.txt')

# get rid of the chapter headings
persuasion = re.sub(r'Chapter \d+', '', persuasion)
alice = re.sub(r'CHAPTER .*', '', alice)

# apply the text cleaning function
persuasion = text_cleaner(persuasion)
alice = text_cleaner(alice)

In [3]:
# parse them using spacy
nlp = spacy.load('en_core_web_sm')
alice_doc = nlp(alice)
persuasion_doc = nlp(persuasion)

In [4]:
# group the text data into sentences, so 1 doc = 1 sent
alice_sents = [[sent, "Carroll"] for sent in alice_doc.sents]
persuasion_sents = [[sent, "Austen"] for sent in persuasion_doc.sents]

# combine the sentences from both novels into one dataframe
sentences_df = pd.DataFrame(alice_sents + persuasion_sents, columns=["text", "author"])
sentences_df.head()

Unnamed: 0,text,author
0,"(Alice, was, beginning, to, get, very, tired, ...",Carroll
1,"(So, she, was, considering, in, her, own, mind...",Carroll
2,"(There, was, nothing, so, VERY, remarkable, in...",Carroll
3,"(Oh, dear, !)",Carroll
4,"(I, shall, be, late, !, ')",Carroll


In [5]:
# get rid of stop words, punctuation, and lemmatize
for i, sentence in enumerate(sentences_df["text"]):
    sentences_df.loc[i, "lemmas"] = " ".join([
        token.lemma_ for token in sentence
        if not token.is_punct and not token.is_stop
    ])

In [6]:
sentences_df.head()

Unnamed: 0,text,author,lemmas
0,"(Alice, was, beginning, to, get, very, tired, ...",Carroll,Alice begin tired sit sister bank have twice p...
1,"(So, she, was, considering, in, her, own, mind...",Carroll,consider mind hot day feel sleepy stupid pleas...
2,"(There, was, nothing, so, VERY, remarkable, in...",Carroll,remarkable Alice think way hear Rabbit oh dear
3,"(Oh, dear, !)",Carroll,oh dear
4,"(I, shall, be, late, !, ')",Carroll,shall late


## Initial Model

|  | LR Score | RF Score | GB Score |
|---|---|---|---|
| Training Set | 0.945 | 0.981 | 0.860 |
| Test Set | 0.894 | 0.884 | 0.847 |

In [7]:
# prepare the initial model dataframe from the checkpoint notebook
# using bag of words
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer(analyzer='word')
X = vectorizer.fit_transform(sentences_df['lemmas'])
bow_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
sentences_df = pd.concat([bow_df, sentences_df[["lemmas", "author", "text"]]], axis=1)

In [8]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import train_test_split

Y = sentences_df['author'] # target variable, who is the author?
X = np.array(sentences_df.drop(['text', 'author', 'lemmas'], 1)) # only want the b-o-w

# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.4, random_state=123)

# Models
lr = LogisticRegression()
rfc = RandomForestClassifier()
gbc = GradientBoostingClassifier()

lr.fit(X_train, y_train)
rfc.fit(X_train, y_train)
gbc.fit(X_train, y_train)

print("----------------------Logistic Regression Scores----------------------")
print('Training set score:', lr.score(X_train, y_train))
print('\nTest set score:', lr.score(X_test, y_test))

print("----------------------Random Forest Scores----------------------")
print('Training set score:', rfc.score(X_train, y_train))
print('\nTest set score:', rfc.score(X_test, y_test))

print("----------------------Gradient Boosting Scores----------------------")
print('Training set score:', gbc.score(X_train, y_train))
print('\nTest set score:', gbc.score(X_test, y_test))

----------------------Logistic Regression Scores----------------------
Training set score: 0.9261687571265679

Test set score: 0.8641025641025641
----------------------Random Forest Scores----------------------
Training set score: 0.9723489167616876

Test set score: 0.8264957264957264
----------------------Gradient Boosting Scores----------------------
Training set score: 0.8275370581527937

Test set score: 0.8128205128205128


## Using Parts of Speech

Performs worse than initial model.

**Model with "1-gram" parts of speech**

|  | LR Score | RF Score | GB Score |
|---|---|---|---|
| Training Set | 0.693 | 0.833 | 0.746 |
| Test Set | 0.695 | 0.704 | 0.727 |

**Model with "2-grams" parts of speech**

|  | LR Score | RF Score | GB Score |
|---|---|---|---|
| Training Set | 0.738 | 0.882 | 0.764 |
| Test Set | 0.727 | 0.716 | 0.733 |

For the purposes of this notebook, I want to explore the addition of a feature that counts the instances of various parts of speech.

In the following two cells, I create a column with the parts of speech used in each sentence. Then, I apply the bag-of-words method to this column.

In [9]:
# get the pos
for i, sentence in enumerate(sentences_df["text"]):
    sentences_df.loc[i, "pos"] = " ".join([
        token.pos_ for token in sentence
        if not token.is_punct and not token.is_stop
    ])

In [10]:
# apply bag-of-words method to the 'pos' column
vectorizer = CountVectorizer(analyzer='word')
X = vectorizer.fit_transform(sentences_df['pos'])
bow_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
sentences_df = pd.concat([bow_df, sentences_df[["lemmas", "author", "text", "pos"]]], axis=1)

In [11]:
# check it out
sentences_df

Unnamed: 0,adj,adp,adv,det,intj,noun,num,part,pron,propn,punct,sconj,verb,lemmas,author,text,pos
0,1,0,1,0,0,10,0,0,0,2,0,0,6,Alice begin tired sit sister bank have twice p...,Carroll,"(Alice, was, beginning, to, get, very, tired, ...",PROPN VERB ADJ VERB NOUN NOUN VERB ADV VERB NO...
1,5,0,2,0,0,8,0,0,0,2,0,0,6,consider mind hot day feel sleepy stupid pleas...,Carroll,"(So, she, was, considering, in, her, own, mind...",VERB NOUN ADJ NOUN VERB ADJ ADJ NOUN VERB NOUN...
2,1,0,0,0,2,1,0,0,0,2,0,0,2,remarkable Alice think way hear Rabbit oh dear,Carroll,"(There, was, nothing, so, VERY, remarkable, in...",ADJ PROPN VERB NOUN VERB PROPN INTJ INTJ
3,0,0,0,0,2,0,0,0,0,0,0,0,0,oh dear,Carroll,"(Oh, dear, !)",INTJ INTJ
4,1,0,0,0,0,0,0,0,0,0,0,0,1,shall late,Carroll,"(I, shall, be, late, !, ')",VERB ADJ
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
5843,0,0,0,0,0,7,0,0,0,1,0,0,0,spring felicity glow spirit friend Anne warmth...,Austen,"(Her, spring, of, felicity, was, in, the, glow...",NOUN NOUN NOUN NOUN NOUN PROPN NOUN NOUN
5844,0,1,0,0,0,2,0,0,0,3,0,0,0,Anne tenderness worth Captain Wentworth affection,Austen,"(Anne, was, tenderness, itself, ,, and, she, h...",PROPN ADP NOUN PROPN PROPN NOUN
5845,1,0,0,0,0,5,0,0,0,0,0,0,3,profession friend wish tenderness dread future...,Austen,"(His, profession, was, all, that, could, ever,...",NOUN NOUN VERB VERB NOUN ADJ NOUN VERB NOUN
5846,5,0,0,0,0,7,0,0,0,0,0,0,3,glory sailor wife pay tax quick alarm belong p...,Austen,"(She, gloried, in, being, a, sailor, 's, wife,...",VERB NOUN NOUN VERB NOUN ADJ NOUN VERB NOUN AD...


In [12]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import train_test_split

Y = sentences_df['author'] # target variable, who is the author?
X = np.array(sentences_df.drop(['text', 'author', 'lemmas', 'pos'], 1)) # only want the b-o-w

# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.4, random_state=123)

# Models
lr = LogisticRegression()
rfc = RandomForestClassifier()
gbc = GradientBoostingClassifier()

lr.fit(X_train, y_train)
rfc.fit(X_train, y_train)
gbc.fit(X_train, y_train)

print("----------------------Logistic Regression Scores----------------------")
print('Training set score:', lr.score(X_train, y_train))
print('\nTest set score:', lr.score(X_test, y_test))

print("----------------------Random Forest Scores----------------------")
print('Training set score:', rfc.score(X_train, y_train))
print('\nTest set score:', rfc.score(X_test, y_test))

print("----------------------Gradient Boosting Scores----------------------")
print('Training set score:', gbc.score(X_train, y_train))
print('\nTest set score:', gbc.score(X_test, y_test))

----------------------Logistic Regression Scores----------------------
Training set score: 0.6468072976054732

Test set score: 0.6482905982905983
----------------------Random Forest Scores----------------------
Training set score: 0.7916191562143672

Test set score: 0.6816239316239316
----------------------Gradient Boosting Scores----------------------
Training set score: 0.7183580387685291

Test set score: 0.6897435897435897


In [13]:
# what if we tried "2-grams" with the pos?
vectorizer = CountVectorizer(analyzer='word', ngram_range=(2,2))
X = vectorizer.fit_transform(sentences_df['pos'])
bow_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
sentences_df = pd.concat([bow_df, sentences_df[["lemmas", "author", "text", "pos"]]], axis=1)

In [14]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.model_selection import train_test_split

Y = sentences_df['author'] # target variable, who is the author?
X = np.array(sentences_df.drop(['text', 'author', 'lemmas', 'pos'], 1)) # only want the b-o-w

# Split the dataset into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.4, random_state=123)

# Models
lr = LogisticRegression()
rfc = RandomForestClassifier()
gbc = GradientBoostingClassifier()

lr.fit(X_train, y_train)
rfc.fit(X_train, y_train)
gbc.fit(X_train, y_train)

print("----------------------Logistic Regression Scores----------------------")
print('Training set score:', lr.score(X_train, y_train))
print('\nTest set score:', lr.score(X_test, y_test))

print("----------------------Random Forest Scores----------------------")
print('Training set score:', rfc.score(X_train, y_train))
print('\nTest set score:', rfc.score(X_test, y_test))

print("----------------------Gradient Boosting Scores----------------------")
print('Training set score:', gbc.score(X_train, y_train))
print('\nTest set score:', gbc.score(X_test, y_test))

----------------------Logistic Regression Scores----------------------
Training set score: 0.6998289623717218

Test set score: 0.6974358974358974
----------------------Random Forest Scores----------------------
Training set score: 0.8349486887115165

Test set score: 0.6568376068376068
----------------------Gradient Boosting Scores----------------------
Training set score: 0.7152223489167617

Test set score: 0.7029914529914529


# Problem 2

In the 2-gram example above, you only used 2-gram as your features. This time, use both 1-gram and 2-gram features together as your feature set. Run the same models as in the example and compare the results.

Looks like the model with both 1 and 2-grams performs best.

**1-gram model**

|  | LR Score | RF Score | GB Score |
|---|---|---|---|
| Training Set | 0.935 | 0.980 | 0.851 |
| Test Set | 0.876 | 0.854 | 0.836 |

**2-gram model**

|  | LR Score | RF Score | GB Score |
|---|---|---|---|
| Training Set | 0.917 | 0.953 | 0.766 |
| Test Set | 0.783 | 0.798 | 0.762 |

**Both 1 and 2-gram model**

|  | LR Score | RF Score | GB Score |
|---|---|---|---|
| Training Set | 0.944 | 0.972 | 0.829 |
| Test Set | 0.866 | 0.830 | 0.815 |


In [15]:
# we'll use both 1 and 2-grams
vectorizer = CountVectorizer(analyzer='word', ngram_range=(1,2)) # here
X = vectorizer.fit_transform(sentences_df["lemmas"])
bow_df = pd.DataFrame(X.toarray(), columns=vectorizer.get_feature_names())
sentences_df = pd.concat([bow_df, sentences_df[["author"]]], axis=1)
sentences_df.head()

Unnamed: 0,1st,29th,29th september,abbreviation,abbreviation living,abdication,abdication neighbour,abide,abide consequence,abide figure,ability,ability affection,ability awkwardness,ability difficulty,able,able attempt,able avail,able avoid,able bear,able convince,able devise,able eat,able far,able feign,able join,able judge,able leave,able letter,able live,able marry,able persuade,able regard,able remain,able return,able ring,able rise,able set,able shew,able speak,able tell,...,young woman,young young,younker,youth,youth beauty,youth bloom,youth early,youth father,youth fine,youth hardly,youth hope,youth jaw,youth kill,youth learn,youth like,youth mention,youth possibly,youth restore,youth say,youth spring,youth value,youth vigour,youthful,youthful infatuation,zeal,zeal business,zeal common,zeal dwell,zeal sport,zeal think,zealand,zealand australia,zealous,zealous officer,zealous subject,zealously,zealously discharge,zigzag,zigzag go,author
0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Carroll
1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Carroll
2,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Carroll
3,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Carroll
4,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,Carroll


In [16]:
Y = sentences_df['author']
X = np.array(sentences_df.drop(['author'], 1))

# We split the dataset into train and test sets
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.4, random_state=123)

# Models
lr = LogisticRegression()
rfc = RandomForestClassifier()
gbc = GradientBoostingClassifier()

lr.fit(X_train, y_train)
rfc.fit(X_train, y_train)
gbc.fit(X_train, y_train)

print("----------------------Logistic Regression Scores----------------------")
print('Training set score:', lr.score(X_train, y_train))
print('\nTest set score:', lr.score(X_test, y_test))

print("----------------------Random Forest Scores----------------------")
print('Training set score:', rfc.score(X_train, y_train))
print('\nTest set score:', rfc.score(X_test, y_test))

print("----------------------Gradient Boosting Scores----------------------")
print('Training set score:', gbc.score(X_train, y_train))
print('\nTest set score:', gbc.score(X_test, y_test))

----------------------Logistic Regression Scores----------------------
Training set score: 0.9444127708095781

Test set score: 0.8662393162393163
----------------------Random Forest Scores----------------------
Training set score: 0.9723489167616876

Test set score: 0.82991452991453
----------------------Gradient Boosting Scores----------------------
Training set score: 0.8292474344355758

Test set score: 0.8153846153846154
