## Challenge: Build your own NLP model

For this challenge, you will need to choose a corpus of data from nltk or another source that includes categories you can predict and create an analysis pipeline that includes the following steps:

1. Data cleaning / processing / language parsing
2. Create features using two different NLP methods: For example, BoW vs tf-idf
3. Use the features to fit supervised learning models for each feature set to predict the category outcomes
4. Assess your models using cross-validation and determine whether one model performed better
5. Pick one of the models and try to increase accuracy by at least 5 percentage points

Write up your report in a Jupyter notebook. Be sure to explicitly justify the choices you make throughout, and submit it below.

In [1]:
import pandas as pd
import re
import spacy
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.naive_bayes import BernoulliNB
from typing import Dict

# Data Cleaning, Processing, and Language Parsing

In [2]:
# Importing the text files and using regex to clean.
with open(r"./Much Ado About Nothing.txt", encoding='utf-16') as much_ado:
    much_ado_raw = much_ado.read()
with open(r"./Romeo and Juliet.txt", encoding='utf-16') as romeo:
    romeo_raw = romeo.read()

In [3]:
# Utility function to clean text.
def text_cleaner(text: str) -> str:
    """Function to strip all characters except letters in words."""
    text = re.sub(r'--', ' ', text)
    text = re.sub(r"[\[].*?[\]]", "", text)
    text = re.sub(r"[\<].*?[\>]", "", text)
    text = ' '.join(text.split())
    return text

In [4]:
# Clean the data.
much_ado_clean = text_cleaner(much_ado_raw)
romeo_clean = text_cleaner(romeo_raw)

In [5]:
# Print "Much Ado" cleaned text.
much_ado_clean[:1000]

'I learn in this letter that Don Pedro of Arragon comes this night to Messina. He is very near by this: he was not three leagues off when I left him. How many gentlemen have you lost in this action? But few of any sort, and none of name. A victory is twice itself when the achiever brings home full numbers. I find here that Don Pedro hath bestowed much honour on a young Florentine called Claudio. Much deserved on his part and equally remembered by Don Pedro. He hath borne himself beyond the promise of his age, doing in the figure of a lamb the feats of a lion: he hath indeed better bettered expectation than you must expect of me to tell you how. He hath an uncle here in Messina will be very much glad of it. I have already delivered him letters, and there appears much joy in him; even so much that joy could not show itself modest enough without a badge of bitterness. Did he break out into tears? In great measure. A kind overflow of kindness. There are no faces truer than those that are s

In [6]:
# Print "Romeo" cleaned text.
romeo_clean[:1000]

"Two households, both alike in dignity, In fair Verona, where we lay our scene, From ancient grudge break to new mutiny, Where civil blood makes civil hands unclean. From forth the fatal loins of these two foes A pair of star-cross'd lovers take their life; Whose misadventur'd piteous overthrows Do with their death bury their parents' strife. The fearful passage of their death-mark'd love, And the continuance of their parents' rage, Which, but their children's end, nought could remove, Is now the two hours' traffick of our stage; The which if you with patient ears attend, What here shall miss, our toil shall strive to mend. Gregory, o' my word, we'll not carry coals. No. for then we should be colliers. I mean, an we be in choler, we'll draw. Ay, while you live, draw your neck out o' the collar. I strike quickly, being moved. But thou art not quickly moved to strike. A dog of the house of Montague moves me. To move is to stir, and to be valiant is to stand; therefore, if thou art moved,

In [7]:
# Parse the data. This can take some time.
nlp = spacy.load('en')
much_ado_doc = nlp(much_ado_clean)
romeo_doc = nlp(romeo_clean)

In [8]:
# Group into sentences using spacy.
much_ado_sents = [(sent.text, 'much_ado') for sent in
                  much_ado_doc.sents if len(sent) > 2]
romeo_sents = [(sent.text, 'romeo') for sent in
               romeo_doc.sents if len(sent) > 2]

# Convert list of sents to a df and add a pd.Series with the title.
much_ado_sent_df = pd.DataFrame(much_ado_sents)
romeo_sent_df = pd.DataFrame(romeo_sents)
clean_sents = pd.concat([much_ado_sent_df, romeo_sent_df])
assert (len(much_ado_sent_df) + len(romeo_sent_df)) == len(clean_sents)

# Rename columns.
clean_sents.columns = ['sentence', 'play_title']

In [9]:
# Check the count of sents per title.
clean_sents.play_title.value_counts()

romeo       2076
much_ado    1607
Name: play_title, dtype: int64

# Creating Features
## Bag of Words

In [10]:
# Splitting the data
X = clean_sents.sentence
y = clean_sents.play_title
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=15)

In [11]:
# Create base parameters dictionary.
base_param_dict = {'strip_accents': 'unicode',
                   'lowercase': True,
                   'stop_words': 'english'}

In [12]:
# Instantiate CountVectorizer.
bow = CountVectorizer(**base_param_dict)

In [13]:
# Convert X_train, X_test into dfs of bags of words.
_bow_train = bow.fit_transform(X_train)
_bow_test = bow.transform(X_test)
assert len(X_train) == _bow_train.shape[0]  # df and sparse-matrix

# Find feature names.
feature_names = bow.get_feature_names()

# Set up data frames.
X_train_bow = pd.DataFrame(_bow_train.toarray(), columns=feature_names)
X_test_bow = pd.DataFrame(_bow_test.toarray(), columns=feature_names)

## Tfidf

In [14]:
# Instantiate Tfidf.
tfidf = TfidfVectorizer(**base_param_dict)

# Creating Training/Testing Splits for Both Data Sets

In [15]:
# Convert X_train, X_test into dfs of tfidf values.
_tfidf_train = tfidf.fit_transform(X_train)
_tfidf_test = tfidf.transform(X_test)
assert len(X_train) == _tfidf_train.shape[0]  # df and sparse-matrix

# Find feature names.
feature_names_tfidf = tfidf.get_feature_names()

# Set up data frames.
X_train_tfidf = pd.DataFrame(
    _tfidf_train.toarray(), columns=feature_names_tfidf)
X_test_tfidf = pd.DataFrame(
    _tfidf_test.toarray(), columns=feature_names_tfidf)

# Creating Baseline Models

In [16]:
# Create baseline model.
print('Baseline score to beat:', sum(
    clean_sents.play_title == 'romeo') / len(clean_sents))

Baseline score to beat: 0.5636709204452892


In [17]:
# Create pipeline helpers.
scaler = StandardScaler()
skf = StratifiedKFold(n_splits=2, shuffle=True, random_state=15)

In [18]:
# Instantiate models.
log_reg = LogisticRegression(solver='lbfgs', max_iter=1000)
tree = DecisionTreeClassifier()
forest = RandomForestClassifier()
boost = GradientBoostingClassifier()
nb = BernoulliNB()

In [19]:
# Modeling arg dicts.
bow_kwargs = {'X_train': X_train_bow, 'y_train': y_train,
              'X_test': X_test_bow, 'y_test': y_test}
tfidf_kwargs = {'X_train': X_train_tfidf, 'y_train': y_train,
                'X_test': X_test_tfidf, 'y_test': y_test}

In [20]:
# Model tuning param_grids.
log_reg_params = {'model__C': [1]}
tree_params = {'model__criterion': ['gini']}
forest_params = {'model__n_estimators': [100]}
boost_params = {'model__n_estimators': [100]}
nb_params = {'model__alpha': [1]}

In [21]:
# Create a function to fit and show our predictive models.


def fit_and_predict(
        model, params: Dict, X_train: pd.DataFrame, y_train: pd.DataFrame,
        X_test: pd.DataFrame, y_test: pd.DataFrame) -> None:
    """
    Takes an instantiated sklearn model, training data (X_train, y_train) and
    performs cross-validation and then prints the mean of the cross-validation
    accuracies.
    """
    assert len(X_train) == len(y_train)
    assert len(X_test) == len(y_test)
    pipe = Pipeline(steps=[('sc', scaler), ('model', model)])
    clf = GridSearchCV(pipe, cv=skf, param_grid=params, n_jobs=2)
    clf.fit(X_train, y_train)
    print('The mean cross_val accuracy on train is',
          f'{clf.cv_results_["mean_test_score"]}.')
    print('The std of the cross_val accuracy in ',
          f'{clf.cv_results_["std_test_score"]}.')
    y_pred = clf.predict(X_test)
    print(classification_report(y_test, y_pred))
    print(confusion_matrix(y_test, y_pred))

In [22]:
# Bag of Words with Logistic Regression.
fit_and_predict(log_reg, params=log_reg_params, **bow_kwargs)

  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)


The mean cross_val accuracy on train is [0.6647357].
The std of the cross_val accuracy in  [0.01013758].
              precision    recall  f1-score   support

    much_ado       0.64      0.62      0.63       405
       romeo       0.71      0.72      0.71       516

   micro avg       0.68      0.68      0.68       921
   macro avg       0.67      0.67      0.67       921
weighted avg       0.68      0.68      0.68       921

[[252 153]
 [144 372]]


  Xt = transform.transform(Xt)


In [23]:
# Bag of Words with Decision Tree.
fit_and_predict(tree, params=tree_params, **bow_kwargs)

  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)


The mean cross_val accuracy on train is [0.62853005].
The std of the cross_val accuracy in  [0.00796524].
              precision    recall  f1-score   support

    much_ado       0.60      0.61      0.61       405
       romeo       0.69      0.68      0.69       516

   micro avg       0.65      0.65      0.65       921
   macro avg       0.65      0.65      0.65       921
weighted avg       0.65      0.65      0.65       921

[[249 156]
 [164 352]]


  Xt = transform.transform(Xt)


In [24]:
# Bag of Words with Random Forest.
fit_and_predict(forest, params=forest_params, **bow_kwargs)

  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)


The mean cross_val accuracy on train is [0.67414917].
The std of the cross_val accuracy in  [0.00434468].


  Xt = transform.transform(Xt)


              precision    recall  f1-score   support

    much_ado       0.66      0.63      0.65       405
       romeo       0.72      0.74      0.73       516

   micro avg       0.69      0.69      0.69       921
   macro avg       0.69      0.69      0.69       921
weighted avg       0.69      0.69      0.69       921

[[257 148]
 [134 382]]


In [27]:
# Bag of Words with Gradient Boosting.
fit_and_predict(boost, params=boost_params, **bow_kwargs)

  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)


The mean cross_val accuracy on train is [0.66582187].
The std of the cross_val accuracy in  [0.00398262].
              precision    recall  f1-score   support

    much_ado       0.78      0.36      0.49       405
       romeo       0.65      0.92      0.76       516

   micro avg       0.68      0.68      0.68       921
   macro avg       0.72      0.64      0.63       921
weighted avg       0.71      0.68      0.64       921

[[146 259]
 [ 40 476]]


  Xt = transform.transform(Xt)


In [28]:
# Bag of Words with Naive Bayes.
fit_and_predict(nb, params=nb_params, **bow_kwargs)

  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)


The mean cross_val accuracy on train is [0.71035482].
The std of the cross_val accuracy in  [0.01593049].
              precision    recall  f1-score   support

    much_ado       0.79      0.57      0.66       405
       romeo       0.72      0.88      0.79       516

   micro avg       0.74      0.74      0.74       921
   macro avg       0.76      0.72      0.73       921
weighted avg       0.75      0.74      0.74       921

[[229 176]
 [ 60 456]]


  Xt = transform.transform(Xt)


In [29]:
# Tfidf with Logistic Regression.
fit_and_predict(log_reg, params=log_reg_params, **tfidf_kwargs)

The mean cross_val accuracy on train is [0.67233888].
The std of the cross_val accuracy in  [0.0141202].
              precision    recall  f1-score   support

    much_ado       0.63      0.65      0.64       405
       romeo       0.72      0.70      0.71       516

   micro avg       0.68      0.68      0.68       921
   macro avg       0.67      0.67      0.67       921
weighted avg       0.68      0.68      0.68       921

[[262 143]
 [156 360]]


In [30]:
# Tfidf with Decision Tree.
fit_and_predict(tree, params=tree_params, **tfidf_kwargs)

The mean cross_val accuracy on train is [0.64627082].
The std of the cross_val accuracy in  [0.00036206].
              precision    recall  f1-score   support

    much_ado       0.60      0.60      0.60       405
       romeo       0.69      0.69      0.69       516

   micro avg       0.65      0.65      0.65       921
   macro avg       0.65      0.65      0.65       921
weighted avg       0.65      0.65      0.65       921

[[242 163]
 [158 358]]


In [31]:
# Tfidf with Random Forest.
fit_and_predict(forest, params=forest_params, **tfidf_kwargs)

The mean cross_val accuracy on train is [0.68283852].
The std of the cross_val accuracy in  [0.00289645].
              precision    recall  f1-score   support

    much_ado       0.70      0.57      0.63       405
       romeo       0.70      0.81      0.75       516

   micro avg       0.70      0.70      0.70       921
   macro avg       0.70      0.69      0.69       921
weighted avg       0.70      0.70      0.70       921

[[229 176]
 [ 97 419]]


In [32]:
# Tfidf with Gradient Boosting.
fit_and_predict(boost, params=boost_params, **tfidf_kwargs)

The mean cross_val accuracy on train is [0.6622013].
The std of the cross_val accuracy in  [0.00325851].
              precision    recall  f1-score   support

    much_ado       0.78      0.39      0.52       405
       romeo       0.66      0.91      0.76       516

   micro avg       0.68      0.68      0.68       921
   macro avg       0.72      0.65      0.64       921
weighted avg       0.71      0.68      0.66       921

[[157 248]
 [ 44 472]]


In [33]:
# Tfidf with Naive Bayes.
fit_and_predict(nb, params=nb_params, **tfidf_kwargs)

The mean cross_val accuracy on train is [0.71035482].
The std of the cross_val accuracy in  [0.01593049].
              precision    recall  f1-score   support

    much_ado       0.79      0.57      0.66       405
       romeo       0.72      0.88      0.79       516

   micro avg       0.74      0.74      0.74       921
   macro avg       0.76      0.72      0.73       921
weighted avg       0.75      0.74      0.74       921

[[229 176]
 [ 60 456]]


# Pick One Model and Increase Accuracy by 5%

In [34]:
# Try GaussianNB for better results.
from sklearn.naive_bayes import GaussianNB

# Instantiate GaussianNB.
gnb = GaussianNB()

gnb_params = {}

# Bag of Words with Gaussian.
fit_and_predict(gnb, params=gnb_params, **bow_kwargs)

  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)


The mean cross_val accuracy on train is [0.60934106].
The std of the cross_val accuracy in  [0.00470673].
              precision    recall  f1-score   support

    much_ado       0.54      0.86      0.66       405
       romeo       0.79      0.42      0.54       516

   micro avg       0.61      0.61      0.61       921
   macro avg       0.66      0.64      0.60       921
weighted avg       0.68      0.61      0.60       921

[[347  58]
 [301 215]]


  Xt = transform.transform(Xt)


In [35]:
# Try adjusting KFold splits.
skf = StratifiedKFold(n_splits=3, shuffle=True, random_state=15)

# Bag of Words with Naive Bayes (3 splits).
fit_and_predict(nb, params=nb_params, **bow_kwargs)

  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)
  Xt = transform.transform(Xt)


The mean cross_val accuracy on train is [0.72411296].
The std of the cross_val accuracy in  [0.00833298].
              precision    recall  f1-score   support

    much_ado       0.79      0.57      0.66       405
       romeo       0.72      0.88      0.79       516

   micro avg       0.74      0.74      0.74       921
   macro avg       0.76      0.72      0.73       921
weighted avg       0.75      0.74      0.74       921

[[229 176]
 [ 60 456]]


In [36]:
# Tfidf with Naive Bayes (3 splits).
fit_and_predict(nb, params=nb_params, **tfidf_kwargs)

The mean cross_val accuracy on train is [0.72411296].
The std of the cross_val accuracy in  [0.00833298].
              precision    recall  f1-score   support

    much_ado       0.79      0.57      0.66       405
       romeo       0.72      0.88      0.79       516

   micro avg       0.74      0.74      0.74       921
   macro avg       0.76      0.72      0.73       921
weighted avg       0.75      0.74      0.74       921

[[229 176]
 [ 60 456]]


In [37]:
# Try adjusting KFold splits.
skf = StratifiedKFold(n_splits=4, shuffle=True, random_state=15)

# Bag of Words with Naive Bayes (4 splits).
fit_and_predict(nb, params=nb_params, **bow_kwargs)

  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)
  Xt = transform.transform(Xt)


The mean cross_val accuracy on train is [0.72737147].
The std of the cross_val accuracy in  [0.01456765].
              precision    recall  f1-score   support

    much_ado       0.79      0.57      0.66       405
       romeo       0.72      0.88      0.79       516

   micro avg       0.74      0.74      0.74       921
   macro avg       0.76      0.72      0.73       921
weighted avg       0.75      0.74      0.74       921

[[229 176]
 [ 60 456]]


In [38]:
# Tfidf with Naive Bayes (4 splits).
fit_and_predict(nb, params=nb_params, **tfidf_kwargs)

The mean cross_val accuracy on train is [0.72737147].
The std of the cross_val accuracy in  [0.01456765].
              precision    recall  f1-score   support

    much_ado       0.79      0.57      0.66       405
       romeo       0.72      0.88      0.79       516

   micro avg       0.74      0.74      0.74       921
   macro avg       0.76      0.72      0.73       921
weighted avg       0.75      0.74      0.74       921

[[229 176]
 [ 60 456]]


In [39]:
# Try adjusting KFold splits.
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=15)

# Bag of Words with Naive Bayes (5 splits).
fit_and_predict(nb, params=nb_params, **bow_kwargs)

  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)


The mean cross_val accuracy on train is [0.73968139].
The std of the cross_val accuracy in  [0.01300631].
              precision    recall  f1-score   support

    much_ado       0.79      0.57      0.66       405
       romeo       0.72      0.88      0.79       516

   micro avg       0.74      0.74      0.74       921
   macro avg       0.76      0.72      0.73       921
weighted avg       0.75      0.74      0.74       921

[[229 176]
 [ 60 456]]


  Xt = transform.transform(Xt)


In [40]:
# Tfidf with Naive Bayes (5 splits).
fit_and_predict(nb, params=nb_params, **tfidf_kwargs)

The mean cross_val accuracy on train is [0.73968139].
The std of the cross_val accuracy in  [0.01300631].
              precision    recall  f1-score   support

    much_ado       0.79      0.57      0.66       405
       romeo       0.72      0.88      0.79       516

   micro avg       0.74      0.74      0.74       921
   macro avg       0.76      0.72      0.73       921
weighted avg       0.75      0.74      0.74       921

[[229 176]
 [ 60 456]]


In [41]:
# Try adjusting KFold splits.
skf = StratifiedKFold(n_splits=6, shuffle=True, random_state=15)

# Bag of Words with Naive Bayes (6 splits).
fit_and_predict(nb, params=nb_params, **bow_kwargs)

  return self.partial_fit(X, y)
  return self.fit(X, y, **fit_params).transform(X)


The mean cross_val accuracy on train is [0.7346126].
The std of the cross_val accuracy in  [0.02212966].
              precision    recall  f1-score   support

    much_ado       0.79      0.57      0.66       405
       romeo       0.72      0.88      0.79       516

   micro avg       0.74      0.74      0.74       921
   macro avg       0.76      0.72      0.73       921
weighted avg       0.75      0.74      0.74       921

[[229 176]
 [ 60 456]]


  Xt = transform.transform(Xt)


In [42]:
# Tfidf with Naive Bayes (6 splits).
fit_and_predict(nb, params=nb_params, **tfidf_kwargs)

The mean cross_val accuracy on train is [0.7346126].
The std of the cross_val accuracy in  [0.02212966].
              precision    recall  f1-score   support

    much_ado       0.79      0.57      0.66       405
       romeo       0.72      0.88      0.79       516

   micro avg       0.74      0.74      0.74       921
   macro avg       0.76      0.72      0.73       921
weighted avg       0.75      0.74      0.74       921

[[229 176]
 [ 60 456]]


Having already used GridSearchCV and supplying parameters that seem most reasonable, the only parameter change that made any noticeable difference was the K value in SelectKFolds. By changing the K-folds from 2 to 5 yielded the best improvement (from 71.04% to 73.97%). Other than the K-folds value, the GridSearchCV has already optimized the parameters for each given model.