# Models

And now to try and predict.

In [1]:
# Load in the packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import seaborn as sns
%matplotlib inline
sns.set_style('darkgrid')

Now to set up the data again. As before, refer to notebooks 1 and 2 for more information.

In [2]:
# Set up the data
lines = pd.read_csv('../data/All-seasons.csv')

lines = lines[lines.Season != 'Season']

lines[['Season', 'Episode']] = lines[['Season', 'Episode']].astype('int64')

lines['is_cartman'] = 0

lines.loc[lines.Character == 'Cartman', 'is_cartman'] = 1

lines.head(3)

Unnamed: 0,Season,Episode,Character,Line,is_cartman
0,10,1,Stan,"You guys, you guys! Chef is going away. \n",0
1,10,1,Kyle,Going away? For how long?\n,0
2,10,1,Stan,Forever.\n,0


### Corpus

The corpus is established here, with steps to convert everything to lowercase and remove punctuation from the end of each word.

In [3]:
import re, string

corpus = lines.Line.tolist()

for line in range(len(corpus)):
    corpus[line] = re.sub('\\n', '', corpus[line].rstrip()).lower()
    corpus[line] = " ".join(word.strip(string.punctuation) for word in corpus[line].split())
    
corpus[7:10]

["what's the meaning of life why are we here",
 "i hope you're making the right choice",
 "i'm gonna miss him i'm gonna miss chef and i...and i don't know how to tell him"]

<b>Improving the corpus by removing contractions and lemmatizing words</b>

In [4]:
from nltk.stem import WordNetLemmatizer
from nltk import word_tokenize
import contractions

Here is a loop to expand all contractions:

In [5]:
for line in range(len(corpus)):
    corpus[line] = contractions.fix(corpus[line])
                                  
corpus[7:10]

['what is the meaning of life why are we here',
 'i hope you are making the right choice',
 'I am going to miss him I am going to miss chef and i...and i do not know how to tell him']

Now a function to lemmatize all verbs and nouns:

In [6]:
lem = WordNetLemmatizer()

def lemmatize_lines(line):
    word_list = word_tokenize(line)
    
    word_list = [lem.lemmatize(w, pos='v') for w in word_list]
    
    lem_line = ' '.join([lem.lemmatize(w) for w in word_list])
    
    return lem_line

And finally, a loop using the previous function to execute the lemmatization:

In [7]:
for line in range(len(corpus)):
    corpus[line] = lemmatize_lines(corpus[line])
    
corpus[7:9]

['what be the mean of life why be we here',
 'i hope you be make the right choice']

### Stop words

The list of stop words as determined in notebook 2 needs to instantiated.

In [8]:
sw = ['be', 'you', 'i', 'to', 'the', 'do', 'it',\
        'a', 'we', 'that', 'and', 'have', 'go', 'what',\
        'get', 'of', 'this', 'in', 'on', 'all', 'just',\
        'for', 'he', 'know', 'will', 'but', 'with', 'so',\
        'they', 'now', 'well', "'s", 'guy', 'u', 'come',\
        'like', 'there', 'at', 'would', 'who', 'him',\
        'them', 'his', 'thing', 'where', 'should', 'an',\
        'please', 'maybe', 'their', 'even', 'any', 'than']

### Word vectors and Data splits

The words and the corpus have been preprocessed, now it's time to continue the setup by establishing the word vector and then splitting up the train and test sets. From there, we can tune different models and try to determine the best predictor. We first start with CountVectorizer to establish a basic bag-of-words, combined with a Naive Bayes algorithm for training.<br>
<br>
<b>Establishing the vector</b>

In [95]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split

In [96]:
vectorizer = CountVectorizer(stop_words=sw, ngram_range=(1,3))

X = vectorizer.fit_transform(corpus)
y = lines.is_cartman

state = 3

# How many features are there?
X.shape[1]

536629

<b>Splitting the data and training Multinomial Naive Bayes</b><br>
Now the features, X, and the target labels, y, need to be split into test and training sets. Then we can fit the data to a Multinomial Naive Bayes model and cross validate on the training data before checking against the test data.

In [97]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=state)

In [98]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score

mnb = MultinomialNB()

cv_scores = cross_val_score(mnb, X_train, y_train, cv=5)

In [99]:
print('Scores: ', cv_scores)
print('Average score: ', np.mean(cv_scores))

Scores:  [0.86255542 0.86174929 0.86213847 0.86242693 0.8612175 ]
Average score:  0.8620175217113915


We would certainly like to see better results, but it's not terrible for the first run.<br>
<br>
The immediate next steps would be to try differt n-gram ranges for the count vectorizer, and different values of *alpha* for MultinomialNB. Let's experiment with n-grams first.

<b>Results for different n-gram ranges: MultinomialNB</b>

| ngram_range | Number of features | Mean cv score        |
|-------------|--------------------|----------------------|
|  (1,1)      | 20,939             | 0.856                 |
| (1,2)       | 232,654            | 0.860                 |
| (1,3)       | 536,629            | 0.862                 |
| (1,5)       | 1,060,181          | 0.849                 |
| (2,2)       | 211,715            | 0.812                 |
| (3,3)       | 303,975            | 0.720                 |

Increasing the n-grams to (1,3) improves the accuracy, while also greatly increasing the features. Anything beyond that only adds features without any added benefit.<br>
<br>
Now we can experiment with values of *alpha* using `GridSearchCV`.

In [100]:
from sklearn.model_selection import GridSearchCV

params = {'alpha': [1.1, 1.15, 1.17, 1.2, 1.25, 1.3]}

nb_grid = GridSearchCV(mnb, params, cv=5)

nb_grid.fit(X_train, y_train)

print('Best alpha: ', nb_grid.best_estimator_)
print('Best score: ', nb_grid.best_score_)

Best alpha:  MultinomialNB(alpha=1.15, class_prior=None, fit_prior=True)
Best score:  0.8629849843797238


*Alpha* helps, but only slightly.<br>
<br>
Let's check the confusion matrix.

In [101]:
from sklearn.metrics import confusion_matrix

confusion_matrix(y_train, nb_grid.predict(X_train))

array([[42748,    25],
       [ 3365,  3477]], dtype=int64)

In [102]:
from sklearn.metrics import classification_report

print(classification_report(y_train, nb_grid.predict(X_train), np.unique(y)))

              precision    recall  f1-score   support

           0       0.93      1.00      0.96     42773
           1       0.99      0.51      0.67      6842

    accuracy                           0.93     49615
   macro avg       0.96      0.75      0.82     49615
weighted avg       0.94      0.93      0.92     49615



The results aren't bad. The predictions were labelled correctly 93% of the time, and the model was very good at correctly predicting the non-Cartman class. Cartman lines were correctly predicted about half the time.<br>
<br>
Now to validate:

In [103]:
print(confusion_matrix(y_test, nb_grid.predict(X_test)))
print(classification_report(y_test, nb_grid.predict(X_test), np.unique(y)))

[[17953   379]
 [ 2612   320]]
              precision    recall  f1-score   support

           0       0.87      0.98      0.92     18332
           1       0.46      0.11      0.18      2932

    accuracy                           0.86     21264
   macro avg       0.67      0.54      0.55     21264
weighted avg       0.82      0.86      0.82     21264



For some reason, the accuracy is similar, but the f1 score for the minority label is much worse. We'll see if this trend continues with further testing.

<b>Common phrases</b><br>
We can also use the CountVectorizer along with different n-gram ranges to find common phrases used by Cartman. By using different ranges we can examine phrases of differing lengths.<br>
<br>
First, we use the `is_cartman` series to find the indices for all of Cartman's lines. These indices will then be used to select which arrays to pull using the `X` variable that was established earlier by fitting the corpus to the vectorizer.

In [29]:
ind = list(y[y ==1].index)

Next, we enumerate over the arrays searching for any words that appear in that document. This won't give the actual word, but will give the indice for that word that can then be referenced in the vectorizer vocabulary to find the actual word.

In [30]:
vocab_values = []
    
for i in ind:
    vec = X[i].toarray().tolist()
    vec = vec[0]
    for i, el in enumerate(vec):
        if el > 0:
            vocab_values.append(i)

The `vocabulary_` attribute provides a dictionary of every word used in the vectorizer along with indice that corresponds with that word. These keys and values are then switched in the `new_vocab` dictionary so that words can be referenced by searching for the indice as the dictionary key. Then, the indices from the `vocab_values` list are used to create a simple list, not a set, of all the words or phrases used by Cartman.

In [31]:
vocab = vectorizer.vocabulary_

new_vocab = dict([(v, k) for k, v in vocab.items()])

vocab_words = []
for i in vocab_values:
    word = new_vocab[i]
    vocab_words.append(word)

Finally, a simple counter can be used on the `vocab_words` list to find the most common words and phrases.

In [49]:
from collections import Counter

cartman_phrases = Counter(vocab_words)
cartman_phrases.most_common(10)

[('oh my god', 72),
 ('can not believe', 15),
 ('not tell me', 12),
 ('no no no', 12),
 ('can not wait', 10),
 ('can not let', 10),
 ('let me see', 9),
 ('dude can not', 9),
 ('no can not', 9),
 ('why can not', 9)]

<b>Most common words and phrases</b><br>
<br>

| Single words\: (1,1)        | Two word phrases\: (2,2)| Three word phrases\: (3,3)|
|-----------------------------|-------------------------|---------------------------|
| not                         |  can not                |  oh my god                |
| my                          |  my god                 | can not believe           |
| can                         |  oh my                  | not tell me               |
| me                          |  not want               | no no no                  |
| oh                          |  not think              | can not wait              |
| your                        | south park              | can not let               |
| no                          | no not                  | let me see                |
| yeah                        | why not                 | dude can not              |
| here                        | tell me                 | no can not                |
| right                       | no no                   | why can not               |

The entries in the table above aren't as distinct as I had hoped. All the negative words kind of get in the way but, on the other hand, it shows how central the negations are to the tone of the dialogue. The sentiment of these phrases would be much different if 'no' and 'not' were removed from the corpus. Also, the expanded contractions and other preprocessing techniques kind of dilute the phrases.

### Tf-idf

To try and make further improvements, we can use a weighted tf-idf vectorizer instead of a basic CountVectorizer. Stop words aren't as big of a concern here due to the inherent adjustments, but since we already have a list we may as well use it. Also, because this is a new vectorizer we will have to create new train and test splits.

In [104]:
from sklearn.feature_extraction.text import TfidfVectorizer

tf_vect = TfidfVectorizer(stop_words=sw, ngram_range=(1,1))

X = tf_vect.fit_transform(corpus)
y = lines.is_cartman

In [105]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=state)

Now to run the new splits with MultinomialNB and check the scores.

In [106]:
params = {'alpha': [1.0, 1.25, 1.35, 1.4]}

t_nb_grid = GridSearchCV(mnb, params, cv=5)

t_nb_grid.fit(X_train, y_train)

print('Best alpha: ', t_nb_grid.best_estimator_)
print('Best score: ', t_nb_grid.best_score_)

Best alpha:  MultinomialNB(alpha=1.35, class_prior=None, fit_prior=True)
Best score:  0.8624811045046861


In [107]:
confusion_matrix(y_train, t_nb_grid.predict(X_train))

array([[42766,     7],
       [ 6733,   109]], dtype=int64)

In [108]:
print(classification_report(y_train, t_nb_grid.predict(X_train), np.unique(y)))

              precision    recall  f1-score   support

           0       0.86      1.00      0.93     42773
           1       0.94      0.02      0.03      6842

    accuracy                           0.86     49615
   macro avg       0.90      0.51      0.48     49615
weighted avg       0.87      0.86      0.80     49615



Combined with MultinomialNB, the tf-idf vectorizer has about the same accuracy compared to CountVectorizer, but the confusion matrix shows a significant decrease in the predictive ability. The precision for Cartman predictions is high, but recall is terrible, lowering the f1 score to 0.03. However, there may be other algorithms better suited to the weighted vectors.

### Trying other models

<b>Random Forest</b><br>
The results so far have been less than stellar, but maybe there are other algorithms that will perform better than Naive Bayes. First, let's try Random Forest, still with the tf-idf vectors. With `oob_score` (out of bag) set to 'True', we use the out-of-bag samples as a sort of validation set. I also set `class_weight` to 'balanced' to try to account for the class imbalance. This lowers the accuracy, but should improve the other metrics.

In [110]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=400,\
                            max_features='sqrt',\
                            class_weight='balanced',\
                            max_depth=5,\
                            oob_score=True,\
                            random_state=state,\
                            n_jobs=-1)

rf.fit(X_train, y_train)
print(rf.oob_score_)

0.7751284893681346


In [111]:
confusion_matrix(y_train, rf.predict(X_train))

array([[35849,  6924],
       [ 3402,  3440]], dtype=int64)

In [112]:
print(classification_report(y_train, rf.predict(X_train), np.unique(y)))

              precision    recall  f1-score   support

           0       0.91      0.84      0.87     42773
           1       0.33      0.50      0.40      6842

    accuracy                           0.79     49615
   macro avg       0.62      0.67      0.64     49615
weighted avg       0.83      0.79      0.81     49615



The oob score is pretty in line with all the other results we've seen, but the results from the classification report far exceed anything else seen so far. The predictions for the Cartman class are still questionable, but are improved overall. Let's see if it holds up.

In [113]:
rf_pred = rf.predict(X_test)

accuracy_score(y_test, rf_pred)

0.7787810383747178

In [114]:
confusion_matrix(y_test, rf_pred)

array([[15211,  3121],
       [ 1583,  1349]], dtype=int64)

In [115]:
print(classification_report(y_test, rf_pred, np.unique(y)))

              precision    recall  f1-score   support

           0       0.91      0.83      0.87     18332
           1       0.30      0.46      0.36      2932

    accuracy                           0.78     21264
   macro avg       0.60      0.64      0.62     21264
weighted avg       0.82      0.78      0.80     21264



Performance dropped some, but it's still better than the previous ones.

<b>SVM</b><br>
We can also try Support Vector Machines.

In [128]:
from sklearn.svm import LinearSVC

svm = LinearSVC(max_iter=5000, class_weight={0: 0.4, 1: 2}, random_state=3)

params = {'C': [0.5, 0.6, 0.7, 0.8]}

sv_grid = GridSearchCV(svm, params, cv=5)

sv_grid.fit(X_train, y_train)

print('Best C: ', sv_grid.best_estimator_)
print('Best score: ', sv_grid.best_score_)

Best C:  LinearSVC(C=0.5, class_weight={0: 0.4, 1: 2}, dual=True, fit_intercept=True,
          intercept_scaling=1, loss='squared_hinge', max_iter=5000,
          multi_class='ovr', penalty='l2', random_state=3, tol=0.0001,
          verbose=0)
Best score:  0.7897813161342336


In [129]:
confusion_matrix(y_train, sv_grid.predict(X_train))

array([[37131,  5642],
       [ 1765,  5077]], dtype=int64)

Some improvements with recall. Now to validate.

In [130]:
confusion_matrix(y_test, sv_grid.predict(X_test))

array([[15230,  3102],
       [ 1493,  1439]], dtype=int64)

In [131]:
print(classification_report(y_test, sv_grid.predict(X_test), np.unique(y)))

              precision    recall  f1-score   support

           0       0.91      0.83      0.87     18332
           1       0.32      0.49      0.39      2932

    accuracy                           0.78     21264
   macro avg       0.61      0.66      0.63     21264
weighted avg       0.83      0.78      0.80     21264



A similar observation as before where the accuracy is similar, but there is a significant change in the confusion matrix results. 

<b>Logistic Regression</b>

In [132]:
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression(max_iter=4000, multi_class='multinomial', random_state=3)

params = {'C': [1.0, 1.5, 1.75],
         'solver': ['newton-cg', 'sag']}

log_grid = GridSearchCV(log_reg, params, cv=5)

log_grid.fit(X_train, y_train)

print('Best C: ', log_grid.best_estimator_)
print('Best score: ', log_grid.best_score_)

Best C:  LogisticRegression(C=1.5, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=4000,
                   multi_class='multinomial', n_jobs=None, penalty='l2',
                   random_state=3, solver='newton-cg', tol=0.0001, verbose=0,
                   warm_start=False)
Best score:  0.8678222311800867


In [133]:
confusion_matrix(y_train, log_grid.predict(X_train))

array([[42396,   377],
       [ 5546,  1296]], dtype=int64)

Playing around wih different GridSearch parameters yields some better results, but there's still a similar tradeoff. The precision is higher here, but the recall is poor.

<b>Boosting</b><br>
<br>
Here's GradientBoosting:

In [47]:
from sklearn.ensemble import GradientBoostingClassifier

gb = GradientBoostingClassifier(max_depth=3, n_estimators=150, learning_rate=0.05, random_state=state)

cv_boost = cross_val_score(gb, X_train, y_train, cv=5)

In [48]:
print('Scores: ', cv_boost)
print('Average score: ', np.mean(cv_boost))

Scores:  [0.86648529 0.86517533 0.86375088 0.86585366 0.86655916]
Average score:  0.8655648645006405


In [50]:
gb.fit(X_train, y_train)

confusion_matrix(y_train, gb.predict(X_train))

array([[42739,    34],
       [ 6525,   317]], dtype=int64)

And now to validate:

In [74]:
print(confusion_matrix(y_test, gb.predict(X_test)))
print(classification_report(y_test, gb.predict(X_test), np.unique(y)))

[[18297    35]
 [ 2841    91]]
              precision    recall  f1-score   support

           0       0.87      1.00      0.93     18332
           1       0.72      0.03      0.06      2932

    accuracy                           0.86     21264
   macro avg       0.79      0.51      0.49     21264
weighted avg       0.85      0.86      0.81     21264



Still a lousy tradeoff between precision and recall for the minority class.<br>
<br>
Now AdaBoost

In [77]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier

dt_b = DecisionTreeClassifier(max_depth=1, class_weight='balanced', random_state=state)

adb = AdaBoostClassifier(base_estimator=dt_b, n_estimators=200)

adb.fit(X_train, y_train)

#cv_ada = cross_val_score(adb, X_train, y_train, cv=5)

AdaBoostClassifier(algorithm='SAMME.R',
                   base_estimator=DecisionTreeClassifier(class_weight='balanced',
                                                         criterion='gini',
                                                         max_depth=1,
                                                         max_features=None,
                                                         max_leaf_nodes=None,
                                                         min_impurity_decrease=0.0,
                                                         min_impurity_split=None,
                                                         min_samples_leaf=1,
                                                         min_samples_split=2,
                                                         min_weight_fraction_leaf=0.0,
                                                         presort=False,
                                                         random_state=3,
                          

In [79]:
print('Score: ', accuracy_score(y_train, adb.predict(X_train)))
#print('Average score: ', np.mean(cv_ada))

Score:  0.7622291645671672


In [80]:
confusion_matrix(y_train, adb.predict(X_train))

array([[33925,  8848],
       [ 2949,  3893]], dtype=int64)

Now to validate

In [81]:
print(confusion_matrix(y_test, adb.predict(X_test)))
print(classification_report(y_test, adb.predict(X_test), np.unique(y)))

[[14396  3936]
 [ 1424  1508]]
              precision    recall  f1-score   support

           0       0.91      0.79      0.84     18332
           1       0.28      0.51      0.36      2932

    accuracy                           0.75     21264
   macro avg       0.59      0.65      0.60     21264
weighted avg       0.82      0.75      0.78     21264



An improvement over gradient boost, but still hasn't exceeded SVM, and the computational cost is much more severe.