# Models

And now to try and predict.

In [1]:
# Load in the packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import seaborn as sns
%matplotlib inline
sns.set_style('darkgrid')

Now to set up the data again. As before, refer to notebooks 1 and 2 for more information.

In [2]:
# Set up the data
lines = pd.read_csv('../data/All-seasons.csv')

lines = lines[lines.Season != 'Season']

lines[['Season', 'Episode']] = lines[['Season', 'Episode']].astype('int64')

support_chars = ['Mr. Garrison', 'Chef', 'Sharon',\
                 'Mr. Mackey', 'Gerald', 'Liane', 'Sheila',\
                 'Stephen', 'Ms. Garrison', 'Mrs. Garrison']

lines.loc[lines.Character.isin(support_chars), 'Character'] = 'Support Character'

final_labels = ['Cartman', 'Stan', 'Kyle', 'Butters', 'Randy', 'Support Character']

lines_final = lines[lines.Character.isin(final_labels)]

In [3]:
lines_final.head(3)

Unnamed: 0,Season,Episode,Character,Line
0,10,1,Stan,"You guys, you guys! Chef is going away. \n"
1,10,1,Kyle,Going away? For how long?\n
2,10,1,Stan,Forever.\n


### Corpus

The corpus is established here, with steps to convert everything to lowercase and remove punctuation from the end of each word.

In [4]:
import re, string

corpus = lines_final.Line.tolist()

for line in range(len(corpus)):
    corpus[line] = re.sub('\\n', '', corpus[line].rstrip()).lower()
    corpus[line] = " ".join(word.strip(string.punctuation) for word in corpus[line].split())
    
corpus[:3]

['you guys you guys chef is going away', 'going away for how long', 'forever']

<b>Improving the corpus by removing contractions and lemmatizing words</b>

In [5]:
from nltk.stem import WordNetLemmatizer
from nltk import word_tokenize
import contractions

Here is a loop to expand all contractions:

In [6]:
for line in range(len(corpus)):
    corpus[line] = contractions.fix(corpus[line])
                                  
corpus[7:10]

['what is the meaning of life why are we here',
 'i hope you are making the right choice',
 'I am going to miss him I am going to miss chef and i...and i do not know how to tell him']

Now a function to lemmatize all verbs and nouns:

In [7]:
lem = WordNetLemmatizer()

def lemmatize_lines(line):
    word_list = word_tokenize(line)
    
    word_list = [lem.lemmatize(w, pos='v') for w in word_list]
    
    lem_line = ' '.join([lem.lemmatize(w) for w in word_list])
    
    return lem_line

And finally, a loop using the previous function to execute the lemmatization:

In [8]:
for line in range(len(corpus)):
    corpus[line] = lemmatize_lines(corpus[line])
    
corpus[7:9]

['what be the mean of life why be we here',
 'i hope you be make the right choice']

### Stop words

The list of stop words as determined in notebook 2 needs to instantiated.

In [9]:
sw = ['be', 'you', 'i', 'to', 'the', 'do', 'it',\
        'a', 'we', 'that', 'and', 'have', 'go', 'what',\
        'get', 'of', 'this', 'in', 'on', 'all', 'just',\
        'for', 'he', 'know', 'will', 'but', 'with', 'so',\
        'they', 'now', 'well', "'s", 'guy', 'u', 'come',\
        'like', 'there', 'at', 'would', 'who', 'him',\
        'them', 'his', 'thing', 'where', 'should', 'an',\
        'please', 'maybe', 'their', 'even', 'any', 'than']

### Word vectors and Data splits

The words and the corpus have been preprocessed, now it's time to continue the setup by establishing the word vector and then splitting up the train and test sets. From there, we can tune different models and try to determine the best predictor. We first start with CountVectorizer to establish a basic bag-of-words, combined with a Naive Bayes algorithm for training.<br>
<br>
<b>Establishing the vector</b>

In [10]:
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.model_selection import train_test_split

In [36]:
vectorizer = CountVectorizer(stop_words=sw, ngram_range=(1,1))

X = vectorizer.fit_transform(corpus)
y = lines_final.Character

state = 3

# How many features are there?
X.shape[1]

12333

<b>Splitting the data and training Multinomial Naive Bayes</b><br>
Now the features, X, and the target labels, y, need to be split into test and training sets. Then we can fit the data to a Multinomial Naive Bayes model and cross validate on the training data before checking against the test data.

In [37]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=state)

In [38]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
from sklearn.model_selection import cross_val_score

mnb = MultinomialNB()

cv_scores = cross_val_score(mnb, X_train, y_train, cv=5)

In [39]:
print('Scores: ', cv_scores)
print('Average score: ', np.mean(cv_scores))

Scores:  [0.40567747 0.41510574 0.40407094 0.41031842 0.4069744 ]
Average score:  0.4084293941887636


It's only the first run, but we would certainly like to see better scores than that.<br>
<br>
The immediate next steps would be to try differt n-gram ranges for the count vectorizer, and different values of *alpha* for MultinomialNB. Let's experiment with n-grams first.

<b>Results for different n-gram ranges: MultinomialNB</b>

| ngram_range | Number of features | Mean cv score |
|-------------|-----------------|----------------------|
|  (1,1)      | 12,333           | 0.408                 |
| (1,2)       | 119,582          | 0.406                 |
| (1,3)       | 265,707          | 0.406                 |
| (1,5)       | 511,383          | 0.404                 |

Clearly, increasing the n-gram range only seems to add more noise without adding any benefit. It appears it would be best to stick with a range of (1,1).<br>
<br>
Now we can experiment with values of *alpha* using `GridSearchCV`.

In [41]:
from sklearn.model_selection import GridSearchCV

params = {'alpha': [0.1, 0.2, 0.3, 0.5, 0.75, 0.9, 1.0]}

nb_grid = GridSearchCV(mnb, params, cv=5)

nb_grid.fit(X_train, y_train)

print('Best alpha: ', nb_grid.best_estimator_)
print('Best score: ', nb_grid.best_score_)

Best alpha:  MultinomialNB(alpha=0.5, class_prior=None, fit_prior=True)
Best score:  0.4144336543498408


*Alpha* helps, but only slightly.

### Tf-idf

To try and make further improvements, we can use a weighted tf-idf vectorizer instead of a basic CountVectorizer. Stop words aren't as big of a concern here due to the inherent adjustments, but since we already have a list we may as well use it. Also, because this is a new vectorizer we will have to create new train and test splits.

In [43]:
from sklearn.feature_extraction.text import TfidfVectorizer

tf_vect = TfidfVectorizer(stop_words=sw)

X = tf_vect.fit_transform(corpus)
y = lines_final.Character

In [44]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, stratify=y, random_state=state)

Now to run the new splits with MultinomialNB and check the scores.

In [47]:
nb_grid.fit(X_train, y_train)

print('Best alpha: ', nb_grid.best_estimator_)
print('Best score: ', nb_grid.best_score_)

Best alpha:  MultinomialNB(alpha=0.1, class_prior=None, fit_prior=True)
Best score:  0.40520610871579965


Well that is frustrating.

### Trying other models

<b>Random Forest</b><br>
The results so far have been disappointing, but maybe there are other algorithms that will perform better than Naive Bayes. First, let's try Random Forest, still with the tf-idf vectors. With `oob_score` (out of bag) set to 'True', we use the out-of-bag samples as a sort of validation set.

In [52]:
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_estimators=100,\
                            max_features='sqrt',\
                            oob_score=True,\
                            random_state=state,\
                            n_jobs=-1)

rf.fit(X_train, y_train)
print(rf.oob_score_)

0.40653584236612


The score is pretty in line with all the other results we've seen. There are other parameters we could play with, but the results don't look any more promising than Naive Bayes, so let's try some others.

<b>SVM</b><br>
We can also try Support Vector Machines.

In [62]:
from sklearn.svm import LinearSVC

svm = LinearSVC(max_iter=5000, random_state=3)

params = {'C': [0.15, 0.2, 0.25]}

sv_grid = GridSearchCV(svm, params, cv=5)

sv_grid.fit(X_train, y_train)

print('Best C: ', sv_grid.best_estimator_)
print('Best score: ', sv_grid.best_score_)

Best C:  LinearSVC(C=0.2, class_weight=None, dual=True, fit_intercept=True,
          intercept_scaling=1, loss='squared_hinge', max_iter=5000,
          multi_class='ovr', penalty='l2', random_state=3, tol=0.0001,
          verbose=0)
Best score:  0.42761010597574245


Better, but more or less the same.

<b>Logistic Regression</b>

In [70]:
from sklearn.linear_model import LogisticRegression

log_reg = LogisticRegression(max_iter=5000, multi_class='multinomial', random_state=3)

params = {'C': [0.2, 1.0, 2.0, 5.0, 10.0],
         'class_weight': [None],
         'solver': ['newton-cg', 'sag']}

log_grid = GridSearchCV(log_reg, params, cv=5)

log_grid.fit(X_train, y_train)

print('Best C: ', log_grid.best_estimator_)
print('Best score: ', log_grid.best_score_)

Best C:  LogisticRegression(C=2.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=5000,
                   multi_class='multinomial', n_jobs=None, penalty='l2',
                   random_state=3, solver='newton-cg', tol=0.0001, verbose=0,
                   warm_start=False)
Best score:  0.42474916387959866


Playing around wih different GridSearch parameters yields some better results, but it seems clear at this point that cracking 50% is likely out of reach. 