# Models

And now to try and predict.

In [83]:
# Load in the packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

import seaborn as sns
%matplotlib inline
sns.set_style('darkgrid')

Now to set up the data again. As before, refer to notebooks 1 and 2 for more information.

In [84]:
# Set up the data
lines = pd.read_csv('../data/All-seasons.csv')

lines = lines[lines.Season != 'Season']

lines[['Season', 'Episode']] = lines[['Season', 'Episode']].astype('int64')

support_chars = ['Mr. Garrison', 'Chef', 'Sharon',\
                 'Mr. Mackey', 'Gerald', 'Liane', 'Sheila',\
                 'Stephen', 'Ms. Garrison', 'Mrs. Garrison']

lines.loc[lines.Character.isin(support_chars), 'Character'] = 'Support Character'

final_labels = ['Cartman', 'Stan', 'Kyle', 'Butters', 'Randy', 'Support Character']

lines_final = lines[lines.Character.isin(final_labels)]

In [85]:
lines_final.head(3)

Unnamed: 0,Season,Episode,Character,Line
0,10,1,Stan,"You guys, you guys! Chef is going away. \n"
1,10,1,Kyle,Going away? For how long?\n
2,10,1,Stan,Forever.\n


### Corpus

The corpus is established here, with steps to convert everything to lowercase and remove punctuation from the end of each word.

In [86]:
import re, string

corpus = lines_final.Line.tolist()

for line in range(len(corpus)):
    corpus[line] = re.sub('\\n', '', corpus[line].rstrip()).lower()
    corpus[line] = " ".join(word.strip(string.punctuation) for word in corpus[line].split())
    
corpus[:3]

['you guys you guys chef is going away', 'going away for how long', 'forever']

### Stop words

The list of stop words as determined in notebook 2 needs to instatiated.

In [87]:
sw = ['you', 'the', 'i', 'to', 'a', 'and', 'it', 'that',\
              'we', 'is', 'of', 'what', 'this', 'in', 'have', 'all',\
              'just', 'do', 'for', "don't", 'are', 'be', "it's", 'get',\
              'but', 'with', 'know', 'so', 'go', 'can', 'right', 'out',\
              'like', 'was', 'gonna', "that's", 'here', 'up', 'about', \
              "you're", 'he', 'come', 'they', 'okay', 'see', 'our',\
              'how', 'if', 'think', 'at', 'us', "can't", "we're", 'got',\
              'there', 'look', 'did', 'why', 'then', 'him', 'time',\
              'back', 'one', 'going', 'want', 'who', "he's", 'from', \
              'some', 'his', 'will', 'need', 'make', 'take', 'yes',\
              "let's", 'because', 'them', 'has', 'as', "what's",\
              "there's", 'too', 'an', 'when', 'been', 'where', 'or',\
              'were', 'had', "they're", 'her', 'by', 'their', 'those',\
              'she', 'these', 'any', 'into', "we've", 'two','does',\
              'much', 'being', 'am', 'than']

### Data splits

In [88]:
X = pd.Series(corpus)
y = lines_final.Character

state = 33

In [89]:
from sklearn.model_selection import train_test_split

X_train, holdout_X, y_train, holdout_y = train_test_split(X, y, test_size=0.20, stratify=y, random_state=state)

Now we have a final holdout set of data, which is essentially a portion of the corpus and its corresponding character labels, and a training set of data. The corpus for the training data will be used to create the word vectors and train the models, and then the final model will be tested on the hold out set.

In order to properly test the training model, the training data needs to be split again for validation purposes. Here I use a smaller test-size when creating the validation set to try and maximize the data used to train the model.

In [90]:
X_tr, X_val, y_tr, y_val = train_test_split(X_train, y_train, test_size=0.12, stratify=y_train, random_state=state)

<b>Establishing the CountVectorizer</b><br>
Now the word vectors are created with CountVectorizer and are fit to the remaining training data.

In [91]:
from sklearn.feature_extraction.text import CountVectorizer

cv = CountVectorizer(ngram_range=(1,1), stop_words=sw)

cv_train = cv.fit_transform(X_tr)
cv_val = cv.transform(X_val)

  'stop_words.' % sorted(inconsistent))


<b>Using naive bayes with the vectorizer</b>

In [92]:
from sklearn.naive_bayes import MultinomialNB

nb = MultinomialNB(alpha=0.25)

_ = nb.fit(cv_train, y_tr)

In [93]:
from sklearn.metrics import accuracy_score

nb_tr_pred = nb.predict(cv_train)
nb_val_pred = nb.predict(cv_val)

print('Training accuracy is: ', accuracy_score(y_tr, nb_tr_pred))
print('Validation accuracy is: ', accuracy_score(y_val, nb_val_pred))

Training accuracy is:  0.6250250410673505
Validation accuracy is:  0.4283196239717979


In [82]:
from sklearn.model_selection import cross_val_score

cross_val_score(nb, cv_train, y_tr, cv=5)

array([0.42362362, 0.41389668, 0.41583166, 0.41823647, 0.41002004])

<b>Results for different n-gram ranges</b>

| ngram_range | Train set score | Validation set score |
|-------------|-----------------|----------------------|
|  (1,1)      | 0.58            | 0.42                 |
| (1,2)       | 0.74            | 0.41                 |
| (1,3)       | 0.78            | 0.40                 |
| (1,5)       | 0.79            | 0.40                 |
| (1,7)       | 0.79            | 0.399                |
| (1,10)      | 0.79            | 0.399                |

<b>Tf-idf</b>

In [95]:
from sklearn.feature_extraction.text import TfidfVectorizer

tv = TfidfVectorizer(stop_words=sw)

tv_train = tv.fit_transform(X_tr)
tv_val = tv.transform(X_val)

  'stop_words.' % sorted(inconsistent))


In [96]:
_ = nb.fit(tv_train, y_tr)

In [97]:
nb_tr_pred = nb.predict(tv_train)
nb_val_pred = nb.predict(tv_val)

print('Training accuracy is: ', accuracy_score(y_tr, nb_tr_pred))
print('Validation accuracy is: ', accuracy_score(y_val, nb_val_pred))

Training accuracy is:  0.6214191273688849
Validation accuracy is:  0.4203877790834313
