# Baseline Model
## Game of Thrones Text Classification
### T. P. Goter
### W266 Final Project
### Fall 2019

This notebook is used to generate a Naive Bayes model for text classification using training data generated for the Game of Thrones novel. A simple GridSearch with cross validation is done for each model in order to empirically determine the best parameter of for smoothing (i.e., alpha). Both unigram and bigram models are considered. Additionally, consideration is given to a tfidf vectorication vice a simple word count vectorization. The results of these studies show that all four models perform about the same and have precision, recall and f1 scores near 0.70. Class 5 (i.e., Book 5) shows as the most easily predicted. This isn't really surprising given books four and five were focused on a subset of the characters. Thus, word counts of character names likely are better indicators for these books. With 70% accuracy as our baseline score on the development set, there is plenty of room for further improvement with BERT and the UDA techniques.

In [2]:
# This tells matplotlib not to try opening a new window for each plot.
%matplotlib inline

# General libraries.
import re
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# SK-learn libraries for learning.
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import GridSearchCV

# SK-learn libraries for evaluation.
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import classification_report

# SK-learn libraries for feature extraction from text.
from sklearn.feature_extraction.text import *

# import the tokenization module
from utils import tokenization

## Load Data

In [3]:
# Read the data into dataframes stored in a dictionary
dfs = {}
for data in 'train dev test'.split():
    dfs[data] = pd.read_pickle('Data/' + data + '/' + data + '.pkl')


In [4]:
def create_basic_model(df_train, df_dev, train_size=None, bigram=False, tfidf=False, tokenize=False):
    '''
    Function to train and evalute a multinomial naive bayes model.
    :param: bigram: Boolean, use unigram and bigrams
    :param: tfidf: Boolean, use tfidf weighting during feature vectorization
    :param: tokenize: Boolean, use WordPiece tokenization with BERT input vocabulary
    '''
    # Set up a range of alphas to test
    alphas = {'alpha': [0.0001, 0.001, 0.01, 0.1, 0.5, 1.0, 2.0, 10.0]}
     
    # Get data and labels from dataframe
    if tokenize: 
        if train_size:
            reduced_df = reduced_df.sample(train_size)
            train_data = df_train.tokens.map(lambda x: " ".join(x))
            train_y = reduced_df.label
        else:      
            train_data = df_train.tokens.map(lambda x: " ".join(x))
            train_y = df_train.label
        dev_data = df_dev.tokens.map(lambda x: " ".join(x))
    else:
        if train_size:
            reduced_df = df_train.sample(train_size)
            train_data = reduced_df.text
            train_y = reduced_df.label
        else:
            train_data = df_train.text
            train_y = df_train.label
        dev_data = df_dev.text
    
    
    dev_y = df_dev.label
    
    # Instantiate the count vectorizer
    if bigram:
        if tfidf:
            vectorizer = TfidfVectorizer(ngram_range=(1,2))
        else:
            vectorizer = CountVectorizer(ngram_range=(1,2))
    else:
        if tfidf:
            vectorizer = TfidfVectorizer()
        else:
            vectorizer = CountVectorizer()
    
    # Generate the Feature Vectors
    train_X = vectorizer.fit_transform(train_data)    
    
    # Generate the vocabulary for the dev data
    dev_X = vectorizer.transform(dev_data)
       
    # Fit the model
    clf = MultinomialNB()
    
    # Run a grid search over alpha (smoothing) values to determine best 
    gs_clf = GridSearchCV(clf, param_grid=alphas, cv=4, return_train_score=True)
    gs_clf.fit(train_X, train_y)
    
    # Display the best parameter
    print(50 * "=")
    print("The best alpha value was determined to be {}".format(gs_clf.best_params_['alpha']))
    print(50 * "=")

    # Let's make some predictions using the best classifier
    y_pred = gs_clf.best_estimator_.predict(dev_X)
       
    print(classification_report(y_pred, dev_y))
    print(confusion_matrix(dev_y, y_pred))
    
    return gs_clf.best_estimator_.feature_log_prob_


## Basic Unigram Model

Metrics we will consider are precision, recall, and F1-score. Remember that:

- Precision: Number of items labeled as class A that are actually class A. - How many false positives?
- Recall: Number of items labeled as class A normalized to all things that are class A. - How many false negatives?
- F1 Score: Harmonic mean of Precision and Recall


In [4]:
uni_probs = create_basic_model(dfs['train'], dfs['dev'])

The best alpha value was determined to be 0.5
              precision    recall  f1-score   support

           1       0.79      0.70      0.74       481
           2       0.58      0.70      0.63       407
           3       0.63      0.65      0.64       577
           4       0.76      0.70      0.73       446
           5       0.78      0.77      0.77       590

   micro avg       0.70      0.70      0.70      2501
   macro avg       0.71      0.70      0.70      2501
weighted avg       0.71      0.70      0.71      2501

[[335  34  39   7   9]
 [ 75 284  70  26  34]
 [ 48  66 375  46  57]
 [  8  14  40 314  35]
 [ 15   9  53  53 455]]


## Bigram Model

In [5]:
bi_probs = create_basic_model(dfs['train'], dfs['dev'], bigram=True)

The best alpha value was determined to be 0.1
              precision    recall  f1-score   support

           1       0.74      0.73      0.74       432
           2       0.54      0.69      0.61       384
           3       0.72      0.65      0.68       660
           4       0.74      0.75      0.74       409
           5       0.80      0.76      0.78       616

   micro avg       0.71      0.71      0.71      2501
   macro avg       0.71      0.71      0.71      2501
weighted avg       0.72      0.71      0.71      2501

[[315  45  43   4  17]
 [ 65 265  97  26  36]
 [ 33  50 426  28  55]
 [  9  13  42 305  42]
 [ 10  11  52  46 466]]


## Unigram Model with Tf-Idf Weighting

In [6]:
uni_tfidf_probs = create_basic_model(dfs['train'], dfs['dev'], tfidf=True)

The best alpha value was determined to be 0.1
              precision    recall  f1-score   support

           1       0.71      0.76      0.74       394
           2       0.50      0.71      0.59       347
           3       0.74      0.58      0.65       756
           4       0.68      0.77      0.72       363
           5       0.81      0.74      0.77       641

   micro avg       0.69      0.69      0.69      2501
   macro avg       0.69      0.71      0.69      2501
weighted avg       0.71      0.69      0.70      2501

[[301  33  70   4  16]
 [ 54 245 122  23  45]
 [ 24  46 440  24  58]
 [  6  13  62 280  50]
 [  9  10  62  32 472]]


## Bigram Model with Tf-Idf Weighting

In [7]:
bi_tfidf_probs = create_basic_model(dfs['train'], dfs['dev'], tfidf=True, bigram=True)

The best alpha value was determined to be 0.01
              precision    recall  f1-score   support

           1       0.69      0.74      0.72       396
           2       0.50      0.67      0.58       365
           3       0.74      0.61      0.67       715
           4       0.70      0.77      0.73       373
           5       0.79      0.71      0.75       652

   micro avg       0.69      0.69      0.69      2501
   macro avg       0.69      0.70      0.69      2501
weighted avg       0.71      0.69      0.69      2501

[[294  48  53   3  26]
 [ 59 246 117  20  47]
 [ 25  45 438  22  62]
 [ 10  15  47 286  53]
 [  8  11  60  42 464]]


## Reduced Dataset to 12000

In [14]:
uni_probs = create_basic_model(dfs['train'], dfs['dev'], train_size=12000)

The best alpha value was determined to be 0.5
              precision    recall  f1-score   support

           1       0.79      0.70      0.74       478
           2       0.57      0.69      0.62       403
           3       0.66      0.65      0.65       599
           4       0.76      0.72      0.74       435
           5       0.77      0.77      0.77       586

   micro avg       0.71      0.71      0.71      2501
   macro avg       0.71      0.71      0.71      2501
weighted avg       0.71      0.71      0.71      2501

[[335  31  39   8  11]
 [ 78 277  73  22  39]
 [ 42  67 389  43  51]
 [  8  16  39 314  34]
 [ 15  12  59  48 451]]


In [15]:
uni_probs = create_basic_model(dfs['train'], dfs['dev'], train_size=5000)

The best alpha value was determined to be 0.5
              precision    recall  f1-score   support

           1       0.72      0.68      0.70       449
           2       0.49      0.66      0.56       362
           3       0.66      0.60      0.63       647
           4       0.69      0.68      0.68       414
           5       0.76      0.71      0.73       629

   micro avg       0.66      0.66      0.66      2501
   macro avg       0.66      0.67      0.66      2501
weighted avg       0.68      0.66      0.67      2501

[[306  32  52  10  24]
 [ 67 240 103  28  51]
 [ 42  48 390  48  64]
 [ 17  17  50 282  45]
 [ 17  25  52  46 445]]


In [16]:
uni_probs = create_basic_model(dfs['train'], dfs['dev'], train_size=2000)

The best alpha value was determined to be 0.5
              precision    recall  f1-score   support

           1       0.56      0.68      0.62       353
           2       0.42      0.49      0.45       418
           3       0.64      0.51      0.56       747
           4       0.58      0.61      0.59       390
           5       0.65      0.64      0.65       593

   micro avg       0.58      0.58      0.58      2501
   macro avg       0.57      0.58      0.57      2501
weighted avg       0.58      0.58      0.58      2501

[[239  69  70  16  30]
 [ 54 206 130  36  63]
 [ 32  69 378  45  68]
 [ 11  36  75 237  52]
 [ 17  38  94  56 380]]


In [17]:
uni_probs = create_basic_model(dfs['train'], dfs['dev'], train_size=200)

The best alpha value was determined to be 0.1
              precision    recall  f1-score   support

           1       0.41      0.38      0.39       465
           2       0.33      0.29      0.31       557
           3       0.29      0.37      0.32       464
           4       0.36      0.40      0.38       372
           5       0.49      0.44      0.46       643

   micro avg       0.38      0.38      0.38      2501
   macro avg       0.38      0.38      0.37      2501
weighted avg       0.38      0.38      0.38      2501

[[175  88  66  32  63]
 [ 99 162  89  35 104]
 [ 96 128 171  80 117]
 [ 37  90  59 150  75]
 [ 58  89  79  75 284]]




In [18]:
uni_probs = create_basic_model(dfs['train'], dfs['dev'], train_size=20)



The best alpha value was determined to be 1.0
              precision    recall  f1-score   support

           1       0.01      0.18      0.01        17
           2       0.91      0.20      0.32      2272
           3       0.06      0.26      0.10       145
           4       0.03      0.36      0.06        39
           5       0.01      0.18      0.02        28

   micro avg       0.20      0.20      0.20      2501
   macro avg       0.20      0.23      0.10      2501
weighted avg       0.83      0.20      0.30      2501

[[  3 383  22  10   6]
 [  3 446  27   5   8]
 [  6 538  37   7   4]
 [  2 356  34  14   5]
 [  3 549  25   3   5]]




## Use Consistent Training Data with UDA Models

In [24]:
def create_basic_bert_model(df_train, df_dev, bigram=False, tfidf=False):
    '''
    Function to train and evalute a multinomial naive bayes model.
    :param: bigram: Boolean, use unigram and bigrams
    :param: tfidf: Boolean, use tfidf weighting during feature vectorization
    :param: tokenize: Boolean, use WordPiece tokenization with BERT input vocabulary
    '''
    # Set up a range of alphas to test
    alphas = {'alpha': [0.0001, 0.001, 0.01, 0.1, 0.5, 1.0, 2.0, 10.0]}
     
    # Get data and labels from dataframe
    train_data = df_train.seq
    train_y = df_train.label.map(lambda x: str(x))

    dev_data = df_dev.tokens.map(lambda x: " ".join(x))
    dev_y = df_dev.label
    
    # Instantiate the count vectorizer
    if bigram:
        if tfidf:
            vectorizer = TfidfVectorizer(ngram_range=(1,2))
        else:
            vectorizer = CountVectorizer(ngram_range=(1,2))
    else:
        if tfidf:
            vectorizer = TfidfVectorizer()
        else:
            vectorizer = CountVectorizer()
    
    # Generate the Feature Vectors
    train_X = vectorizer.fit_transform(train_data)    
    
    # Generate the vocabulary for the dev data
    dev_X = vectorizer.transform(dev_data)
       
    # Fit the model
    clf = MultinomialNB()
    
    # Run a grid search over alpha (smoothing) values to determine best 
    gs_clf = GridSearchCV(clf, param_grid=alphas, cv=4, return_train_score=True)
    gs_clf.fit(train_X, train_y)
    
    # Display the best parameter
    print(50 * "=")
    print("The best alpha value was determined to be {}".format(gs_clf.best_params_['alpha']))
    print(50 * "=")

    # Let's make some predictions using the best classifier
    y_pred = gs_clf.best_estimator_.predict(dev_X)
       
    print(classification_report(dev_y,y_pred))
    print(confusion_matrix(dev_y, y_pred))
    print("Accuracy determined to be: {:.3f}".format(np.mean(y_pred==dev_y)))
    
    return gs_clf.best_estimator_.feature_log_prob_

In [19]:
df_20 = pd.read_pickle('Data/proc_data/GoT/train_20/train_20.pkl')
df_20.head(20)

Unnamed: 0,seq,label
0,[CLS] you ’ ve been hurt . he took her hands i...,1
1,[CLS] pinch ##bot ##tom pet ##to and sl ##oe -...,2
2,[CLS] is it his fault the old man died ? stan ...,2
3,[CLS] the battle ##ments br ##istle ##d with s...,1
4,"[CLS] for all his ru ##tting , he has not prov...",5
5,[CLS] he called his servants and sent them run...,1
6,"[CLS] they have no trees , she realized . bra ...",4
7,[CLS] lord wal ##der might well con ##st ##ru ...,3
8,"[CLS] still fighting on the fist , amidst heav...",3
9,[CLS] twenty mule ##s awaited them within the ...,4


In [25]:
uni_probs = create_basic_bert_model(df_20, dfs['dev'])

The best alpha value was determined to be 1.0
              precision    recall  f1-score   support

           1       0.35      0.25      0.29       424
           2       0.22      0.35      0.27       489
           3       0.26      0.31      0.28       592
           4       0.20      0.13      0.16       411
           5       0.33      0.24      0.28       585

   micro avg       0.26      0.26      0.26      2501
   macro avg       0.27      0.26      0.26      2501
weighted avg       0.27      0.26      0.26      2501

[[105 116 113  40  50]
 [ 50 173 141  48  77]
 [ 64 169 184  65 110]
 [ 32 128 142  55  54]
 [ 47 197 129  70 142]]
Accuracy determined to be: 0.263


In [26]:
df_200 = pd.read_pickle('Data/proc_data/GoT/train_200/train_200.pkl')
uni_probs_200 = create_basic_bert_model(df_200, dfs['dev'])

The best alpha value was determined to be 0.1
              precision    recall  f1-score   support

           1       0.39      0.44      0.41       424
           2       0.30      0.35      0.32       489
           3       0.37      0.30      0.33       592
           4       0.44      0.35      0.39       411
           5       0.47      0.51      0.49       585

   micro avg       0.39      0.39      0.39      2501
   macro avg       0.39      0.39      0.39      2501
weighted avg       0.39      0.39      0.39      2501

[[187  97  60  19  61]
 [ 92 171 116  27  83]
 [108 116 177  68 123]
 [ 42  95  60 145  69]
 [ 54  95  66  71 299]]
Accuracy determined to be: 0.391


In [27]:
df_2000 = pd.read_pickle('Data/proc_data/GoT/train_2000/train_2000.pkl')
uni_probs_2000 = create_basic_bert_model(df_2000, dfs['dev'])

The best alpha value was determined to be 2.0
              precision    recall  f1-score   support

           1       0.54      0.75      0.63       424
           2       0.55      0.34      0.42       489
           3       0.51      0.52      0.51       592
           4       0.56      0.66      0.60       411
           5       0.68      0.61      0.64       585

   micro avg       0.57      0.57      0.57      2501
   macro avg       0.57      0.58      0.56      2501
weighted avg       0.57      0.57      0.56      2501

[[320  31  51  10  12]
 [118 167 117  42  45]
 [ 85  54 308  80  65]
 [ 18  23  55 270  45]
 [ 47  28  74  80 356]]
Accuracy determined to be: 0.568


In [28]:
df_5000 = pd.read_pickle('Data/proc_data/GoT/train_5000/train_5000.pkl')
uni_probs_5000 = create_basic_bert_model(df_5000, dfs['dev'])

The best alpha value was determined to be 1.0
              precision    recall  f1-score   support

           1       0.64      0.75      0.69       424
           2       0.60      0.45      0.51       489
           3       0.55      0.63      0.58       592
           4       0.64      0.69      0.67       411
           5       0.75      0.65      0.69       585

   micro avg       0.63      0.63      0.63      2501
   macro avg       0.63      0.63      0.63      2501
weighted avg       0.63      0.63      0.63      2501

[[318  39  48   7  12]
 [ 83 219 119  33  35]
 [ 58  62 372  48  52]
 [ 14  15  66 285  31]
 [ 25  32  76  72 380]]
Accuracy determined to be: 0.629


In [29]:
df_12000 = pd.read_pickle('Data/proc_data/GoT/train_12000/train_12000.pkl')
uni_probs_12000 = create_basic_bert_model(df_12000, dfs['dev'])

The best alpha value was determined to be 1.0
              precision    recall  f1-score   support

           1       0.68      0.75      0.71       424
           2       0.63      0.49      0.55       489
           3       0.58      0.67      0.62       592
           4       0.67      0.73      0.70       411
           5       0.78      0.68      0.73       585

   micro avg       0.66      0.66      0.66      2501
   macro avg       0.67      0.67      0.66      2501
weighted avg       0.67      0.66      0.66      2501

[[316  40  43  11  14]
 [ 75 242 108  36  28]
 [ 42  66 395  44  45]
 [ 11  14  55 302  29]
 [ 20  22  84  59 400]]
Accuracy determined to be: 0.662
