# Baseline Model
## Game of Thrones Text Classification
### T. P. Goter
### W266 Final Project
### Fall 2019

This notebook is used to generate a Naive Bayes model for text classification using training data generated for the Game of Thrones novel. A simple GridSearch with cross validation is done for each model in order to empirically determine the best parameter of for smoothing (i.e., alpha). Both unigram and bigram models are considered. Additionally, consideration is given to a tfidf vectorication vice a simple word count vectorization. The results of these studies show that all four models perform about the same and have precision, recall and f1 scores near 0.70. Class 5 (i.e., Book 5) shows as the most easily predicted. This isn't really surprising given books four and five were focused on a subset of the characters. Thus, word counts of character names likely are better indicators for these books. With 70% accuracy as our baseline score on the development set, there is plenty of room for further improvement with BERT and the UDA techniques.

In [4]:
# This tells matplotlib not to try opening a new window for each plot.
%matplotlib inline

# General libraries.
import re
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# SK-learn libraries for learning.
from sklearn.pipeline import Pipeline
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import GridSearchCV

# SK-learn libraries for evaluation.
from sklearn.metrics import confusion_matrix
from sklearn import metrics
from sklearn.metrics import classification_report

# SK-learn libraries for feature extraction from text.
from sklearn.feature_extraction.text import *

# import the tokenization module
from utils import tokenization

## Load Data

In [29]:
# Read the data into dataframes stored in a dictionary
dfs = {}
for data in 'train dev test'.split():
    dfs[data] = pd.read_pickle('Data/' + data + '/' + data + '.pkl')


In [118]:
def create_basic_model(df_train, df_dev, bigram=False, tfidf=False, tokenize=False):
    '''
    Function to train and evalute a multinomial naive bayes model.
    :param: bigram: Boolean, use unigram and bigrams
    :param: tfidf: Boolean, use tfidf weighting during feature vectorization
    :param: tokenize: Boolean, use WordPiece tokenization with BERT input vocabulary
    '''
    # Set up a range of alphas to test
    alphas = {'alpha': [0.0001, 0.001, 0.01, 0.1, 0.5, 1.0, 2.0, 10.0]}
     
    # Get data and labels from dataframe
    if tokenize: 
        train_data = df_train.tokens.map(lambda x: " ".join(x))
        dev_data = df_dev.tokens.map(lambda x: " ".join(x))
    else:
        train_data = df_train.text
        dev_data = df_dev.text
    
    train_y = df_train.label
    dev_y = df_dev.label
    
    # Instantiate the count vectorizer
    if bigram:
        if tfidf:
            vectorizer = TfidfVectorizer(ngram_range=(1,2))
        else:
            vectorizer = CountVectorizer(ngram_range=(1,2))
    else:
        if tfidf:
            vectorizer = TfidfVectorizer()
        else:
            vectorizer = CountVectorizer()
    
    # Generate the Feature Vectors
    train_X = vectorizer.fit_transform(train_data)    
    
    # Generate the vocabulary for the dev data
    dev_X = vectorizer.transform(dev_data)
       
    # Fit the model
    clf = MultinomialNB()
    
    # Run a grid search over alpha (smoothing) values to determine best 
    gs_clf = GridSearchCV(clf, param_grid=alphas, cv=5, return_train_score=True)
    gs_clf.fit(train_X, train_y)
    
    # Display the best parameter
    print(50 * "=")
    print("The best alpha value was determined to be {}".format(gs_clf.best_params_['alpha']))
    print(50 * "=")

    # Let's make some predictions using the best classifier
    y_pred = gs_clf.best_estimator_.predict(dev_X)
       
    print(classification_report(y_pred, dev_y))
    print(confusion_matrix(dev_y, y_pred))
    
    return gs_clf.best_estimator_.feature_log_prob_


## Basic Unigram Model

Metrics we will consider are precision, recall, and F1-score. Remember that:

- Precision: Number of items labeled as class A that are actually class A. - How many false positives?
- Recall: Number of items labeled as class A normalized to all things that are class A. - How many false negatives?
- F1 Score: Harmonic mean of Precision and Recall


In [119]:
uni_probs = create_basic_model(dfs['train'], dfs['dev'])

The best alpha value was determined to be 0.5
              precision    recall  f1-score   support

           1       0.79      0.73      0.76       427
           2       0.61      0.70      0.65       400
           3       0.64      0.62      0.63       608
           4       0.79      0.73      0.75       484
           5       0.76      0.79      0.77       582

   micro avg       0.71      0.71      0.71      2501
   macro avg       0.72      0.71      0.71      2501
weighted avg       0.72      0.71      0.71      2501

[[311  35  39   4   7]
 [ 45 280  85  29  20]
 [ 43  56 380  50  66]
 [ 10  17  37 351  31]
 [ 18  12  67  50 458]]


## Bigram Model

In [120]:
bi_probs = create_basic_model(dfs['train'], dfs['dev'], bigram=True)

The best alpha value was determined to be 0.1
              precision    recall  f1-score   support

           1       0.75      0.75      0.75       395
           2       0.57      0.67      0.61       390
           3       0.68      0.62      0.65       650
           4       0.74      0.75      0.74       439
           5       0.80      0.77      0.78       627

   micro avg       0.71      0.71      0.71      2501
   macro avg       0.70      0.71      0.71      2501
weighted avg       0.71      0.71      0.71      2501

[[296  39  45   8   8]
 [ 53 260  95  25  26]
 [ 29  55 403  36  72]
 [  7  23  49 328  39]
 [ 10  13  58  42 482]]


## Unigram Model with Tf-Idf Weighting

In [121]:
uni_tfidf_probs = create_basic_model(dfs['train'], dfs['dev'], tfidf=True)

The best alpha value was determined to be 0.1
              precision    recall  f1-score   support

           1       0.70      0.77      0.73       360
           2       0.54      0.71      0.61       346
           3       0.74      0.56      0.64       790
           4       0.69      0.80      0.74       384
           5       0.78      0.76      0.77       621

   micro avg       0.70      0.70      0.70      2501
   macro avg       0.69      0.72      0.70      2501
weighted avg       0.71      0.70      0.69      2501

[[276  36  73   2   9]
 [ 35 246 129  21  28]
 [ 27  36 440  25  67]
 [  9  18  66 307  46]
 [ 13  10  82  29 471]]


## Bigram Model with Tf-Idf Weighting

In [122]:
bi_tfidf_probs = create_basic_model(dfs['train'], dfs['dev'], tfidf=True, bigram=True)

The best alpha value was determined to be 0.01
              precision    recall  f1-score   support

           1       0.71      0.76      0.73       369
           2       0.53      0.64      0.58       377
           3       0.69      0.57      0.63       714
           4       0.69      0.77      0.73       395
           5       0.79      0.74      0.76       646

   micro avg       0.69      0.69      0.69      2501
   macro avg       0.68      0.70      0.69      2501
weighted avg       0.69      0.69      0.69      2501

[[281  47  54   2  12]
 [ 46 241 122  22  28]
 [ 26  54 410  28  77]
 [  6  19  63 306  52]
 [ 10  16  65  37 477]]
