## Naive Bayes & Random Forest Classification

This notebook shows an example of Naive Bayes models and Random Forest models run on the history of philosophy dataset.

### Imports and Loading Data

In [None]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_selection import SelectKBest
from sklearn.preprocessing import KBinsDiscretizer, OneHotEncoder, StandardScaler, LabelEncoder
from sklearn.metrics import confusion_matrix, plot_confusion_matrix
from sklearn.ensemble import AdaBoostClassifier, GradientBoostingClassifier
from sklearn.utils.class_weight import compute_class_weight, compute_sample_weight
from sklearn.ensemble import RandomForestClassifier, AdaBoostClassifier, GradientBoostingClassifier
from sklearn.naive_bayes import BernoulliNB, BaseEstimator, BaseNB
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import plot_confusion_matrix, classification_report
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score

from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler


def plot_pretty_cf(predictor, xtest, ytest, cmap='Greys', normalize='true', 
                   title=None, label_dict={}):
    fig, ax = plt.subplots(figsize=(8, 8))
    plot_confusion_matrix(predictor, xtest, ytest, cmap=cmap, normalize=normalize, ax=ax)
    ax.set_title(title, size='xx-large', pad=20, fontweight='bold')
    if label_dict != {}:
      ax.set_xticklabels([label_dict[int(x.get_text())] for x in ax.get_xticklabels()], rotation=35)
      ax.set_yticklabels([label_dict[int(x.get_text())] for x in ax.get_yticklabels()])
    else: 
      ax.set_xticklabels([str(x).replace('_', ' ').title()[12:-2] for x in ax.get_xticklabels()], rotation=35)
      ax.set_yticklabels([str(x).replace('_', ' ').title()[12:-2] for x in ax.get_yticklabels()])
    ax.set_xlabel('Predicted Label', size='x-large')
    ax.set_ylabel('True Label', size='x-large')
    plt.show()
    
    
def class_weight_applier(y_train, y_test):
  y_classes = y_train.unique()
  le = LabelEncoder()
  y_integers = le.fit_transform(y_train)

  # create a dict of labels : their integer representations
  label_dict = dict(zip(le.classes_, np.unique(y_integers)))
  flipped_dict = {value:key for key, value in label_dict.items()}

  # # get the class weights
  class_weights = compute_class_weight('balanced', np.unique(y_integers), y_integers)
  sample_weights = compute_sample_weight('balanced', y_integers)
  class_weights_dict = dict(zip(le.transform(list(le.classes_)), class_weights))

  # convert the target to the numerical categories
  y_train = y_train.apply(lambda x: label_dict[x])
  y_test = y_test.apply(lambda x: label_dict[x])

  return y_train, y_test, flipped_dict

In [None]:
df = pd.read_csv('../input/history-of-philosophy/phil_nlp.csv')

df.sample(5)

### Baseline NB Bayes Model

First we need to split up the data into test and train.

In [None]:
# split the data
x_train, x_test, y_train, y_test = train_test_split(df['sentence'], df['school'])

Then we vectorize. After a few attempts, we found that the best models were those where no stopwords were involved.

In [None]:
# vectorize
tfidvectorizer = TfidfVectorizer(decode_error='ignore', stop_words=[])
tf_idf_data_train = tfidvectorizer.fit_transform(x_train)
tf_idf_data_test = tfidvectorizer.transform(x_test)

In [None]:
# build the classifier, train it, get predictions
nb_classifier = MultinomialNB()
nb_classifier.fit(tf_idf_data_train, y_train)
nb_classifier_preds = nb_classifier.predict(tf_idf_data_test)

In [None]:
plot_pretty_cf(nb_classifier, tf_idf_data_test, y_test, title='Baseline NB Model')

In [None]:
print(classification_report(y_test, nb_classifier_preds))

Accuracy in the low 70s over 10 classes is not too bad, but we can at least aim higher than this. If we look at it, a lot of failures were along the lines of lines of class imbalance. 

In [None]:
df['school'].value_counts(normalize=True)

Perhaps correcting for class imbalance could improve the model. 

### NB Corrected for Class Imbalance

Here we will use imblearn's over and undersampler to correct for class imbalance.

In [None]:
x_train, x_test, y_train, y_test = train_test_split(df['sentence'], df['school'])

In [None]:
tfidvectorizer = TfidfVectorizer(decode_error='ignore', stop_words=[])
tf_idf_data_train = tfidvectorizer.fit_transform(x_train)
tf_idf_data_test = tfidvectorizer.transform(x_test)

In [None]:
y_train, y_test, flipped_dict = class_weight_applier(y_train, y_test)

#### Oversampling

In [None]:
ros = RandomOverSampler(sampling_strategy='all')

In [None]:
x_under, y_under = ros.fit_sample(tf_idf_data_train, y_train)

In [None]:
nb_undersampled = MultinomialNB()
nb_undersampled.fit(x_under, y_under)
nb_undersampled_preds = nb_undersampled.predict(tf_idf_data_test)

In [None]:
plot_pretty_cf(nb_undersampled, tf_idf_data_test, y_test, 
               title='NB w/ Undersampling', label_dict=flipped_dict)

In [None]:
print(classification_report(y_test, nb_undersampled_preds))

Not bad, we got a sold increase in accuracy. Let's check if oversampling helps any more.

#### Undersampling

In [None]:
rus = RandomUnderSampler(sampling_strategy='all')

In [None]:
x_over, y_over = rus.fit_sample(tf_idf_data_train, y_train)

In [None]:
nb_oversampled = MultinomialNB()
nb_oversampled.fit(x_under, y_under)
nb_oversampled_preds = nb_oversampled.predict(tf_idf_data_test)

In [None]:
plot_pretty_cf(nb_undersampled, tf_idf_data_test, y_test, 
               title='NB w/ Oversampling', label_dict=flipped_dict)

In [None]:
print(classification_report(y_test, nb_oversampled_preds))

Unsurprisingly, not much of a different result. It seems like Multinomial Naive Bayes can give us about 77% accuracy. 

It's perhaps worth checking if lemmatization can help the model.

### NB with Lemmatization

In [None]:
x_train, x_test, y_train, y_test = train_test_split(df['lemmatized_str'], df['school'])

# vectorize
tfidvectorizer = TfidfVectorizer(decode_error='ignore', stop_words=[])
tf_idf_data_train = tfidvectorizer.fit_transform(x_train)
tf_idf_data_test = tfidvectorizer.transform(x_test)

In [None]:
y_train, y_test, flipped_dict = class_weight_applier(y_train, y_test)

In [None]:
rus = RandomUnderSampler(sampling_strategy='all')

x_over_lemma, y_over_lemma = rus.fit_sample(tf_idf_data_train, y_train)

In [None]:
nb_lemma = MultinomialNB()
nb_lemma.fit(x_over_lemma, y_over_lemma)
nb_lemma_preds = nb_lemma.predict(tf_idf_data_test)

plot_pretty_cf(nb_lemma, tf_idf_data_test, y_test, 
               title='NB w/ Lemmatization', label_dict=flipped_dict)

In [None]:
print(classification_report(y_test, nb_lemma_preds))

Not great, and worse than non-lemmatized versions. This makes sense since lemmatization essentially masks information that might have had some small part to play in the classification math.

### NB with Bigrams

While singular words may not always be indicative of a school, certain phrases are often almost entirely exclusive to a school. So it stands to reason that incorporating bigrams into our data would help the model.

In [None]:
# vectorize, this time adjusting the ngram range to include bigrams
tfidvectorizer = TfidfVectorizer(decode_error='ignore', 
                                 stop_words=[], 
                                 ngram_range=(1,2))
tf_idf_data_train = tfidvectorizer.fit_transform(x_train)
tf_idf_data_test = tfidvectorizer.transform(x_test)

In [None]:
y_train, y_test, flipped_dict = class_weight_applier(y_train, y_test)

In [None]:
rus = RandomUnderSampler(sampling_strategy='all')

x_over_bgram, y_over_bgram = rus.fit_sample(tf_idf_data_train, y_train)

In [None]:
nb_bigrams = MultinomialNB()
nb_bigrams.fit(x_over_bgram, y_over_bgram)
nb_bigrams_preds = nb_bigrams.predict(tf_idf_data_test)

plot_pretty_cf(nb_bigrams, tf_idf_data_test, y_test, 
               title='NB w/ Bigrams', label_dict=flipped_dict)

In [None]:
print(classification_report(y_test, nb_bigrams_preds))

It seems like bigrams actually made the model worse. Let's try something totally different - random forests!

### Random Forest Classifier

Random Forests don't always do well on this kind of task, but it's perhaps worth trying. We'll just do an untuned model to see if it gets any kind of results worth exploring. 

In [None]:
x_train, x_test, y_train, y_test = train_test_split(df['sentence'], df['school'])

# vectorize
tfidvectorizer = TfidfVectorizer(decode_error='ignore', 
                                 stop_words=[])
tf_idf_data_train = tfidvectorizer.fit_transform(x_train)
tf_idf_data_test = tfidvectorizer.transform(x_test)

y_train, y_test, flipped_dict = class_weight_applier(y_train, y_test)

In [None]:
rus = RandomUnderSampler(sampling_strategy='all')

x_over, y_over = rus.fit_sample(tf_idf_data_train, y_train)

In [None]:
rf = RandomForestClassifier()
rf.fit(x_over, y_over)
rf_preds = rf.predict(tf_idf_data_test)

plot_pretty_cf(rf, tf_idf_data_test, y_test, 
               title='Untuned RF', label_dict=flipped_dict)

In [None]:
print(classification_report(y_test, rf_preds))

Unfortunately the random forest model got only 60%, worse than any Bayesian model. A result like this is poor enough that spending time refining it may just not be worth the effort, especially when there are more promising avenues still to explore. 

Overall, the Bayesian models were able to reach 77% accuracy when corrected for class imbalance. When one takes into account the number of classes (10) involved, that is a respectable result. 