# Naive-Bayes Classifier

Let's try using a Naive-Bayes Classifier on this data.

First, we import the necessary libraries.

In [1]:
import json
import numpy as np
from naive_bayes import load_data, train_naive_bayes, evaluate_naive_bayes

Then we load the data.

In [2]:
with open('../data/NewsMTSC-dataset/train_preprocessed.jsonl', 'r') as f:
    train_data = [json.loads(line) for line in f]

with open('../data/NewsMTSC-dataset/devtest_mt_preprocessed.jsonl', 'r') as f:
    devtest_mt_data = [json.loads(line) for line in f]

with open('../data/NewsMTSC-dataset/devtest_rw_preprocessed.jsonl', 'r') as f:
    devtest_rw_data = [json.loads(line) for line in f]

print('train_data:', len(train_data))
print('devtest_mt_data:', len(devtest_mt_data))
print('devtest_rw_data:', len(devtest_rw_data))

print('train_data[0]:', train_data[0])
print('devtest_mt_data[0]:', devtest_mt_data[0])
print('devtest_rw_data[0]:', devtest_rw_data[0])

train_data: 8739
devtest_mt_data: 1476
devtest_rw_data: 1146
train_data[0]: {'gid': 'allsides_1000_401_25_Reality Leigh Winner_0_6', 'sentence_normalized': 'Winner wrote 30minute private meeting Republican lawmakers state policy director', 'polarity': 4.0}
devtest_mt_data[0]: {'gid': 'allsides_1002_402_12_former FBI director James B. Comey_51_56', 'sentence_normalized': 'While White House officials said days Comeys dismissal largely result memo written Deputy Attorney General Rod J Rosenstein criticizing FBI directors handling investigation Hillary Clintons use private email server secretary state Trump suggested NBC interview Russian investigation played role decision', 'polarity': 2.0}
devtest_rw_data[0]: {'gid': 'allsides_703_283_55_Mr. Trump_124_133', 'sentence_normalized': 'A group congressional Democrats said Wednesday ask Congress take rare step officially censuring Mr Trump', 'polarity': 2.0}


In [3]:
text_data, labels = load_data(train_data)
text_data_mt, labels_mt = load_data(devtest_mt_data)
text_data_rw, labels_rw = load_data(devtest_rw_data)

Then we train a baseline model.

In [4]:
alpha = 1.0
fit_prior = True
clf, vectorizer = train_naive_bayes(text_data, labels, alpha, fit_prior)

Finally, we evaluate the model on the test datasets.

In [5]:
print('Train Classification Report:')
accuracy, roc_auc, report = evaluate_naive_bayes(clf, vectorizer, text_data, labels)

print('Devtest MT Classification Report:')
accuracy_mt, roc_auc_mt, report_mt = evaluate_naive_bayes(clf, vectorizer, text_data_mt, labels_mt)

print('Devtest RW Classification Report:')
accuracy_rw, roc_auc_rw, report_rw = evaluate_naive_bayes(clf, vectorizer, text_data_rw, labels_rw)

Train Classification Report:
Classification Report:
{'0': {'precision': 0.7788533134772897, 'recall': 0.9463208685162847, 'f1-score': 0.8544588155207624, 'support': 3316}, '1': {'precision': 0.8443976115208992, 'recall': 0.7939233817701453, 'f1-score': 0.8183829787234043, 'support': 3028}, '2': {'precision': 0.9312936124530328, 'recall': 0.7244258872651357, 'f1-score': 0.8149365899483325, 'support': 2395}, 'accuracy': 0.8327039707060304, 'macro avg': {'precision': 0.8515148458170739, 'recall': 0.8215567125171885, 'f1-score': 0.8292594613974997, 'support': 8739}, 'weighted avg': {'precision': 0.8433415444560006, 'recall': 0.8327039707060304, 'f1-score': 0.8311273858299086, 'support': 8739}}
Accuracy: 0.8327
ROC-AUC: 0.9581
Devtest MT Classification Report:
Classification Report:
{'0': {'precision': 0.49407114624505927, 'recall': 0.7780082987551867, 'f1-score': 0.6043513295729251, 'support': 482}, '1': {'precision': 0.6998284734133791, 'recall': 0.5454545454545454, 'f1-score': 0.61307287

Okay, it includes all the results we want to see except for the ROC-AUC score. We can add that to this report so that we have all the results in one place. 

In [6]:
report['roc_auc'] = roc_auc
report_mt['roc_auc'] = roc_auc_mt
report_rw['roc_auc'] = roc_auc_rw

Now, let's create a master dictionary that will have details about the model (type and hyperparameters), and the reported results from all three datasets splits.

In [7]:
result = dict()
result['model'] = {'type': 'naive_bayes', 'alpha': alpha, 'fit_prior': fit_prior}
result['train'] = report
result['devtest_mt'] = report_mt
result['devtest_rw'] = report_rw

We get 95.3% on the training dataset but 55.6% on the mt dataset and 54.0% on the rw dataset. There is a significant difference between the performance on the training dataset and the test datasets. This could be due to overfitting, or the model is not learning the right features. We can try to improve the model by tuning the hyperparameters (epochs, le), modifying the architecture, or changing the distribution of the datasets. For the sake of this project, we will try to improve models by tuning the hyperparameters.

## Hyperparameter tuning

Let's try to tune the hyperparameters of the model to improve the performance. We will run experiments and compare the results.

We will tune the following hyperparameters:
- alpha: Smoothing parameter
- fit_prior: Whether to learn class prior probabilities or not

In [8]:
# Suppress output
import contextlib
with contextlib.redirect_stdout(open('/dev/null', 'w')):

    alpha_list = np.linspace(0.1, 2.0, num=20)
    fit_prior_list = [True, False]
    result_list = []

    for alpha in alpha_list:
        for fit_prior in fit_prior_list:
            clf, vectorizer = train_naive_bayes(text_data, labels, alpha, fit_prior)
            accuracy, roc_auc, report = evaluate_naive_bayes(clf, vectorizer, text_data, labels)
            accuracy_mt, roc_auc_mt, report_mt = evaluate_naive_bayes(clf, vectorizer, text_data_mt, labels_mt)
            accuracy_rw, roc_auc_rw, report_rw = evaluate_naive_bayes(clf, vectorizer, text_data_rw, labels_rw)
            report['roc_auc'] = roc_auc
            report_mt['roc_auc'] = roc_auc_mt
            report_rw['roc_auc'] = roc_auc_rw
            result = dict()
            result['model'] = {'type': 'naive_bayes', 'alpha': alpha, 'fit_prior': fit_prior}
            result['train'] = report
            result['devtest_mt'] = report_mt
            result['devtest_rw'] = report_rw
            result_list.append(result)

We write the results to a json file, where the structure is the following:
    
    ```json
    {
        "model": {
            "type": [model_type],
            "alpha": [alpha]
            "fit_prior": [fit_prior]
        },
        "train": {
            "accuracy": [train_acc],
            "roc_auc": [train_roc_auc]
            "0": {
                "precision": [train_precision_0],
                "recall": [train_recall_0],
                "f1": [train_f1_0],
                "support": [train_support_0]
            },
            "1": {
                "precision": [train_precision_1],
                "recall": [train_recall_1],
                "f1": [train_f1_1],
                "support": [train_support_1]
            },
            "2": {
                "precision": [train_precision_2],
                "recall": [train_recall_2],
                "f1": [train_f1_2],
                "support": [train_support_2]
            }
        },
        "devtest_mt": {
            "accuracy": [devtest_mt_acc],
            "roc_auc": [devtest_mt_roc_auc]
            "0": {
                "precision": [devtest_mt_precision_0],
                "recall": [devtest_mt_recall_0],
                "f1": [devtest_mt_f1_0],
                "support": [devtest_mt_support_0]
            },
            "1": {
                "precision": [devtest_mt_precision_1],
                "recall": [devtest_mt_recall_1],
                "f1": [devtest_mt_f1_1],
                "support": [devtest_mt_support_1]
            },
            "2": {
                "precision": [devtest_mt_precision_2],
                "recall": [devtest_mt_recall_2],
                "f1": [devtest_mt_f1_2],
                "support": [devtest_mt_support_2]
            }
        },
        "devtest_rw": {
            "accuracy": [devtest_rw_acc],
            "roc_auc": [devtest_rw_roc_auc]
            "0": {
                "precision": [devtest_rw_precision_0],
                "recall": [devtest_rw_recall_0],
                "f1": [devtest_rw_f1_0],
                "support": [devtest_rw_support_0]
            },
            "1": {
                "precision": [devtest_rw_precision_1],
                "recall": [devtest_rw_recall_1],
                "f1": [devtest_rw_f1_1],
                "support": [devtest_rw_support_1]
            },
            "2": {
                "precision": [devtest_rw_precision_2],
                "recall": [devtest_rw_recall_2],
                "f1": [devtest_rw_f1_2],
                "support": [devtest_rw_support_2]
            }
        }
    }
    ```

In [None]:
with open('naive_bayes_results.json', 'w') as f:
    json.dump(result_list, f, indent=4)