# Support Vector Machine (SVM)

Let's try an SVM model.

First, we import the necessary libraries.

In [1]:
import json
import numpy as np
import pandas as pd
from svm import load_data, train_svm, evaluate_svm

Then we load the data.

In [2]:
with open('../data/NewsMTSC-dataset/train_preprocessed.jsonl', 'r') as f:
    train_data = [json.loads(line) for line in f]

with open('../data/NewsMTSC-dataset/devtest_mt_preprocessed.jsonl', 'r') as f:
    devtest_mt_data = [json.loads(line) for line in f]

with open('../data/NewsMTSC-dataset/devtest_rw_preprocessed.jsonl', 'r') as f:
    devtest_rw_data = [json.loads(line) for line in f]

print('train_data:', len(train_data))
print('devtest_mt_data:', len(devtest_mt_data))
print('devtest_rw_data:', len(devtest_rw_data))

print('train_data[0]:', train_data[0])
print('devtest_mt_data[0]:', devtest_mt_data[0])
print('devtest_rw_data[0]:', devtest_rw_data[0])

train_data: 8739
devtest_mt_data: 1476
devtest_rw_data: 1146
train_data[0]: {'gid': 'allsides_1000_401_25_Reality Leigh Winner_0_6', 'sentence_normalized': 'Winner wrote 30minute private meeting Republican lawmakers state policy director', 'polarity': 4.0}
devtest_mt_data[0]: {'gid': 'allsides_1002_402_12_former FBI director James B. Comey_51_56', 'sentence_normalized': 'While White House officials said days Comeys dismissal largely result memo written Deputy Attorney General Rod J Rosenstein criticizing FBI directors handling investigation Hillary Clintons use private email server secretary state Trump suggested NBC interview Russian investigation played role decision', 'polarity': 2.0}
devtest_rw_data[0]: {'gid': 'allsides_703_283_55_Mr. Trump_124_133', 'sentence_normalized': 'A group congressional Democrats said Wednesday ask Congress take rare step officially censuring Mr Trump', 'polarity': 2.0}


In [3]:
text_data, labels = load_data(train_data)
text_data_mt, labels_mt = load_data(devtest_mt_data)
text_data_rw, labels_rw = load_data(devtest_rw_data)

Then we train the model.

In [4]:
C = 1.0
gamma = 0.7
kernel = 'rbf'
degree = 3

clf, vectorizer = train_svm(text_data, labels, C, kernel, degree, gamma)

Finally, we evaluate the model on the test datasets.

In [5]:
print('Training data:')
accuracy, roc_auc, report = evaluate_svm(clf, vectorizer, text_data_mt, labels_mt)
print('MT devtest data:')
accuracy_mt, roc_auc_mt, report_mt = evaluate_svm(clf, vectorizer, text_data_mt, labels_mt)
print('RW devtest data:')
accuracy_rw, roc_auc_rw, report_rw = evaluate_svm(clf, vectorizer, text_data_rw, labels_rw)

Training data:
Classification Report:
{'0': {'precision': 0.5462328767123288, 'recall': 0.6618257261410788, 'f1-score': 0.5984990619136961, 'support': 482}, '1': {'precision': 0.6680216802168022, 'recall': 0.6590909090909091, 'f1-score': 0.6635262449528937, 'support': 748}, '2': {'precision': 0.487012987012987, 'recall': 0.3048780487804878, 'f1-score': 0.37500000000000006, 'support': 246}, 'accuracy': 0.6009485094850948, 'macro avg': {'precision': 0.5670891813140394, 'recall': 0.5419315613374919, 'f1-score': 0.5456751022888633, 'support': 1476}, 'weighted avg': {'precision': 0.5980824242430253, 'recall': 0.6009485094850948, 'f1-score': 0.5942033733517385, 'support': 1476}}
Accuracy: 0.6009
ROC-AUC: 0.7526
MT devtest data:
Classification Report:
{'0': {'precision': 0.5462328767123288, 'recall': 0.6618257261410788, 'f1-score': 0.5984990619136961, 'support': 482}, '1': {'precision': 0.6680216802168022, 'recall': 0.6590909090909091, 'f1-score': 0.6635262449528937, 'support': 748}, '2': {'p

Let's what the report looks like.

In [6]:
report_rw

{'0': {'precision': 0.6116071428571429,
  'recall': 0.6386946386946387,
  'f1-score': 0.6248574686431015,
  'support': 429},
 '1': {'precision': 0.5709219858156028,
  'recall': 0.7076923076923077,
  'f1-score': 0.6319921491658489,
  'support': 455},
 '2': {'precision': 0.5522388059701493,
  'recall': 0.2824427480916031,
  'f1-score': 0.3737373737373738,
  'support': 262},
 'accuracy': 0.5846422338568935,
 'macro avg': {'precision': 0.5782559782142983,
  'recall': 0.5429432314928498,
  'f1-score': 0.5435289971821081,
  'support': 1146},
 'weighted avg': {'precision': 0.5818809205898715,
  'recall': 0.5846422338568935,
  'f1-score': 0.5702787729821498,
  'support': 1146}}

Okay, it includes all the results we want to see except for the ROC-AUC score. We can add that to this report so that we have all the results in one place. 

In [7]:
report['roc_auc'] = roc_auc
report_mt['roc_auc'] = roc_auc_mt
report_rw['roc_auc'] = roc_auc_rw

Now, let's create a master dictionary that will have details about the model (type and hyperparameters), and the reported results from all three datasets splits.

In [8]:
result = dict()
result['model'] = {'type': 'svm', 'C': C, 'kernel': kernel, 'degree': degree, 'gamma': gamma}
result['train'] = report
result['devtest_mt'] = report_mt
result['devtest_rw'] = report_rw

We get 95.3% on the training dataset but 55.6% on the mt dataset and 54.0% on the rw dataset. There is a significant difference between the performance on the training dataset and the test datasets. This could be due to overfitting, or the model is not learning the right features. We can try to improve the model by tuning the hyperparameters (epochs, le), modifying the architecture, or changing the distribution of the datasets. For the sake of this project, we will try to improve models by tuning the hyperparameters.

## Hyperparameter tuning

Let's try to tune the hyperparameters of the model to improve the performance. We will run experiments and compare the results.

We will tune the following hyperparameters:
- C: Regularization parameter. The strength of the regularization is inversely proportional to C. Must be strictly positive. The penalty is a squared l2 penalty.
- kernel: Specifies the kernel type to be used in the algorithm. It must be one of 'linear', 'poly', 'rbf', 'sigmoid', 'precomputed' or a callable. If none is given, 'rbf' will be used. If a callable is given it is used to pre-compute the kernel matrix from data matrices; that matrix should be an array of shape (n_samples, n_samples).
- degree: Degree of the polynomial kernel function ('poly'). Ignored by all other kernels.
- gamma: Kernel coefficient for 'rbf', 'poly' and 'sigmoid'. If gamma is 'auto' then 1/n_features will be used instead.

In [None]:
# Suppress output
import contextlib
with contextlib.redirect_stdout(open('/dev/null', 'w')):

    C_list = [0.1, 1.0]
    gamma_list = [0.1, 0.7, 1.0]
    kernel_list = ['linear', 'poly', 'rbf', 'sigmoid']
    degree_list = [2, 3]

    results = []

    for C in C_list:
        for gamma in gamma_list:
            for kernel in kernel_list:
                for degree in degree_list:
                    clf, vectorizer = train_svm(text_data, labels, C, kernel, degree, gamma)
                    accuracy, roc_auc, report = evaluate_svm(clf, vectorizer, text_data_mt, labels_mt)
                    accuracy_mt, roc_auc_mt, report_mt = evaluate_svm(clf, vectorizer, text_data_mt, labels_mt)
                    accuracy_rw, roc_auc_rw, report_rw = evaluate_svm(clf, vectorizer, text_data_rw, labels_rw)
                    report['roc_auc'] = roc_auc
                    report_mt['roc_auc'] = roc_auc_mt
                    report_rw['roc_auc'] = roc_auc_rw
                    result = dict()
                    result['model'] = {'type': 'svm', 'C': C, 'kernel': kernel, 'degree': degree, 'gamma': gamma}
                    result['train'] = report
                    result['devtest_mt'] = report_mt
                    result['devtest_rw'] = report_rw
                    results.append(result)

We will write the results of the experiments in a json file, where each entry looks like the following:
    
    ```json
    {
        "model": {
            "type": [model_type],
            "C": [C],
            "kernel": [kernel],
            "degree": [degree],
            "gamma": [gamma]
        },
        "train": {
            "accuracy": [train_acc],
            "roc_auc": [train_roc_auc]
            "0": {
                "precision": [train_precision_0],
                "recall": [train_recall_0],
                "f1": [train_f1_0],
                "support": [train_support_0]
            },
            "1": {
                "precision": [train_precision_1],
                "recall": [train_recall_1],
                "f1": [train_f1_1],
                "support": [train_support_1]
            },
            "2": {
                "precision": [train_precision_2],
                "recall": [train_recall_2],
                "f1": [train_f1_2],
                "support": [train_support_2]
            }
        },
        "devtest_mt": {
            "accuracy": [devtest_mt_acc],
            "roc_auc": [devtest_mt_roc_auc]
            "0": {
                "precision": [devtest_mt_precision_0],
                "recall": [devtest_mt_recall_0],
                "f1": [devtest_mt_f1_0],
                "support": [devtest_mt_support_0]
            },
            "1": {
                "precision": [devtest_mt_precision_1],
                "recall": [devtest_mt_recall_1],
                "f1": [devtest_mt_f1_1],
                "support": [devtest_mt_support_1]
            },
            "2": {
                "precision": [devtest_mt_precision_2],
                "recall": [devtest_mt_recall_2],
                "f1": [devtest_mt_f1_2],
                "support": [devtest_mt_support_2]
            }
        },
        "devtest_rw": {
            "accuracy": [devtest_rw_acc],
            "roc_auc": [devtest_rw_roc_auc]
            "0": {
                "precision": [devtest_rw_precision_0],
                "recall": [devtest_rw_recall_0],
                "f1": [devtest_rw_f1_0],
                "support": [devtest_rw_support_0]
            },
            "1": {
                "precision": [devtest_rw_precision_1],
                "recall": [devtest_rw_recall_1],
                "f1": [devtest_rw_f1_1],
                "support": [devtest_rw_support_1]
            },
            "2": {
                "precision": [devtest_rw_precision_2],
                "recall": [devtest_rw_recall_2],
                "f1": [devtest_rw_f1_2],
                "support": [devtest_rw_support_2]
            }
        }
    }
    ```

In [None]:
with open('svm_results.json', 'w') as f:
    json.dump(results, f, indent=4)

NameError: name 'json' is not defined