### Coursework 2

In this coursework you will be aiming to complete two classification tasks. 
Both the classification tasks relate to text classification tasks. 

One task is to be solved using Support Vector Machines. The other has to be solved using Boosting.

The specific tasks and the marking for the various tasks are provided in the notebook. Each task is expected to be accompanied by a lab-report. Each task can have a concise lab report that is maximum of one page in an A4 size. You will be expected to submit your Jupyter Notebook and all lab reports as a single zip file. You could have additional functions implemented that you require for carrying out each task.


#### Task 1

In this task, you need to obtain sentiment analysis for the provided dataset. The dataset consists of movie reviews with the sentiments being provided. The sentiments are either positive or negative. You need to train an SVM based classifier to obtain train and check on the sample test dataset provided. The method will be evaluated also against an external test set. Please do not hardcode any dimensions or number of samples while writing the code. It should be possible to automate the testing and hardcoding values does not allow for automated testing. 

You are allowed to use scikit-learn to implement the SVM. However, you are expected to write your own kernels.

You are allowed to use the existing library functions such as scikit-learn or numpy for obtaining the SVM. The main idea is to analyse the dataset using different kind of kernels. You are also supposed to write your own custom text kernels. Refer to the documentation provided [here](https://scikit-learn.org/stable/modules/svm.html) at 1.4.6.2 and an example [here](https://scikit-learn.org/stable/auto_examples/svm/plot_custom_kernel.html) for writing your own kernels.

Details regarding the marking have been provided in the coursework specification file. Ensure that the code can be run with different test files. 

#### Process the text and obtain a bag of words-based features 

In [1]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
from nltk.metrics.distance import jaccard_distance
from bs4 import BeautifulSoup
from nltk.util import ngrams
from nltk.corpus import brown
from nltk import FreqDist
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer

nltk.download('stopwords')
nltk.download('words')
nltk.download('brown')

def pre_processing(dataset):
    
    for i in range(len(dataset)):
        #print(i)
        # remove html tags
        dataset[i] = BeautifulSoup(dataset[i]).get_text()
        
        # convert to lower case
        dataset[i] = dataset[i].lower()
        
        # tokenize
        dataset[i] = word_tokenize(dataset[i])
                
        # remove punctuation
        dataset[i] = [word for word in dataset[i] if word.isalpha()]
        
        # remove lengthened words e.g. finallllly
        pattern = re.compile(r"(.)\1{2,}")
        dataset[i] = [pattern.sub(r"\1\1", word) for word in dataset[i]]
        
        # remove stop words
        for word in dataset[i]:
            if word in stopwords.words('english'):
                dataset[i].remove(word)
        
        # stemming
        stemmer = PorterStemmer()
        dataset[i] = [stemmer.stem(word) for word in dataset[i]]
        
        # join words with space
        dataset[i] = ' '.join(dataset[i])
             
    return dataset

def extract_bag_of_words_train_test(train_file, test_file):
    
    # Read the CSV files for training and test sets
    train = pd.read_csv(train_file)
    test = pd.read_csv(test_file)
    
    X_train = np.array(train.review)
    X_test = np.array(test.review)
    y_train = np.array(train.sentiment)
    y_test = np.array(test.sentiment)
    
    # Extract bag of words features
    X_train = pre_processing(X_train) 
    print('Train set: done processing data')
    X_test = pre_processing(X_test)
    print('Test set: done processing data')
    
    return (X_train,y_train,X_test,y_test)

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/Sayuri/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package words to /Users/Sayuri/nltk_data...
[nltk_data]   Package words is already up-to-date!
[nltk_data] Downloading package brown to /Users/Sayuri/nltk_data...
[nltk_data]   Package brown is already up-to-date!


In [77]:
from sklearn import svm
from sklearn import metrics
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import accuracy_score 
from warnings import warn
from grakel.kernels import ShortestPath
from nltk.tokenize import word_tokenize
from grakel.kernels import WeisfeilerLehman, VertexHistogram

class SVMClassifier:
    def __init__(self, kernel, C=None, gamma=None, coef=None, d=None):
        
        #implement initialisation
        self.kernel = kernel
        self.tf_idf = None
        self.graph_kernel = None
        
        # for cross-validation 
        ## to be implemented 
        self.train_val_split = 0.8
        self.num_fold = 5
        
        # regularization parameter
        self.C = C # penalty parameter
        
        # kernel parameters
        self.gamma = gamma # kernel coef for rbf, poly, sigmoid
        self.coef = coef #independent term for polynomial and sigmoid kernels
        self.d = d # degree; for polynomial kernel

        
    # define your own kernel here
    # Refer to the documentation here: https://scikit-learn.org/stable/auto_examples/svm/plot_custom_kernel.html
    
    def custom_graph_kernel(self, train, val): #to be implemented
        # reference: https://www.jmlr.org/papers/volume21/18-370/18-370.pdf
        
        self.graph_kernel = ShortestPath()
        G_train = self.graph_kernel.fit_transform(train)
        G_val = self.graph_kernel.transform(val)
        
        return G_train, G_val
        
    def fit(self, X, y):
        # training of the SVM
        # Ensure you call your own defined kernel here
        
        # cross validation
        
        skf = StratifiedKFold(n_splits=self.num_fold, shuffle=True, random_state=1)
        fold_acc_lst = []
  
        for train_idx, val_idx in skf.split(X, y):
        
            x_train_fold, x_val_fold = X[train_idx], X[val_idx]
            y_train_fold, y_val_fold = y[train_idx], y[val_idx]
            
            # get feature vectors for train and val sets
            if self.kernel != 'custom':
                self.tf_idf = TfidfVectorizer()
                x_train_fold = self.tf_idf.fit_transform(x_train_fold)
                #print(x_train_fold)

                # calling diff kernels
                if self.kernel == 'linear':
                    self.clf = svm.SVC(kernel='linear', C=self.C)

                elif self.kernel == 'polynomial':
                    self.clf = svm.SVC(kernel='poly', C=self.C, gamma=self.gamma, coef0=self.coef, degree=self.d)

                elif self.kernel == 'sigmoid':
                    self.clf = svm.SVC(kernel='sigmoid', C=self.C, gamma=self.gamma, coef0=self.coef)

                elif self.kernel == 'rbf':
                    self.clf = svm.SVC(kernel='rbf', C=self.C, gamma=self.gamma, coef0=self.coef)
            
            elif self.kernel == 'custom':
                x_train_fold = [word_tokenize(w) for w in x_train_fold]
                print(x_train_fold)
                G_train, G_test = self.custom_graph_kernel(x_train_fold, x_val_fold)
                self.clf = svm.SVC(kernel='precomputed', C=self.C)
                
            self.clf.fit(x_train_fold, y_train_fold)
            #print(y_train_fold)
            y_pred_fold = self.clf.predict(x_val_fold)
            fold_acc_lst.append(accuracy_score(y_val_fold, y_pred_fold))
            
        return fold_acc_lst
        
        
        # for submission
        #self.td_idf = TfidfVectorizer()
        #X = self.tf_idf.fit_transform(X)
        #return self.clf.fit(X, y) 
    
    def predict(self, X):
        
        # get tfidf feature vector
        
        if self.kernel == 'custom':
            X = self.graph_kernel.transform(X)
        else:
            X = self.tf_idf.transform(X)
        # prediction routine for the SVM
        return self.clf.predict(X)

In [4]:
# self testing code - remove before submission
(X_train, Y_train, X_test, Y_test) = extract_bag_of_words_train_test("movie_review_train.csv", "movie_review_test.csv")

Train set: done processing data
Test set: done processing data


In [39]:
set(Y_train) - set(Y_test)

set()

In [48]:
# self testing code - remove before submission
# Hyperparameter tuning for linear kernel
from sklearn.metrics import classification_report

linear_perf = []
linear_rpt = []

# for linear kernel
k_rng = np.linspace(-2.84,-0.4,10)

run = 1

for k in k_rng:
    print('run: ', run)
    c = 2**k
    print('k: ', k)
    print('c: ', c)

    sc = SVMClassifier(kernel='linear', C=c)

    acc_list = np.array(sc.fit(X_train, Y_train))
    print(acc_list)
    print('mean score: ', np.mean(acc_list))
    Y_Pred = sc.predict(X_test)
    acc = accuracy_score(Y_test, Y_Pred)
    print("Accuracy:",acc)
    print(classification_report(Y_test, Y_Pred))
    linear_perf.append(acc_list)
    linear_rpt.append(classification_report(Y_test, Y_Pred))
    
    run += 1

run:  1
k:  -2.84
c:  0.1396608922590275
['positive' 'negative' 'positive' ... 'negative' 'positive' 'positive']
['positive' 'positive' 'positive' ... 'positive' 'negative' 'negative']
['positive' 'positive' 'positive' ... 'positive' 'negative' 'negative']
['positive' 'positive' 'negative' ... 'positive' 'negative' 'negative']
['positive' 'positive' 'positive' ... 'positive' 'negative' 'negative']
[0.823 0.833 0.83  0.825 0.836]
mean score:  0.8294
Accuracy: 0.8366666666666667
              precision    recall  f1-score   support

    negative       0.83      0.84      0.83       731
    positive       0.85      0.83      0.84       769

    accuracy                           0.84      1500
   macro avg       0.84      0.84      0.84      1500
weighted avg       0.84      0.84      0.84      1500

run:  2
k:  -2.568888888888889
c:  0.1685339458372455
['positive' 'negative' 'positive' ... 'negative' 'positive' 'positive']
['positive' 'positive' 'positive' ... 'positive' 'negative' 'nega

In [52]:
# self testing code - remove before submission
# Hyperparameter tuning for RBF kernel
from sklearn.metrics import classification_report

rbf_perf = []
rbf_rpt = []

# for rbf kernel
k1_rng = np.linspace(-2,2,5)
C_rng = [2**k for k in k1_rng]
k2_rng = np.linspace(-2,2,5)
gamma_rng = [2**k for k in k2_rng]
k3_rng = np.linspace(-1.2,2,5)
coef_rng = [2**k for k in k3_rng]

run = 1

for c in C_rng:
    print('run: ', run)
    print('c: ', c)
    
    for gamma in gamma_rng:
        
        print('gamma: ', gamma)
        
        for coef in coef_rng:
            
            print('coef: ', coef)

            sc = SVMClassifier(kernel='rbf', C=c, gamma=gamma, coef=coef)

            acc_list = np.array(sc.fit(X_train, Y_train))
            print(acc_list)
            print('mean score: ', np.mean(acc_list))
            Y_Pred = sc.predict(X_test)
            acc = accuracy_score(Y_test, Y_Pred)
            print("Accuracy:",acc)
            print(classification_report(Y_test, Y_Pred))
            rbf_perf.append(acc_list)
            rbf_rpt.append(classification_report(Y_test, Y_Pred))
    
            run += 1

run:  1
c:  0.25
gamma:  0.25
coef:  0.25
['positive' 'negative' 'positive' ... 'negative' 'positive' 'positive']
['positive' 'positive' 'positive' ... 'positive' 'negative' 'negative']
['positive' 'positive' 'positive' ... 'positive' 'negative' 'negative']
['positive' 'positive' 'negative' ... 'positive' 'negative' 'negative']
['positive' 'positive' 'positive' ... 'positive' 'negative' 'negative']
[0.818 0.828 0.826 0.826 0.833]
mean score:  0.8262
Accuracy: 0.8313333333333334
              precision    recall  f1-score   support

    negative       0.83      0.83      0.83       731
    positive       0.84      0.83      0.84       769

    accuracy                           0.83      1500
   macro avg       0.83      0.83      0.83      1500
weighted avg       0.83      0.83      0.83      1500

coef:  0.5
['positive' 'negative' 'positive' ... 'negative' 'positive' 'positive']
['positive' 'positive' 'positive' ... 'positive' 'negative' 'negative']
['positive' 'positive' 'positive' .

Accuracy: 0.834
              precision    recall  f1-score   support

    negative       0.83      0.84      0.83       731
    positive       0.84      0.83      0.84       769

    accuracy                           0.83      1500
   macro avg       0.83      0.83      0.83      1500
weighted avg       0.83      0.83      0.83      1500

coef:  0.5
['positive' 'negative' 'positive' ... 'negative' 'positive' 'positive']
['positive' 'positive' 'positive' ... 'positive' 'negative' 'negative']
['positive' 'positive' 'positive' ... 'positive' 'negative' 'negative']
['positive' 'positive' 'negative' ... 'positive' 'negative' 'negative']
['positive' 'positive' 'positive' ... 'positive' 'negative' 'negative']
[0.82  0.834 0.83  0.834 0.84 ]
mean score:  0.8316000000000001
Accuracy: 0.834
              precision    recall  f1-score   support

    negative       0.83      0.84      0.83       731
    positive       0.84      0.83      0.84       769

    accuracy                           0.8

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


['positive' 'negative' 'positive' ... 'negative' 'positive' 'positive']
['positive' 'positive' 'positive' ... 'positive' 'negative' 'negative']
['positive' 'positive' 'positive' ... 'positive' 'negative' 'negative']
['positive' 'positive' 'negative' ... 'positive' 'negative' 'negative']
['positive' 'positive' 'positive' ... 'positive' 'negative' 'negative']
[0.506 0.506 0.506 0.507 0.507]
mean score:  0.5064
Accuracy: 0.48733333333333334
              precision    recall  f1-score   support

    negative       0.49      1.00      0.66       731
    positive       0.00      0.00      0.00       769

    accuracy                           0.49      1500
   macro avg       0.24      0.50      0.33      1500
weighted avg       0.24      0.49      0.32      1500

coef:  1.0


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


['positive' 'negative' 'positive' ... 'negative' 'positive' 'positive']
['positive' 'positive' 'positive' ... 'positive' 'negative' 'negative']
['positive' 'positive' 'positive' ... 'positive' 'negative' 'negative']
['positive' 'positive' 'negative' ... 'positive' 'negative' 'negative']
['positive' 'positive' 'positive' ... 'positive' 'negative' 'negative']
[0.506 0.506 0.506 0.507 0.507]
mean score:  0.5064
Accuracy: 0.48733333333333334
              precision    recall  f1-score   support

    negative       0.49      1.00      0.66       731
    positive       0.00      0.00      0.00       769

    accuracy                           0.49      1500
   macro avg       0.24      0.50      0.33      1500
weighted avg       0.24      0.49      0.32      1500

coef:  2.0


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


['positive' 'negative' 'positive' ... 'negative' 'positive' 'positive']
['positive' 'positive' 'positive' ... 'positive' 'negative' 'negative']
['positive' 'positive' 'positive' ... 'positive' 'negative' 'negative']
['positive' 'positive' 'negative' ... 'positive' 'negative' 'negative']
['positive' 'positive' 'positive' ... 'positive' 'negative' 'negative']
[0.506 0.506 0.506 0.507 0.507]
mean score:  0.5064
Accuracy: 0.48733333333333334
              precision    recall  f1-score   support

    negative       0.49      1.00      0.66       731
    positive       0.00      0.00      0.00       769

    accuracy                           0.49      1500
   macro avg       0.24      0.50      0.33      1500
weighted avg       0.24      0.49      0.32      1500

coef:  4.0


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


['positive' 'negative' 'positive' ... 'negative' 'positive' 'positive']
['positive' 'positive' 'positive' ... 'positive' 'negative' 'negative']
['positive' 'positive' 'positive' ... 'positive' 'negative' 'negative']
['positive' 'positive' 'negative' ... 'positive' 'negative' 'negative']
['positive' 'positive' 'positive' ... 'positive' 'negative' 'negative']
[0.506 0.506 0.506 0.507 0.507]
mean score:  0.5064
Accuracy: 0.48733333333333334
              precision    recall  f1-score   support

    negative       0.49      1.00      0.66       731
    positive       0.00      0.00      0.00       769

    accuracy                           0.49      1500
   macro avg       0.24      0.50      0.33      1500
weighted avg       0.24      0.49      0.32      1500

run:  26
c:  0.5
gamma:  0.25
coef:  0.25


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


['positive' 'negative' 'positive' ... 'negative' 'positive' 'positive']
['positive' 'positive' 'positive' ... 'positive' 'negative' 'negative']
['positive' 'positive' 'positive' ... 'positive' 'negative' 'negative']
['positive' 'positive' 'negative' ... 'positive' 'negative' 'negative']
['positive' 'positive' 'positive' ... 'positive' 'negative' 'negative']
[0.822 0.834 0.835 0.828 0.842]
mean score:  0.8321999999999999
Accuracy: 0.8373333333333334
              precision    recall  f1-score   support

    negative       0.83      0.84      0.83       731
    positive       0.84      0.84      0.84       769

    accuracy                           0.84      1500
   macro avg       0.84      0.84      0.84      1500
weighted avg       0.84      0.84      0.84      1500

coef:  0.5
['positive' 'negative' 'positive' ... 'negative' 'positive' 'positive']
['positive' 'positive' 'positive' ... 'positive' 'negative' 'negative']
['positive' 'positive' 'positive' ... 'positive' 'negative' 'nega

[0.817 0.839 0.828 0.825 0.842]
mean score:  0.8301999999999999
Accuracy: 0.8426666666666667
              precision    recall  f1-score   support

    negative       0.83      0.84      0.84       731
    positive       0.85      0.84      0.85       769

    accuracy                           0.84      1500
   macro avg       0.84      0.84      0.84      1500
weighted avg       0.84      0.84      0.84      1500

coef:  0.5
['positive' 'negative' 'positive' ... 'negative' 'positive' 'positive']
['positive' 'positive' 'positive' ... 'positive' 'negative' 'negative']
['positive' 'positive' 'positive' ... 'positive' 'negative' 'negative']
['positive' 'positive' 'negative' ... 'positive' 'negative' 'negative']
['positive' 'positive' 'positive' ... 'positive' 'negative' 'negative']
[0.817 0.839 0.828 0.825 0.842]
mean score:  0.8301999999999999
Accuracy: 0.8426666666666667
              precision    recall  f1-score   support

    negative       0.83      0.84      0.84       731
    pos

['positive' 'negative' 'positive' ... 'negative' 'positive' 'positive']
['positive' 'positive' 'positive' ... 'positive' 'negative' 'negative']
['positive' 'positive' 'positive' ... 'positive' 'negative' 'negative']
['positive' 'positive' 'negative' ... 'positive' 'negative' 'negative']
['positive' 'positive' 'positive' ... 'positive' 'negative' 'negative']
[0.534 0.541 0.538 0.546 0.535]
mean score:  0.5388000000000001
Accuracy: 0.526
              precision    recall  f1-score   support

    negative       0.51      1.00      0.67       731
    positive       1.00      0.08      0.14       769

    accuracy                           0.53      1500
   macro avg       0.75      0.54      0.41      1500
weighted avg       0.76      0.53      0.40      1500

coef:  1.0
['positive' 'negative' 'positive' ... 'negative' 'positive' 'positive']
['positive' 'positive' 'positive' ... 'positive' 'negative' 'negative']
['positive' 'positive' 'positive' ... 'positive' 'negative' 'negative']
['posi

Accuracy: 0.8433333333333334
              precision    recall  f1-score   support

    negative       0.83      0.85      0.84       731
    positive       0.85      0.84      0.85       769

    accuracy                           0.84      1500
   macro avg       0.84      0.84      0.84      1500
weighted avg       0.84      0.84      0.84      1500

coef:  1.0
['positive' 'negative' 'positive' ... 'negative' 'positive' 'positive']
['positive' 'positive' 'positive' ... 'positive' 'negative' 'negative']
['positive' 'positive' 'positive' ... 'positive' 'negative' 'negative']
['positive' 'positive' 'negative' ... 'positive' 'negative' 'negative']
['positive' 'positive' 'positive' ... 'positive' 'negative' 'negative']
[0.826 0.831 0.826 0.829 0.847]
mean score:  0.8318000000000001
Accuracy: 0.8433333333333334
              precision    recall  f1-score   support

    negative       0.83      0.85      0.84       731
    positive       0.85      0.84      0.85       769

    accuracy    

['positive' 'negative' 'positive' ... 'negative' 'positive' 'positive']
['positive' 'positive' 'positive' ... 'positive' 'negative' 'negative']
['positive' 'positive' 'positive' ... 'positive' 'negative' 'negative']
['positive' 'positive' 'negative' ... 'positive' 'negative' 'negative']
['positive' 'positive' 'positive' ... 'positive' 'negative' 'negative']
[0.828 0.84  0.834 0.829 0.845]
mean score:  0.8351999999999998
Accuracy: 0.8446666666666667
              precision    recall  f1-score   support

    negative       0.84      0.85      0.84       731
    positive       0.85      0.84      0.85       769

    accuracy                           0.84      1500
   macro avg       0.84      0.84      0.84      1500
weighted avg       0.84      0.84      0.84      1500

coef:  2.0
['positive' 'negative' 'positive' ... 'negative' 'positive' 'positive']
['positive' 'positive' 'positive' ... 'positive' 'negative' 'negative']
['positive' 'positive' 'positive' ... 'positive' 'negative' 'nega

Accuracy: 0.8453333333333334
              precision    recall  f1-score   support

    negative       0.84      0.85      0.84       731
    positive       0.85      0.84      0.85       769

    accuracy                           0.85      1500
   macro avg       0.85      0.85      0.85      1500
weighted avg       0.85      0.85      0.85      1500

coef:  2.0
['positive' 'negative' 'positive' ... 'negative' 'positive' 'positive']
['positive' 'positive' 'positive' ... 'positive' 'negative' 'negative']
['positive' 'positive' 'positive' ... 'positive' 'negative' 'negative']
['positive' 'positive' 'negative' ... 'positive' 'negative' 'negative']
['positive' 'positive' 'positive' ... 'positive' 'negative' 'negative']
[0.827 0.83  0.825 0.824 0.847]
mean score:  0.8306000000000001
Accuracy: 0.8453333333333334
              precision    recall  f1-score   support

    negative       0.84      0.85      0.84       731
    positive       0.85      0.84      0.85       769

    accuracy    

['positive' 'negative' 'positive' ... 'negative' 'positive' 'positive']
['positive' 'positive' 'positive' ... 'positive' 'negative' 'negative']
['positive' 'positive' 'positive' ... 'positive' 'negative' 'negative']
['positive' 'positive' 'negative' ... 'positive' 'negative' 'negative']
['positive' 'positive' 'positive' ... 'positive' 'negative' 'negative']
[0.824 0.827 0.828 0.831 0.85 ]
mean score:  0.8319999999999999
Accuracy: 0.836
              precision    recall  f1-score   support

    negative       0.83      0.84      0.83       731
    positive       0.84      0.83      0.84       769

    accuracy                           0.84      1500
   macro avg       0.84      0.84      0.84      1500
weighted avg       0.84      0.84      0.84      1500

coef:  4.0
['positive' 'negative' 'positive' ... 'negative' 'positive' 'positive']
['positive' 'positive' 'positive' ... 'positive' 'negative' 'negative']
['positive' 'positive' 'positive' ... 'positive' 'negative' 'negative']
['posi

[0.815 0.816 0.831 0.805 0.821]
mean score:  0.8176
Accuracy: 0.8253333333333334
              precision    recall  f1-score   support

    negative       0.81      0.84      0.82       731
    positive       0.84      0.81      0.83       769

    accuracy                           0.83      1500
   macro avg       0.83      0.83      0.83      1500
weighted avg       0.83      0.83      0.83      1500

coef:  4.0
['positive' 'negative' 'positive' ... 'negative' 'positive' 'positive']
['positive' 'positive' 'positive' ... 'positive' 'negative' 'negative']
['positive' 'positive' 'positive' ... 'positive' 'negative' 'negative']
['positive' 'positive' 'negative' ... 'positive' 'negative' 'negative']
['positive' 'positive' 'positive' ... 'positive' 'negative' 'negative']
[0.815 0.816 0.831 0.805 0.821]
mean score:  0.8176
Accuracy: 0.8253333333333334
              precision    recall  f1-score   support

    negative       0.81      0.84      0.82       731
    positive       0.84      0.

['positive' 'negative' 'positive' ... 'negative' 'positive' 'positive']
['positive' 'positive' 'positive' ... 'positive' 'negative' 'negative']
['positive' 'positive' 'positive' ... 'positive' 'negative' 'negative']
['positive' 'positive' 'negative' ... 'positive' 'negative' 'negative']
['positive' 'positive' 'positive' ... 'positive' 'negative' 'negative']
[0.823 0.825 0.825 0.822 0.846]
mean score:  0.8282
Accuracy: 0.83
              precision    recall  f1-score   support

    negative       0.82      0.83      0.83       731
    positive       0.84      0.83      0.83       769

    accuracy                           0.83      1500
   macro avg       0.83      0.83      0.83      1500
weighted avg       0.83      0.83      0.83      1500

gamma:  1.0
coef:  0.25
['positive' 'negative' 'positive' ... 'negative' 'positive' 'positive']
['positive' 'positive' 'positive' ... 'positive' 'negative' 'negative']
['positive' 'positive' 'positive' ... 'positive' 'negative' 'negative']
['posi

[0.826 0.838 0.835 0.83  0.848]
mean score:  0.8353999999999999
Accuracy: 0.8386666666666667
              precision    recall  f1-score   support

    negative       0.83      0.84      0.84       731
    positive       0.85      0.84      0.84       769

    accuracy                           0.84      1500
   macro avg       0.84      0.84      0.84      1500
weighted avg       0.84      0.84      0.84      1500

gamma:  4.0
coef:  0.25
['positive' 'negative' 'positive' ... 'negative' 'positive' 'positive']
['positive' 'positive' 'positive' ... 'positive' 'negative' 'negative']
['positive' 'positive' 'positive' ... 'positive' 'negative' 'negative']
['positive' 'positive' 'negative' ... 'positive' 'negative' 'negative']
['positive' 'positive' 'positive' ... 'positive' 'negative' 'negative']
[0.815 0.816 0.831 0.805 0.821]
mean score:  0.8176
Accuracy: 0.8253333333333334
              precision    recall  f1-score   support

    negative       0.81      0.84      0.82       731
    po

In [46]:
# self testing code - remove before submission
# Hyperparameter tuning for polynomial kernel

from sklearn.metrics import classification_report

poly_perf = []
poly_rpt = []

k1_rng = np.linspace(-5,5,5)
C_rng = [2**k for k in k1_rng]
k2_rng = np.linspace(-5,5,5)
gamma_rng = [2**k for k in k2_rng]
k3_rng = np.linspace(2,5,5)
coef_rng = [2**k for k in k3_rng]
d_rng = np.linspace(3,5,3)


# for polynomial kernel

run = 1

for d in d_rng:
    print('dimension: ', d)
    
    for c in C_rng:
        print('c: ', c)
        
        for gamma in gamma_rng:
            print('gamma: ', gamma)
            
            for coef in coef_rng:
                print('coef: ', coef)

                sc = SVMClassifier(kernel='polynomial', C=c, gamma=gamma, d=d, coef=coef)

                acc_list = np.array(sc.fit(X_train, Y_train))
                print(acc_list)
                print('mean score: ', np.mean(acc_list))
                Y_Pred = sc.predict(X_test)
                print(Y_test, ' ', Y_Pred)
                acc = accuracy_score(Y_test, Y_Pred)
                print("Accuracy:",acc)
                print(classification_report(Y_test, Y_Pred))
                poly_perf.append(acc_list)
                poly_rpt.append(classification_report(Y_test, Y_Pred))

                run += 1

dimension:  3.0
c:  0.03125
gamma:  0.03125
coef:  2.0
['positive' 'negative' 'positive' ... 'negative' 'positive' 'positive']
['positive' 'positive' 'positive' ... 'positive' 'negative' 'negative']
['positive' 'positive' 'positive' ... 'positive' 'negative' 'negative']
['positive' 'positive' 'negative' ... 'positive' 'negative' 'negative']
['positive' 'positive' 'positive' ... 'positive' 'negative' 'negative']
[0.506 0.506 0.506 0.507 0.507]
mean score:  0.5064
['positive' 'positive' 'positive' ... 'positive' 'positive' 'negative']   ['negative' 'negative' 'negative' ... 'negative' 'negative' 'negative']
Accuracy: 0.48733333333333334
              precision    recall  f1-score   support

    negative       0.49      1.00      0.66       731
    positive       0.00      0.00      0.00       769

    accuracy                           0.49      1500
   macro avg       0.24      0.50      0.33      1500
weighted avg       0.24      0.49      0.32      1500

coef:  4.0


  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


['positive' 'negative' 'positive' ... 'negative' 'positive' 'positive']
['positive' 'positive' 'positive' ... 'positive' 'negative' 'negative']
['positive' 'positive' 'positive' ... 'positive' 'negative' 'negative']
['positive' 'positive' 'negative' ... 'positive' 'negative' 'negative']
['positive' 'positive' 'positive' ... 'positive' 'negative' 'negative']
[0.812 0.792 0.841 0.809 0.83 ]
mean score:  0.8168000000000001
['positive' 'positive' 'positive' ... 'positive' 'positive' 'negative']   ['negative' 'positive' 'negative' ... 'negative' 'positive' 'negative']
Accuracy: 0.828
              precision    recall  f1-score   support

    negative       0.81      0.85      0.83       731
    positive       0.85      0.81      0.83       769

    accuracy                           0.83      1500
   macro avg       0.83      0.83      0.83      1500
weighted avg       0.83      0.83      0.83      1500

coef:  8.0
['positive' 'negative' 'positive' ... 'negative' 'positive' 'positive']
['po

['positive' 'negative' 'positive' ... 'negative' 'positive' 'positive']
['positive' 'positive' 'positive' ... 'positive' 'negative' 'negative']
['positive' 'positive' 'positive' ... 'positive' 'negative' 'negative']
['positive' 'positive' 'negative' ... 'positive' 'negative' 'negative']
['positive' 'positive' 'positive' ... 'positive' 'negative' 'negative']
[0.85  0.852 0.859 0.87  0.871]
mean score:  0.8603999999999999
['positive' 'positive' 'positive' ... 'positive' 'positive' 'negative']   ['negative' 'positive' 'positive' ... 'negative' 'positive' 'negative']
Accuracy: 0.8706666666666667
              precision    recall  f1-score   support

    negative       0.88      0.85      0.87       731
    positive       0.86      0.89      0.88       769

    accuracy                           0.87      1500
   macro avg       0.87      0.87      0.87      1500
weighted avg       0.87      0.87      0.87      1500

coef:  4.0
['positive' 'negative' 'positive' ... 'negative' 'positive' 'po

['positive' 'negative' 'positive' ... 'negative' 'positive' 'positive']
['positive' 'positive' 'positive' ... 'positive' 'negative' 'negative']
['positive' 'positive' 'positive' ... 'positive' 'negative' 'negative']
['positive' 'positive' 'negative' ... 'positive' 'negative' 'negative']
['positive' 'positive' 'positive' ... 'positive' 'negative' 'negative']
[0.834 0.843 0.859 0.847 0.863]
mean score:  0.8492000000000001
['positive' 'positive' 'positive' ... 'positive' 'positive' 'negative']   ['negative' 'positive' 'positive' ... 'negative' 'positive' 'negative']
Accuracy: 0.8473333333333334
              precision    recall  f1-score   support

    negative       0.84      0.85      0.84       731
    positive       0.85      0.85      0.85       769

    accuracy                           0.85      1500
   macro avg       0.85      0.85      0.85      1500
weighted avg       0.85      0.85      0.85      1500

gamma:  32.0
coef:  2.0
['positive' 'negative' 'positive' ... 'negative' '

['positive' 'negative' 'positive' ... 'negative' 'positive' 'positive']
['positive' 'positive' 'positive' ... 'positive' 'negative' 'negative']
['positive' 'positive' 'positive' ... 'positive' 'negative' 'negative']
['positive' 'positive' 'negative' ... 'positive' 'negative' 'negative']
['positive' 'positive' 'positive' ... 'positive' 'negative' 'negative']
[0.83  0.843 0.844 0.843 0.857]
mean score:  0.8433999999999999
['positive' 'positive' 'positive' ... 'positive' 'positive' 'negative']   ['negative' 'positive' 'positive' ... 'negative' 'positive' 'negative']
Accuracy: 0.842
              precision    recall  f1-score   support

    negative       0.84      0.84      0.84       731
    positive       0.85      0.85      0.85       769

    accuracy                           0.84      1500
   macro avg       0.84      0.84      0.84      1500
weighted avg       0.84      0.84      0.84      1500

coef:  32.0
['positive' 'negative' 'positive' ... 'negative' 'positive' 'positive']
['p

['positive' 'negative' 'positive' ... 'negative' 'positive' 'positive']
['positive' 'positive' 'positive' ... 'positive' 'negative' 'negative']
['positive' 'positive' 'positive' ... 'positive' 'negative' 'negative']
['positive' 'positive' 'negative' ... 'positive' 'negative' 'negative']
['positive' 'positive' 'positive' ... 'positive' 'negative' 'negative']
[0.833 0.843 0.856 0.848 0.864]
mean score:  0.8488
['positive' 'positive' 'positive' ... 'positive' 'positive' 'negative']   ['negative' 'positive' 'positive' ... 'negative' 'positive' 'negative']
Accuracy: 0.848
              precision    recall  f1-score   support

    negative       0.84      0.85      0.84       731
    positive       0.85      0.85      0.85       769

    accuracy                           0.85      1500
   macro avg       0.85      0.85      0.85      1500
weighted avg       0.85      0.85      0.85      1500

coef:  16.0
['positive' 'negative' 'positive' ... 'negative' 'positive' 'positive']
['positive' 'po

KeyboardInterrupt: 

In [79]:
# self testing code - remove before submission
# Hyperparameter tuning for polynomial kernel

from sklearn.metrics import classification_report

custom_perf = []
custom_rpt = []

# for polynomial kernel
sc = SVMClassifier(kernel='custom')
acc_list = np.array(sc.fit(X_train, Y_train))
print(acc_list)
print('mean score: ', np.mean(acc_list))
Y_Pred = sc.predict(X_test)
print(Y_test, ' ', Y_Pred)
acc = accuracy_score(Y_test, Y_Pred)
print("Accuracy:",acc)
print(classification_report(Y_test, Y_Pred))
poly_perf.append(acc_list)
poly_rpt.append(classification_report(Y_test, Y_Pred))

IOPub data rate exceeded.
The notebook server will temporarily stop sending output
to the client in order to avoid crashing it.
To change this limit, set the config variable
`--NotebookApp.iopub_data_rate_limit`.

Current values:
NotebookApp.iopub_data_rate_limit=1000000.0 (bytes/sec)
NotebookApp.rate_limit_window=3.0 (secs)



ValueError: Unsupported input type. For more information check the documentation, concerning valid input types for graph type object.

### Test function that will be called to evaluate your code. Separate test dataset will be provided

Do not modify the code below. Please write your code above such that it can be evaluated by the function below. You can modify your code above such that you obtain the best performance through this function. 

In [180]:
def test_func_svm(dataset_train, dataset_test):
    from sklearn.metrics import accuracy_score  
    (X_train, Y_train, X_test, Y_test) = extract_bag_of_words_train_test(dataset_train, dataset_test)
    sc = SVMClassifier()
    sc.fit(X_train, Y_train)
    Y_Pred = sc.predict(X_test)
    acc = accuracy_score(Y_test, Y_Pred)
    print("Accuracy:",acc)
    return acc

In [100]:
acc = test_func_svm("movie_review_train.csv", "movie_review_test.csv")

NameError: name 'linear_kernel' is not defined

### Task 2

In this task you need to implement a boosting based classifier that can be used to classify the images. 

Details regarding the marking for the coursework are provided in the coursework specification file. Please ensure that your code will work with a different test file than the one provided with the coursework.

Note that the boosting classifier you implement can include decision trees from scikit-learn or your own decision trees. Use the same sentiment analysis dataset for evaluation.

In [4]:
class BoostingClassifier:
    # You need to implement this classifier. 
    def __init__(self):
        import numpy as np
        #implement initialisation
        self.some_paramter=1
    def fit(self, X,y):
        from sklearn.tree import DecisionTreeClassifier
        import numpy as np
        #implement training of the boosting classifier
        return 
    def predict(self, X):
        # implement prediction of the boosting classifier
        return

### Test function that will be called to evaluate your code. Separate test dataset will be provided

Do not modify the code below. Please write your code above such that it can be evaluated by the function below. You can modify your code above such that you obtain the best performance through this function. 

In [5]:
def test_func_boosting(dataset_train, dataset_test):
    from sklearn.metrics import accuracy_score    
    (X_train, Y_train, X_test, Y_test) = extract_bag_of_words_train_test(dataset_train, dataset_test)
    bc = BoostingClassifier()
    bc.fit(X_train, Y_train)
    Y_Pred = bc.predict(X_test)    
    acc = accuracy_score(Y_test, Y_Pred)
    return acc

In [None]:
acc = test_func_boosting("movie_review_train.csv", "movie_review_test.csv")