### Coursework 2

In this coursework you will be aiming to complete two classification tasks. 
Both the classification tasks relate to text classification tasks. 

One task is to be solved using Support Vector Machines. The other has to be solved using Boosting.

The specific tasks and the marking for the various tasks are provided in the notebook. Each task is expected to be accompanied by a lab-report. Each task can have a concise lab report that is maximum of one page in an A4 size. You will be expected to submit your Jupyter Notebook and all lab reports as a single zip file. You could have additional functions implemented that you require for carrying out each task.


#### Task 1

In this task, you need to obtain sentiment analysis for the provided dataset. The dataset consists of movie reviews with the sentiments being provided. The sentiments are either positive or negative. You need to train an SVM based classifier to obtain train and check on the sample test dataset provided. The method will be evaluated also against an external test set. Please do not hardcode any dimensions or number of samples while writing the code. It should be possible to automate the testing and hardcoding values does not allow for automated testing. 

You are allowed to use scikit-learn to implement the SVM. However, you are expected to write your own kernels.

You are allowed to use the existing library functions such as scikit-learn or numpy for obtaining the SVM. The main idea is to analyse the dataset using different kind of kernels. You are also supposed to write your own custom text kernels. Refer to the documentation provided [here](https://scikit-learn.org/stable/modules/svm.html) at 1.4.6.2 and an example [here](https://scikit-learn.org/stable/auto_examples/svm/plot_custom_kernel.html) for writing your own kernels.

Details regarding the marking have been provided in the coursework specification file. Ensure that the code can be run with different test files. 

In [1]:
import pandas as pd
import numpy as np
import csv
import nltk
import string
import re
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn import svm
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
from sklearn.metrics import confusion_matrix, classification_report

## For stopwords that we need to exclude
from nltk.corpus import stopwords

## For ngrams example in task2
from nltk import ngrams

## For Stemming
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize

import seaborn as sns
import matplotlib.pyplot as plt

In [2]:
# new_stop = stopwords.words('english')
all_stopwords = stopwords.words('english')
# # print(all_stopwords)
# # for word in all_stopwords:
# #     all_stopwords.remove('not')
    
# # a = all_stopwords.remove("not")
new_stop = []
for i in all_stopwords:
    if i == 'not' or i == "don't" or i == "don'":
        pass
    else:
        new_stop.append(i)

In [3]:
def word_weight(y_pred_test):
    false_pos = []
    false_neg = []
    for i in range(len(y_pred_test)):
        if y_pred_test[i] == 1 and Y_test[i] == -1:
            false_pos.append(i)
        elif y_pred_test[i] == -1 and Y_test[i] == 1:
            false_neg.append(i)

    false_pos_lst = []
    false_neg_lst = []
    for x in false_pos:
        false_pos_lst.append(X_test[x])
    for x in false_neg:
        false_neg_lst.append(X_test[x])

    false_pos_weight = np.sum(false_pos_lst, axis =0).tolist()
    false_neg_weight = np.sum(false_neg_lst, axis =0).tolist()
    
    return false_pos_weight, false_neg_weight

def Nmaxelements(list1, N):
    final_list = []
    list2 = list(enumerate(list1))
    for i in range(0, N): 
        max1 = 0
        max_position = 0
          
        for j in range(len(list2)):
            if list2[j][1] > max1:
                max1 = list2[j][1]
                max_position = list2[j][0]
                  
        
        final_list.append((max_position,max1))
        list2.remove((max_position,max1))
        
          
    return final_list

In [4]:
def clean(dataset):
    corpus = []
    for i in range(len(dataset)):
        review = re.sub('[^a-zA-Z]', ' ', dataset['review'][i])
        review = review.lower()
        review = review.split()
        ps = PorterStemmer()
        all_stopwords = new_stop
        review = [ps.stem(word) for word in review if not word in set(all_stopwords)]
        review = [word for word in review if len(word) >= 3]
        review = ' '.join(review)
        corpus.append(review)
    return corpus

def numerial(x):
    if x == 'positive':
        return 1
    else:
        return 0
    
cv = TfidfVectorizer()

# Process the text and obtain a bag of words-based features 

In [5]:
def extract_bag_of_words_train_test(train_file, test_file):
    import numpy as np
    import nltk
    # Read the CSV file and extract Bag of Words Features
    train = pd.read_csv(train_file)
    test_old = pd.read_csv(test_file)
    test = test_old.iloc[:,:2]
    
    train['sentiment'] = train[pd.notnull(train['sentiment'])]['sentiment'].apply(numerial)
    test['sentiment'] = test[pd.notnull(test['sentiment'])]['sentiment'].apply(numerial)
    
    corpus_train = clean(train)
    corpus_test = clean(test)
    
    X_train = cv.fit_transform(corpus_train).toarray()
    y_train = train.iloc[:, -1].values
    y_train = y_train*2-1
    X_test = cv.transform(corpus_test).toarray()
    y_test = test.iloc[:, -1].values
    y_test = y_test*2-1
    
    
    return (X_train,y_train,X_test,y_test)

(X_train,Y_train,X_test,Y_test) = extract_bag_of_words_train_test('movie_review_train.csv', 'movie_review_test.csv')

# Cross Validation

In [6]:
words=cv.get_feature_names()
from sklearn.model_selection import train_test_split
X_train, X_val, Y_train, Y_val = train_test_split(X_train, Y_train, test_size = 0.20, random_state = 32)

# 3. Customize Kernel (Polynomial Kernel)

In [7]:
gamma = 1
degree = 2
        
def kernel(X, Y):
   
    K = np.zeros((X.shape[0],Y.shape[0]))
    K = (gamma*X.dot(Y.T))**degree

    return K

class SVMClassifier:
    def __init__(self):
        self.clf = SVC(kernel = kernel)
        
    def fit(self, X, Y):
        return self.clf.fit(X,Y)
        
    def predict(self, X):
        return self.clf.predict(X)

### Test function that will be called to evaluate your code. Separate test dataset will be provided

Do not modify the code below. Please write your code above such that it can be evaluated by the function below. You can modify your code above such that you obtain the best performance through this function. 

In [8]:
def test_func_svm(dataset_train, dataset_test):
    from sklearn.metrics import accuracy_score  
    (X_train, Y_train, X_test, Y_test) = extract_bag_of_words_train_test(dataset_train, dataset_test)
    sc = SVMClassifier()
    sc.fit(X_train, Y_train)
    Y_Pred = sc.predict(X_test)
    acc = accuracy_score(Y_test, Y_Pred)
    print("Accuracy:",acc)
    return acc

In [9]:
acc = test_func_svm("movie_review_train.csv", "movie_review_test.csv")

Accuracy: 0.8753333333333333


### Task 2

In this task you need to implement a boosting based classifier that can be used to classify the images. 

Details regarding the marking for the coursework are provided in the coursework specification file. Please ensure that your code will work with a different test file than the one provided with the coursework.

Note that the boosting classifier you implement can include decision trees from scikit-learn or your own decision trees. Use the same sentiment analysis dataset for evaluation.

In [10]:
def compute_error(y, y_pred, w):
    return (sum(w * (np.not_equal(y, y_pred)).astype(int)))/sum(w)

def compute_alpha(error):
    return np.log((1 - error) / error)

def update_weights(w, alpha, y, y_pred):
    return w * np.exp(alpha * (np.not_equal(y, y_pred)).astype(int))

In [11]:
from sklearn.tree import DecisionTreeClassifier
class BoostingClassifier():
    
    def __init__(self):
        self.alphas = []
        self.weak_classifier = []
        self.n_estimater = None
        self.train_error = []

    def fit(self, X, y, n_estimater = 400):
        self.alphas = [] 
        self.train_error = []
        self.n_estimater = n_estimater

        # Iterate weak classifiers
        for m in range(0, n_estimater):
            
            # Set weights for current boosting
            if m == 0:
                w = np.ones(len(y)) * 1 / len(y)
            else:
                # Update weight
                w = update_weights(w, alpha_m, y, y_pred)
            
            # Fit weak classifier and predict
            tree = DecisionTreeClassifier(max_depth=1)
            tree.fit(X, y, sample_weight = w)
            y_pred = tree.predict(X)
            
             # Save to list of weak classifiers
            self.weak_classifier.append(tree)

            # Compute error
            error_m = compute_error(y, y_pred, w)
            self.train_error.append(error_m)

            # Compute alpha
            alpha_m = compute_alpha(error_m)
            self.alphas.append(alpha_m)

        assert len(self.weak_classifier) == len(self.alphas)
        
    def predict(self, X):
        
        # Initialise dataframe with weak predictions for each observation
        weak_preds = pd.DataFrame(index = range(len(X)), columns = range(self.n_estimater)) 

        # Predict class label for each weak classifier, weighted by alpha_m
        for m in range(self.n_estimater):
            y_pred_m = self.weak_classifier[m].predict(X) * self.alphas[m]
            weak_preds.iloc[:,m] = y_pred_m

        # Calculate final predictions
        y_pred = (1 * np.sign(weak_preds.T.sum())).astype(int)

        return y_pred

### Test function that will be called to evaluate your code. Separate test dataset will be provided

Do not modify the code below. Please write your code above such that it can be evaluated by the function below. You can modify your code above such that you obtain the best performance through this function. 

In [12]:
def test_func_boosting(dataset_train, dataset_test):
    from sklearn.metrics import accuracy_score
    (X_train, Y_train, X_test, Y_test) = extract_bag_of_words_train_test(dataset_train, dataset_test)
    bc = BoostingClassifier()
    bc.fit(X_train, Y_train)
    Y_Pred = bc.predict(X_test)
    acc = accuracy_score(Y_test, Y_Pred)
    print("Accuracy:",acc)
    return acc

In [13]:
acc = test_func_boosting("movie_review_train.csv", "movie_review_test.csv")

Accuracy: 0.8346666666666667
