### Coursework 2

In this coursework you will be aiming to complete two classification tasks. 
Both the classification tasks relate to text classification tasks. 

One task is to be solved using Support Vector Machines. The other has to be solved using Boosting.

The specific tasks and the marking for the various tasks are provided in the notebook. Each task is expected to be accompanied by a lab-report. Each task can have a concise lab report that is maximum of one page in an A4 size. You will be expected to submit your Jupyter Notebook and all lab reports as a single zip file. You could have additional functions implemented that you require for carrying out each task.


#### Task 1

In this task, you need to obtain sentiment analysis for the provided dataset. The dataset consists of movie reviews with the sentiments being provided. The sentiments are either positive or negative. You need to train an SVM based classifier to obtain train and check on the sample test dataset provided. The method will be evaluated also against an external test set. Please do not hardcode any dimensions or number of samples while writing the code. It should be possible to automate the testing and hardcoding values does not allow for automated testing. 

You are allowed to use scikit-learn to implement the SVM. However, you are expected to write your own kernels.

You are allowed to use the existing library functions such as scikit-learn or numpy for obtaining the SVM. The main idea is to analyse the dataset using different kind of kernels. You are also supposed to write your own custom text kernels. Refer to the documentation provided [here](https://scikit-learn.org/stable/modules/svm.html) at 1.4.6.2 and an example [here](https://scikit-learn.org/stable/auto_examples/svm/plot_custom_kernel.html) for writing your own kernels.

Details regarding the marking have been provided in the coursework specification file. Ensure that the code can be run with different test files. 

#### Process the text and obtain a bag of words-based features 

In [110]:
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer
from nltk.tokenize import word_tokenize
from bs4 import BeautifulSoup
from nltk.util import ngrams
from nltk import FreqDist
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
import unidecode
import spacy
import en_core_web_sm
import contractions
import string
from tqdm.notebook import tqdm # for showing progress bar
from sklearn.model_selection import GridSearchCV
from textblob import TextBlob
from collections import defaultdict

nltk.download('stopwords')

# initialization
pattern = re.compile(r"(.)\1{2,}")
punc_translator = str.maketrans(string.punctuation, ' ' * len(string.punctuation))
num_translator = str.maketrans(string.digits, ' ' * len(string.digits))
nlp = en_core_web_sm.load()
stopwords = stopwords.words('english')
custom_negation = ['rather', 'instead']
    
def pre_processing(dataset):
    
    to_return = []
    
    for i in tqdm(range(len(dataset))):
        filtered_data = []
        
        # (1) remove html tags
        dataset[i] = BeautifulSoup(dataset[i]).get_text()
        
        # (2) remove urls
        dataset[i] = re.sub(r'http\S+', '', dataset[i])
        dataset[i] = re.sub(r'www\S+', '', dataset[i])
        
        # (3) remove email addresses
        dataset[i] = re.sub(r'\S*@\S*\s?', '', dataset[i])
        
        # (3) convert to lower case
        dataset[i] = dataset[i].casefold()
        
        # (4) convert accented character
        dataset[i] = unidecode.unidecode(dataset[i]) 
        
        # (5) if there are >2 consecutive duplicated characters, convert to 2 consecutive duplicated characters
        # e.g. finallllly --> finally
        dataset[i] = pattern.sub(r"\1\1", dataset[i]) 
        
        # (6) expand contractions
        dataset[i] = contractions.fix(dataset[i])
        
        # (7) replace punctuation with space
        dataset[i] = dataset[i].translate(punc_translator)
        
        # (8) replace numbers with space
        dataset[i] = dataset[i].translate(num_translator)
        
        # (9) spacy tokenization
        tokens = nlp(dataset[i])
            
        for token in tokens:
            
            # Lemmatisation
            word = token.lemma_
            
            # filter out words that are:
            # - stopwords
            # - with length <= 2
            # - demonstratives (e.g. this, that, those)
            # - pronoun and proper nouns (e.g. names)
            # - spaces
            
            names = [ent.text for ent in tokens if ent.ent_type_]
            
            if (word != "-PRON-") and (word !="-PROPN-") and (word not in names) and (not token.is_space):
               
                #print(word)

                if (token.dep_ == 'neg') or (word in custom_negation):
                    filtered_data.append('_NEG_')
                    continue
                
                # remove the word "like" when it is used as preposition
                if (word == 'like' and token.dep_ == 'prep'):
                    continue
                
                # remove stopwords
                if (word in stopwords):
                    continue

                # remove words with len <= 2
                elif (len(word) <= 2):
                    continue

                else:
                    filtered_data.append(word)
        
        # join words
        filtered_data = ' '.join(filtered_data)
        
        # Negation tagging
        filtered_data = re.sub(r'_NEG_\s', '_NEG_', filtered_data)
        filtered_data = re.sub(r"(_NEG_)\1{1,}", '_NEG_', filtered_data) # remove duplicated negation tagging
        
        to_return.append(filtered_data)
        
    return to_return

def extract_bag_of_words_train_test(train_file, test_file):
    
    # Read the CSV files for training and test sets
    train = pd.read_csv(train_file)
    test = pd.read_csv(test_file)
    
    X_train = np.array(train.review)
    X_test = np.array(test.review)
    
    y_train = np.array(train.sentiment)
    y_train[y_train=='positive'] = 1
    y_train[y_train=='negative'] = -1
    y_train = y_train.astype('int')
    
    y_test = np.array(test.sentiment)
    y_test[y_test=='positive'] = 1
    y_test[y_test=='negative'] = -1
    y_test = y_test.astype('int')
    
    # Extract bag of words features
    print("Train set: ")
    print("Preprocessing progress: ")
    X_train = pre_processing(X_train) 
    print("--Done--\n")
    print("Test set: ")
    print("Preprocessing progress: ")
    X_test = pre_processing(X_test)
    print('--Done--')
    
    return (X_train,y_train,X_test,y_test)

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/Sayuri/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


In [111]:
# self testing code - remove before submission
(X_train, Y_train, X_test, Y_test) = extract_bag_of_words_train_test("movie_review_train.csv", "movie_review_test.csv")

Train set: 
Preprocessing progress: 


  0%|          | 0/5000 [00:00<?, ?it/s]

--Done--

Test set: 
Preprocessing progress: 


  0%|          | 0/1500 [00:00<?, ?it/s]

--Done--


In [113]:
from sklearn import svm
from sklearn import metrics
from sklearn.metrics import accuracy_score 
from collections import defaultdict

class SVMClassifier:
    def __init__(self, kernel='rbf', C=1.6058997806999291, gamma=0.8101577349324269):
        
        #implement initialisation
        self.clf = svm.SVC()
        self.kernel = kernel
        
        # regularization parameter
        self.C = C # penalty parameter
        
        # kernel parameters
        self.gamma = gamma

        self.vectorizer = TfidfVectorizer(min_df = 2, # remove words that appear too rarely
                                          max_df = 0.7, # remove words that appear too often
                                          ngram_range=(1,5), # 1-2 gram
                                          max_features=30000,
                                          smooth_idf = True, # +1 to all frequencies, prevent division by zero
                                          sublinear_tf = True #use log for TF, clip extreme values
                                          )
        
    # define your own kernel here
    # Refer to the documentation here: https://scikit-learn.org/stable/auto_examples/svm/plot_custom_kernel.html
  
    def custom_kernel(self, X, y):
        
        X = X.A
        y = y.A
        
        print('Computing custom kernel...')
        
        # 1. Histogram intersection kernel （0.86）
        # ---------------- BEGIN ----------------#
        kernel = np.zeros((X.shape[0], y.shape[0]))

        for d in tqdm(range(X.shape[1])):
            column_1 = X[:, d].reshape(-1, 1)
            column_2 = y[:, d].reshape(-1, 1)
            kernel += np.minimum(column_1, column_2.T)

        # ------------------ END -----------------#
        
        return kernel
    
    def fit(self, X, y):
        # training of the SVM
        # Ensure you call your own defined kernel here

        # Transform data into tfidf feature vectors
        X = self.vectorizer.fit_transform(X)

        # calling diff kernels
        if self.kernel == 'linear':
            self.clf = svm.SVC(kernel='linear', C=self.C)

        elif self.kernel == 'poly':
            self.clf = svm.SVC(kernel='poly', C=self.C, degree=self.d)

        elif self.kernel == 'rbf':
            # for hyperparameter tuning
            self.clf = svm.SVC(kernel='rbf', C=self.C, gamma=self.gamma)

        elif self.kernel == 'custom':
            self.clf = svm.SVC(kernel=self.custom_kernel, C=self.C)
        
        self.clf.fit(X,y)
    
    def predict(self, X):
        
        # prediction routine for the SVM
        X = self.vectorizer.transform(X)
        
        return self.clf.predict(X)

### Test function that will be called to evaluate your code. Separate test dataset will be provided

Do not modify the code below. Please write your code above such that it can be evaluated by the function below. You can modify your code above such that you obtain the best performance through this function. 

In [114]:
def test_func_svm(dataset_train, dataset_test):
    from sklearn.metrics import accuracy_score  
    (X_train, Y_train, X_test, Y_test) = extract_bag_of_words_train_test(dataset_train, dataset_test)
    sc = SVMClassifier()
    sc.fit(X_train, Y_train)
    Y_Pred = sc.predict(X_test)
    acc = accuracy_score(Y_test, Y_Pred)
    print("Accuracy:",acc)
    return acc

In [115]:
acc = test_func_svm("movie_review_train.csv", "movie_review_test.csv")

Train set: 
Preprocessing progress: 


  0%|          | 0/5000 [00:00<?, ?it/s]

--Done--

Test set: 
Preprocessing progress: 


  0%|          | 0/1500 [00:00<?, ?it/s]

--Done--
Accuracy: 0.888


### Task 2

In this task you need to implement a boosting based classifier that can be used to classify the images. 

Details regarding the marking for the coursework are provided in the coursework specification file. Please ensure that your code will work with a different test file than the one provided with the coursework.

Note that the boosting classifier you implement can include decision trees from scikit-learn or your own decision trees. Use the same sentiment analysis dataset for evaluation.

In [171]:
from sklearn.tree import DecisionTreeClassifier
import numpy as np

class BoostingClassifier:
    # You need to implement this classifier. 
    def __init__(self,n_clf=100, max_depth=None, criterion=None, splitter=None):
        
        # Hyperparameter for AdaBoost
        self.n_clf=n_clf
        
        # Hyperparameters for decision tree
        self.max_depth = max_depth
        self.criterion = criterion
        self.splitter = splitter
        
        # TF-IDF vectorizer to convert the feature vectors
        self.tf_idf = TfidfVectorizer(min_df = 2, # remove words that appear too rarely
                                      max_df = 0.7, # remove words that appear too often
                                      sublinear_tf = True,
                                      ngram_range=(1,5),
                                      max_features=30000,
                                      smooth_idf = True
                                      )
        


    def update_w(self, w, al, y, pred):
        return w * np.exp(al * (np.not_equal(y, pred)))
    
    def calc_err(self, y, pred, w):       
        return sum(w * np.not_equal(y, pred))/sum(w)
    
    def calc_alph(self, err):
        eps = 1e-10
        return np.log((1 - err) / (err + eps))
    
    def fit(self,X,y):
        n_samples = len(X)
        self.clfs=[]
        self.alpha=[]

        X = self.tf_idf.fit_transform(X)
        
        for m in tqdm(range(self.n_clf)):
            
            if m == 0:
                # init weights
                w = np.full(n_samples,(1/n_samples))
            else:
                # update weights
                w = self.update_w(w, alph, y, pred)
            
            clf = DecisionTreeClassifier(max_depth = self.max_depth, 
                                         criterion = self.criterion,
                                         splitter = self.splitter
                                        )

            clf = clf.fit(X, y, sample_weight = w)
            
            pred = clf.predict(X) # predictions made by the weak classifier
            
            # save classifier
            self.clfs.append(clf)
            
            # calculate error
            err = self.calc_err(y, pred, w)
            
            # cal alph 
            alph = self.calc_alph(err)
            self.alpha.append(alph)
            
    def predict(self, X):
        
        # init df for storing pred from each weak classifier (decision tree)
        weak_preds = pd.DataFrame(index = range(len(X)), columns = range(self.n_clf))
        
        X = self.tf_idf.transform(X)    
        
        for m in tqdm(range(self.n_clf)):
            pred_m = self.clfs[m].predict(X) * self.alpha[m]
            weak_preds.iloc[:,m] = pred_m

        # Calculate final predictions
        y_pred = (1 * np.sign(weak_preds.T.sum())).astype(int)

        return np.array(y_pred)

In [172]:
from datetime import datetime
from sklearn.metrics import accuracy_score, f1_score, classification_report

start = datetime.now()
clf=BoostingClassifier(n_clf=10000,  max_depth=1, criterion='gini', splitter='best')
clf.fit(X_train,Y_train)
y_pred= clf.predict(X_test)

acc=accuracy_score(Y_test,y_pred)
print("Accuracy :",acc)
classificationReport = classification_report(Y_test, y_pred)
print(classificationReport)
end = datetime.now()


  0%|          | 0/10000 [00:00<?, ?it/s]

  0%|          | 0/10000 [00:00<?, ?it/s]

Accuracy : 0.848
              precision    recall  f1-score   support

          -1       0.84      0.84      0.84       731
           1       0.85      0.85      0.85       769

    accuracy                           0.85      1500
   macro avg       0.85      0.85      0.85      1500
weighted avg       0.85      0.85      0.85      1500



In [166]:
from datetime import datetime
from sklearn.metrics import accuracy_score, f1_score, classification_report

start = datetime.now()
clf=BoostingClassifier(n_clf=10000,  max_depth=1, criterion='gini', splitter='best')
clf.fit(X_train,Y_train)
y_pred= clf.predict(X_test)

acc=accuracy_score(Y_test,y_pred)
print("Accuracy :",acc)
classificationReport = classification_report(Y_test, y_pred)
print(classificationReport)
end = datetime.now()


  0%|          | 0/10000 [00:00<?, ?it/s]

  0%|          | 0/10000 [00:00<?, ?it/s]

Accuracy : 0.848
              precision    recall  f1-score   support

          -1       0.84      0.84      0.84       731
           1       0.85      0.85      0.85       769

    accuracy                           0.85      1500
   macro avg       0.85      0.85      0.85      1500
weighted avg       0.85      0.85      0.85      1500



### Test function that will be called to evaluate your code. Separate test dataset will be provided

Do not modify the code below. Please write your code above such that it can be evaluated by the function below. You can modify your code above such that you obtain the best performance through this function. 

In [None]:
def test_func_boosting(dataset_train, dataset_test):
    from sklearn.metrics import accuracy_score    
    (X_train, Y_train, X_test, Y_test) = extract_bag_of_words_train_test(dataset_train, dataset_test)
    bc = BoostingClassifier()
    bc.fit(X_train, Y_train)
    Y_Pred = bc.predict(X_test)    
    acc = accuracy_score(Y_test, Y_Pred)
    return acc



In [None]:
acc = test_func_boosting("movie_review_train.csv", "movie_review_test.csv")