# COMP47590: Advanced Machine Learning
# Assignment 1: Multi-label Classification

Name(s): Raphael Hetherington

Student Number(s): 18200573

## Import Packages Etc

In [300]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import math
import seaborn as sns
import sklearn


## Task 0: Load the Yeast Dataset

In [303]:
# Write your code here
data = pd.read_csv("yeast.csv")
data.describe()


Unnamed: 0,Att1,Att2,Att3,Att4,Att5,Att6,Att7,Att8,Att9,Att10,...,Class5,Class6,Class7,Class8,Class9,Class10,Class11,Class12,Class13,Class14
count,2417.0,2417.0,2417.0,2417.0,2417.0,2417.0,2417.0,2417.0,2417.0,2417.0,...,2417.0,2417.0,2417.0,2417.0,2417.0,2417.0,2417.0,2417.0,2417.0,2417.0
mean,0.001173,-0.000436,-0.000257,0.000265,0.001228,0.000475,0.001107,0.00042,0.001076,-9e-06,...,0.298717,0.247,0.177079,0.198593,0.073645,0.104675,0.11957,0.751345,0.744311,0.014067
std,0.097411,0.097885,0.097746,0.096969,0.096909,0.097306,0.09717,0.096803,0.096326,0.096805,...,0.45779,0.431356,0.381815,0.399024,0.261246,0.306198,0.324525,0.432323,0.436338,0.117792
min,-0.371146,-0.472632,-0.339195,-0.467945,-0.367044,-0.509447,-0.319928,-0.594498,-0.369712,-0.767128,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,-0.053655,-0.058734,-0.057526,-0.057149,-0.058461,-0.060212,-0.058445,-0.062849,-0.063472,-0.06501,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
50%,0.003649,-0.003513,0.002892,-0.000153,0.005565,0.000321,0.006179,0.001436,0.003515,0.002432,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
75%,0.057299,0.048047,0.061007,0.054522,0.066286,0.059908,0.068892,0.061418,0.064958,0.063096,...,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0
max,0.520272,0.614114,0.353241,0.56896,0.307649,0.336971,0.351401,0.454591,0.419852,0.420876,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


## Task 1: Implement the Binary Relevance Algorithm

In [304]:
'''
To use the Binary Relevance class, you need to pass in the specific list of labels that you want to choose from.
The algorithm will then go through each, create a model based on each, and then aggregate the findings. 

'''
class BinaryRelevance():
    def __init__(self, classifier, class_labels):
        # pass in a classifier object
        # the number of class labels is necessary so that we know where to start slicing from 
        self.classifier = classifier
        self.class_labels = class_labels # should be a list
        self.models = {}
        self.labels_series = pd.Series(self.class_labels)
         
    
    def trajin(self, data_to_train):
        # first step is simply to pass in a subset of the data. 
        # We just want one batch of the data and another batch
        '''
        Step 1: 
            For each class label we have to strip away the others, and make a model with the class label.
            We're ultimately going to be using an aggregation of these models, so we store the models that 
            we create in the models dictionary.
        ''' 
        # the features are all the columns that are not the class labels
        features = data_to_train[data_to_train.columns[~data_to_train.columns.isin(self.labels_series)]]
        
        # for each class label we create a new model
        for class_label in self.class_labels:
            # the model is stored in self.models
            self.models[class_label] = self.classifier()
            # we select the class label column as our y
            y = data_to_train[[class_label]]
            # we train the model
            self.models[class_label].fit(features, y.values.ravel())


    # Inputs: this method receives a dataframe WITHOUT class labels and returns a dataframe with the class labels predicted
    def predict(self, features):
            return_frame = features
            for class_label in self.class_labels:
                model = self.models[class_label]# select the appropriate model from the dictionary
                prediction = model.predict(features) 
                prediction_frame = pd.DataFrame(data=prediction, columns=[class_label]) # create a new df with the prediction
                return_frame = return_frame.reset_index(drop=True) # reset index
                prediction_frame = prediction_frame.reset_index(drop=True) # reset index
                return_frame = pd.concat([return_frame, prediction_frame], axis=1) # concatenate the class label with the features
            return return_frame
            
                


## Task 2: Implement the Binary Relevance Algorithm with Under-Sampling
Our objective is to balance the class distribution for each label. We can do this by:  
1) Evaluating the class distribution for each class  
2) If the classes are imbalanced, we can remove a subset of the data to balance them. 

To enact undersampling, I'm going to create a subclass of the BinaryRelevance class, called BinaryRelevanceWithUnderSampling. This class will have a method - balance_classes_and_train. This method will:
- assess each label and the distribution of classes
- produce a balanced dataframe **if required**
- train a model
- revert back to the original "full" dataframe and repeat so as to keep as much data available as possible.

In [267]:
class BinaryRelevanceWithUnderSampling(BinaryRelevance):
    def __init__(self, classifier, labels):
        super().__init__(classifier, labels)
    
    
    def balance_classes_and_train(self, data_to_train):
        for class_label in self.class_labels:
            # first check the distribution of classes 
            with_label = data_to_train.loc[data_to_train[class_label] == 1]
            without_label = data_to_train.loc[data_to_train[class_label] == 0]
            # find the majority and minority of with_label/without_label
            shorter = with_label if len(with_label) < len(without_label) else without_label
            longer = with_label if len(with_label) > len(without_label) else without_label
#             take a subset of the majority class that is the same length as the minority class
            sub_sample = longer.sample(n=len(shorter), random_state=42)
#           create a new data frame that combines the two
            sub_sampled_df = pd.concat([shorter, sub_sample]).reset_index()
            # create classifier
            self.models[class_label] = self.classifier()
            features = sub_sampled_df[sub_sampled_df.columns[~sub_sampled_df.columns.isin(self.labels_series)]]
            features = features.drop(labels='index', axis=1)
            y = sub_sampled_df[[class_label]]
            self.models[class_label].fit(features, y.values.ravel())


## Task 3: Compare the Performance of Different Binary Relevance Approaches

### Evaluation Strategies

We are going to assess the performance of our algorithms in the following ways:

1. Experimenting with different base classifiers. In our case we will use three common classification algorithms: 
    - Naive Bayes
    - Logistic Regression 
    and  
    - K-Nearest Neighbours  

and   

2. Using hamming loss, macro-averaged accuracy score and macro-averaged f-score to measure the performance of each classifier.  

In [324]:
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import hamming_loss

    

# Create training and testing sets
train, test = np.split(data, [int(.7*len(data))]) # I split the dataframe into 70% train and 30% test
train # 1691 rows
test # 726 rows


# Binary Relevance: 

# get all class labels
test_class_label_results = test.loc[:, 'Class1':]
prediction_class_label_results_with_undersampling = binary_relevance_with_undersampling.loc[:, 'Class1':]
prediction_class_label_results_without_undersampling = binary_relevance_without_undersampling.loc[:, 'Class1':]

# reset indices
prediction_class_label_results_with_undersampling.reset_index()
prediction_class_label_results_without_undersampling.reset_index()
test_class_label_results = test_class_label_results.reset_index()
test_class_label_results = test_class_label_results.drop("index", axis=1)

# classifiers:

from sklearn.naive_bayes import GaussianNB # our test classifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
class_labels = ['Class1', 'Class2', 'Class3', 'Class4', 'Class5', 'Class6', 'Class7', 'Class8', 'Class9', 'Class10', 'Class11', 'Class12', 'Class13', 'Class14']


# # Binary Relevance Classifiers
br_reg_classifier = BinaryRelevance(LogisticRegression, class_labels) 
br_reg_classifier.train(train)
br_nb_classifier = BinaryRelevance(GaussianNB, class_labels) 
br_nb_classifier.train(train)
br_knn_classifier = BinaryRelevance(KNeighborsClassifier, class_labels)
br_knn_classifier.train(train)


# undersampling classifiers
us_reg_classifier = BinaryRelevanceWithUnderSampling(LogisticRegression, class_labels) 
us_reg_classifier.balance_classes_and_train(train)
us_nb_classifier = BinaryRelevanceWithUnderSampling(GaussianNB, class_labels) 
us_nb_classifier.balance_classes_and_train(train)
us_knn_classifier = BinaryRelevanceWithUnderSampling(KNeighborsClassifier, class_labels)
us_knn_classifier.balance_classes_and_train(train)






## Predictions

In [325]:
# Testing set without class labels
test_features = test[test.columns[~test.columns.isin(class_labels)]]


In [326]:
br_reg_predictions = br_reg_classifier.predict(test_features)
br_nb_predictions = br_nb_classifier.predict(test_features)
br_knn_predictions = br_knn_classifier.predict(test_features)
us_reg_predictions = us_reg_classifier.predict(test_features)
us_nb_predictions = us_nb_classifier.predict(test_features)
us_knn_predictions = us_knn_classifier.predict(test_features)


### Macro-averaged Accuracy

In [327]:
# accuracy without undersampling

classifiersToEvaluate = {
    "binary_relevance_logistic_regression": br_reg_predictions,
    "binary_relevance_naive_bayes": br_nb_predictions,
    "binary_relevance_knn": br_knn_predictions,
    "undersampled_logistic_regression": us_reg_predictions,
    "undersampled_naive_bayes": us_nb_predictions,
    "undersampled_knn": us_knn_predictions
}

for key, value in classifiersToEvaluate.items():
    cumulative_accuracy = 0
    for class_label in class_labels:
        val_to_add = accuracy_score(test_class_label_results[class_label], value[class_label])
        cumulative_accuracy += val_to_add
    averaged_accuracy = cumulative_accuracy / len(class_labels)
    print("accuracy for ", key, " is ", averaged_accuracy)





accuracy for  binary_relevance_logistic_regression  is  0.7971271153089335
accuracy for  binary_relevance_naive_bayes  is  0.698150334513971
accuracy for  binary_relevance_knn  is  0.7888626524990162
accuracy for  undersampled_logistic_regression  is  0.6250491932310114
accuracy for  undersampled_naive_bayes  is  0.6201298701298701
accuracy for  undersampled_knn  is  0.6294765840220385


#### Results

Accuracy without undersampling: 80%, 70% and 79%   
Accuracy with undersampling: 63%, 62%, 63%

### Macro-averaged F1 Measure

In [328]:
for key, value in classifiersToEvaluate.items():
    cumulative_accuracy = 0
    for class_label in class_labels:
        val_to_add = f1_score(test_class_label_results[class_label], value[class_label])
        cumulative_accuracy += val_to_add
    averaged_accuracy = cumulative_accuracy / len(class_labels)
    print("f1 measure for ", key, " is ", averaged_accuracy)

f1 measure for  binary_relevance_logistic_regression  is  0.3470882117611737
f1 measure for  binary_relevance_naive_bayes  is  0.44663364040197173
f1 measure for  binary_relevance_knn  is  0.40798443921699407
f1 measure for  undersampled_logistic_regression  is  0.45071613455204557
f1 measure for  undersampled_naive_bayes  is  0.4415479688600657
f1 measure for  undersampled_knn  is  0.4545034543133587


  'precision', 'predicted', average, warn_for)


### Results

F1 without undersampling: 35%, 45%, 41%   
F1 with undersampling: 45%, 44%, 45%

## Hamming Loss

In [329]:
for key, value in classifiersToEvaluate.items():
    comparison_cols = value.loc[:, "Class1":]
    hamming_score = hamming_loss(test_class_label_results, comparison_cols)
    print("hamming loss for ", key, " is ", hamming_score)


hamming loss for  binary_relevance_logistic_regression  is  0.2028728846910665
hamming loss for  binary_relevance_naive_bayes  is  0.3018496654860291
hamming loss for  binary_relevance_knn  is  0.21113734750098387
hamming loss for  undersampled_logistic_regression  is  0.37495080676898856
hamming loss for  undersampled_naive_bayes  is  0.37987012987012986
hamming loss for  undersampled_knn  is  0.37052341597796146


### Results:
- without undersampling: 20%, 30%, 21% - avg = 24%
- with undersampling: 37%, 38%, 37% - avg = 37%

## Task 4: Implement the Classifier Chains Algorithm

Classifier chains work by using the cumulative predictions of previous models to create iteratively more accurate predictions. They differ from binary relevance in that the class labels are no longer seen as independent. 

In order to implement a classifier chain algorithm, we need to:  
Train:
- train a model for the first class with the feature set. 
- append that class label to the feature set
- train the next model with the new, extended feature set
- Repeat.

Predict: 
- Make a prediction for label 1 with the test set instances.
- Append that prediction to the test set and then move onto the next feature label. 
- Repeat


In [330]:
class ClassifierChains():
    def __init__(self, classifier, class_labels):
        # pass in a classifier object
        # the number of class labels is necessary so that we know where to start slicing from 
        self.classifier = classifier
        self.class_labels = class_labels # should be a list
        self.models = {}
        self.labels_series = pd.Series(self.class_labels)
         
    # in the training phase, we train the model for Class1, then 
    # concat the features with the true values for class1, then
    # use the extended feature set to train the model for Class2 etc. 
    def train(self, data_to_train):
        features = data_to_train[data_to_train.columns[~data_to_train.columns.isin(self.labels_series)]]
        # for each class label we create a new model
        for class_label in self.class_labels:
            # the model is stored in self.models
            self.models[class_label] = self.classifier()
            # we select the class label column as our y
            y = data_to_train[[class_label]]
            # we train the model
            self.models[class_label].fit(features, y.values.ravel())
            # chain on the class label
            features = pd.concat([features, y], axis=1)

    # for predicting, we have to chain on our predictions before we pass to the next model.
    def predict(self, features):
            return_frame = features
            for class_label in self.class_labels:
                model = self.models[class_label]# select the appropriate model from the dictionary
                prediction = model.predict(features)
                # concat our prediction
                prediction_frame = pd.DataFrame(data=prediction, columns=[class_label]) # create a new df with the prediction
                return_frame = return_frame.reset_index(drop=True) # reset index
                prediction_frame = prediction_frame.reset_index(drop=True) # reset index
                
                # Chaining! This is where we concatenate the prediction onto the feature space
                features = pd.concat([return_frame, prediction_frame], axis=1)
                return_frame = pd.concat([return_frame, prediction_frame], axis=1) # concatenate the class label with the features
            return return_frame


## Task 5: Evaluate the Performance of the Classifier Chains Algorithm

In [331]:
cc_reg_classifier = ClassifierChains(LogisticRegression, class_labels) 
cc_reg_classifier.train(train)
cc_nb_classifier = ClassifierChains(GaussianNB, class_labels) 
cc_nb_classifier.train(train)
cc_knn_classifier = ClassifierChains(KNeighborsClassifier, class_labels)
cc_knn_classifier.train(train)



In [332]:
cc_reg_predictions = cc_reg_classifier.predict(test_features)
cc_nb_predictions = cc_nb_classifier.predict(test_features)
cc_knn_predictions = cc_knn_classifier.predict(test_features)

cc_evaluate = {
    "cc_logistic_regression": cc_reg_predictions,
    "cc_naive_bayes": cc_nb_predictions,
    "cc_knn": cc_knn_predictions,
}

### Accuracy

In [333]:
for key, value in cc_evaluate.items():
    cumulative_accuracy = 0
    for class_label in class_labels:
        val_to_add = accuracy_score(test_class_label_results[class_label], value[class_label])
        cumulative_accuracy += val_to_add
    averaged_accuracy = cumulative_accuracy / len(class_labels)
    print("accuracy for ", key, " is ", averaged_accuracy)


accuracy for  cc_logistic_regression  is  0.7856158992522628
accuracy for  cc_naive_bayes  is  0.6860487996851632
accuracy for  cc_knn  is  0.778728846910665


### F1 Score

In [334]:
for key, value in cc_evaluate.items():
    cumulative_accuracy = 0
    for class_label in class_labels:
        val_to_add = f1_score(test_class_label_results[class_label], value[class_label])
        cumulative_accuracy += val_to_add
    averaged_accuracy = cumulative_accuracy / len(class_labels)
    print("f1 score for ", key, " is ", averaged_accuracy)

f1 score for  cc_logistic_regression  is  0.38729133518247766
f1 score for  cc_naive_bayes  is  0.4455026823600304
f1 score for  cc_knn  is  0.41666830999808824


  'precision', 'predicted', average, warn_for)


### Hamming Loss

In [338]:
for key, value in cc_evaluate.items():
    comparison_cols = value.loc[:, "Class1":]
    hamming_score = hamming_loss(test_class_label_results, comparison_cols)
    print("hamming loss for ", key, " is ", hamming_score)


hamming loss for  cc_logistic_regression  is  0.21438410074773712
hamming loss for  cc_naive_bayes  is  0.3139512003148367
hamming loss for  cc_knn  is  0.2212711530893349


The scores for BR:  20%, 30%, 21% - avg = 24%
The scores for CC: 21%, 31%, 22% - avg = 25%

Again we can see a slight improvement in the case of the Hamming Loss, but not by a great amount.

## Task 6: Reflect on the Performance of the Different Models Evaluated

Interestingly the binary relevance classifier without undersampling was more accurate than the undersampled one. The undersampling strategy is clearly visible in how closely the different base classifiers perform, given that the data is balanced (but also drastically reduced in volume in some cases).  
In the case of the F1 score, we can see that undersampling benefitted the overall models, with an average F1 of 40% for the models that weren't undersampled, against 45% in the case of the undersampled ones. 
We can also see that undersampling drastically improved the performance of our classification in the case of hamming loss, with an average HL of 24% without undersampling, and 37% with it.

In the case of classifier chains, I used a non-under sampled approach, so we ought to compare these scores against our non-undersampled binary relevance algorithm.

The BR algorithm scored: 80%, 70% and 79%
The CC algorithm score: 79%, 69% and 78%

We can see that the CC algorithm is marginally worse here. It may well be the case that incorrect predictions in prior models might negatively affect predictions made by later models i.e the chaining is actually having a detrimental effect. 

For the non-undersampled BR the averaged F1 scores were: 35%, 45%, 41%, while in CC the averaged F1 scores were: 39%, 45%, 42%, which, encouragingly, shows an improvement. However, the Hamming Loss, while improved with CC, was still very low, with an average of 25%.