# Deep Learning Homework 1

### Wilbert Aristo Guntoro - 1003742

### Thought Process / Steps:

1. **Filter** only **Spring, Summer, and Autumn Images** from the whole dataset
2. Keep the labels of each of these images in a dictionary called **ssa_dict** where the **key** is the *filename* and the **value** is the array of labels for each season (e.g. *[1,0,0] for Spring image, [0,1,0] for Summer image, [0,0,1] for Autumn image*)

In [1]:
import os
import math
from shutil import copyfile, move, rmtree

# Initialise folder
spring_summer_autumn = os.path.join(os.getcwd(), "spring_summer_autumn")
try:
    os.makedirs(spring_summer_autumn)
except:
    rmtree(spring_summer_autumn)
    os.makedirs(spring_summer_autumn)

ssa_dict = {}

with open("gtlabels.txt") as label_file:
    for line in label_file:
        line_array = line.split()
        # If Spring, Summer, or Autumn, then copy that file from collection to a special directory
        if int(line_array[10]) or int(line_array[11]) or int(line_array[12]):
            filename = line_array[0]
            ssa_dict[filename + "_ft.npy"] = list(map(int, line_array[10:13]))
            copyfile("imagefeatures/{}_ft.npy".format(filename), "spring_summer_autumn/{}_ft.npy".format(filename))

3. Split spring, summer, and autumn images in **alphabetical order** into:
- Training Set (First 65% Images)
- Validation Set (Next 15% Images)
- Test Set (Last 20% Images)

I did a **manual splitting** here instead of using sklearn's train_test_split() because manual splitting is easier. There were also **no requirement for me to use train_test_split() or randomise the splitting process** in the homework brief.

In [2]:
spring_summer_autumn_files = os.listdir(spring_summer_autumn)
total_files = len(spring_summer_autumn_files)
training_number = math.floor(0.65 * total_files)
validation_number = math.floor(0.15 * total_files)

# Initialise Folders
training_folder = os.path.join(spring_summer_autumn, "train")
validation_folder = os.path.join(spring_summer_autumn, "validate")
test_folder = os.path.join(spring_summer_autumn, "test")

os.makedirs(training_folder)
os.makedirs(validation_folder)
os.makedirs(test_folder)

# Do the splitting
os.chdir("spring_summer_autumn")

for filename in spring_summer_autumn_files[:training_number]:
    move(filename, "train/" + filename)

for filename in spring_summer_autumn_files[training_number:training_number+validation_number]:
    move(filename, "validate/" + filename)
    
for filename in spring_summer_autumn_files[training_number+validation_number:]:
    move(filename, "test/" + filename)

os.chdir("..")

4. Create BinarySVM class that can train, validate and test data using specified regularization paramater (*reg_param*) and kernel type (*linear*, *rbf*, etc.) <br> We can run the BinarySVM class by calling *execute_on_validation()* method <br> When we are ready to run the SVM on test set, we supply the best *reg_param* and run *execute_on_test()* method instead

In [3]:
from sklearn import svm
from statistics import mean
import numpy as np

class BinarySVM:
    
    def __init__(self, reg_param, kernel_type):
        self.reg_param = reg_param
        self.kernel_type = kernel_type
        self.spring_classifier = None
        self.summer_classifier = None
        self.autumn_classifier = None
        self.vanilla_accuracy = 0
        self.classwise_averaged_accuracy = 0
    
    def init_classifiers(self):
        self.spring_classifier = svm.SVC(C=self.reg_param, kernel=self.kernel_type, probability=True)
        self.summer_classifier = svm.SVC(C=self.reg_param, kernel=self.kernel_type, probability=True)
        self.autumn_classifier = svm.SVC(C=self.reg_param, kernel=self.kernel_type, probability=True)
        
    def train(self, on_dataset = "train"):
        os.chdir("spring_summer_autumn/{}".format(on_dataset))

        training_X = []
        training_spring_y = []
        training_summer_y = []
        training_autumn_y = []

        # Get X values and corresponding labels for spring, summer, autumn
        for training_filename in os.listdir(os.getcwd()):
            training_X.append(np.load(training_filename))
            training_spring_y.append(ssa_dict[training_filename][0])
            training_summer_y.append(ssa_dict[training_filename][1])
            training_autumn_y.append(ssa_dict[training_filename][2])

        # Train each classifier using X values and its respective array of labels (y)
        self.spring_classifier.fit(training_X, training_spring_y)
        self.summer_classifier.fit(training_X, training_summer_y)
        self.autumn_classifier.fit(training_X, training_autumn_y)

        os.chdir("../..")
    
    def test(self, on_dataset = "validate"):
        os.chdir("spring_summer_autumn/{}".format(on_dataset))

        validation_X = []
        validation_Y = []
        total_spring = 0
        total_summer = 0
        total_autumn = 0

        # Get X values and corresponding labels for spring, summer, autumn
        for validation_filename in os.listdir(os.getcwd()):
            validation_X.append(np.load(validation_filename))
            validation_Y.append(ssa_dict[validation_filename])
            if ssa_dict[validation_filename][0]:
                total_spring += 1
            elif ssa_dict[validation_filename][1]:
                total_summer += 1
            else:
                total_autumn += 1

        # Get Prediction
        spring_probas = self.spring_classifier.predict_proba(validation_X)
        summer_probas = self.summer_classifier.predict_proba(validation_X)
        autumn_probas = self.autumn_classifier.predict_proba(validation_X)
        
        predicted_indexes = []                      
        for spring_proba, summer_proba, autumn_proba in zip(spring_probas, summer_probas, autumn_probas):
            highest_proba = max([spring_proba[1], summer_proba[1], autumn_proba[1]])
            # Store index = 0 if Spring has highest probability
            if highest_proba == spring_proba[1]:
                predicted_indexes.append(0)
            # Store index = 1 if Summer has highest probability
            elif highest_proba == summer_proba[1]:
                predicted_indexes.append(1)
            # Store index = 2 if Autumn has highest probability
            else:
                predicted_indexes.append(2)
        
        # Get Vanilla Accuracy
        correct_prediction = 0
        for index, answer_key in zip(predicted_indexes, validation_Y):
            if answer_key[index] == 1:
                correct_prediction += 1
        
        self.vanilla_accuracy = round((correct_prediction / len(validation_Y)) * 100, 3)
        
        ## ===== Get Class-Wise Averaged Accuracy ====
        # We can make use of predicted_indexes to count how many Spring/Summer/Autumn images that we PREDICTED
        # We have achieved respective total number of spring, summer and autumn images from above
        spring_accuracy = predicted_indexes.count(0) / total_spring
        summer_accuracy = predicted_indexes.count(1) / total_summer
        autumn_accuracy = predicted_indexes.count(2) / total_autumn
        self.classwise_averaged_accuracy = round(mean([spring_accuracy, summer_accuracy, autumn_accuracy]) * 100, 3)
        
        os.chdir("../..")
    
    def execute_on_validation(self):
        self.init_classifiers()
        self.train()
        self.test()
        print("-------- VALIDATING Regularization Parameter = {} --------".format(self.reg_param))
        print("Vanilla Accuracy = {}%".format(self.vanilla_accuracy))
        print("Class-Wise Averaged Accuracy = {}%\n".format(self.classwise_averaged_accuracy))
        return self.classwise_averaged_accuracy
    
    def execute_on_test(self):
        self.init_classifiers()
        # Train on both training & validation dataset
        self.train()
        self.train(on_dataset="validate")
        # Test on test dataset
        self.test(on_dataset="test")
        print("-------- TEST SET WITH Regularization Parameter = {} and Kernel Type = {} --------".format(self.reg_param, self.kernel_type))
        print("Vanilla Accuracy = {}%".format(self.vanilla_accuracy))
        print("Class-Wise Averaged Accuracy = {}%\n".format(self.classwise_averaged_accuracy))
        

reg_params = [0.01, 0.1, 0.1 ** 0.5, 1.0, 10 ** 0.5, 10, 100]
best_score = 0
best_reg_param = 0
for reg_param in reg_params:
    binary_svm = BinarySVM(reg_param, "linear")
    current_score = binary_svm.execute_on_validation()
    if current_score > best_score:
        best_score = current_score
        best_reg_param = reg_param

print("\033[1m From validation above, the best regularization paramater is {}\033[0m\n".format(best_reg_param))

best_svm_linear = BinarySVM(best_reg_param, "linear")
best_svm_linear.execute_on_test()

best_svm_linear = BinarySVM(best_reg_param, "rbf")
best_svm_linear.execute_on_test()


-------- VALIDATING Regularization Parameter = 0.01 --------
Vanilla Accuracy = 81.871%
Class-Wise Averaged Accuracy = 75.355%

-------- VALIDATING Regularization Parameter = 0.1 --------
Vanilla Accuracy = 77.193%
Class-Wise Averaged Accuracy = 56.118%

-------- VALIDATING Regularization Parameter = 0.31622776601683794 --------
Vanilla Accuracy = 78.363%
Class-Wise Averaged Accuracy = 59.226%

-------- VALIDATING Regularization Parameter = 1.0 --------
Vanilla Accuracy = 77.778%
Class-Wise Averaged Accuracy = 56.895%

-------- VALIDATING Regularization Parameter = 3.1622776601683795 --------
Vanilla Accuracy = 77.778%
Class-Wise Averaged Accuracy = 58.449%

-------- VALIDATING Regularization Parameter = 10 --------
Vanilla Accuracy = 77.778%
Class-Wise Averaged Accuracy = 55.341%

-------- VALIDATING Regularization Parameter = 100 --------
Vanilla Accuracy = 77.778%
Class-Wise Averaged Accuracy = 56.895%

[1m From validation above, the best regularization paramater is 0.01[0m

-----

## Analysis of Two Different Accuracy Measurements

### 1) Vanilla Accuracy

$$\frac{1}{n}\sum_{i=1}^{n}1[f(x_{i})==y_{i}]$$


Vanilla accuracy is used to calculate the plain **accuracy score**. We use vanilla accuracy as a **qualitative measure** of SVM, which means a SVM with **higher vanilla accuracy returns MORE correct results relative to the incorrect ones.**

### 2) Class-wise Averaged Accuracy

$$A = \frac{1}{C}\sum_{c=1}^{C}a_{c}$$

$$a_{c} = \frac{1}{\sum_{i=1}^{n}1[y_{i}==c]}\sum_{i=1}^{n}1[y_{i}==c]1[f(x_{i})==c]$$

$$ = \frac{1}{\sum_{(x_{i},y_{i}):y_{i}=c}1}\sum_{(x_{i},y_{i}):y_{i}=c}1[f(x_{i})==c]$$

Meanwhile, class-wise averaged accuracy is used to calculate the **recall score**. We use class-wise average accuracy as a **quantitative measure** of correctness of the SVM, which means a SVM with **higher class-wise averaged accuracy returns most of the correct predictions (regardless of whether the incorrect ones are returned as well)**