### Assignment 1

#### Image classification (CIFAR-10 dataset)

##### Data preparation

The CIFAR-10 dataset consists of 60000 32x32 colour images in 10 classes, with 6000 images per class. There are 50000 training images and 10000 test images.

The dataset is divided into five training batches and one test batch, each with 10000 images. The test batch contains exactly 1000 randomly-selected images from each class. The training batches contain the remaining images in random order, but some training batches may contain more images from one class than another. Between them, the training batches contain exactly 5000 images from each class.

The archive contains the files data_batch_1, data_batch_2, ..., data_batch_5, as well as test_batch. Each of these files is a Python "pickled" object produced with cPickle.

In [None]:
from data_preprocessing_utils import unpickle_file

# specify the file path of a specific batch
file = "cifar-10-python/data_batch_1"

batch_data = unpickle_file(file)

In [None]:
# The keys in the dictionary are byte strings (bytes) rather than standard strings. In Python, a byte string is prefixed with b, like b'batch_label'
print(list(batch_data.keys()))

# Convert the byte strings to standard strings
keys = [key.decode("utf-8") for key in batch_data.keys()]
print(keys)

# Update the keys in the dictionary with the standard strings
batch_data = {key.decode("utf-8"): value for key, value in batch_data.items()}

Data is a 10000x3072 numpy array of uint8s. 
Each row of the array stores a 32x32 colour image. 
The first 1024 entries contain the red channel values, the next 1024 the green, and the final 1024 the blue.

In [None]:
print(batch_data["data"].shape)
print(type(batch_data["data"]))

print(batch_data["data"][0].shape)
print(type(batch_data["data"][0][0]))

print(batch_data["labels"][0:10])
print(batch_data["filenames"][0:10])

In [None]:
images = [img_path.decode("utf-8") for img_path in batch_data["filenames"]]
print(images[0:10])

Below we can preview 5 random images from our batch

In [None]:
from data_preprocessing_utils import display_images
from data_preprocessing_utils import preprocess_images

# Extract data and labels
images = batch_data["data"]
labels = batch_data["labels"]

# TODO: Analyze this function
images_normalized = preprocess_images(images)

# CIFAR-10 class names
class_names = [
    "airplane",
    "automobile",
    "bird",
    "cat",
    "deer",
    "dog",
    "frog",
    "horse",
    "ship",
    "truck",
]

# Display images
display_images(images_normalized, labels, class_names, num_images=5)

Load all the dataset

In [None]:
from data_preprocessing_utils import load_cifar10_data

data_dir = "./cifar-10-python"

x_train, y_train, x_test, y_test = load_cifar10_data(data_dir)

print("x_train.shape =", x_train.shape)
print(type(x_train[0]))
print(type(x_train[0][0]))

print("y_train.shape =", y_train.shape)
print(type(y_train[0]))

print("x_test.shape =", x_test.shape)
print(type(x_test[0]))
print(type(x_test[0][0]))


print("y_test.shape =", y_test.shape)
print(type(y_test[0]))


Preprocess the training & testing data

In [None]:
from data_preprocessing_utils import normalize_data

# Normalize the data without reshaping
x_train_normalized = normalize_data(x_train)
x_test_normalized = normalize_data(x_test)

print("Normalized data shape:", x_train_normalized.shape)
print("First pixel value before normalization:", x_train[0, 0])
print("First pixel value after normalization:", x_train_normalized[0, 0])

class distribution

In [None]:
from data_preprocessing_utils import display_training_class_distribution

display_training_class_distribution(y_train)

TODO: In order to do a 10-class classification problem, i need to create 10 one-vs-all SVMs (for my custom implementation). Sklearn does it automatically. So, for my implementation I'm going to create a binary classification ONLY for a specific class and I'm goind to compare the results with sklearn's SVM.

TODO: So, in essense these are 2 different problems, so i need to create 2 different datasets.

TODO: Analyze why you took a smaller sample of the original dataset

This notebook analyzes the binary classification problem

In [None]:
# Select 22 Samples from Each Class
import numpy as np

train_selected_indices = []
test_selected_indices = []

train_target_class_indices = np.where(y_train == 0)[0][0:1000]
test_target_class_indices = np.where(y_test == 0)[0][0:1000]


# Loop over each class label (1 to 9)
for label in range(1, 10):
    # Find the indices of all samples with this label
    train_label_indices = np.where(y_train == label)[0]
    test_label_indices = np.where(y_test == label)[0]

    # Randomly select 22 samples from these indices
    train_selected_label_indices = np.random.choice(
        train_label_indices, size=100, replace=False
    )
    test_selected_label_indices = np.random.choice(
        test_label_indices, size=100, replace=False
    )

    # Add these indices to the list
    train_selected_indices.extend(train_selected_label_indices)
    test_selected_indices.extend(test_selected_label_indices)

In [None]:
# Convert Selected Indices to a Numpy Array and Shuffle

# Convert the list to a numpy array
train_selected_indices = np.array(train_selected_indices)
test_selected_indices = np.array(test_selected_indices)

# Shuffle the indices to mix samples from different classes
np.random.shuffle(train_selected_indices)
np.random.shuffle(test_selected_indices)

In [None]:
# Subset the data

x_train_subset_1 = x_train_normalized[train_target_class_indices]
x_train_subset_2 = x_train_normalized[train_selected_indices]
x_test_subset_1 = x_train_normalized[test_target_class_indices]
x_test_subset_2 = x_train_normalized[test_selected_indices]

y_train_subset_1 = y_train[train_target_class_indices]
y_train_subset_2 = y_train[train_selected_indices]
y_test_subset_1 = y_train[test_target_class_indices]
y_test_subset_2 = y_train[test_selected_indices]

print("x_train_subset_1.shape:", x_train_subset_1.shape)
print("x_train_subset_2.shape:", x_train_subset_2.shape)
print("x_test_subset_1.shape:", x_test_subset_1.shape)
print("x_test_subset_2.shape:", x_test_subset_2.shape)

print("y_train_subset_1.shape:", y_train_subset_1.shape)
print("y_train_subset_2.shape:", y_train_subset_2.shape)
print("y_test_subset_1.shape:", y_test_subset_1.shape)
print("y_test_subset_2.shape:", y_test_subset_2.shape)


In [None]:
x_train_final = np.concatenate((x_train_subset_1, x_train_subset_2))
y_train_final = np.concatenate((y_train_subset_1, y_train_subset_2))

x_test_final = np.concatenate((x_test_subset_1, x_test_subset_2))
y_test_final = np.concatenate((y_test_subset_1, y_test_subset_2))


print("x_train_final.shape:", x_train_final.shape)
print("y_train_final.shape:", y_train_final.shape)
print("x_test_final.shape:", x_test_final.shape)
print("y_test_final.shape:", y_test_final.shape)

In [None]:
y_train_final = np.where(y_train_final == 1, 1, -1)
y_test_final = np.where(y_test_final == 1, 1, -1)

#### Dimensionality reduction

TODO: (Analyze this): There is a reduction of dimensions with the PCA technique while maintaining 90% of the distribution. The dimension of the data is reduced from 3072 to x.

In [None]:
from sklearn import decomposition
import numpy as np

# TODO: Analyze how PCA works with sklearn (the arguments and the return values)
pca = decomposition.PCA(n_components=0.9, svd_solver="full", random_state=0)
x_train_final = pca.fit_transform(x_train_final)
x_test_final = pca.transform(x_test_final)

print("x_train_final.shape =", x_train_final.shape)
print("x_test.shape =", x_test_final.shape)

TODO: Maybe visualize some stuff on the dimensions of the first and second eigenvector

TODO: Analyze MoschosSVM

#### Model evaluation

`grid_search` performs K-fold cross validation & evaluates for various parameter values. 

Accuracy is chosen as the evaluation metric, because the classes are weighted.

`plot_grid_search` generates plots for accuracy and training time. 

`evaluate_model` retrains the best model on the entire training set and evaluates it on the test set.

In [None]:
final_results = []

##### MoschosSVM (Linear Kernel)

$ Loss = \mathbf{w}^T\mathbf{w} + C\sum_{k=1}^R\varepsilon_{k}\ $

Kernel: $ K(\mathbf{x}, \mathbf{x}') = \langle\mathbf{x},\mathbf{x}'\rangle $

In [None]:
from svm import MoschosSVM
from model_training_utils import grid_search

param_dict = {"C": (0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0)}

model = MoschosSVM(kernel="linear")

results = grid_search(model, param_dict, x_train_final, y_train_final)

In [None]:
from model_training_utils import plot_grid_search

plot_grid_search(results, "C", None, "log")

In [None]:
from model_training_utils import evaluate_model


res = evaluate_model(
    "My Linear SVM",
    model,
    results["best_params"],
    x_train_final,
    y_train_final,
    x_test_final,
    y_test_final,
)

final_results.append(res)

##### Sklearn (Linear Kernel)

In [None]:
from sklearn import svm

param_dict = {"C": (0.001, 0.005, 0.01, 0.05, 0.1, 0.5, 1.0)}

model = svm.SVC(kernel="linear")
results = grid_search(model, param_dict, x_train_final, y_train_final)

In [None]:
plot_grid_search(results, "C", None, "log")

In [None]:
res = evaluate_model(
    "Linear SVM",
    model,
    results["best_params"],
    x_train_final,
    y_train_final,
    x_test_final,
    y_test_final,
)
final_results.append(res)

##### MoschosSVM (Polynomial Kernel)

Kernel: $ K(\mathbf{x}, \mathbf{x}') = (\gamma\langle\mathbf{x},\mathbf{x}'\rangle+r)^d $

In [None]:
from model_training_utils import grid_search
from svm import MoschosSVM

param_dict = {"C": (0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0), "degree": (2, 3, 4, 5)}

model = MoschosSVM(kernel="poly")
results = grid_search(model, param_dict, x_train_final, y_train_final)

In [None]:
plot_grid_search(results, "C", "degree", "log")

In [None]:
res = evaluate_model(
    "My Polynomial SVM",
    model,
    results["best_params"],
    x_train_final,
    y_train_final,
    x_test_final,
    y_test_final,
)
final_results.append(res)

##### Sklearn (Polynomial Kernel)

Βελτιστοποίηση του sklearn.svm.SVC με polynomial kernel στο μικρο training set.

In [None]:
from sklearn import svm

param_dict = {"C": (0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0), "degree": (2, 3, 4, 5)}

model = svm.SVC(kernel="poly")
results = grid_search(model, param_dict, x_train_final, y_train_final)

In [None]:
plot_grid_search(results, "C", "degree", "log")

In [None]:
res = evaluate_model(
    "Polynomial SVM",
    model,
    results["best_params"],
    x_train_final,
    y_train_final,
    x_test_final,
    y_test_final,
)
final_results.append(res)

##### MoschosSVM (RBF Kernel)

Η βελτιστοποίηση του MySVM με rbf kernel πραγματοποιείται ως προς τις παραμέτρους C και gamma. Το gamma δείχνει πόσο μακριά φτάνει η επιρροή ενός παραδείγματος.

Kernel: $ K(\mathbf{x}, \mathbf{x}') = e^{-\gamma||\mathbf{x}-\mathbf{x}'||^2} $

In [None]:
param_dict = {"C": (0.001, 0.01, 0.1, 1.0, 10.0, 100.0), "gamma": (0.01, 0.1, 1.0)}

model = MoschosSVM(kernel="rbf")
results = grid_search(model, param_dict, x_train_final, y_train_final)

In [None]:
plot_grid_search(results, "C", "gamma", "log")

In [None]:
res = evaluate_model(
    "My RBF SVM",
    model,
    results["best_params"],
    x_train_final,
    y_train_final,
    x_test_final,
    y_test_final,
)
final_results.append(res)

##### Sklearn (RBF Kernel)

In [None]:
param_dict = {"C": (0.001, 0.01, 0.1, 1.0, 10.0, 100.0), "gamma": (0.01, 0.1, 1.0)}

model = svm.SVC(kernel="rbf")
results = grid_search(model, param_dict, x_train_final, y_train_final)

In [None]:
plot_grid_search(results, "C", "gamma", "log")

In [None]:
res = evaluate_model(
    "RBF SVM",
    model,
    results["best_params"],
    x_train_final,
    y_train_final,
    x_test_final,
    y_test_final,
)
final_results.append(res)

##### MoschosSVM (MLP Kernel)

In [None]:
param_dict = {"C": (0.001, 0.01, 0.1, 1.0, 10.0, 100.0), "gamma": (0.001, 0.01, 0.1)}

model = MoschosSVM(kernel="sigmoid")
results = grid_search(model, param_dict, x_train_final, y_train_final)

In [None]:
plot_grid_search(results, "C", "gamma", "log")

In [None]:
res = evaluate_model(
    "My MLP SVM",
    model,
    results["best_params"],
    x_train_final,
    y_train_final,
    x_test_final,
    y_test_final,
)
final_results.append(res)

##### Sklearn (MLP Kernel)

Kernel: $ K(\mathbf{x}, \mathbf{x}') = tanh(\gamma\langle\mathbf{x},\mathbf{x}'\rangle+r) $

In [None]:
param_dict = {"C": (0.001, 0.01, 0.1, 1.0, 10.0, 100.0), "gamma": (0.001, 0.01, 0.1)}

model = svm.SVC(kernel="sigmoid")
results = grid_search(model, param_dict, x_train_final, y_train_final)

In [None]:
plot_grid_search(results, "C", "gamma", "log")

In [None]:
res = evaluate_model(
    "MLP SVM",
    model,
    results["best_params"],
    x_train_final,
    y_train_final,
    x_test_final,
    y_test_final,
)
final_results.append(res)

##### Nearest Neighbors

$d_p(\mathbf{x}, \mathbf{y}) = \sqrt[p]{\sum_{i}|x_i-y_i|^p}$

In [None]:
from sklearn import neighbors

param_dict = {"n_neighbors": (1, 2, 5, 10), "p": (1, 2, 3)}

model = neighbors.KNeighborsClassifier()
results = grid_search(model, param_dict, x_train_final, y_train_final)

In [None]:
plot_grid_search(results, "n_neighbors", "p", "log")

In [None]:
res = evaluate_model(
    "Nearest Neighbors",
    model,
    results["best_params"],
    x_train_final,
    y_train_final,
    x_test_final,
    y_test_final,
)
final_results.append(res)

##### Nearest Class Centroid

In [None]:
param_dict = {"shrink_threshold": np.arange(0, 1.1, 0.1)}

model = neighbors.NearestCentroid()
results = grid_search(model, param_dict, x_train_final, y_train_final)

In [None]:
plot_grid_search(results, "shrink_threshold", None, "log")

In [None]:
res = evaluate_model(
    "Nearest Class Centroid",
    model,
    results["best_params"],
    x_train_final,
    y_train_final,
    x_test_final,
    y_test_final,
)
final_results.append(res)

#### Results summary

In [None]:
import pandas as pd

final_results_df = pd.DataFrame(final_results)
final_results_df = final_results_df.style.set_table_styles(
    [{"selector": "th", "props": [("text-align", "left")]}]
)
final_results_df = final_results_df.set_properties(
    subset=["text-align"], **{"text-align": "left"}
).hide(axis="index")