# Machine Learning Model Training and Evaluation
This notebook solves the D part of the assignemnt by training a NN using PyTorch. 
It includes data preprocessing, model definition, training, cross-validation, and test set prediction.

## Import Required Libraries
The following libraries are imported to facilitate data processing, model creation, training, and evaluation.

In [3]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import numpy as np
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split, KFold
from sklearn.metrics import accuracy_score
from torch.utils.data import DataLoader, TensorDataset
import matplotlib.pyplot as plt
from itertools import product
from collections import Counter

## Data Loading and Preprocessing
The training and test datasets are loaded and preprocessed. Features are scaled, and the labels are converted to a format suitable for PyTorch.

In [4]:
"""Load and preprocess train and test datasets."""
    
DEVICE = torch.device("cuda" if torch.cuda.is_available() else "cpu")


train_path="datasetTV.csv"
test_path="datasetTest.csv"
train_data = pd.read_csv(train_path)
test_data = pd.read_csv(test_path)

X = train_data.iloc[:, :-1].values
y = train_data.iloc[:, -1].values
X_test = test_data.values

scaler = StandardScaler()
x = scaler.fit_transform(X)
X_test = scaler.transform(X_test)

X = torch.tensor(X, dtype=torch.float32).to(DEVICE)
y = torch.tensor(y - 1, dtype=torch.long).to(DEVICE) # 0-indexed
X_test = torch.tensor(X_test, dtype=torch.float32).to(DEVICE)

### Check for class imbalance in the dataset
We check for class imbalance in the dataset, so we know if it is something we have to take care when continuing with the training.

In [5]:
"""Check for class imbalance"""

from collections import Counter


# Count the number of samples for each class
class_counts = Counter(y.cpu().numpy())
print("Class Counts:", class_counts)

Class Counts: Counter({np.int64(4): 1784, np.int64(0): 1768, np.int64(2): 1754, np.int64(1): 1720, np.int64(3): 1716})


The classes are fairly balanced in the dataset, so we do not have to take any measures regarding class imbalance.

### Definition of the NN Architecture and functions that we will need for training and plotting.
* A Neural Network Classifier with 2 hidden layers is chosen, ReLu is our activation function, and we also implement dropout layers to prevent overfitting and boost generalization.
* The `train_model` function trains a model for a specified number of epochs, given the desired optimizer, criteria, and the training and validation datasets. It aslo allows for `patience` implementation, to enable early stopping when the validation losses start to plateu. *Note: By implementing this function we reduce code duplication, since we train various models in the rest of our code.*

In [None]:
class Model(nn.Module):
    def __init__(self, in_features, h1, h2, num_classes):
        """Fully connected neural network with batch normalization and dropout."""
        super(Model, self).__init__()
        self.fc1 = nn.Linear(in_features, h1)
        self.bn1 = nn.BatchNorm1d(h1)
        self.dropout1 = nn.Dropout(0.3)

        self.fc2 = nn.Linear(h1, h2)
        self.bn2 = nn.BatchNorm1d(h2)
        self.dropout2 = nn.Dropout(0.2)

        self.out = nn.Linear(h2, num_classes)

    def forward(self, x):
        x = F.relu(self.bn1(self.fc1(x)))
        x = self.dropout1(x)
        x = F.relu(self.bn2(self.fc2(x)))
        x = self.dropout2(x)
        return self.out(x)


def plot_losses(train_losses, val_losses):
    """Plot the training and validation loss curves."""
    plt.figure(figsize=(10, 6))
    plt.plot(train_losses, label="Training Loss")
    plt.plot(val_losses, label="Validation Loss")
    plt.xlabel("Epoch")
    plt.ylabel("Loss")
    plt.title("Training and Validation Loss Over Epochs")
    plt.legend()
    plt.show()


def train_model(model, train_loader, val_loader, optimizer, criteria, epochs, device=DEVICE, patience=None):
    """Train and evaluate the model for the given number of epochs.

    Args:
        model (): PyTorch model.
        train_loader: PyTorch DataLoader for training set.
        val_loader: PyTorch DataLoader for validation set.
        optimizer: PyTorch optimizer.
        criteria: Loss function.
        epochs: Number of epochs to train the model.
        device: Device to run the model on. Default on `DEVICE`.
        patience: Number of epochs to wait before early stopping if validation loss does not decrease. Default is None.
    
    Returns:
        train_losses: List of training losses for each epoch
        val_losses: List of validation losses for each epoch
    
    """
    train_losses, val_losses = [], []
    best_val_loss = float("inf")
    patience_counter = 0

    for epoch in range(epochs):
        model.train()
        batch_losses = []

        # Training step
        for batch_X, batch_y in train_loader:
            batch_X, batch_y = batch_X.to(device), batch_y.to(device)

            optimizer.zero_grad()
            outputs = model(batch_X)
            loss = criteria(outputs, batch_y)
            loss.backward()
            optimizer.step()
            batch_losses.append(loss.item())

        train_losses.append(sum(batch_losses) / len(batch_losses))

        # Validation step
        model.eval()
        val_loss, num_samples, correct_predicted = 0, 0, 0
        with torch.no_grad():
            for val_X, val_y in val_loader:
                val_X, val_y = val_X.to(device), val_y.to(device)
                outputs = model(val_X)
                val_loss += criteria(outputs, val_y).item()
                correct_predicted += accuracy_score(
                    outputs.argmax(dim=1).cpu().numpy(), val_y.cpu().numpy(), normalize=False
                )
                num_samples += val_y.size(0)

        val_loss /= len(val_loader)
        val_losses.append(val_loss)

        if (epoch + 1) % 10 == 0:
            print(
                f"Epoch [{epoch+1}/{epochs}], Train Loss: {train_losses[-1]:.4f}, Val Loss: {val_losses[-1]:.4f}, Accuracy: {correct_predicted / num_samples:.4f}"
            )

        # Early stopping
        if patience:
            if val_loss < best_val_loss:
                best_val_loss = val_loss
                patience_counter = 0
            else:
                patience_counter += 1
                if patience_counter >= patience:
                    print(f"Early stopping triggered at epoch {epoch+1}.")
                    break

    return train_losses, val_losses

### K-fold cross-validation
We train with the **k-fold cross validation** technique.Meaning 

* we create *k* datasets, and on each one of them a differents part of the whole training set provided, will be split into *validation set* and *training set*, 
* and *k* models will be trained.

That way we can keep track of the performance of our approach (especially overfitting) in an as much as possible unbiased way, since we will see the **validatiion vs training loss** for different validation sets, and the results will not be depended on a single validation set. If we used a single validation set without performing cross validation there is a chance that the single validation set will happen to be very similar to the training set, and our ability to judge our model's generaliaztion abilty will be biased.

### Hyper parameter Grid Search
The code supports grid-search for the hyper parameters. After performing the grid-search we saw that almost all parameters give similar results, but approach minimiaztion of loss at different number of epochs.

Our chosen final parametrs are:
* learning rate = 0.001
* batch size = 64
* Hidden layers dimension = 128 (for both hidden layers)

If you wish to perform the grid-search you can just uncomment the definition of the grid search parameters.

## Choices:
* k = 5 (meaning 1/5 folds, so 20% of our data are validation set each time)
* Cross-entropy loss (good for clasification)
* Adam Optimizer

In [None]:
"""Perform K-fold cross-validation for hyperparameter tuning."""

train_params = {
    "learning_rate": [0.0001],
    "batch_size": [64],
    "hidden_layer_size": [128],
    "epochs": 100,
    # Uncomment to test multiple hyperparameters, and comment the corresping lines above
    # "learning_rate": [0.0001, 0.001, 0.01],
    # "batch_size": [32, 64, 128],
    # "hidden_layer_size": [32, 64, 128],
}

n_folds = 5

kf = KFold(n_splits=n_folds, shuffle=True, random_state=42)
criteria = nn.CrossEntropyLoss()

for lr, bs, hls in product(
    train_params["learning_rate"], train_params["batch_size"], train_params["hidden_layer_size"]
):
    fold_val_losses = []
    print(f"Testing Hyperparameters: lr: {lr}, bs: {bs}, hls: {hls}")

    for fold, (train_idx, val_idx) in enumerate(kf.split(x)):
        print(f"Fold {fold + 1}/{n_folds}")
        
        X_train, X_val = X[train_idx], X[val_idx]
        y_train, y_val = y[train_idx], y[val_idx]

        train_dataset = TensorDataset(X_train, y_train)
        val_dataset = TensorDataset(X_val, y_val)

        train_loader = DataLoader(train_dataset, batch_size=bs, shuffle=True)
        val_loader = DataLoader(val_dataset, batch_size=bs, shuffle=False)

        model = Model(in_features=224, h1=hls, h2=hls, num_classes=5).to(DEVICE)
        optimizer = optim.Adam(model.parameters(), lr=lr)

        train_losses, val_losses = train_model(
            model, train_loader, val_loader, optimizer, criteria, train_params["epochs"]
        )

        fold_val_losses.append(val_losses[-1])

    avg_val_loss = sum(fold_val_losses) / n_folds
    print(f"Average Validation Loss for lr={lr}, bs={bs}, hls={hls}: {avg_val_loss:.4f}")

### Train the final model
After peforming cross-validation training to examine closer the behaviour of our system, we decided that we don't want to waste 20% of the data for validation, so we chose to train our final model on 90% of the data, using only 10% for validation. 

That way we still have a good information on the prorgress of overfitting, without loosing a big amount of our data. 

Using 20% on cross-validation seem a good value to get an unbiased opinion about our hyperparameters and the architecture of our model.

So we train the final model without cross-fold validation, and with 10% of the training data as validation set.

In [None]:
"""Train the final model with a small validation set."""

train_params = {
    "learning_rate": [0.0001],
    "batch_size": [64],
    "hidden_layer_size": [128],
    "epochs": 100,
}

# Split into train and pseudo-validation sets
X_train, X_val, y_train, y_val = train_test_split(
    X, y, test_size=0.1, random_state=42
)

# Create DataLoaders
batch_size = train_params["batch_size"][0]
train_dataset = TensorDataset(X_train, y_train)
val_dataset = TensorDataset(X_val, y_val)

train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=batch_size, shuffle=False)

model = Model(
    in_features=224,
    h1=train_params["hidden_layer_size"][0],
    h2=train_params["hidden_layer_size"][0],
    num_classes=5
).to(DEVICE)
optimizer = optim.Adam(model.parameters(), lr=train_params["learning_rate"][0])
criteria = nn.CrossEntropyLoss()

train_losses, val_losses = train_model(
    model, train_loader, val_loader, optimizer, criteria, train_params["epochs"], patience=10
)

plot_losses(train_losses, val_losses)

## Predicting on Test Data
The trained model is used to predict the labels for the test dataset, and the results are saved.

In [None]:
"""Predict test data."""

model.eval()
with torch.no_grad():
    test_outputs = model(X_test)
    _, predicted_labels = torch.max(test_outputs, 1)
    predicted_labels += 1  # Convert back to 1-based indexing

np.save("labelsX.npy", predicted_labels.cpu().numpy())
print("Predictions saved as labelsX.npy")

### Making sure labels can be loaded

In [7]:
"""Make sure that the labelsX.npy file can be read."""

labels = np.load("labelsX.npy")
print(labels, type(labels))

[4 5 2 ... 3 2 1] <class 'numpy.ndarray'>
