# EE 508 HW 1 Part 2: Classification

Your task in this Colab notebook is to fill out the sections that are specified by **TODO** (please search the keyword `TODO` to make sure you do not miss any).

## Cross Validation, Bias-Variance trade-off, Overfitting

In this section, we will demonstrate data splitting and the validation process in machine learning paradigms. We will use the Iris dataset from the `sklearn` library.

Objective:
- Train a Fully-Connected Network (FCN) for classification.  
- Partition the data using three-fold cross-validation and report the training, validation, and testing accuracy.  
- Train the model using cross-entropy loss and evaluate it with 0/1 loss.  

In [28]:
# import required libraries and dataset
import numpy as np
# load sklearn for ML functions
from sklearn.datasets import load_iris
# load torch dataaset for training NNs
import torch
import torch.nn as nn
import torch.optim as optim
# plotting library
import matplotlib.pyplot as plt
%matplotlib inline
plt.style.use(['ggplot'])

### **TODO 1**: Implement the cross validation function
In this function, the dataset is first shuffled. Then, we need to implement a loop that iterates through each fold, selecting a subset of samples as the validation set while assigning the remaining samples to the training set, and stores these partitions in the `folds` list.

In [29]:
def cross_validation(x: np.array, y: np.array, n_folds: int=3):
    """
    Splitting the dataset to the given fold
    Parameters:
    - x: Feaures of the dataset, with shape (n_samples, n_features)
    - y: Class label of the dataset, with shape (n_samples,)
    - n_folds: the given number of partitions
        For instnace, 5-fold CV with 100 percentage:
        fold_1: training on 20~99, validation on 0~19(%)
        fold_2: training on 0~19 and 40~99, validation on 20~39(%)
        fold_3: training on 0~39 and 60~99, validation on 40~59(%)
        fold_4: training on 0~59 and 80~99, validation on 60~79(%)
        fold_5: training on 0~79, validation on 80~99(%)

    Returns:
    - folds (list): In the format with len(folds) == n_folds
        [
            (x_train_fold1, y_train_fold1, x_valid_fold1, y_valid_fold1),
            (x_train_fold2, y_train_fold2, x_valid_fold2, y_valid_fold2),
            (x_train_fold3, y_train_fold3, x_valid_fold3, y_valid_fold3),
            ...
        ]
    """

    folds = []
    n_data = x.shape[0]
    index = np.arange(n_data)
    # shaffle the data with np.random.shuffle
    np.random.shuffle(index)
    # find the partition with numpy.linspace
    partitions = np.linspace(0, n_data, num=n_folds+1, endpoint=True)
    partitions = partitions.astype(int)

    # Finish the code here
    # Implementing cross-validation splits
    for i in range(n_folds):
        valid_idx = index[partitions[i]:partitions[i+1]]
        train_idx = np.concatenate((index[:partitions[i]], index[partitions[i+1]:]))

        x_train, y_train = x[train_idx], y[train_idx]
        x_valid, y_valid = x[valid_idx], y[valid_idx]

        folds.append((x_train, y_train, x_valid, y_valid))


    print(f"The Partitions:")
    for idx, (_, train_y, _, valid_y) in enumerate(folds):
        print(f"[Fold-{idx+1}] #Training: {train_y.shape[0]:4>0d}; #Validation: {valid_y.shape[0]:4>0d}")
        from collections import Counter
        # you check check the label distribution
        print(Counter(train_y))
        print(Counter(valid_y))

    return folds

In [30]:
# fixed the random seed
np.random.seed(42)
# Load Iris dataset
iris = load_iris()
x, y = iris.data, iris.target
# Split into training and testing sets
three_folds = cross_validation(x, y)

The Partitions:
[Fold-1] #Training: 100; #Validation: 50
Counter({1: 35, 2: 34, 0: 31})
Counter({0: 19, 2: 16, 1: 15})
[Fold-2] #Training: 100; #Validation: 50
Counter({2: 35, 1: 33, 0: 32})
Counter({0: 18, 1: 17, 2: 15})
[Fold-3] #Training: 100; #Validation: 50
Counter({0: 37, 1: 32, 2: 31})
Counter({2: 19, 1: 18, 0: 13})


### **TODO 2**: Build a Fully-Connect Networks with PyTorch
In this section, we build simple FCN models with different numbers of hidden units for the classification task.

- **Training:** Use cross-entropy for optimization.  
- **Inferencing:** Evaluate with 0/1 loss.  

In [31]:
# define the FCN model
class FCN_model(nn.Module):
    # take the argument for the number of hidden units
    def __init__(self, n_hidden=32):
        # Finish the code here

        super(FCN_model, self).__init__()

        # Define input and output sizes
        n_input = 4  # Number of features in the Iris dataset #Sepal length (cm)Sepal width (cm)Petal length (cm)Petal width (cm)
        n_output = 3  # Number of classes in the Iris dataset

        # Fully connected layers
        self.fc1 = nn.Linear(n_input, n_hidden)  # Input layer to hidden layer
        self.relu = nn.ReLU()  # Activation function
        self.fc2 = nn.Linear(n_hidden, n_output)  # Hidden layer to output layer

    def forward(self, x):
        # Finish the code here
        x = self.fc1(x)  # First fully connected layer
        x = self.relu(x)  # Activation function
        x = self.fc2(x)  # Second fully connected layer (output)

        return x

Set up the evaluation and training functions for the FCN models.

In [32]:
def eval(model:nn.Module,
         x:torch.tensor,
         y:torch.tensor) -> float:
    """Evaluate the model: inference the model with 0/1 loss
    We can define the output label is the maximum logit from the model

    Parameters:
    - model: the FCN model
    - x: input features
    - y: ground truth labels, dtype=long

    Returns:
    - loss: the average 0/1 loss value
    """
    # Evaluate the model
    model.eval()
    with torch.no_grad():
        preds = torch.argmax(model(x), dim=1)

    loss = 0
    # Finish the code here
    loss = torch.sum(preds != y).item()

    print(f"Averaging 0/1 loss: {loss/preds.shape[0]:.4f}")
    return loss/preds.shape[0]

In [33]:
def train(model:nn.Module,
          x_train:torch.tensor,
          y_train:torch.tensor,
          x_valid:torch.tensor,
          y_valid:torch.tensor,
          epochs:int=300):
    """Trining process
    Parameters:
    - model: the FCN model
    - x_train, y_train: trainig features and labels (dtype=long)
    - x_valid, y_valid: validation features and labels (dtype=long)
    - epochs: number of the epoches for training
    """
    # To simplify the process
    # we do not take batches but use all the training samples
    # set up the objective function and the optimizer
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.SGD(model.parameters(), lr=1e-2)
    # training loop
    for epoch in range(epochs):
        model.train()
        # Forward pass
        outputs = model(x_train)
        loss = criterion(outputs, y_train)

        # Backward pass and optimization
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        if (epoch + 1) % 100 == 0:
            print(f"Epoch [{epoch + 1}/{epochs}], Cross Entropy Loss: {loss.item():.4f}")
            print(f"[Train] ", end="")
            eval(model, x_train, y_train)
            print(f"[Valid] ", end="")
            eval(model, x_valid, y_valid)


### **TODO 3**: Conduct the training/validation process in each fold
We will use three-fold validation, meaning you need to train three models and report the training and validation loss for all three folds.  

First, instantiate an FCN model with 32 hidden units.  
Then, call the `train` function, which takes the training and validation folds created by the `cross_validation()` function, along with the model, as input. Set `epochs` to `500`.  


In [34]:
train_losses, valid_losses = [], []

for idx, (x_train, y_train, x_valid, y_valid) in enumerate(three_folds):
    print(f"===== Traing Fold {idx} =====")
    x_train = torch.Tensor(x_train)
    y_train = torch.tensor(y_train, dtype=torch.long)
    x_valid = torch.Tensor(x_valid)
    y_valid = torch.tensor(y_valid, dtype=torch.long)

    # Finish the code here
    # Instantiate a new model for each fold
    model = FCN_model(n_hidden=32)

    # Define the criterion and optimizer
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.SGD(model.parameters(), lr=1e-2)

    # Train the model
    train(model, x_train, y_train, x_valid, y_valid, epochs=500)

    train_losses.append(eval(model, x_train, y_train))
    valid_losses.append(eval(model, x_valid, y_valid))

===== Traing Fold 0 =====
Epoch [100/500], Cross Entropy Loss: 0.6390
[Train] Averaging 0/1 loss: 0.3300
[Valid] Averaging 0/1 loss: 0.3000
Epoch [200/500], Cross Entropy Loss: 0.4884
[Train] Averaging 0/1 loss: 0.0900
[Valid] Averaging 0/1 loss: 0.0600
Epoch [300/500], Cross Entropy Loss: 0.4100
[Train] Averaging 0/1 loss: 0.0400
[Valid] Averaging 0/1 loss: 0.0200
Epoch [400/500], Cross Entropy Loss: 0.3528
[Train] Averaging 0/1 loss: 0.0400
[Valid] Averaging 0/1 loss: 0.0200
Epoch [500/500], Cross Entropy Loss: 0.3066
[Train] Averaging 0/1 loss: 0.0300
[Valid] Averaging 0/1 loss: 0.0200
Averaging 0/1 loss: 0.0300
Averaging 0/1 loss: 0.0200
===== Traing Fold 1 =====
Epoch [100/500], Cross Entropy Loss: 0.6014
[Train] Averaging 0/1 loss: 0.3100
[Valid] Averaging 0/1 loss: 0.3200
Epoch [200/500], Cross Entropy Loss: 0.4617
[Train] Averaging 0/1 loss: 0.1100
[Valid] Averaging 0/1 loss: 0.1600
Epoch [300/500], Cross Entropy Loss: 0.3889
[Train] Averaging 0/1 loss: 0.0600
[Valid] Averaging

In [35]:
print(f"#Fold, training loss, validation loss")
for idx, (train_loss, valid_loss) in enumerate(zip(train_losses, valid_losses)):
    print(f"{idx:>5d},          {train_loss:.2f},            {valid_loss:.2f}")

#Fold, training loss, validation loss
    0,          0.03,            0.02
    1,          0.02,            0.08
    2,          0.04,            0.02


### **TODO4**: Check over-fitting with complex model
We can follow the same procedure with a more complex FCN model.  
Now, set the `number of hidden units` to `2048` and repeat the process for three-fold validation with `epochs = 500`.  
The gap between the training and validation performance should increase.  

In [36]:
train_overfit, valid_overfit = [], []

for idx, (x_train, y_train, x_valid, y_valid) in enumerate(three_folds):
    print(f"===== Traing Fold {idx} =====")
    x_train = torch.Tensor(x_train)
    y_train = torch.tensor(y_train, dtype=torch.long)
    x_valid = torch.Tensor(x_valid)
    y_valid = torch.tensor(y_valid, dtype=torch.long)

    # Finish the code here
    # Define the complex model with 2048 hidden units
    model = FCN_model(n_hidden=2048)
    criterion = nn.CrossEntropyLoss()  # Loss function
    optimizer = torch.optim.Adam(model.parameters(), lr=0.001)  # Optimizer

    # Training the model
    epochs = 500
    for epoch in range(epochs):
        optimizer.zero_grad()  # Reset gradients
        outputs = model(x_train)  # Forward pass
        loss = criterion(outputs, y_train)  # Compute loss
        loss.backward()  # Backpropagation
        optimizer.step()  # Update weights


    train_overfit.append(eval(model, x_train, y_train))
    valid_overfit.append(eval(model, x_valid, y_valid))

===== Traing Fold 0 =====
Averaging 0/1 loss: 0.0100
Averaging 0/1 loss: 0.0200
===== Traing Fold 1 =====
Averaging 0/1 loss: 0.0000
Averaging 0/1 loss: 0.0600
===== Traing Fold 2 =====
Averaging 0/1 loss: 0.0200
Averaging 0/1 loss: 0.0000


In [37]:
print(f"#Fold, training loss, validation loss")
for idx, (train_loss, valid_loss) in enumerate(zip(train_overfit, valid_overfit)):
    print(f"{idx:>5d},          {train_loss:.2f},            {valid_loss:.2f}")

#Fold, training loss, validation loss
    0,          0.01,            0.02
    1,          0.00,            0.06
    2,          0.02,            0.00


### **TODO 5**: Compare the FCN with statistical ML models
Here, we will use the Naive Bayes model from the `sklearn` library and perform three-fold validation.  

In [38]:
# Load the Naive Bayes classifier from the library
from sklearn.naive_bayes import GaussianNB

train_nb, valid_nb = [], []
for idx, (x_train, y_train, x_valid, y_valid) in enumerate(three_folds):

    # Finish the code here
    # Initialize and train the Naïve Bayes model
    model = GaussianNB()
    model.fit(x_train, y_train)

    # Make predictions
    y_train_pred = model.predict(x_train)
    y_valid_pred = model.predict(x_valid)

    # Calculate accuracy
    train_acc = np.mean(y_train_pred == y_train)
    valid_acc = np.mean(y_valid_pred == y_valid)

    train_nb.append(1 - train_acc)
    valid_nb.append(1 - valid_acc)

In [39]:
print(f"#Fold, training loss, validation loss")
for idx, (train_loss, valid_loss) in enumerate(zip(train_nb, valid_nb)):
    print(f"{idx:>5d},          {train_loss:.2f},            {valid_loss:.2f}")

#Fold, training loss, validation loss
    0,          0.05,            0.04
    1,          0.02,            0.06
    2,          0.04,            0.04


### **TODO 6**:
Answer the following questions in the next cell.  
1. What is the the bias-variance trade-off in machine learning?
2. How to reduce overfitting and underfitting?
3. How do the training and inference processes differ between the Naive Bayes model and a fully connected neural network?

Your anwser:

```
# This is formatted as code
```
"""
1. What is the bias-variance trade-off in machine learning?
   - The bias-variance trade-off refers to the balance between two sources of error in a machine learning model:
     - **Bias**: Error due to overly simplistic assumptions in the model, leading to underfitting.
     - **Variance**: Error due to excessive complexity in the model, leading to overfitting.
   - A model with high bias makes strong assumptions and fails to capture underlying patterns, while a model with high variance captures noise in the training data.
   - The goal is to find an optimal balance where both bias and variance are minimized.

2. How to reduce overfitting and underfitting?
   - **To reduce overfitting (high variance)**:
     - Use more training data
     - Apply regularization (L1, L2)
     - Use dropout (for neural networks)
     - Simplify the model architecture
     - Use data augmentation
     - Perform cross-validation
   - **To reduce underfitting (high bias)**:
     - Increase model complexity (add more layers/nodes in a neural network)
     - Use better feature engineering
     - Train for more epochs
     - Reduce regularization strength

3. How do the training and inference processes differ between the Naive Bayes model and a fully connected neural network?
   - **Naive Bayes**:
     - **Training**: Computes class priors and likelihoods using simple probabilistic rules.
     - **Inference**: Uses Bayes’ theorem to compute posterior probabilities for each class and selects the class with the highest probability.
     - **Computation**: Fast and requires only counting and simple arithmetic operations.
   - **Fully Connected Neural Network (FCN)**:
     - **Training**: Uses backpropagation and gradient descent to adjust weights based on a loss function.
     - **Inference**: Passes input data through multiple layers of neurons, applying learned weights and activation functions.
     - **Computation**: Computationally expensive due to matrix multiplications and backpropagation.
"""

