# **Deep Learning Course**

## **Loss Functions and Multilayer Perceptrons (MLP)**

---

### **Student Information:**

- **Name:** *Tina Halimi*
- **Student Number:** *400101078*

---

### **Assignment Overview**

In this notebook, we will explore various loss functions used in neural networks, with a specific focus on their role in training **Multilayer Perceptrons (MLPs)**. By the end of this notebook, you will have a deeper understanding of:
- Types of loss functions
- How loss functions affect the training process
- The relationship between loss functions and model optimization in MLPs

---

### **Table of Contents**

1. Introduction to Loss Functions
2. Types of Loss Functions
3. Multilayer Perceptrons (MLP)
4. Implementing Loss Functions in MLP
5. Conclusion

---



# 1.Introduction to Loss Functions 

In deep learning, **loss functions** play a crucial role in training models by quantifying the difference between the predicted outputs and the actual targets. Selecting the appropriate loss function is essential for the success of your model. In this assay, we will explore various loss functions available in PyTorch, understand their theoretical backgrounds, and provide you with a scaffolded class to experiment with these loss functions.

Before begining, let's train a simle MLP model using the **L1Loss** function. We'll return to this model later to experiment with different loss functions. We'll start by importing the necessary libraries and defining the model architecture.

First things first, let's talk about **L1Loss**.

### 1. L1Loss (`torch.nn.L1Loss`)
- **Description:** Also known as Mean Absolute Error (MAE), L1Loss computes the average absolute difference between the predicted values and the target values.
- **Use Case:** Suitable for regression tasks where robustness to outliers is desired.

Here is the mathematical formulation of L1Loss:
\begin{equation}
\text{L1Loss} = \frac{1}{n} \sum_{i=1}^{n} |y_{\text{pred}_i} - y_{\text{true}_i}|
\end{equation}

Let's implement a simple MLP model using the L1Loss function.

In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
import pandas as pd
from torch.utils.data import TensorDataset, DataLoader, random_split
from sklearn.model_selection import train_test_split
from torch.optim import Adam
from tqdm import tqdm
from sklearn.preprocessing import StandardScaler
import requests
from io import StringIO


# Don't be courious about Adam, it's just a fancy name for a fancy optimization algorithm

Here, we'll define a class called `SimpleMLP` that inherits from `nn.Module`. This class can have multiple layers, and we'll use the `nn.Sequential` module to define the layers of the model. The model will have the following architecture:

In [2]:
class SimpleMLP(nn.Module):
    def __init__(self, input_dim, hidden_dim, output_dim, num_hidden_layers=1, last_layer_activation_fn=nn.ReLU):
        super(SimpleMLP, self).__init__()

        layers = [nn.Linear(input_dim, hidden_dim), nn.ReLU()]
        
        for _ in range(num_hidden_layers - 1):
            layers.extend([nn.Linear(hidden_dim, hidden_dim), nn.ReLU()])

        layers.append(nn.Linear(hidden_dim, output_dim))

        if last_layer_activation_fn is not None:
            layers.append(last_layer_activation_fn())

        self.network = nn.Sequential(*layers)

    def forward(self, x):
        return self.network(x)

Now, let's define a class called `SimpleMLP_Loss` that has the following architecture:

In [3]:
class SimpleMLPTrainer:
    def __init__(self, model, criterion, optimizer):
        self.model = model
        self.criterion = criterion
        self.optimizer = optimizer

    def train(self, train_loader, num_epochs):
        self.model.train()
        train_losses = []

        for epoch in range(num_epochs):
            epoch_loss = 0.0
            for inputs, targets in tqdm(train_loader, desc=f"Epoch {epoch+1}/{num_epochs}"):

                self.optimizer.zero_grad()
                
                outputs = self.model(inputs)
                loss = self.criterion(outputs, targets)
                
                loss.backward()
                self.optimizer.step()
                
                epoch_loss += loss.item()

            avg_epoch_loss = epoch_loss / len(train_loader)
            train_losses.append(avg_epoch_loss)
            
            print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {avg_epoch_loss:.4f}")
        
        return train_losses

    def evaluate(self, val_loader):
        self.model.eval()
        val_loss = 0.0
        correct = 0
        total = 0

        with torch.no_grad():
            for inputs, targets in val_loader:
                outputs = self.model(inputs)
                loss = self.criterion(outputs, targets)
                val_loss += loss.item()

            # Check the output shape to determine if it's binary or multi-class
                if outputs.shape[1] == 2:  # Multi-class prediction
                    predicted = torch.argmax(outputs, dim=1)
                # Convert targets to 1D if they are one-hot encoded or multi-dimensional
                    targets = torch.argmax(targets, dim=1) if targets.dim() > 1 else targets
                else:  # Binary classification
                    predicted = (outputs >= 0.5).float().squeeze()
                    targets = targets.squeeze()

                correct += (predicted == targets).sum().item()
                total += targets.size(0)

        avg_val_loss = val_loss / len(val_loader)
        accuracy = 100 * correct / total if total > 0 else 0

        print(f"Validation Loss: {avg_val_loss:.4f}, Accuracy: {accuracy:.2f}%")

        return avg_val_loss, accuracy


Next, lets test our model using the L1Loss function. You'll use <span style="color:red">*Titanic Dataset*</span> to train the model.


In [4]:
train_url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
response = requests.get(train_url, verify=False)
data = pd.read_csv(StringIO(response.text))

data = data[['Pclass', 'Sex', 'Age', 'Fare', 'Survived']].dropna()
data['Sex'] = data['Sex'].map({'male': 0, 'female': 1})

X = data[['Pclass', 'Sex', 'Age', 'Fare']].values
y = data['Survived'].values

scaler = StandardScaler()
X = scaler.fit_transform(X)

X_tensor = torch.tensor(X, dtype=torch.float32)
y_tensor = torch.tensor(y, dtype=torch.float32).view(-1, 1)

dataset = TensorDataset(X_tensor, y_tensor)
train_size = int(0.8 * len(dataset))
val_size = len(dataset) - train_size
train_dataset, val_dataset = random_split(dataset, [train_size, val_size])

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=32)



In [5]:
print(y.squeeze)

<built-in method squeeze of numpy.ndarray object at 0x15035bb70>


In [6]:
print("Training dataset shape:", len(train_dataset))
print("Validation dataset shape:", len(val_dataset))

Training dataset shape: 571
Validation dataset shape: 143


<div style="text-align: center;"> <span style="color:red; font-size: 26px; font-weight: bold;">Let's train!</span> </div>

In [7]:
# Using L1 Loss
input_dim = X.shape[1]
hidden_dim = 16
output_dim = 1
num_hidden_layers = 2

model = SimpleMLP(input_dim=input_dim, hidden_dim=hidden_dim, output_dim=output_dim, num_hidden_layers=num_hidden_layers, last_layer_activation_fn=None)
criterion = nn.L1Loss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

trainer = SimpleMLPTrainer(model, criterion, optimizer)
num_epochs = 20
train_losses = trainer.train(train_loader, num_epochs)
# trainer.evaluate(val_loader)

Epoch 1/20: 100%|██████████| 18/18 [00:00<00:00, 201.97it/s]


Epoch [1/20], Loss: 0.4172


Epoch 2/20: 100%|██████████| 18/18 [00:00<00:00, 1982.60it/s]


Epoch [2/20], Loss: 0.3984


Epoch 3/20: 100%|██████████| 18/18 [00:00<00:00, 1652.17it/s]


Epoch [3/20], Loss: 0.3878


Epoch 4/20: 100%|██████████| 18/18 [00:00<00:00, 1380.29it/s]


Epoch [4/20], Loss: 0.3758


Epoch 5/20: 100%|██████████| 18/18 [00:00<00:00, 1807.46it/s]


Epoch [5/20], Loss: 0.3586


Epoch 6/20: 100%|██████████| 18/18 [00:00<00:00, 1717.10it/s]


Epoch [6/20], Loss: 0.3355


Epoch 7/20: 100%|██████████| 18/18 [00:00<00:00, 353.24it/s]


Epoch [7/20], Loss: 0.2980


Epoch 8/20: 100%|██████████| 18/18 [00:00<00:00, 2006.47it/s]


Epoch [8/20], Loss: 0.2610


Epoch 9/20: 100%|██████████| 18/18 [00:00<00:00, 1496.01it/s]


Epoch [9/20], Loss: 0.2411


Epoch 10/20: 100%|██████████| 18/18 [00:00<00:00, 1914.72it/s]


Epoch [10/20], Loss: 0.2306


Epoch 11/20: 100%|██████████| 18/18 [00:00<00:00, 1746.90it/s]


Epoch [11/20], Loss: 0.2253


Epoch 12/20: 100%|██████████| 18/18 [00:00<00:00, 1569.14it/s]


Epoch [12/20], Loss: 0.2216


Epoch 13/20: 100%|██████████| 18/18 [00:00<00:00, 1639.00it/s]


Epoch [13/20], Loss: 0.2196


Epoch 14/20: 100%|██████████| 18/18 [00:00<00:00, 1466.48it/s]


Epoch [14/20], Loss: 0.2202


Epoch 15/20: 100%|██████████| 18/18 [00:00<00:00, 1938.22it/s]


Epoch [15/20], Loss: 0.2180


Epoch 16/20: 100%|██████████| 18/18 [00:00<00:00, 1433.92it/s]


Epoch [16/20], Loss: 0.2173


Epoch 17/20: 100%|██████████| 18/18 [00:00<00:00, 1810.14it/s]


Epoch [17/20], Loss: 0.2177


Epoch 18/20: 100%|██████████| 18/18 [00:00<00:00, 1745.36it/s]


Epoch [18/20], Loss: 0.2154


Epoch 19/20: 100%|██████████| 18/18 [00:00<00:00, 360.48it/s]


Epoch [19/20], Loss: 0.2155


Epoch 20/20: 100%|██████████| 18/18 [00:00<00:00, 1970.03it/s]

Epoch [20/20], Loss: 0.2149





In [8]:
val_loss, accuracy = trainer.evaluate(val_loader)

Validation Loss: 0.2296, Accuracy: 76.92%


---
# 2. Types of Loss Functions

PyTorch offers a variety of built-in loss functions tailored for different types of problems, such as regression, classification, and more. Below, we discuss several commonly used loss functions, their theoretical foundations, and typical use cases.

### 2. MSELoss (`torch.nn.MSELoss`)
- **Description:** Mean Squared Error (MSE) calculates the average of the squares of the differences between predicted and target values.
- **Use Case:** Commonly used in regression problems where larger errors are significantly penalized.

Here is boring math stuff for MSE:
\begin{equation}
\text{MSE} = \frac{1}{n} \sum_{i=1}^{n} (y_{i} - \hat{y}_{i})^{2}
\end{equation}

<span style="color:red; font-size: 18px; font-weight: bold;">Warning:</span> Don't forget to reinitialize the model before experimenting with different loss functions.

In [9]:
model_MSE = SimpleMLP(input_dim=input_dim, hidden_dim=hidden_dim, output_dim=output_dim, num_hidden_layers=num_hidden_layers, last_layer_activation_fn=None)
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model_MSE.parameters(), lr=0.001)
trainer = SimpleMLPTrainer(model_MSE, criterion, optimizer)
train_losses = trainer.train(train_loader, num_epochs)

Epoch 1/20: 100%|██████████| 18/18 [00:00<00:00, 682.26it/s]


Epoch [1/20], Loss: 0.2991


Epoch 2/20: 100%|██████████| 18/18 [00:00<00:00, 339.96it/s]


Epoch [2/20], Loss: 0.2260


Epoch 3/20: 100%|██████████| 18/18 [00:00<00:00, 2082.92it/s]


Epoch [3/20], Loss: 0.1835


Epoch 4/20: 100%|██████████| 18/18 [00:00<00:00, 1490.31it/s]


Epoch [4/20], Loss: 0.1633


Epoch 5/20: 100%|██████████| 18/18 [00:00<00:00, 1911.86it/s]


Epoch [5/20], Loss: 0.1527


Epoch 6/20: 100%|██████████| 18/18 [00:00<00:00, 1744.72it/s]


Epoch [6/20], Loss: 0.1467


Epoch 7/20: 100%|██████████| 18/18 [00:00<00:00, 1702.27it/s]


Epoch [7/20], Loss: 0.1430


Epoch 8/20: 100%|██████████| 18/18 [00:00<00:00, 1599.90it/s]


Epoch [8/20], Loss: 0.1417


Epoch 9/20: 100%|██████████| 18/18 [00:00<00:00, 1845.18it/s]


Epoch [9/20], Loss: 0.1402


Epoch 10/20: 100%|██████████| 18/18 [00:00<00:00, 1439.45it/s]


Epoch [10/20], Loss: 0.1391


Epoch 11/20: 100%|██████████| 18/18 [00:00<00:00, 1933.21it/s]


Epoch [11/20], Loss: 0.1391


Epoch 12/20: 100%|██████████| 18/18 [00:00<00:00, 1450.90it/s]


Epoch [12/20], Loss: 0.1378


Epoch 13/20: 100%|██████████| 18/18 [00:00<00:00, 2100.07it/s]


Epoch [13/20], Loss: 0.1375


Epoch 14/20: 100%|██████████| 18/18 [00:00<00:00, 360.34it/s]


Epoch [14/20], Loss: 0.1371


Epoch 15/20: 100%|██████████| 18/18 [00:00<00:00, 1812.71it/s]


Epoch [15/20], Loss: 0.1368


Epoch 16/20: 100%|██████████| 18/18 [00:00<00:00, 1601.97it/s]


Epoch [16/20], Loss: 0.1372


Epoch 17/20: 100%|██████████| 18/18 [00:00<00:00, 1766.60it/s]


Epoch [17/20], Loss: 0.1372


Epoch 18/20: 100%|██████████| 18/18 [00:00<00:00, 1612.30it/s]


Epoch [18/20], Loss: 0.1357


Epoch 19/20: 100%|██████████| 18/18 [00:00<00:00, 1643.50it/s]


Epoch [19/20], Loss: 0.1351


Epoch 20/20: 100%|██████████| 18/18 [00:00<00:00, 1585.78it/s]

Epoch [20/20], Loss: 0.1353





In [10]:
val_loss, accuracy = trainer.evaluate(val_loader)


Validation Loss: 0.1398, Accuracy: 79.02%


### 3. NLLLoss (`torch.nn.NLLLoss`)
- **Description:** Negative Log-Likelihood Loss measures the likelihood of the target class under the predicted probability distribution.
- **Use Case:** Typically used in multi-class classification tasks, especially when combined with `log_softmax` activation.

Here is the mathematical formulation of NLLLoss:
\begin{equation}
\text{NLLLoss} = -\frac{1}{n} \sum_{i=1}^{n} \log(y_{i})
\end{equation}

I hope you note the logarithm in the formula. It's important! 

Why?

In this part, run your training with Relu at last layer. <span style="color:red; font-weight: bold;">Discuss </span> and explain the difference between the results of the two models. Find a proper solution to the problem.


In [11]:
# Adapt data for NLL loss
import requests
from io import StringIO
train_url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
response = requests.get(train_url, verify=False)
data = pd.read_csv(StringIO(response.text))

data = data[['Pclass', 'Sex', 'Age', 'Fare', 'Survived']].dropna()
data['Sex'] = data['Sex'].map({'male': 0, 'female': 1})

X = data[['Pclass', 'Sex', 'Age', 'Fare']].values
y = data['Survived'].values

scaler = StandardScaler()
X = scaler.fit_transform(X)

X_tensor = torch.tensor(X, dtype=torch.float32)
# y_tensor = torch.tensor(y, dtype=torch.float32).view(-1, 1)
y_tensor = torch.tensor(y, dtype=torch.long).squeeze()

dataset = TensorDataset(X_tensor, y_tensor)

train_size = int(0.8 * len(dataset))
val_size = len(dataset) - train_size
train_dataset, val_dataset = random_split(dataset, [train_size, val_size])


train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=32)





In [12]:
# Model with ReLU as the last layer activation function
output_dim_NLL = 2
model_NLL_relu = SimpleMLP(
    input_dim=input_dim, 
    hidden_dim=hidden_dim, 
    output_dim=output_dim_NLL, 
    num_hidden_layers=num_hidden_layers, 
    last_layer_activation_fn=nn.ReLU
)

criterion_relu = nn.NLLLoss()
optimizer_relu = torch.optim.Adam(model_NLL_relu.parameters(), lr=0.001)
trainer_relu = SimpleMLPTrainer(model_NLL_relu, criterion_relu, optimizer_relu)

train_losses_relu = trainer_relu.train(train_loader, num_epochs)


Epoch 1/20: 100%|██████████| 18/18 [00:00<00:00, 690.50it/s]


Epoch [1/20], Loss: -0.1995


Epoch 2/20: 100%|██████████| 18/18 [00:00<00:00, 1196.47it/s]


Epoch [2/20], Loss: -0.2969


Epoch 3/20: 100%|██████████| 18/18 [00:00<00:00, 1566.76it/s]


Epoch [3/20], Loss: -0.4379


Epoch 4/20: 100%|██████████| 18/18 [00:00<00:00, 1367.46it/s]


Epoch [4/20], Loss: -0.6478


Epoch 5/20: 100%|██████████| 18/18 [00:00<00:00, 1703.62it/s]


Epoch [5/20], Loss: -0.9684


Epoch 6/20: 100%|██████████| 18/18 [00:00<00:00, 1632.34it/s]


Epoch [6/20], Loss: -1.4387


Epoch 7/20: 100%|██████████| 18/18 [00:00<00:00, 1360.02it/s]


Epoch [7/20], Loss: -2.1365


Epoch 8/20: 100%|██████████| 18/18 [00:00<00:00, 1593.35it/s]


Epoch [8/20], Loss: -3.1372


Epoch 9/20: 100%|██████████| 18/18 [00:00<00:00, 389.30it/s]


Epoch [9/20], Loss: -4.4995


Epoch 10/20: 100%|██████████| 18/18 [00:00<00:00, 947.46it/s]


Epoch [10/20], Loss: -6.3015


Epoch 11/20: 100%|██████████| 18/18 [00:00<00:00, 1758.00it/s]


Epoch [11/20], Loss: -8.6242


Epoch 12/20: 100%|██████████| 18/18 [00:00<00:00, 1370.59it/s]


Epoch [12/20], Loss: -11.5096


Epoch 13/20: 100%|██████████| 18/18 [00:00<00:00, 1894.92it/s]


Epoch [13/20], Loss: -15.1290


Epoch 14/20: 100%|██████████| 18/18 [00:00<00:00, 1359.19it/s]


Epoch [14/20], Loss: -19.6202


Epoch 15/20: 100%|██████████| 18/18 [00:00<00:00, 1595.60it/s]


Epoch [15/20], Loss: -25.2060


Epoch 16/20: 100%|██████████| 18/18 [00:00<00:00, 1467.48it/s]


Epoch [16/20], Loss: -31.9297


Epoch 17/20: 100%|██████████| 18/18 [00:00<00:00, 1476.17it/s]


Epoch [17/20], Loss: -39.9339


Epoch 18/20: 100%|██████████| 18/18 [00:00<00:00, 1655.03it/s]


Epoch [18/20], Loss: -49.1431


Epoch 19/20: 100%|██████████| 18/18 [00:00<00:00, 1663.27it/s]


Epoch [19/20], Loss: -59.9061


Epoch 20/20: 100%|██████████| 18/18 [00:00<00:00, 1343.59it/s]

Epoch [20/20], Loss: -72.1965





In [13]:
val_loss_relu, accuracy_relu = trainer_relu.evaluate(val_loader)


Validation Loss: -78.0510, Accuracy: 62.24%


In [14]:
# Model with LogSoftmax as the last layer activation function
output_dim_NLL = 2
model_NLL_logsoftmax = SimpleMLP(
    input_dim=input_dim, 
    hidden_dim=hidden_dim, 
    output_dim=output_dim_NLL, 
    num_hidden_layers=num_hidden_layers, 
    last_layer_activation_fn=nn.LogSoftmax
)

criterion_logsoftmax = nn.NLLLoss()
optimizer_logsoftmax = torch.optim.Adam(model_NLL_logsoftmax.parameters(), lr=0.001)
trainer_logsoftmax = SimpleMLPTrainer(model_NLL_logsoftmax, criterion_logsoftmax, optimizer_logsoftmax)

train_losses_logsoftmax = trainer_logsoftmax.train(train_loader, num_epochs)


  return self._call_impl(*args, **kwargs)
Epoch 1/20: 100%|██████████| 18/18 [00:00<00:00, 629.40it/s]


Epoch [1/20], Loss: 0.6493


Epoch 2/20: 100%|██████████| 18/18 [00:00<00:00, 961.08it/s]


Epoch [2/20], Loss: 0.6247


Epoch 3/20: 100%|██████████| 18/18 [00:00<00:00, 1660.53it/s]


Epoch [3/20], Loss: 0.5999


Epoch 4/20: 100%|██████████| 18/18 [00:00<00:00, 358.30it/s]


Epoch [4/20], Loss: 0.5733


Epoch 5/20: 100%|██████████| 18/18 [00:00<00:00, 1848.98it/s]


Epoch [5/20], Loss: 0.5429


Epoch 6/20: 100%|██████████| 18/18 [00:00<00:00, 1233.96it/s]


Epoch [6/20], Loss: 0.5177


Epoch 7/20: 100%|██████████| 18/18 [00:00<00:00, 1646.08it/s]


Epoch [7/20], Loss: 0.4936


Epoch 8/20: 100%|██████████| 18/18 [00:00<00:00, 1338.49it/s]


Epoch [8/20], Loss: 0.4780


Epoch 9/20: 100%|██████████| 18/18 [00:00<00:00, 1840.50it/s]


Epoch [9/20], Loss: 0.4650


Epoch 10/20: 100%|██████████| 18/18 [00:00<00:00, 1206.92it/s]


Epoch [10/20], Loss: 0.4572


Epoch 11/20: 100%|██████████| 18/18 [00:00<00:00, 1826.30it/s]


Epoch [11/20], Loss: 0.4521


Epoch 12/20: 100%|██████████| 18/18 [00:00<00:00, 1298.61it/s]


Epoch [12/20], Loss: 0.4457


Epoch 13/20: 100%|██████████| 18/18 [00:00<00:00, 1635.91it/s]


Epoch [13/20], Loss: 0.4412


Epoch 14/20: 100%|██████████| 18/18 [00:00<00:00, 1338.61it/s]


Epoch [14/20], Loss: 0.4365


Epoch 15/20: 100%|██████████| 18/18 [00:00<00:00, 1836.74it/s]


Epoch [15/20], Loss: 0.4320


Epoch 16/20: 100%|██████████| 18/18 [00:00<00:00, 347.62it/s]


Epoch [16/20], Loss: 0.4291


Epoch 17/20: 100%|██████████| 18/18 [00:00<00:00, 1873.06it/s]


Epoch [17/20], Loss: 0.4270


Epoch 18/20: 100%|██████████| 18/18 [00:00<00:00, 1360.41it/s]


Epoch [18/20], Loss: 0.4251


Epoch 19/20: 100%|██████████| 18/18 [00:00<00:00, 1664.04it/s]


Epoch [19/20], Loss: 0.4260


Epoch 20/20: 100%|██████████| 18/18 [00:00<00:00, 1420.91it/s]

Epoch [20/20], Loss: 0.4239





In [15]:
val_loss_logsoftmax, accuracy_logsoftmax = trainer_logsoftmax.evaluate(val_loader)

Validation Loss: 0.4309, Accuracy: 77.62%


Your reason for your choice:

<div>
**Your answer here**
</div>

The results highlight a significant difference in performance between using `ReLU` and `LogSoftmax` as the last layer activation function with `NLLLoss`.

### Observed Results:
1. **ReLU Activation**:
   - The negative validation loss and low accuracy suggest that the model is not learning correctly. `NLLLoss` expects log-probabilities (values between `-inf` and `0`), but `ReLU` outputs values in the range `[0, ∞)`, which do not represent valid log-probabilities. This mismatch leads to incorrect calculations in the loss function and ineffective training.

2. **LogSoftmax Activation**:
   - Using `LogSoftmax` provides valid log-probabilities as outputs, allowing `NLLLoss` to calculate the loss correctly. This results in a much higher accuracy, indicating effective training.

### Explanation of Differences:
- **LogSoftmax and NLLLoss Compatibility**: `NLLLoss` is designed to work with log-probabilities, which are provided by `LogSoftmax`. This combination allows the loss function to measure the likelihood of the target class under the predicted probability distribution accurately.
- **Incompatibility of ReLU with NLLLoss**: `ReLU` does not constrain outputs to the range required by `NLLLoss`, leading to invalid loss calculations and poor learning.

### Solution and Reason for Choice:
**Use `LogSoftmax` as the Last Layer Activation** when using `NLLLoss`. This choice ensures that:
- The output values represent valid log-probabilities, allowing `NLLLoss` to compute the loss accurately.
- The model can effectively learn from the data, as evidenced by the improved validation loss and accuracy.

In summary, the proper solution is to pair `NLLLoss` with `LogSoftmax` as the activation function in the last layer. This ensures compatibility, leading to accurate loss calculations and better training performance.


### 4. CrossEntropyLoss (`torch.nn.CrossEntropyLoss`)
- **Description:** Combines `LogSoftmax` and `NLLLoss` in one single class. It computes the cross-entropy loss between the target and the output logits.
- **Use Case:** Widely used for multi-class classification problems.

The mathematical formulation of CrossEntropyLoss is as follows:
\begin{equation}
  \text{CrossEntropy}(y, \hat{y}) = - \sum_{i=1}^{C} y_i \log\left(\frac{e^{\hat{y}_i}}{\sum_{j=1}^{C} e^{\hat{y}_j}}\right)
\end{equation}
  where:
  - $ C $ is the number of classes,
  - $ y_i $ is a one-hot encoded target vector (or a scalar class label),
  - $ \hat{y}_i $ represents the logits (unnormalized model outputs) for each class.
  
  In practice, `torch.nn.CrossEntropyLoss` expects raw logits as input and internally applies the softmax function to convert the logits into probabilities, followed by the negative log-likelihood computation.

- **Background:** Cross-entropy measures the difference between the true distribution $ y $ and the predicted distribution $ \hat{y} $. The function minimizes the negative log-probability assigned to the correct class, effectively penalizing predictions that deviate from the true class, making it a standard choice for classification tasks in deep learning.

Now, let's implement a class called `SimpleMLP_Loss` that has the following architecture:


In [16]:
from torch.nn import CrossEntropyLoss

# Define the model with output_dim equal to the number of classes (2 for binary classification)
output_dim_CE = 2

model_CE = SimpleMLP(
    input_dim=input_dim, 
    hidden_dim=hidden_dim, 
    output_dim=output_dim_CE, 
    num_hidden_layers=num_hidden_layers, 
    last_layer_activation_fn=None  # No activation function in the last layer for CrossEntropyLoss
)

criterion_CE = CrossEntropyLoss()
optimizer_CE = torch.optim.Adam(model_CE.parameters(), lr=0.001)
trainer_CE = SimpleMLPTrainer(model_CE, criterion_CE, optimizer_CE)

train_losses_CE = trainer_CE.train(train_loader, num_epochs)


Epoch 1/20: 100%|██████████| 18/18 [00:00<00:00, 740.28it/s]


Epoch [1/20], Loss: 0.7332


Epoch 2/20: 100%|██████████| 18/18 [00:00<00:00, 1284.74it/s]


Epoch [2/20], Loss: 0.7093


Epoch 3/20: 100%|██████████| 18/18 [00:00<00:00, 1680.34it/s]


Epoch [3/20], Loss: 0.6813


Epoch 4/20: 100%|██████████| 18/18 [00:00<00:00, 1587.28it/s]


Epoch [4/20], Loss: 0.6480


Epoch 5/20: 100%|██████████| 18/18 [00:00<00:00, 1516.04it/s]


Epoch [5/20], Loss: 0.6108


Epoch 6/20: 100%|██████████| 18/18 [00:00<00:00, 1478.57it/s]


Epoch [6/20], Loss: 0.5719


Epoch 7/20: 100%|██████████| 18/18 [00:00<00:00, 1568.62it/s]


Epoch [7/20], Loss: 0.5334


Epoch 8/20: 100%|██████████| 18/18 [00:00<00:00, 1575.36it/s]


Epoch [8/20], Loss: 0.5031


Epoch 9/20: 100%|██████████| 18/18 [00:00<00:00, 1734.26it/s]


Epoch [9/20], Loss: 0.4824


Epoch 10/20: 100%|██████████| 18/18 [00:00<00:00, 462.25it/s]


Epoch [10/20], Loss: 0.4706


Epoch 11/20: 100%|██████████| 18/18 [00:00<00:00, 1868.98it/s]


Epoch [11/20], Loss: 0.4623


Epoch 12/20: 100%|██████████| 18/18 [00:00<00:00, 1345.60it/s]


Epoch [12/20], Loss: 0.4559


Epoch 13/20: 100%|██████████| 18/18 [00:00<00:00, 1614.33it/s]


Epoch [13/20], Loss: 0.4499


Epoch 14/20: 100%|██████████| 18/18 [00:00<00:00, 1202.48it/s]


Epoch [14/20], Loss: 0.4462


Epoch 15/20: 100%|██████████| 18/18 [00:00<00:00, 1848.98it/s]


Epoch [15/20], Loss: 0.4459


Epoch 16/20: 100%|██████████| 18/18 [00:00<00:00, 1746.70it/s]


Epoch [16/20], Loss: 0.4410


Epoch 17/20: 100%|██████████| 18/18 [00:00<00:00, 1264.40it/s]


Epoch [17/20], Loss: 0.4404


Epoch 18/20: 100%|██████████| 18/18 [00:00<00:00, 1673.93it/s]


Epoch [18/20], Loss: 0.4354


Epoch 19/20: 100%|██████████| 18/18 [00:00<00:00, 1564.55it/s]


Epoch [19/20], Loss: 0.4354


Epoch 20/20: 100%|██████████| 18/18 [00:00<00:00, 1427.12it/s]

Epoch [20/20], Loss: 0.4316





In [17]:
val_loss_CE, accuracy_CE = trainer_CE.evaluate(val_loader)

Validation Loss: 0.4533, Accuracy: 74.83%



### 5. KLDivLoss (`torch.nn.KLDivLoss`)
- **Description:** Kullback-Leibler Divergence Loss measures how one probability distribution diverges from a second, reference distribution. Unlike other loss functions that focus on classification, KL divergence specifically compares the relative entropy between two distributions. It quantifies the information loss when using the predicted distribution to approximate the true distribution. 

- **Mathematical Function:**
\begin{equation}
  \text{KL}(P \parallel Q) = \sum_{i=1}^{C} P(i) \left( \log P(i) - \log Q(i) \right)
\end{equation}
  where:
  - \( P \) is the target (true) probability distribution,
  - \( Q \) is the predicted distribution (often the output of `log_softmax`),
  - \( C \) is the number of classes.

  KL divergence is always non-negative, and it equals zero if the two distributions are identical. The loss function expects the model's output to be in the form of log-probabilities (using `log_softmax`) and compares this against a target probability distribution, which is typically a normalized distribution (using softmax).

- **Use Case:** KLDivLoss is frequently used in:
  - **Variational Autoencoders (VAEs):** In VAEs, KL divergence is used to measure how much the learned latent space distribution deviates from a prior distribution (often Gaussian).
  - **Knowledge Distillation:** In teacher-student models, KL divergence is used to transfer the "soft" knowledge from a teacher model to a student model by comparing their output probability distributions.
  - **Reinforcement Learning:** It can be used to update policies while minimizing the divergence from a previous policy.

- **Background:** Kullback-Leibler divergence, a core concept in information theory, measures the inefficiency of assuming the predicted distribution \( Q \) when the true distribution is \( P \). It is asymmetric, meaning that \( KL(P \parallel Q) \neq KL(Q \parallel P) \), so the direction of the comparison matters.

Again, in this part, run your training with Relu at last layer. <span style="color:red; font-weight: bold;">Discuss </span> and explain the difference between the results of the two models. Find a proper solution to the problem.


In [18]:
import torch.nn.functional as F
train_url = "https://raw.githubusercontent.com/datasciencedojo/datasets/master/titanic.csv"
response = requests.get(train_url, verify=False)
data = pd.read_csv(StringIO(response.text))

data = data[['Pclass', 'Sex', 'Age', 'Fare', 'Survived']].dropna()
data['Sex'] = data['Sex'].map({'male': 0, 'female': 1})

X = data[['Pclass', 'Sex', 'Age', 'Fare']].values
y = data['Survived'].values

scaler = StandardScaler()
X = scaler.fit_transform(X)

y_tensor = torch.tensor(y, dtype=torch.long)
y_tensor = F.one_hot(y_tensor, num_classes=2).float()

dataset = TensorDataset(X_tensor, y_tensor)
train_size = int(0.8 * len(dataset))
val_size = len(dataset) - train_size
train_dataset, val_dataset = random_split(dataset, [train_size, val_size])

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
val_loader = DataLoader(val_dataset, batch_size=32)




In [19]:
# Run with relu activation function
from torch.nn import KLDivLoss

output_dim_KL = 2

# Run with ReLU activation (Incorrect setup for KLDivLoss)
model_relu = SimpleMLP(input_dim=input_dim, hidden_dim=hidden_dim, output_dim=output_dim_KL, num_hidden_layers=num_hidden_layers, last_layer_activation_fn=torch.nn.ReLU)
criterion_relu = KLDivLoss(reduction="batchmean")
optimizer_relu = torch.optim.Adam(model_relu.parameters(), lr=0.001)
trainer_relu = SimpleMLPTrainer(model_relu, criterion_relu, optimizer_relu)

train_losses_relu = trainer_relu.train(train_loader, num_epochs)

Epoch 1/20: 100%|██████████| 18/18 [00:00<00:00, 705.78it/s]


Epoch [1/20], Loss: -0.0149


Epoch 2/20: 100%|██████████| 18/18 [00:00<00:00, 1004.80it/s]


Epoch [2/20], Loss: -0.0740


Epoch 3/20: 100%|██████████| 18/18 [00:00<00:00, 1486.11it/s]


Epoch [3/20], Loss: -0.1683


Epoch 4/20: 100%|██████████| 18/18 [00:00<00:00, 1529.06it/s]


Epoch [4/20], Loss: -0.2982


Epoch 5/20: 100%|██████████| 18/18 [00:00<00:00, 1358.31it/s]


Epoch [5/20], Loss: -0.4608


Epoch 6/20: 100%|██████████| 18/18 [00:00<00:00, 2016.33it/s]


Epoch [6/20], Loss: -0.6992


Epoch 7/20: 100%|██████████| 18/18 [00:00<00:00, 1346.29it/s]


Epoch [7/20], Loss: -1.0127


Epoch 8/20: 100%|██████████| 18/18 [00:00<00:00, 1764.70it/s]


Epoch [8/20], Loss: -1.4359


Epoch 9/20: 100%|██████████| 18/18 [00:00<00:00, 1451.29it/s]


Epoch [9/20], Loss: -1.9751


Epoch 10/20: 100%|██████████| 18/18 [00:00<00:00, 687.27it/s]


Epoch [10/20], Loss: -2.6972


Epoch 11/20: 100%|██████████| 18/18 [00:00<00:00, 1958.02it/s]


Epoch [11/20], Loss: -3.5951


Epoch 12/20: 100%|██████████| 18/18 [00:00<00:00, 1556.81it/s]


Epoch [12/20], Loss: -4.6373


Epoch 13/20: 100%|██████████| 18/18 [00:00<00:00, 1362.28it/s]


Epoch [13/20], Loss: -5.9790


Epoch 14/20: 100%|██████████| 18/18 [00:00<00:00, 1937.17it/s]


Epoch [14/20], Loss: -7.5695


Epoch 15/20: 100%|██████████| 18/18 [00:00<00:00, 1598.17it/s]


Epoch [15/20], Loss: -9.4563


Epoch 16/20: 100%|██████████| 18/18 [00:00<00:00, 1232.79it/s]


Epoch [16/20], Loss: -11.6419


Epoch 17/20: 100%|██████████| 18/18 [00:00<00:00, 1757.64it/s]


Epoch [17/20], Loss: -14.2130


Epoch 18/20: 100%|██████████| 18/18 [00:00<00:00, 1661.70it/s]


Epoch [18/20], Loss: -17.0643


Epoch 19/20: 100%|██████████| 18/18 [00:00<00:00, 1669.30it/s]


Epoch [19/20], Loss: -20.5354


Epoch 20/20: 100%|██████████| 18/18 [00:00<00:00, 1403.19it/s]


Epoch [20/20], Loss: -24.1783


In [20]:
print('relu')
val_loss_relu, accuracy_relu = trainer_relu.evaluate(val_loader)

relu
Validation Loss: -31.7334, Accuracy: 44.76%


In [21]:
# Using Softmax
model_logsoftmax = SimpleMLP(input_dim=input_dim, hidden_dim=hidden_dim, output_dim=output_dim_KL, num_hidden_layers=num_hidden_layers, last_layer_activation_fn=torch.nn.LogSoftmax)

criterion_logsoftmax = KLDivLoss(reduction="batchmean")
optimizer_logsoftmax = torch.optim.Adam(model_logsoftmax.parameters(), lr=0.001)
trainer_logsoftmax = SimpleMLPTrainer(model_logsoftmax, criterion_logsoftmax, optimizer_logsoftmax)

train_losses_logsoftmax = trainer_logsoftmax.train(train_loader, num_epochs)

  return self._call_impl(*args, **kwargs)
Epoch 1/20: 100%|██████████| 18/18 [00:00<00:00, 712.16it/s]


Epoch [1/20], Loss: 0.7194


Epoch 2/20: 100%|██████████| 18/18 [00:00<00:00, 1039.14it/s]


Epoch [2/20], Loss: 0.6775


Epoch 3/20: 100%|██████████| 18/18 [00:00<00:00, 1837.68it/s]


Epoch [3/20], Loss: 0.6372


Epoch 4/20: 100%|██████████| 18/18 [00:00<00:00, 340.67it/s]


Epoch [4/20], Loss: 0.5995


Epoch 5/20: 100%|██████████| 18/18 [00:00<00:00, 1856.62it/s]


Epoch [5/20], Loss: 0.5653


Epoch 6/20: 100%|██████████| 18/18 [00:00<00:00, 1497.37it/s]


Epoch [6/20], Loss: 0.5319


Epoch 7/20: 100%|██████████| 18/18 [00:00<00:00, 1704.23it/s]


Epoch [7/20], Loss: 0.5069


Epoch 8/20: 100%|██████████| 18/18 [00:00<00:00, 1373.96it/s]


Epoch [8/20], Loss: 0.4902


Epoch 9/20: 100%|██████████| 18/18 [00:00<00:00, 1579.08it/s]


Epoch [9/20], Loss: 0.4802


Epoch 10/20: 100%|██████████| 18/18 [00:00<00:00, 1692.50it/s]


Epoch [10/20], Loss: 0.4742


Epoch 11/20: 100%|██████████| 18/18 [00:00<00:00, 1226.50it/s]


Epoch [11/20], Loss: 0.4668


Epoch 12/20: 100%|██████████| 18/18 [00:00<00:00, 1612.75it/s]


Epoch [12/20], Loss: 0.4643


Epoch 13/20: 100%|██████████| 18/18 [00:00<00:00, 1314.83it/s]


Epoch [13/20], Loss: 0.4581


Epoch 14/20: 100%|██████████| 18/18 [00:00<00:00, 1680.34it/s]


Epoch [14/20], Loss: 0.4538


Epoch 15/20: 100%|██████████| 18/18 [00:00<00:00, 1539.92it/s]


Epoch [15/20], Loss: 0.4501


Epoch 16/20: 100%|██████████| 18/18 [00:00<00:00, 366.66it/s]


Epoch [16/20], Loss: 0.4471


Epoch 17/20: 100%|██████████| 18/18 [00:00<00:00, 1850.75it/s]


Epoch [17/20], Loss: 0.4421


Epoch 18/20: 100%|██████████| 18/18 [00:00<00:00, 1238.92it/s]


Epoch [18/20], Loss: 0.4400


Epoch 19/20: 100%|██████████| 18/18 [00:00<00:00, 1526.43it/s]


Epoch [19/20], Loss: 0.4383


Epoch 20/20: 100%|██████████| 18/18 [00:00<00:00, 1397.97it/s]

Epoch [20/20], Loss: 0.4360





In [22]:
val_loss_logsoftmax, accuracy_logsoftmax = trainer_logsoftmax.evaluate(val_loader)

Validation Loss: 0.4568, Accuracy: 74.83%


Your reason for your choice:

<div>
**Your answer here**
</div>

The **log_softmax** activation is the better choice for the final layer with **KLDivLoss** for several reasons:

1. **Compatibility**: KLDivLoss expects log-probabilities, which log_softmax provides, whereas ReLU does not produce valid probability distributions, leading to poor performance.
   
2. **Interpretability**: LogSoftmax outputs log-probabilities, making them interpretable as class likelihoods, essential for classification tasks. ReLU’s raw scores lack this interpretability.

3. **Stability**: LogSoftmax ensures stable training by keeping loss values manageable, allowing the model to converge effectively. ReLU’s non-probabilistic outputs lead to erratic gradients and instability.

In summary, log_softmax aligns well with KLDivLoss, resulting in higher accuracy and lower validation loss compared to ReLU.

### 6. CosineEmbeddingLoss (`torch.nn.CosineEmbeddingLoss`)
- **Description:** Measures the cosine similarity between two input tensors, `x1` and `x2`, and computes the loss based on a label `y` that indicates whether the tensors should be similar (`y = 1`) or dissimilar (`y = -1`). Cosine similarity focuses on the angle between vectors, disregarding their magnitude.

- **Mathematical Function:** 
\begin{equation}
  \text{CosineEmbeddingLoss}(x1, x2, y) = 
  \begin{cases} 
  1 - \cos(x_1, x_2), & \text{if } y = 1 \\
  \max(0, \cos(x_1, x_2) - \text{margin}), & \text{if } y = -1
  \end{cases}
\end{equation}
  where $ \cos(x_1, x_2) $ is the cosine similarity between the two vectors, and `margin` is a threshold that determines how dissimilar the vectors should be.

- **Use Case:** Commonly used in tasks like face verification, image similarity, and other scenarios where the relative orientation of vectors (angle) is more important than their length, such as in embeddings and metric learning.

- **Background:** Cosine similarity compares the directional alignment of vectors, making it ideal for high-dimensional data where the magnitude may not be as informative. This loss is particularly useful when training models to learn meaningful embeddings that capture semantic similarity.

You'll become more fimiliar with this loss function in future.

---

# Regularization in Machine Learning

## Introduction

Regularization is a fundamental technique in machine learning that helps prevent overfitting by adding a penalty to the loss function. This penalty discourages the model from becoming too complex, ensuring better generalization to unseen data. In this notebook, you will explore the concepts of regularization, understand different types of regularization techniques, and apply them using Python's popular libraries.

## What is Regularization?

Regularization involves adding a regularization term to the loss function used to train machine learning models. This term imposes a constraint on the model's coefficients, effectively reducing their magnitude. By doing so, regularization helps in:

- **Preventing Overfitting:** Ensures the model does not become too tailored to the training data.
- **Improving Generalization:** Enhances the model's performance on new, unseen data.
- **Feature Selection:** Especially in L1 regularization, it can drive some coefficients to zero, effectively selecting important features.

## Types of Regularization

There are several types of regularization techniques, each imposing different constraints on the model's parameters:

### 1. L1 Regularization (Lasso)

L1 regularization adds the absolute value of the magnitude of coefficients as a penalty term to the loss function. It can lead to sparse models where some feature coefficients are exactly zero.

### 2. L2 Regularization (Ridge)

L2 regularization adds the squared magnitude of coefficients as a penalty term to the loss function. It tends to shrink the coefficients evenly but does not set them to zero.

### 3. Elastic Net

Elastic Net combines both L1 and L2 regularization penalties. It balances the benefits of both Lasso and Ridge methods, allowing for feature selection and coefficient shrinkage.

## Homework Time!
Import Iris dataset from sklearn.datasets and apply ridge regression with different alpha values. Then, create a gif that shows the changes of the classification boundary with respect to alpha values.

Import the libs that you need and start coding!

In [23]:
!pip install imageio



[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m24.1[0m[39;49m -> [0m[32;49m24.3.1[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m


In [24]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
from PIL import Image
from io import BytesIO
import imageio
import warnings


# Suppress warnings for cleaner output
warnings.filterwarnings("ignore")

Load the Iris dataset and select Setosa and Versicolor classes

In [25]:
# Load Iris dataset and filter for Setosa and Versicolor classes
iris = load_iris()
X = iris.data
y = iris.target
binary_class_indices = np.where((y == 0) | (y == 1))
X = X[binary_class_indices]
y = y[binary_class_indices]

# Select Sepal Length and Petal Length features
X = X[:, [0, 2]]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

# Standardize the features
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)


Define Function to Plot Decision Boundary

In [26]:
def plot_decision_boundary(model, X, y, alpha):
    # Define the grid (use meshgrid)
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, 0.01),
                         np.arange(y_min, y_max, 0.01))

    # Predict over the grid
    Z = model.predict(np.c_[xx.ravel(), yy.ravel()])
    Z = Z.reshape(xx.shape)

    # Create a figure
    fig, ax = plt.subplots(figsize=(6, 5))

    # Plot the decision boundary
    ax.contourf(xx, yy, Z, alpha=0.3, levels=[-0.1, 0.1, 1.1], colors=['blue', 'red'])

    # Scatter plot of the training data
    scatter = ax.scatter(
        X[:, 0], X[:, 1], c=y, cmap='bwr', edgecolor='k', s=50
    )

    # Title and labels
    ax.set_title(f'MLP Decision Boundary (alpha={alpha})')
    ax.set_xlabel('Sepal Length (standardized)')
    ax.set_ylabel('Petal Length (standardized)')

    # Remove axes for clarity
    ax.set_xticks([])
    ax.set_yticks([])

    # Tight layout
    plt.tight_layout()

    # Save the plot to a BytesIO object
    buf = BytesIO()
    plt.savefig(buf, format='png')
    plt.close(fig)
    buf.seek(0)
    return Image.open(buf)

Train MLP with Varying Alpha Values and Collect Images

In [27]:
from sklearn.neural_network import MLPClassifier

def create_decision_boundary_gif(alpha_values, X_train, y_train, n_neurons):

    # List to store images
    images = []

    for idx, alpha in enumerate(alpha_values):
        print(f"Processing alpha={alpha:.4f} ({idx + 1}/{len(alpha_values)})")

        # Create and train the MLP
        mlp = MLPClassifier(hidden_layer_sizes=(n_neurons,), alpha=alpha, max_iter=1000, random_state=42)
        mlp.fit(X_train, y_train)

        # Plot decision boundary and get the image
        img = plot_decision_boundary(mlp, X_train, y_train, alpha)
        images.append(img)

    # Save the images as a GIF
    gif_filename = 'mlp_classification_boundaries.gif'
    images[0].save(
        gif_filename,
        save_all=True,
        append_images=images[1:],
        duration=500,
        loop=0
    )

    print(f"GIF saved as '{gif_filename}'")

    # return the gif
    return gif_filename

## RUN

In [28]:
# Use np.logspace to generate alpha values, with at least 20 values
alpha_values = np.logspace(-2, 2, 20)
# Define the number of neurons in the hidden layer
n_neurons = 10

# Create the decision boundary GIF
gif_dir = create_decision_boundary_gif(alpha_values, X_train, y_train, n_neurons)

Processing alpha=0.0100 (1/20)
Processing alpha=0.0162 (2/20)
Processing alpha=0.0264 (3/20)
Processing alpha=0.0428 (4/20)
Processing alpha=0.0695 (5/20)
Processing alpha=0.1129 (6/20)
Processing alpha=0.1833 (7/20)
Processing alpha=0.2976 (8/20)
Processing alpha=0.4833 (9/20)
Processing alpha=0.7848 (10/20)
Processing alpha=1.2743 (11/20)
Processing alpha=2.0691 (12/20)
Processing alpha=3.3598 (13/20)
Processing alpha=5.4556 (14/20)
Processing alpha=8.8587 (15/20)
Processing alpha=14.3845 (16/20)
Processing alpha=23.3572 (17/20)
Processing alpha=37.9269 (18/20)
Processing alpha=61.5848 (19/20)
Processing alpha=100.0000 (20/20)
GIF saved as 'mlp_classification_boundaries.gif'


<div style="text-align: center;">

### **Multilayer Perceptron Classification Boundaries**

![Classification Boundaries](mlp_classification_boundaries.gif)

*Figure 1: Demonstration of classification boundaries created by a Multilayer Perceptron (MLP) model.*

</div>

Your gif should look like this:

<div style="text-align: center;">

### **Multilayer Perceptron Classification Boundaries**

![Classification Boundaries](mlp_classification_boundaries_example.gif)

*Figure 1: Demonstration of classification boundaries created by a Multilayer Perceptron (MLP) model.*

</div>

