<a href="https://colab.research.google.com/github/sonalshreya25/DeepLearning/blob/main/Assignment2.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **Exploring Weight Initialization Methods and Cost Functions**



Neural networks are a subset of machine learning algorithms inspired by the human brain's structure and function. They consist of interconnected layers of nodes called as neurons that process input data to produce an output. The primary components of a neural network include:

**Input Layer**: Receives the input data.

**Hidden Layers**: Intermediate layers that transform the input into something the output layer can use.

**Output Layer**: Produces the final output.

The performance of a neural network is dependent on key factors like:

1.   Weight Initialization
2.   Cost Function Selection


### **Weight Initialization**
Weight initialization refers to the process of setting the initial values for the weights of the network before training begins. It plays a crucial role in ensuring stable and efficient training by preventing issues like vanishing or exploding gradients. Proper weight initialization is critical to neural networks to train effectively and converge quickly.

Some weight initializing techniques include:



1.  **Zero Initialization** : Setting all weights to zero.
2.  **Random Initialization** : Assigning small random values to weights.
3.  **Kaiming (he) Initialization** :  Specifically designed for layers with ReLU activation functions, it scales the weights based on the number of input units to maintain variance throughout the network
4.  **Xavier/Glorot Initialization** : it scales the weights to maintain the variance of activations across layers, it is suitable for layers with sigmoid or tanh activation functions.

In this experiment, we would test with He initialization and Glorot Initialization and investigate the impact of different weight initialization strategies on the training performance of a neural network.

### **Cost Function Selection**
The cost function (or loss function) measures how well the neural network's predictions match the actual data. It guides the optimization process by providing a metric to minimize during training. It determines how well a model learns from the data and influences its convergence behavior

Common cost functions include:

1.  **Cross-Entropy Loss**: Used for classification tasks, it measures the difference between the predicted probability distribution and the actual distribution.
2.   **Mean Squared Error (MSE)**: Used for regression tasks, it calculates the average squared difference between predicted and actual values.
3. **Leibler Divergence Loss**: Measures the difference between two probability distributions
4. **Huber loss**: A loss function that combines the best properties of mean squared error and mean absolute error, being less sensitive to outliers than MSE and more robust than MAE
5. **Log-Cosh Loss**: It is a loss function used in regression tasks that combines the benefits of Mean Squared Error (MSE) and Mean Absolute Error (MAE). It is defined as the logarithm of the hyperbolic cosine of the prediction error, which makes it smooth and less sensitive to outliers.

In this experiment, we plan to use Huber Loss and Log-cosh loss functions and analyze the convergence behavior, training stability, and final performance metrics. We thereby, seek to gain insights into the interplay between initialization methods and cost function choices in deep learning.

### **Experiment Setup**
To evaluate different weight initialization strategies and cost functions, we implemented a simple feedforward neural network for classifying images from the MNIST dataset. The network architecture includes:

Input layer: 784 neurons (flattened 28x28 images)

Hidden layer: 128 neurons with ReLU activation

Output layer: 10 neurons (softmax activation for classification)

Loading data from MNIST dataset and create a dataloader with batchsize 64.


In [1]:
import torch
import torch.nn as nn
import torch.optim as optim
import torchvision.transforms as transforms
import torchvision.datasets as datasets
transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,))])
trainset = datasets.MNIST(root='./data', train=True, download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=64, shuffle=True)

Downloading http://yann.lecun.com/exdb/mnist/train-images-idx3-ubyte.gz
Failed to download (trying next):
HTTP Error 404: Not Found

Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-images-idx3-ubyte.gz to ./data/MNIST/raw/train-images-idx3-ubyte.gz


100%|██████████| 9.91M/9.91M [00:00<00:00, 12.8MB/s]


Extracting ./data/MNIST/raw/train-images-idx3-ubyte.gz to ./data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/train-labels-idx1-ubyte.gz
Failed to download (trying next):
HTTP Error 404: Not Found

Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/train-labels-idx1-ubyte.gz to ./data/MNIST/raw/train-labels-idx1-ubyte.gz


100%|██████████| 28.9k/28.9k [00:00<00:00, 341kB/s]


Extracting ./data/MNIST/raw/train-labels-idx1-ubyte.gz to ./data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-images-idx3-ubyte.gz
Failed to download (trying next):
HTTP Error 404: Not Found

Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-images-idx3-ubyte.gz to ./data/MNIST/raw/t10k-images-idx3-ubyte.gz


100%|██████████| 1.65M/1.65M [00:00<00:00, 3.17MB/s]


Extracting ./data/MNIST/raw/t10k-images-idx3-ubyte.gz to ./data/MNIST/raw

Downloading http://yann.lecun.com/exdb/mnist/t10k-labels-idx1-ubyte.gz
Failed to download (trying next):
HTTP Error 404: Not Found

Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz
Downloading https://ossci-datasets.s3.amazonaws.com/mnist/t10k-labels-idx1-ubyte.gz to ./data/MNIST/raw/t10k-labels-idx1-ubyte.gz


100%|██████████| 4.54k/4.54k [00:00<00:00, 3.76MB/s]

Extracting ./data/MNIST/raw/t10k-labels-idx1-ubyte.gz to ./data/MNIST/raw






Defining the Weight Initialization Function

*    He initialization is a weight initialization technique designed for ReLU

activation functions.It sets the weights to be drawn from a normal distribution with a mean of 0 and a variance of $\frac{2}{n_{\text{in}}}$
 , where $n_{\text{in}}$ is the number of input units to the neuron.
$$
W \sim \mathcal{N}\left( 0, \frac{2}{n_{\text{in}}} \right)
$$

* Xavier initialization (also known as Glorot initialization) is designed for activation functions like sigmoid or tanh, where the weights are drawn from a normal distribution with a mean of 0 and variance $\frac{1}{n_{\text{in}}}$ , where $n_{\text{in}}$   is the number of input units to the neuron.

$$
W \sim \mathcal{N}\left( 0, \frac{1}{n_{\text{in}}} \right)
$$


In [3]:
def initialize_weights(model, method="xavier"):
    for m in model.modules():
        if isinstance(m, nn.Linear):
            if method == "xavier":
                nn.init.xavier_uniform_(m.weight)
            elif method == "he":
                nn.init.kaiming_uniform_(m.weight, nonlinearity='relu')


In [5]:
# Defining the neural network
class NeuralNet(nn.Module):
    def __init__(self):
        super(NeuralNet, self).__init__()
        # Input layer (784 to 128 neurons)
        self.fc1 = nn.Linear(28*28, 128)
        # Activation function
        self.relu = nn.ReLU()
        # Output layer (128 to 10 classes)
        self.fc2 = nn.Linear(128, 10)

    def forward(self, x):
        x = x.view(-1, 28*28)  # Flatten input image
        x = self.fc1(x)
        x = self.relu(x)
        x = self.fc2(x)
        return x

The Cost functions we will be using for this experiment are


*   **Huber Loss** : Huber loss is a combination of MSE and absolute error, which is less sensitive to outliers than MSE. The formula for Huber loss is

$$
L_{\delta}(y, \hat{y}) =
\begin{cases}
\frac{1}{2}(y - \hat{y})^2 & \text{if } |y - \hat{y}| \leq \delta \\
\delta |y - \hat{y}| - \frac{1}{2} \delta^2 & \text{otherwise}
\end{cases}
$$
Where:

- $y$ is the true value.
- $\hat{y}$ is the predicted value.
- $\delta$ is a hyperparameter that controls the threshold between quadratic and linear loss.

**Advantages** :

*   Robust to outliers: Less sensitive to large errors compared to MSE
*   Smooth gradient: Provides a balance between MSE and absolute error, aiding faster convergence

*   **Log-Cosh Loss** : Log-Cosh loss is a smoother version of MSE that is less sensitive to large errors. The formula for Log-Cosh loss is:

$$
L_{\text{log-cosh}}(y, \hat{y}) = \frac{1}{n} \sum_{i=1}^{n} \text{log}\left( \cosh(y_i - \hat{y}_i) \right)
$$
Where:

- $\cosh(x) = \frac{e^x + e^{-x}}{2}$ is the hyperbolic cosine function.
- $y_i$ and $\hat{y}_i$ are the true and predicted values for the $i$-th data point.

**Advantages**:


*   Less sensitive to large errors: Penalizes large errors less than MSE
*   Smoother convergence: Leads to more stable training with a smoother gradient.

Huber loss and Log-Cosh loss offer more robust alternatives to Mean Squared Error, especially when handling noisy data or outliers.



In [6]:
# Implementing Cost Function
def huber_loss(output, target, delta=1.0):
    error = target - output
    loss = torch.where(torch.abs(error) < delta, 0.5 * error**2, delta * (torch.abs(error) - 0.5 * delta))
    return loss.mean()

def log_cosh_loss(output, target):
    return torch.mean(torch.log(torch.cosh(output - target)))

In [12]:
# Training loop
def train_model(init_method, loss_fn):
    model = NeuralNet()
    initialize_weights(model, method=init_method)
    optimizer = optim.Adam(model.parameters(), lr=0.001)
    loss_function = huber_loss if loss_fn == "huber" else log_cosh_loss

    for epoch in range(5):
        for images, labels in trainloader:
            optimizer.zero_grad()
            outputs = model(images)
            labels_one_hot = torch.nn.functional.one_hot(labels, num_classes=10).float()
            loss = loss_function(outputs, labels_one_hot)
            loss.backward()
            optimizer.step()
    print(f"Final loss with {init_method} and {loss_fn}: {loss.item()}")



In [13]:
train_model("he", "huber")

Final loss with he and huber: 0.005685228854417801


In [14]:
train_model("xavier", "huber")

Final loss with xavier and huber: 0.0058444500900805


In [15]:
train_model("xavier", "log_cosh")

Final loss with xavier and log_cosh: 0.008070656098425388


In [16]:
train_model("he", "log_cosh")

Final loss with he and log_cosh: 0.005941941402852535


## **Conclusion**
In this experiment, we tested a neural network using two weight initialization methods—He and Xavier—along with two cost functions: Huber loss and Log-Cosh loss.

The results showed that He initialization with Huber loss gave the best performance, with the lowest final loss (0.0057). This suggests that He initialization works well for ReLU activations, and Huber loss handles errors effectively without being too sensitive to outliers. On the other hand, Xavier initialization with Log-Cosh loss had the highest final loss (0.0081), making it less effective in this case.

In conclusion, He initialization with Huber loss is the best combination for this neural network, as it leads to the lowest loss.
Kaimen initialization performed best because it's specifically designed for ReLU activation functions. ReLU is known for causing problems like "vanishing gradients" or "exploding gradients" in deep networks, especially with standard initialization methods. He initialization addresses this by scaling the weights to account for the number of inputs, which helps maintain stable gradients during backpropagation and allows the network to learn effectively.

On the other-hand, Huber loss also contributed to the better performance because it combines the strengths of mean squared error (MSE) and absolute error. It is less sensitive to outliers than MSE. This makes it effective in situations where one would want to balance error sensitivity and robustness, leading to a lower final loss