# Residual Networks !!!

> Why there is need of residual connection?

- As the depth of CNN model increases, the performance keeps getting better.(till the certain layers)

- Ex. VGG (18-layers) beats AlexNet (8-layers) performance.

- But after increasing more layers in VGG, it starts decreasing its performance.

- Residual connection helps in improving training for larger models.

- Without losing its performance even after increasing its depth.

- But using residual connection can cause an exponential increase in the magnitude of activation at initialization.

- To control and re-center the magnitude of activation, it uses batch-norm at each layer (residual block).

- **additional Note:** ResNets provide a direct path for the gradient, mitigating the problem of vanishing gradients, and allowing the network to train effectively even as it becomes deeper.

> Why model with more depth start decreasing its performance?

- So the model with no skip or residual connection is known as a sequential model.

- Ex. layer = l1, l2, l3, and l4 -> model_with_ln -> y =  (l4(l3(l2(l1(x))))) (forward pass in seq model looks like this, right?)

- In a deep neural network, even a small change in input can cause a lot of change in its gradient. This is known as the shattered gradient phenomenon.

- The shattered gradient phenomenon causes the gradient to become increasingly complex and less smooth as it propagates through many layers.

- Due to this, the gradient descent algorithm (GDA) doesn't perform well because GDA needs a smooth gradient curve to perform well.

- Because the performance of GDA depends on before and after gradient (due to the chain rule).

- Check out the backprop file to understand this more.

> Residual connections

- We can add residuals followed by the linear transformation layer and activation function.

- In practice, we generally add it after several layers of the network.

- If we start res-block with ReLU, they will do nothing if input is zero (ReLU -> 0, neg).

- Hence, we add the res-block after the linear transformation to avoid being lazy for negative values.

- We cannot choose the depth of the network randomly even if it performs well because adding res-block will roughly double the depth of the network, which might affect the variance of activation during the forward pass, and as a result, cause gradient exploding for the backward pass.

- The main purpose of residual connections is to allow the network to learn identity functions easily, effectively skipping layers that are not beneficial.

> How to solve this variance problem that results in gradient exploding?

- In order to handle the exponential variance problem, we need to normalize the inputs.

- This can be solved with the help of batch-norm.

> What is batch-norm?

- To use the batch, we must provide the data in mini-batches to train.

- As its name goes, it uses a batch to calculate the mean and std.

- Mean and std are used as parameters to calculate batch-norm (these are running statistics, not learnable parameters).

- Batch norm is very effective because it not only normalizes but also, if the model doesn't need normalizing, it shifts and scales the input using gamma and beta (learnable parameters).

- **additional Note:** Batch normalization helps stabilize and accelerate the training process by normalizing the inputs of each layer. This is especially beneficial when combined with residual networks, further enhancing their ability to train deep architectures efficiently.

- For more on batch-norm, check the link.

> **Formula for ResNet:**

- A basic residual block can be represented as:

  \( y = F(x, \{W_i\}) + x \)

  where \( x \) is the input, \( F(x, \{W_i\}) \) represents the residual function (e.g., a stack of convolutional layers), and \( y \) is the output.

- The key idea is that instead of expecting each layer to directly learn a desired underlying mapping, it learns the residual mapping which is easier to optimize.

- This formulation allows the network to easily learn identity mappings by pushing the residual to zero, which helps in training very deep networks.

> **Internal Covariate Shift:**

- Internal covariate shift refers to the change in the distribution of network activations due to the updating of network parameters during training.

- Batch normalization helps to reduce this internal covariate shift by normalizing the output of each layer to have a mean of zero and a variance of one.

- By normalizing the inputs of each layer, batch normalization ensures that the distribution of the activations remains consistent during training, which helps in stabilizing and speeding up the training process.

- example : - if we train the model of image classification of rose or not rose but with only images of red roses and at time of testing we provide the images of roses with diff colour then model might not recosined it. (high level example)

> additional Points from the Paper:

- **Residual Block Configuration:** Residual blocks can be constructed with multiple layers, not just one. Typically, each residual block consists of two or three convolutional layers, and batch normalization is applied before each nonlinearity.

- **Training Efficiency:** Residual networks, thanks to their architecture, are able to train much deeper networks without the problems of vanishing or exploding gradients. This results in more efficient training and better performance on deeper models.

- **Performance:** results show that residual networks achieve lower error rates on benchmarks such as ImageNet.(resutling sota model)

> uses 

- almost everywhere.(transformer, stable diffusions)