# In this snippet we will learn about different components of CNN

In [None]:
from google.colab import drive
drive.mount('/content/drive')

In [None]:
import torch
import torch.nn as nn
import numpy as np
seed = 7777
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)
device = 'cuda' if torch.cuda.is_available() else 'cpu'

## Linear layer

<div style="text-align: justify">
A Linear layer (also known as fully connected layer or dense layer) is the simplest layer for CNN. This Layer in a Convolutional Neural Network (CNN) is like a mathematical function that transforms input data into output data. It changes the shape of the input data. It's like a set of weights (coefficients) connecting every neuron in one layer to every neuron in the next layer. For example, in image classification, if the input is a flattened image array (say, 28x28 pixels), and you want to classify it into 10 categories, you'd use a Linear Layer with 28x28 inputs and 10 outputs. Each output represents the confidence score for each category. Through training, the network adjusts these weights to improve accuracy.
</div>

![Linear layer](images/Linear.png)

In [None]:
Linear = nn.Linear(5,8).to(device)
x = torch.tensor([5.,2,3,1,2]).to(device)
y = Linear(x)
x = x.cpu().detach().numpy()
y = y.cpu().detach().numpy()
print(f'Input: {x}',f'Output: {y}',sep='\n',end='\n\n')

x = torch.rand(64,5).to(device)
y = Linear(x)
print(f'Input Shape: {np.array(x.shape)}',f'Output Shape: {y.shape}',sep='\n')

### Stacked Linear layer

<div style="text-align: justify">
In a neural network, stacked linear layers are fundamental components for transforming input data into meaningful output. Each layer, implemented with Linear transformation, takes input from the preceding layer, performs a linear operation (matrix multiplication), and passes it through a non-linear activation function to capture complex relationships within the data. Stacking multiple layers allows the network to learn increasingly abstract features, with dimensions typically decreasing towards the output layer. This hierarchical representation facilitates the network's ability to comprehend and classify complex patterns within the data. Stacked linear layers are more effective than a single linear layer when dealing with complex and nonlinear relationships within the data. By stacking multiple layers, each equipped with its own set of parameters, the network gains the capacity to learn hierarchical representations of the input features. This enables the model to capture intricate patterns and dependencies that may not be discernible with just a single linear transformation. Additionally, the introduction of nonlinear activation functions between layers allows for the modeling of nonlinear relationships, enhancing the network's flexibility and ability to approximate complex functions effectively. Therefore, stacked linear layers are particularly beneficial in scenarios where the data exhibits nonlinear behavior or requires multiple levels of abstraction for accurate modeling.
</div>

In [None]:
fc1 = nn.Linear(256,80).to(device)
fc2 = nn.Linear(80,40).to(device)
fc3 = nn.Linear(40,20).to(device)
x = torch.rand(64,256).to(device)

print(list(x.shape))
y1 = fc1(x)
print(list(x.shape),'---->',list(y1.shape))
y2 = fc2(y1)
print(list(y1.shape),'---->',list(y2.shape))
y3 = fc3(y2)
print(list(y2.shape),'---->',list(y3.shape))

In [None]:
fc = nn.Sequential(nn.Linear(256,80), nn.Linear(80,40), nn.Linear(40,20)).to(device)
x = torch.rand(64,256).to(device)
y = fc(x)
print(list(x.shape),'---->',list(y.shape))

### Linear layer after addition of different features

<div style="text-align: justify">
Combining distinct features before passing them through a Linear layer enhances the model's capability to capture multifaceted relationships within the data. By concatenating or adding these features, the network gains access to a richer representation of the input, potentially improving its predictive power. The Linear layer then processes this combined feature representation by performing a linear transformation, adjusting the weights associated with each feature to generate output predictions. This approach leverages the complementary nature of diverse features, empowering the model to make more informed and accurate decisions. A typical example could be adding image feature with output of LSTM for predicting the next word in image captioning.
</div>

In [None]:
fc = nn.Linear(256,256).to(device)
x = torch.rand(64,256).to(device)
y = torch.rand(64,256).to(device)
print(list(x.shape),list(y.shape))
z = x+y
print(f'{list(x.shape)} + {list(y.shape)} ----> {list(z.shape)}')
o = fc(z)
print(list(z.shape),list(o.shape))

## Pooling layer

<div style="text-align: justify">
Pooling layers in neural networks reduce the spatial dimensions (width and height) of the input volume, effectively downsampling it. By aggregating information from neighboring pixels or units, pooling layers help in capturing the most important features while reducing computational complexity. This downsampling also aids in making the network more robust to variations in input, providing a form of translation invariance. Typically, pooling is performed with a sliding window over the input, where each operation (like taking the maximum or average value) is applied over a fixed-size region, resulting in a smaller output volume.
</div>

![Pooling layer](images/Pooling.png)

## Components of Pooling layer

<div style="text-align: justify">
Pooling operations have three key components:<br>
<b>1. Window Size/Kernel Size:</b> This refers to the size of the sliding window that moves across the input volume during pooling. It determines the region over which the pooling operation (e.g., max or average) is applied.<br>
<b>2. Padding:</b> Padding is an optional component in pooling layers. It involves adding additional pixels around the input feature map. Padding can be used to control the spatial dimensions of the output feature map. Common padding techniques include <strong>zero padding</strong>, where all zeros are added for padding.<br>
<b>3. Stride:</b> Stride refers to the number of pixels the pooling window moves across the input feature map in each step. A larger stride reduces the spatial dimensions of the output feature map.<br>
The formula for obtaining the output size <strong>f</strong> when applying pooling with a window size of <strong>w</strong>, a stride size of <strong>s</strong>, and padding size of <strong>p</strong> to an input of size <strong>i</strong> is expressed as:<br>
$$
  f = \left\lfloor\frac{i + 2 \times p - w}{s}\right\rfloor + 1
$$
</div>

## Types of Pooling layer

<div style="text-align: justify">
Some common poolings are as follows:<br>
<b>Maxpool:</b> Pool the maximum element in the window<br>
</div>

![Maxpool](images/Maxpool.png)

<b>Avgpool:</b> Pool the average of all the elements in the window<br>
</div>

![Avgpool](images/Avgpool.png)

In [None]:
# Define max pooling layer
max_pool = nn.MaxPool2d(kernel_size=2, stride=2).to(device)

# Example input tensor (batch_size, channels, length)
x = torch.tensor([[[1., 2,3,4,5,6],[7,8,9,10,11,12],[13,14,15,16,17,18],[19,20,21,22,23,24]]]).to(device)
# Apply max pooling
y = max_pool(x)
x = x.cpu().detach().numpy()
y = y.cpu().detach().numpy()
print(f'Input: {x}',f'Output: {y}',sep='\n\n',end='\n\n\n\n')
print('Shape of input tensor is: ',list(x.squeeze().shape))
print('Shape of output tensor is: ',list(y.squeeze().shape))

In [None]:
# Define max pooling layer
max_pool = nn.MaxPool2d(kernel_size=3, stride=1).to(device)

# Example input tensor (batch_size, channels, length)
x = torch.tensor([[[1., 2,3,4,5,6],[7,8,9,10,11,12],[13,14,15,16,17,18],[19,20,21,22,23,24]]]).to(device)
# Apply max pooling
y = max_pool(x)
x = x.cpu().detach().numpy()
y = y.cpu().detach().numpy()
print(f'Input: {x}',f'Output: {y}',sep='\n\n',end='\n\n\n\n')
print('Shape of input tensor is: ',list(x.squeeze().shape))
print('Shape of output tensor is: ',list(y.squeeze().shape))

In [None]:
# Define max pooling layer
max_pool = nn.AvgPool2d(kernel_size=2, stride=2).to(device)

# Example input tensor (batch_size, channels, length)
x = torch.tensor([[[1.,2,3,4,5,6],[7,8,9,10,11,12],[13,14,15,16,17,18],[19,20,21,22,23,24]]]).to(device)
# Apply max pooling
y = max_pool(x)
x = x.cpu().detach().numpy()
y = y.cpu().detach().numpy()
print(f'Input: {x}',f'Output: {y}',sep='\n\n',end='\n\n\n\n')
print('Shape of input tensor is: ',list(x.squeeze().shape))
print('Shape of output tensor is: ',list(y.squeeze().shape))

## Convolution layer

<div style="text-align: justify">
In Convolutional Neural Networks (CNNs), convolutional operations are fundamental for feature extraction. It involves applying a filter (also known as a kernel) over an input image or feature map to compute a weighted sum of the pixel values. The filter slides over the input, multiplying its values element-wise with the overlapping region of the input and summing up the results to produce a single output value. This process captures spatial hierarchies of features, such as edges and textures, by detecting patterns at different locations. Through repeated convolutions with multiple filters, CNNs can learn hierarchical representations, starting from simple features like edges in lower layers to complex patterns like object parts in deeper layers. Convolutional operations offer two main advantages: parameter sharing and spatial invariance. Parameter sharing ensures that the same filter is applied across the entire input, reducing the number of parameters to learn. Spatial invariance allows CNNs to recognize patterns regardless of their location in the input, making them effective for tasks like image recognition, object detection, and semantic segmentation. Overall, convolutional operations are crucial for enabling CNNs to learn meaningful representations from raw input data.
</div>

![Convolution layer](images/Convolution.png)

In [None]:
# Define convolutional layer
conv = nn.Conv2d(in_channels=1,out_channels=1,kernel_size=(2,2), stride=(1,1)).to(device)
x = torch.tensor([[[[1.,2,2,1],[2,1,1,2],[3,4,4,3],[4,3,3,4]]]]).to(device)
conv.weight = torch.nn.Parameter(torch.tensor([[[[1,1],[1,1]]]],dtype=torch.float).to(device))
conv.bias = torch.nn.Parameter(torch.zeros(1).to(device))
y= conv(x)
x = x.cpu().detach().numpy()
y = y.cpu().detach().numpy()
print(f'Input: {x}',f'Output: {y}',sep='\n\n',end='\n\n\n\n')

In [None]:
# Define convolutional layer
conv = nn.Conv2d(in_channels=3,out_channels=8,kernel_size=5)
x = torch.rand(64,3,224,224).to(torch.float)
y = conv(x)

print(list(x.shape),list(y.shape),sep='\n')

## Batch Normalization layer

<div style="text-align: justify">
Batch normalization is a crucial operation in Convolutional Neural Networks (CNNs) that addresses the internal covariate shift problem during training. It normalizes the activations of each layer by subtracting the batch mean and dividing by the batch standard deviation, followed by scaling with learnable parameters. This normalization ensures that the input distribution to each layer remains stable throughout training, which accelerates convergence and improves the model's generalization ability. By reducing internal covariate shift, batch normalization allows for more stable gradients during backpropagation, mitigating issues like vanishing or exploding gradients. Moreover, it acts as a form of regularization, reducing the reliance on dropout and enabling the use of higher learning rates without risking instability. Batch normalization also exhibits a slight regularization effect by introducing noise during training, which helps prevent overfitting. Overall, batch normalization is an essential component in CNNs, contributing to faster training, improved convergence, and enhanced model performance. A well known normalization is given as follows
</div>
$$
O = \frac{{I - \mu(I)}}{{\sigma(I)}}
$$


In [None]:
torch.set_printoptions(precision=4)
np.set_printoptions(suppress=True,precision=4)

# Define BatchNorm layer
batch = nn.BatchNorm2d(1).to(device)
x = torch.tensor([[[[1.,2,3],[4,5,6],[7,8,9]]]]).to(device)
y = torch.round(batch(x),decimals=4)
x = x.detach().cpu().numpy()
y = y.cpu().detach().numpy()
print(f'Input: {x}',f'Original Output: {y}',sep='\n\n',end='\n\n')
q = np.round((x-x.mean())/x.std(),decimals=4)
print('Calculated Output: ',q,end='\n\n')
print(f'Details: Mean,std,std: {x.mean():0.4f},{(sum([(i.item()-x.mean())**2 for i in x.reshape(-1)])/9)**0.5:0.4f},{x.std().item():0.4f}')
print(f'Output[2,2] manual calcualtion: {(9-5)/2.582:0.4f}')

## Dropout layer

<div style="text-align: justify">
Dropout is a regularization technique in neural networks, randomly deactivating a proportion of neurons during training to prevent overfitting. It enhances model generalization by reducing dependency on specific neurons, thus improving robustness and preventing co-adaptation of neurons.
</div>


In [None]:
torch.set_printoptions(precision=4)
np.set_printoptions(suppress=True,precision=4)

# Define BatchNorm layer
drop = nn.Dropout(0.5).to(device)
x = torch.tensor([[1.,2],[3,4]]).to(device)
y = drop(x)
x = x.cpu().detach().numpy()
y = y.cpu().detach().numpy()
print(f'Input: {x}',f'Output: {y}',sep='\n\n')

# Activation Function

<div style="text-align: justify">
In a Convolutional Neural Network (CNN), an activation function is like a gatekeeper that decides whether a neuron should be activated or not based on the input it receives. Imagine a switch: if the input is strong enough, the switch flips on and lets the information pass through; if not, it stays off and blocks the information.<br>
The purpose of an activation function is to introduce non-linearity into the network. Without it, the CNN would be limited to linear transformations, making it less capable of learning complex patterns in data like images, sounds, or texts. Non-linear activation functions enable the network to learn and represent more intricate relationships between inputs and outputs.<br>
Overall, activation functions play a crucial role in empowering CNNs to understand and interpret complex data, making them essential components in deep learning architectures.
</div>

## Some common Activation Function

<div style="text-align: justify">
<b>ReLU:</b> ReLU (Rectified Linear Unit) turns on for positive inputs, outputting the same value, and turns off for negative inputs, outputting zero.<br>
$$
f(x) = \begin{cases}
x, & \text{if } x \geq 0 \\
0, & \text{elsewhere}
\end{cases}
$$
</div>

In [None]:
torch.set_printoptions(precision=4)

relu = nn.ReLU().to(device)
x = torch.tensor([-0.25,0,0.25,-2,2]).to(device)
y = torch.round(relu(x),decimals=4).cpu().detach().numpy()
x = x.cpu().detach().numpy()
print(f'Input: {x}',f'Output: {y}',sep='\n\n',end='\n\n\n\n')
x = torch.tensor([[-0.25,0,0.25,-2,2],[-3,3,-4,4,-5]]).to(device)
y = torch.round(relu(x),decimals=4).cpu().detach().numpy()
x = x.cpu().detach().numpy()
print(f'Input: {x}',f'Output: {y}',sep='\n\n')

<div style="text-align: justify">
<b>LeakyReLU:</b> Sometimes in ReLU, neurons can get stuck in a state where they always output zero for any input, which is called the "dying ReLU" problem. Leaky ReLU solves this by allowing a small, non-zero gradient for negative inputs, ensuring that even if a neuron's output is consistently negative, it can still learn and adjust.<br>
$$
f(x) = \begin{cases}
x, & \text{if } x \geq 0 \\
\alpha x, & \text{elsewhere}
\end{cases}
$$
where $\alpha$ is a very small positive constant.
</div>

In [None]:
torch.set_printoptions(precision=4)
np.set_printoptions(suppress=True,precision=4)

lrelu = nn.LeakyReLU(0.001).to(device)
x = torch.tensor([-0.25,0,0.25,-2,2]).to(device)
y = torch.round(lrelu(x),decimals=4).cpu().detach().numpy()
x = x.cpu().detach().numpy()
print(f'Input: {x}',f'Output: {y}',sep='\n\n',end='\n\n\n\n')
x = torch.tensor([[-0.25,0,0.25,-2,2],[-3,3,-4,4,-5]]).to(device)
y = torch.round(lrelu(x),decimals=4).cpu().detach().numpy()
x = x.cpu().detach().numpy()
print(f'Input: {x}',f'Output: {y}',sep='\n\n')

<div style="text-align: justify">
<b>ELU:</b> Exponential Linear Unit (ELU) introduces a saturation regime for negative inputs, allowing negative values which can help with learning dynamics.<br>
$$
f(x) = \begin{cases}
x, & \text{if } x \geq 0 \\
\alpha \left(e^x-1\right), & \text{elsewhere}
\end{cases}
$$
where $\alpha$ is a constant which generally considered as 1.
</div>

In [None]:
torch.set_printoptions(precision=4)
np.set_printoptions(suppress=True,precision=4)

elu = nn.ELU().to(device)
x = torch.tensor([-0.25,0,0.25,-2,2]).to(device)
y = torch.round(elu(x),decimals=4).cpu().detach().numpy()
x = x.cpu().detach().numpy()
print(f'Input: {x}',f'Output: {y}',sep='\n\n',end='\n\n\n\n')
x = torch.tensor([[-0.25,0,0.25,-2,2],[-3,3,-4,4,-5]]).to(device)
y = torch.round(elu(x),decimals=4).cpu().detach().numpy()
x = x.cpu().detach().numpy()
print(f'Input: {x}',f'Output: {y}',sep='\n\n')

<div style="text-align: justify">
<b>GELU:</b> Gaussian Error Linear Unit (GELU) is a smooth approximation of the rectifier function, which is claimed to perform well in deep learning tasks.<br>
$$
f(x) = 0.5x\left(1+erf\left(\frac{x}{\sqrt{2}}\right)\right)
$$
where erf is error random function.
</div>

<div style="text-align: justify">
<b>Linear:</b> The linear activation function is one of the simplest activation functions used in neural networks. It's often used in the output layer of a neural network for tasks where the output is unconstrained, such as regression problems where the target variable can take on any real value.<br>
$$
f(x) = x
$$
In this function, the output $f(x)$ is simply equal to the input x. This means that the output of the neuron is directly proportional to its input.<br>
One of the key properties of the linear activation function is that it preserves the scale of the input. In other words, if you increase the input by a certain amount, the output will also increase by the same amount. This property makes it suitable for regression tasks where the goal is to predict continuous values.<br>
However, one drawback of the linear activation function is that it's not suitable for deeper neural networks or tasks involving classification. This is because linear activation functions result in linear combinations of inputs, which limits the model's ability to learn complex patterns in the data. In deeper networks, using linear activation functions can lead to the model being too simplistic and unable to capture non-linear relationships in the data.
</div>

<div style="text-align: justify">
<b>Sigmoid:</b> Sigmoid squashes the input to the range $\left(0,1\right)$. It's commonly used in the output layer of a binary classification problem.<br>
$$
f(x) = \frac{1}{1+e^{-x}}
$$
</div>

In [None]:
torch.set_printoptions(precision=4)
np.set_printoptions(suppress=True,precision=4)

sigmoid = nn.Sigmoid().to(device)
x = torch.tensor([-0.25,0,0.25,-2,2]).to(device)
y = torch.round(sigmoid(x),decimals=4).cpu().detach().numpy()
x = x.cpu().detach().numpy()
print(f'Input: {x}',f'Output: {y}',sep='\n\n',end='\n\n\n\n')
x = torch.tensor([[-0.25,0,0.25,-2,2],[-3,3,-4,4,-5]]).to(device)
y = torch.round(sigmoid(x),decimals=4).cpu().detach().numpy()
x = x.cpu().detach().numpy()
print(f'Input: {x}',f'Output: {y}',sep='\n\n')

<div style="text-align: justify">
<b>Softmax:</b> Softmax is indeed a crucial non-linear activation function, commonly used in the output layer of neural networks for multi-class classification tasks. It shows the output probability for each class. You can use it according to any dimention.<br>
<i>N.B.</i> Remember not to use this at the last layer while using as pytorch nn module and if CategoricalCrosEntroy loss is used. Because this loss function by default apply it before calculation. So, if you use it the Softmax will be applied twice which will cause abnormality in output.
$$
f(x_i) = \frac{e^{x_i}}{\sum_{j=1}^{N}e^{x_j}}
$$
</div>

In [None]:
torch.set_printoptions(precision=4)
np.set_printoptions(suppress=True,precision=4)

softmax = nn.Softmax(dim=0).to(device)
x = torch.tensor([-0.25,0,0.25,-2,2]).to(device)
y = torch.round(softmax(x),decimals=4).cpu().detach().numpy()
x = x.cpu().detach().numpy()
print(f'Input: {x}',f'Output: {y}',sep='\n\n',end='\n\n\n\n')
x = torch.tensor([[-0.25,0,0.25,-2,2],[-3,3,-4,4,-5]]).to(device)
y = torch.round(softmax(x),decimals=4).cpu().detach().numpy()
x = x.cpu().detach().numpy()
print(f'Input: {x}',f'Output (dim=0): {y}',sep='\n\n',end='\n\n\n\n')
softmax = nn.Softmax(dim=1).to(device)
x = torch.tensor([[-0.25,0,0.25,-2,2],[-3,3,-4,4,-5]]).to(device)
y = torch.round(softmax(x),decimals=4).cpu().detach().numpy()
x = x.cpu().detach().numpy()
print(f'Input: {x}',f'Output (dim=1): {y}',sep='\n\n')