<a href="https://colab.research.google.com/github/yeabwang/Human-Emotions-Detection/blob/main/Note_on_state_of_art_models.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
#AlexNet was a breakthrough in deep learning, winning the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) 2012 and pioneering modern CNN architectures.

# Architecture: 8 layers (5 convolutional + 3 fully connected).

# Input Size: 227 × 227 × 3 (RGB images).

# Convolutional Layers:
#### Conv1: 96 filters, 11×11 kernel, stride 4, ReLU.
#### Conv2: 256 filters, 5×5 kernel, stride 1, ReLU.
#### Conv3: 384 filters, 3×3 kernel, stride 1, ReLU.
#### Conv4: 384 filters, 3×3 kernel, stride 1, ReLU.
#### Conv5: 256 filters, 3×3 kernel, stride 1, ReLU.

# Max Pooling: After Conv1, Conv2, and Conv5 (3×3 kernel, stride 2).

# Fully Connected Layers:
#### FC6: 4096 neurons, ReLU.
#### FC7: 4096 neurons, ReLU.
#### FC8 (Output): 1000 neurons (ImageNet classes), Softmax.

# Activation Function: ReLU (introduced to speed up training).
# Normalization: Local Response Normalization (LRN) after Conv1 and Conv2.
# Regularization: Dropout (0.5) in FC6 and FC7.
# Optimization: Stochastic Gradient Descent (SGD) with momentum (0.9).
# Batch Size: 128.
# Weight Initialization: Gaussian distribution.
# Data Augmentation: Cropping, flipping, and color jittering.
# Training Dataset: ImageNet (1.2 million images, 1000 classes).
# Parallel Training: Two GPUs used to split model layers for efficiency

In [None]:
## VGG Model
# Key Features:
# Deep Network: 16 (VGG16) or 19 (VGG19) layers.
# Uniform Kernel Size: Only 3×3 convolution layers to maintain consistency.
# Increased Depth: More layers compared to AlexNet for hierarchical feature learning.
# Regularization: Dropout (0.5) in fully connected layers.
# Optimization: SGD with momentum (0.9), batch size = 256.
# Weight Initialization: Pretrained on ImageNet, useful for transfer learning.
# Data Augmentation: Cropping, flipping, and color jittering
# VGG16 and VGG19 are the most common variants. #the main difference here is the number of convulational neurons used vgg16 used 13 convulational neurons and the vgg 19 uses the 16 convulational neurons
# Stacked small convolutional filters (3×3 kernel, stride 1, padding 1) for deeper representations.
# Uses 2×2 max pooling (stride 2) after every block for downsampling.


# Layers - Vgg16
# Input Size: 224 × 224 × 3 (RGB images).

# Conv Layers:
#### Block 1: 2 × (64 filters, 3×3, ReLU) → Max Pooling
#### Block 2: 2 × (128 filters, 3×3, ReLU) → Max Pooling
#### Block 3: 3 × (256 filters, 3×3, ReLU) → Max Pooling
#### Block 4: 3 × (512 filters, 3×3, ReLU) → Max Pooling
#### Block 5: 3 × (512 filters, 3×3, ReLU) → Max Pooling

# Fully Connected Layers:
#### FC6: 4096 neurons, ReLU
#### FC7: 4096 neurons, ReLU
#### FC8 (Output): 1000 neurons (Softmax for classification)



In [None]:
#RESNET MODEL

# ResNet introduced residual learning to address the vanishing gradient problem, allowing for extremely deep networks.

# Key Features:
#### Deep Architecture: Can scale up to ResNet-18, ResNet-34, ResNet-50, ResNet-101, ResNet-152.
#### Residual Connections (Skip Connections):
####### Instead of directly learning H(x), it learns F(x) = H(x) - x, making optimization easier.
#### Helps gradients flow smoothly during backpropagation.
#### Batch Normalization: Used after every convolution to stabilize training.
#### ReLU Activation: Applied after each convolutional layer.

# ResNet-18 Layer-by-Layer Breakdown
# Here is the detailed layer-wise breakdown for ResNet-18:

# Conv1 (Initial Convolutional Layer):

# Operation: 7×7 Convolution, 64 filters, stride 2
# Output Size: 112 × 112 × 64
# MaxPool:

# Operation: 3×3 Max Pooling, stride 2
# Output Size: 56 × 56 × 64
# Conv2_x (Residual Block 1 and 2):

# Operation: 2 × Basic Residual Blocks (each with 2x 3×3 convolutions, 64 filters)
# Output Size: 56 × 56 × 64
# Conv3_x (Residual Block 3 and 4):

# Operation: 2 × Basic Residual Blocks (each with 2x 3×3 convolutions, 128 filters), stride 2
# Output Size: 28 × 28 × 128
# Conv4_x (Residual Block 5 and 6):

# Operation: 2 × Basic Residual Blocks (each with 2x 3×3 convolutions, 256 filters), stride 2
# Output Size: 14 × 14 × 256
# Conv5_x (Residual Block 7 and 8):

# Operation: 2 × Basic Residual Blocks (each with 2x 3×3 convolutions, 512 filters), stride 2
# Output Size: 7 × 7 × 512
# AvgPool (Global Average Pooling):

# Operation: Global Average Pooling
# Output Size: 1 × 1 × 512
# Fully Connected (FC):

# Operation: Fully Connected layer (512 → 1000 classes)
# Output Size: 1 × 1 × 1000 (classification result)


## So we can see ResNet as a collection of shallow layers with a condition of skipping layers which their cumulative is zero.
## Firstly this will help the model avoid vanishing gradient.
## Seconly it performs well since it acts like a collection of various shallow layers which the model choose its path based on the conditions.



In [None]:
# # Covariate Shift and Batch Normalization

# # Covariate Shift
# # Covariate Shift refers to a situation where the distribution of the input data changes between training and testing phases, but the conditional distribution of the output given the input remains the same. In simpler terms, it happens when the model is trained on data from one distribution, but when deployed, it encounters data from a different distribution, which can hurt model performance.

# # Batch Normalization (BatchNorm)
# # Batch Normalization is a technique introduced to address internal covariate shift during the training of deep neural networks. It normalizes the activations of each layer by scaling and shifting them, ensuring that the distribution of inputs to each layer remains stable throughout training.

# In 2D Global Average Pooling, the pooling operation averages over all spatial dimensions (height and width) for each feature map (channel) of the input.
# Instead of using traditional pooling methods like max pooling (which extracts the maximum value), global average pooling computes the average value of each feature map over its entire spatial area.
