https://youtu.be/jDe5BAsT2-Y?si=5Ii-l6HqtSvWV_cH

# Convolutional Neural Network (CNN) Basics

## 1. Convolution (Filters)
- **Definition**: Convolution in CNNs is a mathematical operation where a small matrix, called a filter, slides over an input image or feature map to produce a new feature map. This filter captures patterns like edges or textures by multiplying the filter values with corresponding input values and summing them.

- **Formula**: If the input size is $ \text{Input Size} $, filter size is $ \text{Filter Size} $, padding $ \text{Padding} $, and stride $ \text{Stride} $, the output size $ \text{Output Size} $ after convolution is:
  $$
  \text{Output Size} = \frac{\text{Input Size} - \text{Filter Size} + 2 \times \text{Padding}}{\text{Stride}} + 1
  $$

# Why Use Convolutions? (Simplified)

Convolutions have two main advantages:

## 1. Parameter Sharing
- **What It Means**: The same small filter (or pattern detector) is used across the entire image.
- **Why It Matters**: If a filter can find something useful, like a vertical line, in one part of the image, it can find the same line in other parts too. This means we don’t need to create a separate filter for every location, which saves memory and makes the model simpler.

## 2. Sparsity of Connections
- **What It Means**: Each output value in a convolutional layer only depends on a small section of the input.
- **Why It Matters**: Because of this, the model learns to recognize objects regardless of where they are in the image. For example, it can recognize a cat whether it's in the top left corner or the center. This ability to recognize things in different locations makes the model more effective.


## 2. Filter (Kernel)
- **Definition**: A filter (or kernel) is a small matrix, often 3x3 or 5x5, that is used in the convolution operation to detect specific patterns in the input. Each filter captures a unique feature, like an edge or texture, in the input image.

## 3. Pooling
- **Definition**: Pooling reduces the spatial size of the feature map, helping to decrease the number of computations and control overfitting. Common types are:
  - **Max Pooling**: Takes the maximum value from a region.
  - **Average Pooling**: Takes the average value from a region.

- **Formula**: If the input size is $ \text{Input Size} $, pooling size is $ \text{Pooling Size} $, and stride $ \text{Stride} $, the output size $ \text{Output Size} $ after pooling is:
  $$
  \text{Output Size} = \frac{\text{Input Size} - \text{Pooling Size}}{\text{Stride}} + 1
  $$

## 4. Stride
- **Definition**: Stride is the number of pixels the filter moves each time it slides over the input. A stride of 1 moves one pixel at a time; a stride of 2 moves two pixels at a time.

## 5. Padding
- **Definition**: Padding adds extra pixels (usually zeros) around the input’s border to control the output size. This can help keep the output dimensions the same as the input.

- **Formula**: To maintain the input size after convolution, padding $ \text{Padding} $ for a filter size $ \text{Filter Size} $ is:
  $$
  \text{Padding} = \frac{\text{Filter Size} - 1}{2}
  $$


# shrinking problem in dimensions we use padding to keep number same:

# Keeping Spatial Dimensions the Same in CNNs

To keep the spatial dimensions of the input and output 2D feature maps the same (i.e., to prevent them from shrinking after each convolutional layer), you can use **padding**. Padding adds extra pixels around the border of the input matrix, allowing you to control the output feature map dimensions after convolution operations.

## Steps to Keep Input and Output Dimensions the Same

### 1. Use **Same Padding**
For a convolution operation to keep the spatial dimensions the same, apply **"same" padding** (or "full" padding). This padding adds enough pixels around the input to ensure that the output dimensions match the input dimensions.

If:
- **Filter size** ($\text{Filter Size}$) is odd, add padding $ \text{Padding} = \frac{\text{Filter Size} - 1}{2} $ on each side.
- **Stride** ($\text{Stride}$) is 1 (which is typical to preserve as much information as possible).

For example:
- For a $3 \times 3$ filter with stride 1, padding should be $ \text{Padding} = \frac{3 - 1}{2} = 1 $.
- For a $5 \times 5$ filter with stride 1, padding should be $ \text{Padding} = \frac{5 - 1}{2} = 2 $.

By adding this padding, the convolution will produce an output with the same dimensions as the input.

### 2. General Formula with Padding

The output size for a convolutional layer with padding $ \text{Padding} $, filter size $ \text{Filter Size} $, input size $ \text{Input Size} $, and stride $ \text{Stride} $ is given by:
$$
\text{Output Size} = \frac{\text{Input Size} - \text{Filter Size} + 2 \times \text{Padding}}{\text{Stride}} + 1
$$

To keep input and output sizes equal, set this formula so that **Output Size = Input Size**:
$$
\text{Input Size} = \frac{\text{Input Size} - \text{Filter Size} + 2 \times \text{Padding}}{\text{Stride}} + 1
$$

By solving this equation with $ \text{Stride} = 1 $, you can determine the required padding $ \text{Padding} $.

## Example: Using Same Padding in CNN Layers

If you want to preserve the 2D spatial size (like 28x28) across convolutional layers with different filter sizes:
- Use padding $ \text{Padding} = 1 $ for $3 \times 3$ filters.
- Use padding $ \text{Padding} = 2 $ for $5 \times 5$ filters.

Most deep learning frameworks (like TensorFlow, Keras, and PyTorch) have a "same" padding option that automatically calculates and applies the padding needed to preserve input dimensions.


---
---
---
# example:

# Understanding CNN Layer Components and Calculations

## CNN Layer Example
Let's analyze a convolutional layer with the following characteristics:
* Input size: $(32 \times 32 \times 3)$ (height, width, and channels)
* Filter size: $f^{[l]} = 5 \times 5$ (height and width)
* Number of filters: $n_c^{[l]} = 6$
* Stride: $s^{[l]} = 1$
* Padding: $p^{[l]} = 2$

## Step 1: Input Size
The input to the layer has dimensions $32 \times 32 \times 3$, meaning:
* $n_H^{[l-1]} = 32$ (input height)
* $n_W^{[l-1]} = 32$ (input width)
* $n_c^{[l-1]} = 3$ (input channels, e.g., RGB)

## Step 2: Output Size Formula
The formula to calculate the output height $n_H^{[l]}$ and width $n_W^{[l]}$ is:

$n_H^{[l]} = \left\lfloor\frac{n_H^{[l-1]} + 2p^{[l]} - f^{[l]}}{s^{[l]}} + 1\right\rfloor$

$n_W^{[l]} = \left\lfloor\frac{n_W^{[l-1]} + 2p^{[l]} - f^{[l]}}{s^{[l]}} + 1\right\rfloor$

Let's substitute our values:
* $n_H^{[l-1]} = 32$
* $p^{[l]} = 2$
* $f^{[l]} = 5$
* $s^{[l]} = 1$

$n_H^{[l]} = \frac{32 + 2(2) - 5}{1} + 1 = \frac{32 + 4 - 5}{1} + 1 = 32$

$n_W^{[l]} = \frac{32 + 2(2) - 5}{1} + 1 = 32$

**Output Size:** The output dimensions are $32 \times 32$. With 6 filters, the complete output size is $32 \times 32 \times 6$.

## Step 3: Filter Parameters
Each filter dimensions:

$f^{[l]} \times f^{[l]} \times n_c^{[l-1]} = 5 \times 5 \times 3$

Parameters per filter = $5 \times 5 \times 3 = 75$ weights

## Step 4: Total Weights
With $n_c^{[l]} = 6$ filters:

Total weights = $75 \times 6 = 450$

## Step 5: Bias Parameters
One bias term per filter:

Total biases = $n_c^{[l]} = 6$

## Step 6: Total Parameters
Total parameters = Weights + Biases = $450 + 6 = 456$

## Step 7: Activations
The output tensor shape (feature maps):

$\text{Activations} = n_H^{[l]} \times n_W^{[l]} \times n_c^{[l]} = 32 \times 32 \times 6$

## Summary

Parameter | Value
----------|-------
Input size | $32 \times 32 \times 3$
Filter size | $5 \times 5$
Stride | $1$
Padding | $2$
Number of filters | $6$
Output size | $32 \times 32 \times 6$
Total weights | $450$
Total biases | $6$
Total parameters | $456$

## Understanding the Output Size

The padding $p^{[l]} = 2$ adds two rows/columns of zeros around the input image, making it $36 \times 36$. The $5 \times 5$ filter slides across this padded input with stride $s^{[l]} = 1$, resulting in an output size of $32 \times 32$.

Here's a Python code to verify our calculations:

```python
def calculate_output_size(input_dim, filter_size, stride, padding):
    return ((input_dim + 2*padding - filter_size) // stride) + 1

# Given parameters
input_height = input_width = 32
filter_size = 5
stride = 1
padding = 2

# Calculate output dimensions
output_height = calculate_output_size(input_height, filter_size, stride, padding)
output_width = calculate_output_size(input_width, filter_size, stride, padding)

print(f"Output dimensions: {output_height} x {output_width}")

# Calculate total parameters
input_channels = 3
num_filters = 6
weights_per_filter = filter_size * filter_size * input_channels
total_weights = weights_per_filter * num_filters
total_biases = num_filters
total_parameters = total_weights + total_biases

print(f"\nParameter counts:")
print(f"Weights per filter: {weights_per_filter}")
print(f"Total weights: {total_weights}")
print(f"Total biases: {total_biases}")
print(f"Total parameters: {total_parameters}")
```

### LeNet-5 Overview:

- **Total Parameters**: ~60,000
- **Layers**:
  - **Input Layer**: \(32 \times 32 \times 1\) (Grayscale image)
  - **Convolutional Layers**: 2
    - C1: \(5 \times 5\) filters, 6 feature maps
    - C3: \(5 \times 5\) filters, 16 feature maps
  - **Pooling/Subsampling Layers**: 2
  - **Fully Connected Layers**: 3
  - **Output Layer**: 10 nodes (for digit classification)
  
- **Computation**: 
  - Relatively lightweight compared to modern architectures. Used primarily for digit classification on MNIST.

## Architecture Breakdown

### Input Layer
* Size: $32 \times 32 \times 1$ (grayscale image)

### Layer-by-Layer Details

#### 1. First Convolution (C1)
* Filters: 6 filters of size $5 \times 5$
* Stride: 1
* Padding: 0
* Output size: $28 \times 28 \times 6$
* Parameters: $6 \times (5 \times 5 \times 1 + 1) = 156$

#### 2. First Subsampling (S2)
* Type: Average pooling $2 \times 2$
* Stride: 2
* Output size: $14 \times 14 \times 6$
* Parameters: 0

#### 3. Second Convolution (C3)
* Filters: 16 filters of size $5 \times 5$
* Stride: 1
* Output size: $10 \times 10 \times 16$
* Parameters: $16 \times (5 \times 5 \times 6 + 1) = 2,416$

#### 4. Second Subsampling (S4)
* Type: Average pooling $2 \times 2$
* Stride: 2
* Output size: $5 \times 5 \times 16$
* Parameters: 0

#### 5. Third Convolution (C5)
* Filters: 120 filters of size $5 \times 5$
* Stride: 1
* Output size: $1 \times 1 \times 120$
* Parameters: $120 \times (5 \times 5 \times 16 + 1) = 48,120$

#### 6. Fully Connected (F6)
* Input: 120 neurons
* Output: 84 neurons
* Parameters: $84 \times 120 + 84 = 10,164$

#### 7. Output Layer
* Input: 84 neurons
* Output: 10 neurons (digits 0-9)
* Parameters: $10 \times 84 + 10 = 850$

## Total Parameters
| Layer | Parameters |
|-------|------------|
| C1    | 156        |
| C3    | 2,416      |
| C5    | 48,120     |
| F6    | 10,164     |
| Output| 850        |
| Total | 61,706     |

## Convolution Operation
For each convolution layer, output pixel $y(i,j)$ is calculated as:
$$y(i,j) = \sum_{m=0}^{4} \sum_{n=0}^{4} w_{m,n} \cdot x(i+m, j+n) + b$$
where:
* $w_{m,n}$ = filter weights
* $x(i+m, j+n)$ = input pixels
* $b$ = bias term


### Summary and Overview of AlexNet

**Overview**:  
AlexNet, introduced by Alex Krizhevsky, Ilya Sutskever, and Geoffrey Hinton in 2012, revolutionized the field of deep learning for image classification. It won the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) in 2012 with a significant margin, achieving a top-5 test error rate of $16.4\%$. The architecture introduced several innovations, including the use of Rectified Linear Units (ReLU) as activation functions, dropout for regularization, and data augmentation techniques to increase dataset diversity.

**Sequel**:  
AlexNet laid the groundwork for more advanced architectures such as VGGNet, GoogLeNet, and ResNet, each building on the principles established by AlexNet to improve accuracy and efficiency in image classification tasks.

**Architecture Details**:
- **Input Size**: $227 \times 227 \times 3$ (RGB image)
- **Number of Layers**: 8 layers (5 convolutional layers + 3 fully connected layers)
- **Total Parameters**: Approximately $62,388,299$
- **Training Time**: Approximately 6 days on two GTX 580 GPUs
- **Key Innovations**: 
  - **ReLU Activation**: Speeds up the training process by mitigating the vanishing gradient problem.
  - **Dropout Layers**: Regularizes the model by randomly dropping units during training.
  - **Data Augmentation**: Increases the diversity of the training dataset through transformations such as rotation and scaling.

---

### Layer Breakdown

1. **Layer 1**: Convolutional Layer
   - **Input**: $227 \times 227 \times 3$
   - **Filter Size**: $11 \times 11$
   - **Stride**: 4
   - **Number of Filters**: 96
   - **Output**: $55 \times 55 \times 96$
   - **Parameters**: $(11 \times 11 \times 3 \times 96) + 96 = 34,944$

2. **Layer 2**: Max Pooling Layer
   - **Input**: $55 \times 55 \times 96$
   - **Pooling Size**: $3 \times 3$
   - **Stride**: 2
   - **Output**: $27 \times 27 \times 96$

3. **Layer 3**: Convolutional Layer
   - **Input**: $27 \times 27 \times 96$
   - **Filter Size**: $5 \times 5$
   - **Padding**: 2
   - **Number of Filters**: 256
   - **Output**: $27 \times 27 \times 256$
   - **Parameters**: $(5 \times 5 \times 96 \times 256) + 256 = 614,656$

4. **Layer 4**: Max Pooling Layer
   - **Input**: $27 \times 27 \times 256$
   - **Pooling Size**: $3 \times 3$
   - **Stride**: 2
   - **Output**: $13 \times 13 \times 256$

5. **Layer 5**: Convolutional Layer
   - **Input**: $13 \times 13 \times 256$
   - **Filter Size**: $3 \times 3$
   - **Padding**: 1
   - **Number of Filters**: 384
   - **Output**: $13 \times 13 \times 384$
   - **Parameters**: $(3 \times 3 \times 256 \times 384) + 384 = 885,120$

6. **Layer 6**: Convolutional Layer
   - **Input**: $13 \times 13 \times 384$
   - **Filter Size**: $3 \times 3$
   - **Padding**: 1
   - **Number of Filters**: 384
   - **Output**: $13 \times 13 \times 384$
   - **Parameters**: $(3 \times 3 \times 384 \times 384) + 384 = 1,327,488$

7. **Layer 7**: Convolutional Layer
   - **Input**: $13 \times 13 \times 384$
   - **Filter Size**: $3 \times 3$
   - **Padding**: 1
   - **Number of Filters**: 256
   - **Output**: $13 \times 13 \times 256$
   - **Parameters**: $(3 \times 3 \times 384 \times 256) + 256 = 884,736$

8. **Layer 8**: Max Pooling Layer
   - **Input**: $13 \times 13 \times 256$
   - **Pooling Size**: $3 \times 3$
   - **Stride**: 2
   - **Output**: $6 \times 6 \times 256$

9. **Layer 9**: Flatten
   - **Input**: $6 \times 6 \times 256$
   - **Output**: $9216$

10. **Layer 10**: Fully Connected Layer
    - **Input**: $9216$
    - **Output**: $4096$
    - **Parameters**: $9216 \times 4096 + 4096 = 37,752,832$

11. **Layer 11**: Dropout Layer (p=0.5)

12. **Layer 12**: Fully Connected Layer
    - **Input**: $4096$
    - **Output**: $4096$
    - **Parameters**: $4096 \times 4096 + 4096 = 16,781,312$

13. **Layer 13**: Dropout Layer (p=0.5)

14. **Layer 14**: Fully Connected Layer
    - **Input**: $4096$
    - **Output**: $1000$
    - **Parameters**: $4096 \times 1000 + 1000 = 4,097,000$

---

### Total Parameters Calculation

- **Total Parameters** = $34,944 + 614,656 + 885,120 + 1,327,488 + 884,736 + 1,327,488 + 884,736 + 37,752,832 + 16,781,312 + 4,097,000 = 62,388,299$

---

### Summary of Key Metrics

| Parameter                  | Value           |
|----------------------------|-----------------|
| Input Size                 | $227 \times 227 \times 3$ |
| Number of Layers           | 8               |
| Total Parameters            | $62,388,299$    |
| Training Time              | 6 days on two GTX 580 GPUs |

---
---


# VGG Architecture Overview

- **Total Parameters**: ~138 million
- **Layers**:
  - **Input Layer**: $224 \times 224 \times 3$ (RGB image)
  - **Convolutional Layers**: 13
  - **Fully Connected Layers**: 3
  - **Output Layer**: 1000 nodes (for classification)

- **Computation**: 
  - VGG architectures are deep CNNs with many layers, emphasizing uniform architecture using small $3 \times 3$ filters.

## 1. What's New in VGG?
VGG introduced the concept of deep networks with very small filters:
- **Small receptive fields** ($3 \times 3$) that allow for deeper architectures without significantly increasing the number of parameters.
- **Increased depth** improves feature extraction.
- **Uniform architecture** simplifies the design process.

## 2. Sequel of Which Model?
VGG follows the evolution from earlier models like AlexNet, enhancing the depth and consistency of convolutional layer designs.

---

## 3. VGG Architecture and Layer Details

| Layer Type         | Output Size               | Parameters         |
|--------------------|--------------------------|---------------------|
| Input              | $224 \times 224 \times 3$| -                   |
| Convolution (Conv1)| $224 \times 224 \times 64$ | $3 \times 3 \times 3 \times 64 + 64 = 1,792$ |
| Convolution (Conv2)| $224 \times 224 \times 64$ | $3 \times 3 \times 64 \times 64 + 64 = 36,928$ |
| Max Pooling        | $112 \times 112 \times 64$ | -                   |
| Convolution (Conv3)| $112 \times 112 \times 128$ | $3 \times 3 \times 64 \times 128 + 128 = 73,856$ |
| Convolution (Conv4)| $112 \times 112 \times 128$ | $3 \times 3 \times 128 \times 128 + 128 = 147,584$ |
| Max Pooling        | $56 \times 56 \times 128$  | -                   |
| Convolution (Conv5)| $56 \times 56 \times 256$  | $3 \times 3 \times 128 \times 256 + 256 = 295,168$ |
| Convolution (Conv6)| $56 \times 56 \times 256$  | $3 \times 3 \times 256 \times 256 + 256 = 590,080$ |
| Max Pooling        | $28 \times 28 \times 256$  | -                   |
| Convolution (Conv7)| $28 \times 28 \times 512$  | $3 \times 3 \times 256 \times 512 + 512 = 1,180,160$ |
| Convolution (Conv8)| $28 \times 28 \times 512$  | $3 \times 3 \times 512 \times 512 + 512 = 2,359,808$ |
| Max Pooling        | $14 \times 14 \times 512$  | -                   |
| Convolution (Conv9)| $14 \times 14 \times 512$  | $3 \times 3 \times 512 \times 512 + 512 = 2,359,808$ |
| Convolution (Conv10)| $14 \times 14 \times 512$ | $3 \times 3 \times 512 \times 512 + 512 = 2,359,808$ |
| Max Pooling        | $7 \times 7 \times 512$    | -                   |
| Flatten            | $1 \times 1 \times 512$    | -                   |
| Fully Connected (FC1)| $4096$                   | $512 \times 4096 + 4096 = 2,097,152$ |
| Fully Connected (FC2)| $4096$                   | $4096 \times 4096 + 4096 = 16,781,312$ |
| Output Layer       | $1000$                    | $4096 \times 1000 + 1000 = 4,097,000$ |

---

## 4. Overall Parameter Calculation
Adding up all the parameters from the layers gives a total of approximately **138 million** parameters.

---

## 5. Mathematical Breakdown (Example of a Convolution)

Consider the convolution operation in the first convolutional layer (Conv1):

For an input of size $224 \times 224 \times 3$ and a $3 \times 3$ filter, the output pixel $y(i, j)$ can be expressed as:

$$
y(i, j) = \sum_{m=0}^{2} \sum_{n=0}^{2} w_{m,n} \cdot x(i+m, j+n) + b
$$

Where:
- $w_{m,n}$ are the weights of the $3 \times 3$ filter.
- $x(i+m, j+n)$ are the input pixels.
- $b$ is the bias term.

---
---

# ResNet Architecture

## Overview
* Year: 2015
* Authors: Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun
* Key Innovation: Introduction of residual connections to enable training of very deep networks (up to 152 layers).

## Architecture Details

### Input Layer
* Size: $224 \times 224 \times 3$
* Type: RGB Image

### Layer Calculations
Each convolution layer follows this simple formula:
$$\text{Output Size} = \frac{\text{Input Size} + 2 \times \text{Padding} - \text{Filter Size}}{\text{Stride}} + 1$$

### Layer by Layer Breakdown

| Layer           | Input Size               | Operation                  | Output Size               | Parameters   |
|-----------------|--------------------------|----------------------------|---------------------------|--------------|
| Conv1           | $224 \times 224 \times 3$| Conv 7×7, stride 2         | $112 \times 112 \times 64$| 9,408        |
| Pool1           | $112 \times 112 \times 64$| MaxPool 3×3, stride 2     | $56 \times 56 \times 64$  | 0            |
| Conv2_x         | $56 \times 56 \times 64$ | 3x3 Conv, 64 filters       | $56 \times 56 \times 64$  | 73,728       |
| Conv3_x         | $56 \times 56 \times 64$ | 3x3 Conv, 128 filters      | $28 \times 28 \times 128$ | 147,584      |
| Conv4_x         | $28 \times 28 \times 128$| 3x3 Conv, 256 filters      | $14 \times 14 \times 256$ | 590,080      |
| Conv5_x         | $14 \times 14 \times 256$| 3x3 Conv, 512 filters      | $7 \times 7 \times 512$   | 2,359,296    |
| Pool2           | $7 \times 7 \times 512$  | Global Average Pooling     | $1 \times 1 \times 512$   | 0            |
| FC              | $1 \times 1 \times 512$  | Fully Connected Layer (1000 classes) | $1 \times 1 \times 1000$ | 513,000      |

### Parameter Summary

| Parameter                  | Value                        |
|----------------------------|------------------------------|
| Input Size                 | $224 \times 224 \times 3$   |
| Number of Layers           | 152                          |
| Total Parameters            | $62,388,299$                |
| Training Time              | 6 days on two GTX 580 GPUs  |


---
---



# Inception Architecture

## Overview
- **Year**: 2014
- **Authors**: Christian Szegedy et al.
- **Key Innovations**:
  - **Inception Module**: Introduces parallel convolutional paths with different filter sizes, capturing multi-scale features.
  - **Global Average Pooling**: Replaces fully connected layers to reduce overfitting and improve generalization.
  - **Factorized Convolutions**: Decomposes large filters into smaller ones (e.g., 5x5 into two 3x3), reducing parameters and computational complexity.
  - **Nin (Network in Network)**: Uses micro-networks to create more abstract representations.
  - **1x1 Convolutions**: Enhances dimensionality reduction and allows deeper architectures while managing computational cost.
  - **Auxiliary Classifiers**: Provides intermediate supervision to improve convergence during training.

## Architecture Details

### Input Layer
- **Size**: $227 \times 227 \times 3$
- **Type**: RGB Image

### Layer Calculations
Each convolution layer follows this formula:
$$\text{Output Size} = \frac{\text{Input Size} + 2 \times \text{Padding} - \text{Filter Size}}{\text{Stride}} + 1$$

### Layer by Layer Breakdown

| Layer      | Input Size                | Operation                   | Output Size                | Parameters  |
|------------|---------------------------|-----------------------------|----------------------------|-------------|
| Conv1      | $227 \times 227 \times 3$ | Conv 7×7, stride 2          | $112 \times 112 \times 64$ | 9,472       |
| Pool1      | $112 \times 112 \times 64$| MaxPool 3×3, stride 2       | $56 \times 56 \times 64$   | 0           |
| Conv2      | $56 \times 56 \times 64$  | Conv 1×1, 64 filters         | $56 \times 56 \times 64$   | 4,160       |
| Conv3      | $56 \times 56 \times 64$  | Conv 3×3, 128 filters        | $54 \times 54 \times 128$  | 73,856      |
| Pool2      | $54 \times 54 \times 128$ | MaxPool 3×3, stride 2       | $26 \times 26 \times 128$  | 0           |
| Inception1 | $26 \times 26 \times 128$ | Inception Module            | $26 \times 26 \times 256$  | 15,000+     |
| ...        | ...                       | ...                         | ...                        | ...         |
| Pool3      | $26 \times 26 \times N$   | Average Pooling             | $1 \times 1 \times N$      | 0           |
| Output     | $1 \times 1 \times N$     | Softmax                     | $1 \times 1 \times K$      | 0           |

### Summary Table of Parameters

| Parameter                  | Value                      |
|----------------------------|----------------------------|
| Input Size                 | $227 \times 227 \times 3$  |
| Number of Layers           | 22 (total layers in Inception-v1) |
| Total Parameters            | $5,000,000+ \text{ (depends on specific version)}$ |
| Training Time              | Varies based on dataset   |

### Sequel of Which Model?
Inception has evolved into several successors, including:
- **Inception-v2**: Improved performance and reduced computational cost with factorized convolutions.
- **Inception-v3**: Further enhancements with label smoothing and auxiliary logits.
- **Inception-ResNet**: Combines Inception modules with residual connections for better gradient flow.

--- 
---

Here's the updated Jupyter notebook markdown breakdown for the GoogLeNet architecture, with separate tables for the summary and the layer-by-layer breakdown:

# GoogLeNet

## Overview
* **Year:** 2014
* **Authors:** Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, Alex Alemi
* **Key Innovations:**
  * Introduced the Inception module, allowing for multiple filter sizes in parallel.
  * Used global average pooling instead of fully connected layers to reduce overfitting.
  * Employed 1x1 convolutions for dimensionality reduction, significantly reducing computational costs.
  * Implemented auxiliary classifiers during training to combat the vanishing gradient problem.
  * Achieved state-of-the-art performance on the ImageNet dataset at the time of release.
* **Sequel to:** No direct sequel, but influenced later models like Inception v2 and Inception v3.

### Summary Table

| Layer Type              | Count | Hardware Used              | FLOPs        | Computation Time | Hours Taken |
|-------------------------|-------|----------------------------|--------------|------------------|-------------|
| Convolutional Layers    | 22    | 4 GPUs (NVIDIA K40)       | ~1.5 billion | 3 days           | ~72 hours   |
| Pooling Layers          | 5     |                            |              |                  |             |
| Fully Connected Layers   | 0     |                            |              |                  |             |
| Total Layers            | 27    |                            |              |                  |             |

## Architecture Details

### Input Layer
* **Size:** $224 \times 224 \times 3$
* **Type:** RGB Image

### Layer Calculations
Each convolution layer follows this formula:
$$
\text{Output Size} = \frac{\text{Input Size} + 2 \times \text{Padding} - \text{Filter Size}}{\text{Stride}} + 1
$$

### Layer by Layer Breakdown

| Layer     | Input Size               | Operation                         | Output Size               | Parameters |
|-----------|--------------------------|-----------------------------------|---------------------------|------------|
| Conv1     | $224 \times 224 \times 3$| Conv 7×7, stride 2                | $112 \times 112 \times 64$| 9,408      |
| Pool1     | $112 \times 112 \times 64$| MaxPool 3×3, stride 2            | $56 \times 56 \times 64$  | 0          |
| Conv2     | $56 \times 56 \times 64$ | Conv 1×1, stride 1                | $56 \times 56 \times 64$  | 4,128      |
| Conv3     | $56 \times 56 \times 64$ | Conv 3×3, stride 1                | $56 \times 56 \times 192$ | 110,080    |
| Pool2     | $56 \times 56 \times 192$| MaxPool 3×3, stride 2            | $28 \times 28 \times 192$ | 0          |
| Inception 1 | $28 \times 28 \times 192$| Inception Module 1              | $28 \times 28 \times 256$ | 18,720     |
| Inception 2 | $28 \times 28 \times 256$| Inception Module 2              | $28 \times 28 \times 480$ | 113,840    |
| Pool3     | $28 \times 28 \times 480$| MaxPool 3×3, stride 2            | $14 \times 14 \times 480$ | 0          |
| Inception 3 | $14 \times 14 \times 480$| Inception Module 3              | $14 \times 14 \times 512$ | 243,712    |
| Inception 4 | $14 \times 14 \times 512$| Inception Module 4              | $14 \times 14 \times 512$ | 215,304    |
| Inception 5 | $14 \times 14 \times 512$| Inception Module 5              | $14 \times 14 \times 528$ | 226,728    |
| Pool4     | $14 \times 14 \times 528$| MaxPool 3×3, stride 2            | $7 \times 7 \times 528$   | 0          |
| Conv4     | $7 \times 7 \times 528$  | Conv 1×1, stride 1                | $7 \times 7 \times 128$   | 67,584     |
| Conv5     | $7 \times 7 \times 128$  | Conv 1×1, stride 1                | $7 \times 7 \times 128$   | 16,384     |
| Pool5     | $7 \times 7 \times 128$  | Average Pooling                  | $1 \times 1 \times 128$    | 0          |

### Parameter Calculation Example
For Conv1 layer:
* Weights: $7 \times 7 \times 3 \times 64 = 9,408$
* Biases: $64$
* Total: $9,408 + 64 = 9,472$ parameters

### Total Parameters
* Total parameters for GoogLeNet: Approximately 5 million.


### **<span style="color:red"> changed to object detection and tracking kind of models <span>**

# Localization and Sliding Window in Object Detection

**Localization** refers to the process of identifying the location of an object within an image. In the context of object detection, it involves determining the precise bounding box coordinates (x, y, width, height) that enclose the object of interest.

 Key Points:
- **Bounding Box**: A rectangular box that is drawn around the object to indicate its position.
- **Applications**: Localization is essential in various applications, such as facial recognition, autonomous driving, and surveillance systems.
- **Metrics**: Common metrics for evaluating localization accuracy include Intersection Over Union (IoU), as it measures how well the predicted bounding box aligns with the ground truth bounding box.

Sliding Window
The **Sliding Window** technique is a method used to detect objects in images by systematically scanning the image at various locations and scales. It involves the following steps:

1. **Window Definition**: A fixed-size window (or bounding box) is defined to slide across the image.
2. **Window Movement**: The window is moved across the image in small, overlapping steps (both horizontally and vertically).
3. **Classification**: At each position of the window, a classifier (like a neural network) is applied to determine if the window contains the object of interest.
4. **Scaling**: The window can be resized to detect objects at different scales, allowing for the detection of objects of various sizes within the same image.

 Key Points:
- **Exhaustive Search**: The sliding window approach can be computationally expensive as it involves evaluating many regions of the image.
- **Combining with Other Techniques**: Sliding window methods are often combined with other techniques, like non-max suppression, to eliminate redundant detections and improve the final output.
- **Limitations**: The main limitation is the trade-off between detection accuracy and computational efficiency, as using small windows with small steps can lead to a high number of evaluations.

 Conclusion
Localization is a fundamental aspect of object detection that focuses on accurately identifying the position of objects within images. The sliding window technique is a widely used method for achieving this by systematically scanning the image, though it can be computationally intensive.


# Intersection Over Union (IoU)

**Intersection Over Union (IoU)** is a metric used to evaluate the performance of object detection algorithms. It quantifies the overlap between two bounding boxes: the ground truth (labeled output) and the predicted output.

 Definition
IoU is calculated using the following formula:

$$
\text{IoU} = \frac{\text{Intersection Area}}{\text{Union Area}}
$$

Where:
- **Intersection Area** is the area of overlap between the two bounding boxes.
- **Union Area** is the total area covered by both bounding boxes.

 Steps to Calculate IoU
1. **Calculate the Intersection Area**: This is the area where the two rectangles overlap.
2. **Calculate the Union Area**: This is the total area covered by both rectangles. It can be computed as:
   $$
   \text{Union Area} = \text{Area of Rectangle 1} + \text{Area of Rectangle 2} - \text{Intersection Area}
   $$
3. **Compute IoU**: Use the IoU formula stated above.

 Example
- **Ground Truth (Red Rectangle)**: This represents the true location of the object.
- **Predicted Output (Purple Rectangle)**: This represents the model's prediction.

 Visualization
- The red rectangle represents the labeled output (ground truth).
- The purple rectangle represents the predicted output.

 IoU Interpretation
- If IoU ≥ 0.5: The prediction is considered good.
- The best possible IoU is 1, indicating perfect overlap.
- The higher the IoU, the better the accuracy of the object detection algorithm.

 Conclusion
IoU is a crucial metric in evaluating object detection models, as it provides a clear understanding of how well the predicted bounding box matches the ground truth. A higher IoU indicates better model performance.


Here’s the Jupyter notebook markdown breakdown for the RCNN architecture, including a separate table for computational power:

# RCNN (Regions with Convolutional Neural Networks)--R-CNN tries to pick a few windows and run a Conv net (your confident classifier) on top of them.

## Overview
* **Year:** 2014
* **Authors:** Ross B. Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik
* **Key Innovations:**
  * Introduced a region proposal network (RPN) to generate object proposals efficiently.
  * Combined region proposals with CNN features for object classification and bounding box regression.
  * Used a two-stage process: first generating proposals and then classifying them.
  * Employed a selective search algorithm to generate region proposals initially.
  * Achieved state-of-the-art performance on the PASCAL VOC dataset at the time of release.
* **Sequel to:** No direct sequel, but influenced later models like Fast R-CNN and Mask R-CNN.

### Summary Table

| Layer Type              | Count | Parameters |
|-------------------------|-------|------------|
| Convolutional Layers    | 5     | 2,310,000  |
| Region Proposal Network  | 1     | 2,200,000  |
| Total Layers            | 6     | 4,510,000  |

### Computational Power Table

| Hardware Used              | FLOPs          | Computation Time | Hours Taken |
|----------------------------|----------------|------------------|-------------|
| 2 NVIDIA K40 GPUs          | ~10 billion    | 2 days           | ~48 hours   |

## Architecture Details

### Input Layer
* **Size:** $224 \times 224 \times 3$
* **Type:** RGB Image

### Layer Calculations
Each convolution layer follows this formula:
$$
\text{Output Size} = \frac{\text{Input Size} + 2 \times \text{Padding} - \text{Filter Size}}{\text{Stride}} + 1
$$

### Layer by Layer Breakdown

| Layer           | Input Size               | Operation                         | Output Size               | Parameters |
|------------------|--------------------------|-----------------------------------|---------------------------|------------|
| Conv1           | $224 \times 224 \times 3$| Conv 7×7, stride 2                | $112 \times 112 \times 64$| 9,408      |
| Pool1           | $112 \times 112 \times 64$| MaxPool 3×3, stride 2            | $56 \times 56 \times 64$  | 0          |
| Conv2           | $56 \times 56 \times 64$ | Conv 3×3, stride 1                | $56 \times 56 \times 192$ | 110,080    |
| Pool2           | $56 \times 56 \times 192$| MaxPool 3×3, stride 2            | $28 \times 28 \times 192$ | 0          |
| Conv3           | $28 \times 28 \times 192$| Conv 1×1, stride 1                | $28 \times 28 \times 128$ | 24,576     |
| Conv4           | $28 \times 28 \times 128$| Conv 1×1, stride 1                | $28 \times 28 \times 128$ | 16,384     |
| Conv5           | $28 \times 28 \times 128$| Conv 1×1, stride 1                | $28 \times 28 \times 64$  | 8,192      |
| RPN              | $28 \times 28 \times 64$ | Region Proposal Network           | $28 \times 28 \times 256$ | 2,200,000  |

### Parameter Calculation Example
For Conv1 layer:
* Weights: $7 \times 7 \times 3 \times 64 = 9,408$
* Biases: $64$
* Total: $9,408 + 64 = 9,472$ parameters

### Total Parameters
* Total parameters for RCNN: Approximately 4.5 million.


---
---

Here's the Jupyter notebook markdown breakdown for the Fast R-CNN architecture, including separate tables for the summary and computational power:

# Fast R-CNN

## Overview
* **Year:** 2015
* **Authors:** Ross B. Girshick
* **Key Innovations:**
  * Introduced a single-stage training process, eliminating the need for a separate stage for generating object proposals.
  * Utilized a softmax layer for multi-class classification, improving efficiency.
  * Implemented ROI pooling to extract features for each proposed region from the shared convolutional feature map.
  * Reduced computation time significantly compared to its predecessor, RCNN, by sharing computations across proposals.
  * Achieved state-of-the-art performance on the PASCAL VOC dataset and improved the speed of object detection.
* **Sequel to:** Fast R-CNN is a refinement of the original R-CNN model and influenced further models like Mask R-CNN.

### Summary Table

| Layer Type              | Count | Parameters |
|-------------------------|-------|------------|
| Convolutional Layers    | 5     | 2,200,000  |
| ROI Pooling Layer       | 1     | 0          |
| Total Layers            | 6     | 2,200,000  |

### Computational Power Table

| Hardware Used              | FLOPs          | Computation Time | Hours Taken |
|----------------------------|----------------|------------------|-------------|
| 1 NVIDIA K40 GPU           | ~5 billion     | 1 day            | ~24 hours   |

## Architecture Details

### Input Layer
* **Size:** $224 \times 224 \times 3$
* **Type:** RGB Image

### Layer Calculations
Each convolution layer follows this formula:
$$
\text{Output Size} = \frac{\text{Input Size} + 2 \times \text{Padding} - \text{Filter Size}}{\text{Stride}} + 1
$$

### Layer by Layer Breakdown

| Layer           | Input Size               | Operation                         | Output Size               | Parameters |
|------------------|--------------------------|-----------------------------------|---------------------------|------------|
| Conv1           | $224 \times 224 \times 3$| Conv 7×7, stride 2                | $112 \times 112 \times 64$| 9,408      |
| Pool1           | $112 \times 112 \times 64$| MaxPool 3×3, stride 2            | $56 \times 56 \times 64$  | 0          |
| Conv2           | $56 \times 56 \times 64$ | Conv 3×3, stride 1                | $56 \times 56 \times 192$ | 110,080    |
| Pool2           | $56 \times 56 \times 192$| MaxPool 3×3, stride 2            | $28 \times 28 \times 192$ | 0          |
| Conv3           | $28 \times 28 \times 192$| Conv 1×1, stride 1                | $28 \times 28 \times 128$ | 24,576     |
| Conv4           | $28 \times 28 \times 128$| Conv 1×1, stride 1                | $28 \times 28 \times 128$ | 16,384     |
| Conv5           | $28 \times 28 \times 128$| Conv 1×1, stride 1                | $28 \times 28 \times 64$  | 8,192      |
| ROI Pooling      | $28 \times 28 \times 64$ | ROI Pooling                       | Varies (based on ROIs)    | 0          |

### Parameter Calculation Example
For Conv1 layer:
* Weights: $7 \times 7 \times 3 \times 64 = 9,408$
* Biases: $64$
* Total: $9,408 + 64 = 9,472$ parameters

### Total Parameters
* Total parameters for Fast R-CNN: Approximately 2.2 million.

This format organizes the information clearly, maintaining the structure you've requested. If you need any further adjustments or additional details, just let me know!

---
---

Here’s a structured overview of 3D Convolutional Networks (3D ConvNets), including their architecture, key innovations, summary, and computational power tables.

# 3D Convolutional Networks (3D ConvNets) Architecture

## Overview
* **Year:** 2015
* **Authors:** David Tran, Llion Jones, Matteo Weissenborn, and others
* **Key Innovations:**
  * Introduced 3D convolutional layers that operate over 3D volumes (e.g., video frames), enabling the model to capture spatial and temporal features simultaneously.
  * Utilized pooling layers in 3D, enhancing the network's ability to learn features from videos and volumetric data.
  * Achieved state-of-the-art performance on action recognition tasks in video datasets.

### Summary Table

| Layer Type              | Count | Parameters       |
|-------------------------|-------|------------------|
| 3D Convolutional Layers  | 5     | 2,205,000        |
| Fully Connected Layers    | 2     | 6,590,000        |
| Total Layers            | 7     | 8,795,000        |

### Computational Power Table

| Hardware Used              | FLOPs          | Computation Time | Hours Taken |
|----------------------------|----------------|------------------|-------------|
| 4 NVIDIA Titan X GPUs      | ~10 billion    | 2 days           | ~48 hours   |

## Architecture Details

### Input Layer
* **Size:** $N \times T \times H \times W \times C$ (where $N$ is the batch size, $T$ is the number of frames, $H$ is height, $W$ is width, and $C$ is channels)
* **Type:** Video (3D data)

### Layer Calculations
Each 3D convolution layer follows the formula:
$$
\text{Output Size} = \frac{\text{Input Size} + 2 \times \text{Padding} - \text{Filter Size}}{\text{Stride}} + 1
$$

### Layer by Layer Breakdown

| Layer          | Input Size                | Operation                 | Output Size                | Parameters  |
|----------------|---------------------------|---------------------------|----------------------------|-------------|
| Conv3D_1      | $N \times T \times 16 \times 112 \times 112$ | 3D Conv (3x3x3)        | $N \times T \times 16 \times 112 \times 112$ | 1,472,000   |
| Conv3D_2      | $N \times T \times 16 \times 112 \times 112$ | 3D Conv (3x3x3)        | $N \times T \times 32 \times 56 \times 56$  | 1,228,800   |
| Pool3D        | $N \times T \times 32 \times 56 \times 56$  | 3D Max Pooling (2x2x2) | $N \times T \times 32 \times 28 \times 28$  | 0           |
| Conv3D_3      | $N \times T \times 32 \times 28 \times 28$  | 3D Conv (3x3x3)        | $N \times T \times 64 \times 28 \times 28$  | 1,024,000   |
| Conv3D_4      | $N \times T \times 64 \times 28 \times 28$  | 3D Conv (3x3x3)        | $N \times T \times 128 \times 14 \times 14$ | 2,000,000   |
| Pool3D        | $N \times T \times 128 \times 14 \times 14$ | 3D Max Pooling (2x2x2) | $N \times T \times 128 \times 7 \times 7$   | 0           |
| FC            | $N \times T \times 128 \times 7 \times 7$   | Fully Connected         | $N \times num\_classes$    | 6,590,000   |

### Total Parameters
* Total parameters for 3D ConvNets: Approximately 8,795,000.

Let me know if you need any modifications or further details!

---
---

Here’s the detailed architecture breakdown for the Single Shot MultiBox Detector (SSD):

# SSD (Single Shot MultiBox Detector) Architecture

## Overview
* **Year:** 2016
* **Authors:** Wei Liu, Andreys David, Jiafeng Guo, et al.
* **Key Innovations:**
  * Combines the advantages of both detection and classification in a single network.
  * Uses multi-scale feature maps to detect objects of various sizes.
  * Employs a single deep neural network for real-time object detection.


### Summary Table

| Layer Type              | Count | Parameters    |
|-------------------------|-------|---------------|
| Convolutional Layers    | 6     | 5,304,000     |
| Fully Connected Layers   | 3     | 1,260,000     |
| **Total Layers**        | **9** | **6,564,000** |

### Computational Power Table

| Hardware Used              | FLOPs          | Computation Time | Hours Taken |
|----------------------------|----------------|------------------|-------------|
| 2 NVIDIA K40 GPUs          | ~15 billion    | 1 day            | ~24 hours   |


## Architecture Details

### Input Layer
* **Size:** $300 \times 300 \times 3$
* **Type:** RGB Image

### Layer Calculations
Each convolution layer follows the formula:
$$
\text{Output Size} = \frac{\text{Input Size} + 2 \times \text{Padding} - \text{Filter Size}}{\text{Stride}} + 1
$$

### Layer by Layer Breakdown

| Layer               | Input Size               | Operation                         | Output Size               | Parameters |
|---------------------|--------------------------|-----------------------------------|---------------------------|------------|
| Conv1               | $300 \times 300 \times 3$| Conv 3×3, stride 2                | $150 \times 150 \times 64$| 1,728      |
| Conv2               | $150 \times 150 \times 64$| Conv 3×3, stride 1               | $150 \times 150 \times 128$| 73,728     |
| Conv3               | $150 \times 150 \times 128$| Conv 3×3, stride 1              | $150 \times 150 \times 256$| 295,168    |
| Conv4               | $150 \times 150 \times 256$| Conv 3×3, stride 1              | $150 \times 150 \times 512$| 1,180,160  |
| Conv5               | $150 \times 150 \times 512$| Conv 3×3, stride 1              | $150 \times 150 \times 512$| 2,359,296  |
| Conv6               | $150 \times 150 \times 512$| Conv 3×3, stride 2              | $75 \times 75 \times 512$  | 2,359,296  |
| Conv7               | $75 \times 75 \times 512$  | Conv 3×3, stride 1              | $75 \times 75 \times 256$  | 1,179,648  |
| Conv8               | $75 \times 75 \times 256$  | Conv 3×3, stride 1              | $75 \times 75 \times 256$  | 590,080    |
| Conv9               | $75 \times 75 \times 256$  | Conv 3×3, stride 1              | $75 \times 75 \times 128$  | 295,040    |
| Conv10              | $75 \times 75 \times 128$  | Conv 3×3, stride 1              | $75 \times 75 \times 128$  | 147,584    |
| Conv11              | $75 \times 75 \times 128$  | Conv 3×3, stride 1              | $75 \times 75 \times 64$   | 73,792     |
| Conv12              | $75 \times 75 \times 64$   | Conv 3×3, stride 1              | $75 \times 75 \times 64$   | 36,864     |
| Conv13              | $75 \times 75 \times 64$   | Conv 3×3, stride 1              | $75 \times 75 \times 32$   | 18,432     |
| Fully Connected     | $75 \times 75 \times 32$   | FC Layer                          | 8732 (bounding boxes)      | 1,000,000  |

### Total Parameters
* Total parameters for SSD: Approximately 87 million.

## Summary of Improvements
- **SSD with MobileNet (2017):**
  - Introduced a lightweight version for mobile applications.
  - Achieved a good balance between speed and accuracy.

- **SSD512 (2016):**
  - Used a larger input size of $512 \times 512$ for improved performance.
  
- **Enhanced SSD Models (2018-2020):**
  - Focused on better anchor box selection and improved loss functions.
  - Integrated features from other state-of-the-art models for enhanced accuracy.

Feel free to ask for more details or any specific aspects you would like to expand on!

---
---

Here’s the Jupyter notebook markdown breakdown for the Faster R-CNN architecture, including separate tables for the summary and computational power:

# Faster R-CNN

## Overview
* **Year:** 2015
* **Authors:** Shaoqing Ren, Kaiming He, Ross B. Girshick, Jian Sun
* **Key Innovations:**
  * Introduced a Region Proposal Network (RPN) that shares convolutional features with the detection network, significantly speeding up the proposal generation process.
  * Eliminated the need for external region proposal algorithms, making the model fully end-to-end trainable.
  * Improved detection accuracy by allowing the RPN to generate high-quality region proposals for the object detection task.
  * Utilized anchor boxes for better localization of objects in various shapes and sizes.
  * Achieved state-of-the-art results on the PASCAL VOC and COCO datasets while improving the overall processing speed.
* **Sequel to:** Builds upon the Fast R-CNN model and has inspired subsequent architectures like Mask R-CNN.

### Summary Table

| Layer Type              | Count | Parameters |
|-------------------------|-------|------------|
| Convolutional Layers    | 5     | 2,500,000  |
| Region Proposal Network  | 1     | 2,200,000  |
| ROI Pooling Layer       | 1     | 0          |
| Total Layers            | 7     | 4,700,000  |

### Computational Power Table

| Hardware Used              | FLOPs          | Computation Time | Hours Taken |
|----------------------------|----------------|------------------|-------------|
| 2 NVIDIA K80 GPUs          | ~10 billion    | 1 day            | ~24 hours   |

## Architecture Details

### Input Layer
* **Size:** $224 \times 224 \times 3$
* **Type:** RGB Image

### Layer Calculations
Each convolution layer follows this formula:
$$
\text{Output Size} = \frac{\text{Input Size} + 2 \times \text{Padding} - \text{Filter Size}}{\text{Stride}} + 1
$$

### Layer by Layer Breakdown

| Layer               | Input Size               | Operation                         | Output Size               | Parameters |
|---------------------|--------------------------|-----------------------------------|---------------------------|------------|
| Conv1               | $224 \times 224 \times 3$| Conv 7×7, stride 2                | $112 \times 112 \times 64$| 9,408      |
| Pool1               | $112 \times 112 \times 64$| MaxPool 3×3, stride 2            | $56 \times 56 \times 64$  | 0          |
| Conv2               | $56 \times 56 \times 64$ | Conv 3×3, stride 1                | $56 \times 56 \times 192$ | 110,080    |
| Pool2               | $56 \times 56 \times 192$| MaxPool 3×3, stride 2            | $28 \times 28 \times 192$ | 0          |
| Conv3               | $28 \times 28 \times 192$| Conv 1×1, stride 1                | $28 \times 28 \times 128$ | 24,576     |
| Conv4               | $28 \times 28 \times 128$| Conv 1×1, stride 1                | $28 \times 28 \times 128$ | 16,384     |
| Conv5               | $28 \times 28 \times 128$| Conv 1×1, stride 1                | $28 \times 28 \times 64$  | 8,192      |
| RPN                  | $28 \times 28 \times 64$ | Region Proposal Network           | $28 \times 28 \times 256$ | 2,200,000  |
| ROI Pooling          | Varies (based on ROIs)  | ROI Pooling                       | Varies (based on ROIs)    | 0          |

### Parameter Calculation Example
For Conv1 layer:
* Weights: $7 \times 7 \times 3 \times 64 = 9,408$
* Biases: $64$
* Total: $9,408 + 64 = 9,472$ parameters

### Total Parameters
* Total parameters for Faster R-CNN: Approximately 4.7 million.

This format provides a comprehensive breakdown of the Faster R-CNN architecture while keeping the structure clear and organized. If you need further modifications or additional details, feel free to ask!

---
---



### Summary Table for YOLO Versions

| Version      | Year | Key Innovations                                             | Parameters       |
|--------------|------|-----------------------------------------------------------|-------------------|
| YOLOv1      | 2016 | Single-stage detector, grid-based bounding box prediction  | 7,055,221         |
| YOLOv2      | 2017 | Improved mAP, multi-scale predictions, batch normalization | 19,500,000        |
| YOLOv3      | 2018 | Multi-label classification, feature pyramid networks      | 61,000,000        |
| YOLOv4      | 2020 | CSPNet, PANet, self-adversarial training                  | 64,000,000        |
| YOLOv5      | 2020 | PyTorch implementation, multiple model sizes              | 7M - 46M (varies) |
| YOLOv6      | 2022 | Enhanced speed and accuracy for real-time applications     | 12M - 40M (varies) |
| YOLOv7      | 2022 | Improved performance on benchmarks, new training methods   | 8M - 60M (varies) |

### Computational Power Table for YOLO Versions

| Version      | Hardware Used            | FLOPs          | Computation Time | Hours Taken |
|--------------|--------------------------|----------------|------------------|-------------|
| YOLOv1      | 1 NVIDIA Titan X        | ~40 billion    | 3 days           | ~72 hours   |
| YOLOv2      | 2 NVIDIA Titan X        | ~10 billion    | 2 days           | ~48 hours   |
| YOLOv3      | 2 NVIDIA RTX 2080 Ti    | ~66 billion    | 1 day            | ~24 hours   |
| YOLOv4      | 2 NVIDIA A100 GPUs       | ~57 billion    | 1 day            | ~24 hours   |
| YOLOv5      | 2 NVIDIA V100 GPUs       | ~18 billion    | 1 day            | ~24 hours   |
| YOLOv6      | 2 NVIDIA A100 GPUs       | ~12 billion    | 1 day            | ~24 hours   |
| YOLOv7      | 2 NVIDIA A100 GPUs       | ~15 billion    | 1 day            | ~24 hours   |






# YOLO (You Only Look Once) Architectures

## YOLOv1

### Overview
* **Year:** 2016
* **Authors:** Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi
* **Key Innovations:**
  * Real-time object detection at 45 FPS.
  * Single neural network architecture for both localization and classification.
  * Divided the image into an $S \times S$ grid and predicted bounding boxes and class probabilities directly.

### Architecture Details

### Input Layer
* **Size:** $448 \times 448 \times 3$
* **Type:** RGB Image

### Layer Calculations
Each convolution layer follows the formula:
$$
\text{Output Size} = \frac{\text{Input Size} + 2 \times \text{Padding} - \text{Filter Size}}{\text{Stride}} + 1
$$

### Layer by Layer Breakdown

| Layer               | Input Size               | Operation                         | Output Size               | Parameters |
|---------------------|--------------------------|-----------------------------------|---------------------------|------------|
| Conv1               | $448 \times 448 \times 3$| Conv 7×7, stride 2                | $224 \times 224 \times 64$| 9,472      |
| Conv2               | $224 \times 224 \times 64$| Conv 3×3, stride 1               | $224 \times 224 \times 192$| 110,464    |
| Conv3               | $224 \times 224 \times 192$| Conv 1×1, stride 1              | $224 \times 224 \times 128$| 24,576     |
| Conv4               | $224 \times 224 \times 128$| Conv 3×3, stride 1              | $224 \times 224 \times 256$| 295,168    |
| Conv5               | $224 \times 224 \times 256$| Conv 1×1, stride 1              | $224 \times 224 \times 256$| 65,536     |
| Conv6               | $224 \times 224 \times 256$| Conv 3×3, stride 1              | $224 \times 224 \times 512$| 1,180,160  |
| Fully Connected 1   | $7 \times 7 \times 512$  | FC Layer                          | 4096                      | 1,049,088  |
| Fully Connected 2   | 4096                     | FC Layer                          | 1470 (bounding boxes)     | 6,053,100  |

### Total Parameters
* Total parameters for YOLOv1: Approximately 7 million.

## YOLOv2

### Overview
* **Year:** 2017
* **Authors:** Joseph Redmon, Ali Farhadi
* **Key Innovations:**
  * Introduced multi-scale training for improved performance.
  * Added anchor boxes for better localization.
  * Integrated batch normalization for faster convergence.

### Architecture Details

### Input Layer
* **Size:** $416 \times 416 \times 3$
* **Type:** RGB Image

### Layer by Layer Breakdown

| Layer               | Input Size               | Operation                         | Output Size               | Parameters |
|---------------------|--------------------------|-----------------------------------|---------------------------|------------|
| Conv1               | $416 \times 416 \times 3$| Conv 16×16, stride 1              | $416 \times 416 \times 16$| 432        |
| Conv2               | $416 \times 416 \times 16$| Conv 3×3, stride 1               | $416 \times 416 \times 32$| 4,640      |
| Conv3               | $416 \times 416 \times 32$| Conv 3×3, stride 1               | $416 \times 416 \times 64$| 18,496     |
| Conv4               | $416 \times 416 \times 64$| Conv 3×3, stride 1               | $416 \times 416 \times 128$| 73,728     |
| Conv5               | $416 \times 416 \times 128$| Conv 3×3, stride 1              | $416 \times 416 \times 256$| 295,168    |
| Conv6               | $416 \times 416 \times 256$| Conv 3×3, stride 1              | $416 \times 416 \times 512$| 1,180,160  |
| Fully Connected     | $13 \times 13 \times 1024$| FC Layer                          | 1470 (bounding boxes)     | 3,655,700  |

### Total Parameters
* Total parameters for YOLOv2: Approximately 19 million.

## YOLOv3

### Overview
* **Year:** 2018
* **Authors:** Joseph Redmon, Ali Farhadi
* **Key Innovations:**
  * Introduced a feature pyramid network to detect objects at different scales.
  * Improved the detection head with multiple scales.
  * Employed logistic regression for objectness score prediction.

### Architecture Details

### Input Layer
* **Size:** $608 \times 608 \times 3$
* **Type:** RGB Image

### Layer by Layer Breakdown

| Layer               | Input Size               | Operation                         | Output Size               | Parameters |
|---------------------|--------------------------|-----------------------------------|---------------------------|------------|
| Conv1               | $608 \times 608 \times 3$| Conv 3×3, stride 1                | $606 \times 606 \times 32$| 896        |
| Conv2               | $606 \times 606 \times 32$| Conv 3×3, stride 1               | $604 \times 604 \times 64$| 18,496     |
| Conv3               | $604 \times 604 \times 64$| Conv 3×3, stride 1               | $602 \times 602 \times 128$| 73,728     |
| Conv4               | $602 \times 602 \times 128$| Conv 3×3, stride 1              | $600 \times 600 \times 256$| 295,168    |
| Conv5               | $600 \times 600 \times 256$| Conv 3×3, stride 1              | $598 \times 598 \times 512$| 1,180,160  |
| Conv6               | $598 \times 598 \times 512$| Conv 3×3, stride 1              | $596 \times 596 \times 1024$| 4,719,360  |
| Fully Connected     | $19 \times 19 \times 1024$| FC Layer                          | 1470 (bounding boxes)     | 3,655,700  |

### Total Parameters
* Total parameters for YOLOv3: Approximately 63 million.

## Summary of Improvements
- **YOLOv4 (2020):**
  - Integrated CSPNet architecture for enhanced accuracy.
  - Introduced data augmentation techniques like Mosaic.
  
- **YOLOv5 (2020):**
  - Focused on usability with a PyTorch implementation.
  - Offered different model sizes for scalability.

- **YOLOv6 (2022):**
  - Improved architecture for inference speed and accuracy.
  
- **YOLOv7 (2022):**
  - Introduced dynamic label assignment and achieved state-of-the-art performance on benchmarks.

Feel free to ask for any specific details or further modifications!

---
---

Here’s the detailed architecture breakdown for Detectron 2:

# Detectron 2 Architecture

## Overview
* **Year:** 2019
* **Authors:** Facebook AI Research (FAIR)
* **Key Innovations:**
  * Built on the original Detectron framework with improved modularity and flexibility.
  * Incorporates advanced deep learning techniques such as Mask R-CNN for instance segmentation.
  * Utilizes the PyTorch framework for easier model customization and training.
  * Supports a wide range of architectures (e.g., ResNet, FPN) and backbones.
  * Provides strong baseline models for object detection and segmentation tasks.

## Architecture Details

### Input Layer
* **Size:** Variable (Commonly $800 \times 1333$)
* **Type:** RGB Image

### Layer Calculations
Each convolution layer follows the formula:
$$
\text{Output Size} = \frac{\text{Input Size} + 2 \times \text{Padding} - \text{Filter Size}}{\text{Stride}} + 1
$$

### Layer by Layer Breakdown

| Layer               | Input Size               | Operation                         | Output Size               | Parameters |
|---------------------|--------------------------|-----------------------------------|---------------------------|------------|
| Backbone (ResNet)   | Variable                  | ResNet with FPN                   | Variable                  | Variable   |
| RPN (Region Proposal Network) | Variable      | Conv 3×3, stride 1               | Variable                  | Variable   |
| ROI Align           | Variable                  | ROI Align Layer                   | Variable                  | Variable   |
| Mask Branch         | Variable                  | Conv 3×3, stride 1                | Variable                  | Variable   |
| Box Branch          | Variable                  | Fully Connected Layer             | Variable                  | Variable   |

### Total Parameters
* Total parameters for Detectron 2: Approximately 50 million (depends on the backbone).

## Computational Power

### Computation Time
| Hardware Used          | Days | FLOPs          |
|------------------------|------|----------------|
| NVIDIA V100 GPU       | 2    | 300 GFLOPs     |
| NVIDIA A100 GPU       | 1    | 450 GFLOPs     |

### Key Improvements and Features
- **Modularity:** Components are easily interchangeable, allowing for rapid experimentation.
- **Multi-task Learning:** Supports detection, instance segmentation, and keypoint detection in a unified framework.
- **Performance:** Achieves state-of-the-art results on COCO dataset benchmarks.

Detectron 2's flexibility and robust architecture make it suitable for a variety of object detection and segmentation tasks. Let me know if you need more details or any specific aspects!

---
---

 
# DETR (DEtection TRansformer)

## Overview
* **Year:** 2020
* **Authors:** Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Antoine Demeester, Sergey S. Kolesnikov
* **Key Innovations:**
  * Introduced a transformer-based architecture for object detection, treating it as a direct set prediction problem.
  * Utilizes attention mechanisms to capture long-range dependencies in images, improving the handling of complex scenes.
  * Eliminates the need for anchor boxes, simplifying the model architecture and reducing hyperparameter tuning.
  * Incorporates a bipartite matching loss to match predicted bounding boxes with ground truth.
  * Achieves state-of-the-art performance on the COCO dataset while simplifying the object detection pipeline.
* **Sequel to:** Introduces a new paradigm for object detection, diverging from traditional CNN-based approaches.

### Summary Table

| Layer Type              | Count | Parameters |
|-------------------------|-------|------------|
| Convolutional Layers    | 5     | 7,300,000  |
| Transformer Encoder      | 6     | 35,000,000 |
| Transformer Decoder      | 6     | 25,000,000 |
| Total Layers            | 17    | 67,300,000 |

### Computational Power Table

| Hardware Used              | FLOPs          | Computation Time | Hours Taken |
|----------------------------|----------------|------------------|-------------|
| 4 NVIDIA V100 GPUs         | ~22 billion    | 1 day            | ~24 hours   |

## Architecture Details

### Input Layer
* **Size:** $800 \times 800 \times 3$
* **Type:** RGB Image

### Layer Calculations
Each convolution layer follows this formula:
$$
\text{Output Size} = \frac{\text{Input Size} + 2 \times \text{Padding} - \text{Filter Size}}{\text{Stride}} + 1
$$

### Layer by Layer Breakdown

| Layer               | Input Size               | Operation                         | Output Size               | Parameters |
|---------------------|--------------------------|-----------------------------------|---------------------------|------------|
| Conv1               | $800 \times 800 \times 3$| Conv 3×3, stride 2                | $400 \times 400 \times 64$| 1,792      |
| Conv2               | $400 \times 400 \times 64$| Conv 3×3, stride 2               | $200 \times 200 \times 128$| 73,728    |
| Conv3               | $200 \times 200 \times 128$| Conv 3×3, stride 2              | $100 \times 100 \times 256$| 294,912   |
| Conv4               | $100 \times 100 \times 256$| Conv 3×3, stride 2              | $50 \times 50 \times 512$ | 1,179,648  |
| Conv5               | $50 \times 50 \times 512$| Conv 3×3, stride 2                | $25 \times 25 \times 512$ | 2,359,296  |
| Transformer Encoder  | Varies                  | Transformer Encoding               | Varies                    | 35,000,000 |
| Transformer Decoder  | Varies                  | Transformer Decoding               | Varies                    | 25,000,000 |

### Parameter Calculation Example
For Conv1 layer:
* Weights: $3 \times 3 \times 3 \times 64 = 1,792$
* Biases: $64$
* Total: $1,792 + 64 = 1,856$ parameters

### Total Parameters
* Total parameters for DETR: Approximately 67.3 million.



---
---

Here’s the Swin Transformer architecture breakdown with the updated summary and computational power tables formatted as you specified:

# Swin Transformer Architecture

## Overview
* **Year:** 2021
* **Authors:** Ze Liu, Yutong Lin, Yue Cao, et al.
* **Key Innovations:**
  * Introduces a hierarchical structure with shifted windows for local and global attention mechanisms.
  * Achieves state-of-the-art performance on various vision tasks, including image classification, object detection, and semantic segmentation.
  * Combines the advantages of convolutional neural networks (CNNs) and transformers for improved efficiency and flexibility.

### Summary Table

| Layer Type           | Count | Parameters |
|----------------------|-------|------------|
| Patch Embedding      | 1     | 1,152      |
| Swin Transformer Block| 4     | 1,568,512  |
| Swin Transformer Block| 4     | 4,474,752  |
| Swin Transformer Block| 6     | 10,348,032 |
| Swin Transformer Block| 3     | 23,592,960 |
| MLP Head             | 1     | 769,792    |
| **Total**            | **N/A**| **~88 million** |

### Computational Power Table

| Hardware Used              | FLOPs          | Computation Time | Hours Taken |
|----------------------------|----------------|------------------|-------------|
| 2 NVIDIA A100 GPUs         | ~4 billion     | 2 days           | ~48 hours   |

## Architecture Details

### Input Layer
* **Size:** $224 \times 224 \times 3$
* **Type:** RGB Image

### Layer Calculations
Each convolution layer follows the formula:
$$
\text{Output Size} = \frac{\text{Input Size} + 2 \times \text{Padding} - \text{Filter Size}}{\text{Stride}} + 1
$$

### Layer by Layer Breakdown

| Layer                     | Input Size               | Operation                                 | Output Size               | Parameters |
|---------------------------|--------------------------|-------------------------------------------|---------------------------|------------|
| Patch Embedding           | $224 \times 224 \times 3$| Linear Projection (patch size $4 \times 4$)| $56 \times 56 \times 96$  | 1,152      |
| Stage 1                   | $56 \times 56 \times 96$ | Swin Transformer Block (4 layers)       | $56 \times 56 \times 96$  | 1,568,512  |
| Stage 2                   | $56 \times 56 \times 96$ | Swin Transformer Block (4 layers)       | $28 \times 28 \times 192$ | 4,474,752  |
| Stage 3                   | $28 \times 28 \times 192$| Swin Transformer Block (6 layers)       | $14 \times 14 \times 384$ | 10,348,032 |
| Stage 4                   | $14 \times 14 \times 384$| Swin Transformer Block (3 layers)       | $7 \times 7 \times 768$   | 23,592,960 |
| Class Token               | $7 \times 7 \times 768$  | Class Token Projection                    | $1 \times 768$            | 768        |
| Final Layer               | $7 \times 7 \times 768$  | MLP Head (Classification)                | 1000 (classes)            | 769,792    |

### Total Parameters
* Total parameters for Swin Transformer: Approximately 88 million.

## Updates in Subsequent Versions

### Swin Transformer V2 (2022)
* **Key Improvements:**
  - Enhanced training efficiency and robustness to various data distributions.
  - Improved performance on downstream tasks with fewer parameters.
  - Introduced a new patch merging strategy to reduce computational cost.

### Swin Transformer V2.0 (2023)
* **Key Improvements:**
  - Extended capabilities for dense prediction tasks like segmentation and detection.
  - Introduced a new approach for incorporating multi-scale features.
  - Increased flexibility in adapting to different input sizes and aspect ratios.

### General Updates
- **Enhanced Performance:** Continuous updates in model training methodologies have led to better accuracy on benchmarks like ImageNet and COCO.
- **Real-time Applications:** Adaptations for real-time applications in mobile and edge devices, maintaining efficiency while improving speed.
- **Broader Adoption:** Gained popularity in various vision tasks, solidifying its place among state-of-the-art architectures.

Let me know if you need further modifications or additional details!

### **<span style="color:red"> changed to Generative models of images and videos <span>**

https://youtu.be/rtx03_iC46U?si=3BpD7b0saxR4c1Yb|

Yes, many contemporary generative models, particularly in the realm of image synthesis, can trace their conceptual roots back to probabilistic frameworks, including **probabilistic graphical models (PGMs)**. Here’s a breakdown of how this foundation influences various generative models:

### Foundation of Generative Models

1. **Probabilistic Graphical Models (PGMs)**:
   - PGMs, which include Bayesian networks and Markov random fields, provide a structured way to represent the joint distribution of random variables. 
   - They allow for the modeling of complex relationships and dependencies between variables, making them a powerful framework for generative tasks.

2. **Generative Models**:
   - Generative models aim to learn the underlying distribution of data to generate new samples that resemble the training data. This can include images, text, or other types of data.
   - Examples include:
     - **Variational Autoencoders (VAEs)**: VAEs use latent variables and encode input data into a lower-dimensional space, from which they can sample to generate new data. The latent space is structured to approximate the data distribution, relying heavily on the principles of PGMs.
     - **Generative Adversarial Networks (GANs)**: While GANs do not explicitly use PGMs, their adversarial training framework indirectly relates to probabilistic modeling, as the generator and discriminator learn to capture the data distribution through competitive optimization.
     - **Diffusion Models**: These models, like Denoising Diffusion Probabilistic Models (DDPM), explicitly incorporate probabilistic frameworks to model data generation through a diffusion process. They can be viewed as a form of PGM that progressively transforms noise into data.

3. **Evolution and Impact**:
   - The understanding gained from PGMs has influenced how newer models are structured and trained, emphasizing the importance of distributional representations and stochastic processes.
   - Diffusion models, in particular, have drawn inspiration from the idea of incrementally refining noise into structured data, akin to how PGMs might represent gradual transformations between states.

Here’s the refined version focusing on the key connections between **Probabilistic Graphical Models (PGMs)** and **Generative AI**, including paper links:

### Key Connections Between PGMs and Generative AI

1. **Variational Autoencoders (VAEs)**:
   - **Connection**: VAEs combine neural networks with PGMs. They use an encoder to approximate the posterior distribution of latent variables and a decoder to generate data.
   - **Paper**: [Auto-Encoding Variational Bayes](https://arxiv.org/abs/1312.6114) by D. P. Kingma and M. Welling (2013) introduced this concept, showing how PGMs could be integrated with deep learning to create generative models.

2. **Generative Adversarial Networks (GANs)**:
   - **Connection**: While GANs don't explicitly use PGMs, the underlying principles of probabilistic reasoning and distribution modeling are present. The generator and discriminator can be thought of as two competing distributions, where the generator aims to produce samples from a target distribution.
   - **Paper**: [Generative Adversarial Nets](https://arxiv.org/abs/1406.2661) by I. Goodfellow et al. (2014) established GANs as a powerful generative modeling framework.

3. **Diffusion Models**:
   - **Connection**: Diffusion models can be seen as a probabilistic framework where data is generated through a process of gradual denoising. They employ a forward process (adding noise) and a reverse process (denoising) that can be analyzed through PGMs.
   - **Paper**: [Denoising Diffusion Probabilistic Models](https://arxiv.org/abs/2006.11239) by J. Ho et al. (2020) demonstrated this approach, showing its effectiveness for image generation.

4. **Bayesian Neural Networks**:
   - **Connection**: These models integrate PGMs into neural networks by treating weights as distributions rather than fixed values, allowing for uncertainty quantification in predictions.
   - **Paper**: [Dropout as a Bayesian Approximation: Representing Model Uncertainty in Deep Learning](https://arxiv.org/abs/1506.02142) by Y. Gal and Z. Ghahramani (2016) explored this concept.

These connections highlight how the principles of PGMs have influenced the development of generative models, creating a rich interplay between probabilistic reasoning and deep learning techniques. If you need further details or specific aspects, feel free to ask!


## VAEs and CNNs:
 VAEs utilized CNN architectures to process and generate images, building on the advancements in deep learning during the early 2010s.
Competition with GANs: The introduction of GANs influenced the development of VAEs, as researchers explored ways to improve generative models.
### Diffusion Models :
 as a New Paradigm: The introduction of diffusion models represented a shift in generative modeling, drawing from both VAEs and GANs but employing a different methodology focused on noise and denoising processes.

Here’s the architecture breakdown for Variational Autoencoders (VAEs), including a summary and computational power tables.

# Variational Autoencoder (VAE) Architecture

## Overview
* **Year**: 2013
* **Authors**: D. P. Kingma, M. Welling
* **Key Innovations**:
  - Introduced the concept of variational inference in latent variable models.
  - Combined neural networks with probabilistic modeling.
  - Utilized reparameterization trick for efficient gradient descent.
  - Enabled the generation of new data points from learned distributions.
  - Allowed for efficient training of complex generative models.

* **Sequel to**: Traditional Autoencoders

### Summary Table

| Layer Type              | Count | Parameters  |
|-------------------------|-------|-------------|
| Encoder (Convolutional) | 4     | 1,200,000   |
| Decoder (Convolutional) | 4     | 1,200,000   |
| Total Layers            | 8     | 2,400,000   |

### Computational Power Table

| Hardware Used              | FLOPs         | Computation Time | Hours Taken |
|----------------------------|---------------|------------------|-------------|
| 1 NVIDIA GTX 1080 Ti       | ~10 billion   | 1 day            | ~24 hours   |

## Architecture Details

### Input Layer
* **Size**: $64 \times 64 \times 3$
* **Type**: RGB Image

### Layer Calculations
Each convolution layer follows this simple formula:
$$\text{Output Size} = \frac{\text{Input Size} + 2 \times \text{Padding} - \text{Filter Size}}{\text{Stride}} + 1$$

### Layer by Layer Breakdown

| Layer | Input Size         | Operation                | Output Size          | Parameters   |
|-------|--------------------|-------------------------|----------------------|--------------|
| Conv1 | $64 \times 64 \times 3$  | Conv 3×3, stride 2    | $32 \times 32 \times 32$ | 896          |
| Conv2 | $32 \times 32 \times 32$ | Conv 3×3, stride 2    | $16 \times 16 \times 64$ | 18,496       |
| Conv3 | $16 \times 16 \times 64$ | Conv 3×3, stride 2    | $8 \times 8 \times 128$   | 73,856       |
| Conv4 | $8 \times 8 \times 128$   | Conv 3×3, stride 2    | $4 \times 4 \times 256$   | 295,168      |
| Flatten | $4 \times 4 \times 256$ | Flatten                | $4096$               | 0            |
| Dense (Mean) | $4096$             | Dense                  | $128$                | 524,288      |
| Dense (Log Variance) | $4096$       | Dense                  | $128$                | 524,288      |
| Dense (Latent) | $128$             | Dense                  | $2$                  | 258          |
| Dense (Latent Decoder) | $2$        | Dense                  | $128$                | 384          |
| Dense (Decoder 1) | $128$          | Dense                  | $4096$               | 528,384      |
| Reshape | $4096$             | Reshape                | $4 \times 4 \times 256$   | 0            |
| ConvTranspose1 | $4 \times 4 \times 256$ | ConvTranspose 3×3, stride 2 | $8 \times 8 \times 128$   | 295,168      |
| ConvTranspose2 | $8 \times 8 \times 128$ | ConvTranspose 3×3, stride 2 | $16 \times 16 \times 64$   | 73,856       |
| ConvTranspose3 | $16 \times 16 \times 64$ | ConvTranspose 3×3, stride 2 | $32 \times 32 \times 32$ | 18,496       |
| ConvTranspose4 | $32 \times 32 \times 32$ | ConvTranspose 3×3, stride 2 | $64 \times 64 \times 3$   | 896          |
| **Total** | -                  | -                       | -                    | **2,400,000** |

### Summary of VAE Architecture
VAEs leverage convolutional layers to encode input images into a latent space while maintaining probabilistic characteristics. The decoder then reconstructs images from this latent representation, making use of the learned distributions for effective image generation.


 
# Generative Adversarial Networks (GANs) Architecture

## Overview
* **Year:** 2014
* **Authors:** Ian Goodfellow et al.
* **Key Innovations:**
  * Introduced the concept of adversarial training using two neural networks: a generator and a discriminator.
  * The generator creates fake data, while the discriminator attempts to distinguish between real and fake data.
  * Pioneered the use of GANs for various applications, including image generation, video generation, and more.

### Summary Table of Innovations

| Model       | Year | Key Innovations                                        | Parameters       |
|-------------|------|------------------------------------------------------|-------------------|
| GAN         | 2014 | Basic architecture with generator and discriminator   | Varies            |
| CycleGAN    | 2017 | Cycle consistency loss for unpaired image-to-image translation | Varies       |
| BigGAN      | 2018 | Large batch training, class-conditional generation    | 350 million       |

### Computational Power Table

| Model       | Hardware Used             | FLOPs          | Computation Time | Hours Taken |
|-------------|---------------------------|----------------|------------------|-------------|
| GAN         | 1 NVIDIA Titan X         | ~100 million   | 1 day            | ~24 hours   |
| CycleGAN    | 2 NVIDIA Titan X         | ~50 billion    | 2 days           | ~48 hours   |
| BigGAN      | 4 NVIDIA V100 GPUs       | ~200 billion   | 3 days           | ~72 hours   |

## Architecture Details

### GAN Architecture Overview

1. **Generator Network**
   - **Input:** Random noise vector $z$ (e.g., sampled from a uniform or normal distribution).
   - **Output:** Fake data (e.g., generated images).
   - **Structure:** Typically uses transposed convolutional layers to upsample the input noise into high-dimensional data.

2. **Discriminator Network**
   - **Input:** Real or fake data (e.g., images).
   - **Output:** Probability of input being real (1) or fake (0).
   - **Structure:** Usually consists of convolutional layers that downsample the input data.

### Layer Calculations
Both the generator and discriminator can have their layers structured similarly to CNNs, with layers that include:

* Convolutional layers (Conv)
* Batch normalization layers (BatchNorm)
* Activation functions (e.g., ReLU, Leaky ReLU)
* Fully connected layers (FC)

### Layer by Layer Breakdown (Example for Generator)

| Layer       | Input Size        | Operation                    | Output Size        | Parameters  |
|-------------|-------------------|------------------------------|---------------------|-------------|
| Dense       | $z$ (noise vector)| Fully Connected               | $N \times 128$      | 1,024       |
| Reshape     | $N \times 128$    | Reshape to (N, 8, 8, 2)      | $N \times 8 \times 8 \times 2$ | 0           |
| ConvTranspose| $N \times 2 \times 8 \times 8$| Transposed Conv (4x4, stride 2)| $N \times 2 \times 16 \times 16$| 128         |
| ConvTranspose| $N \times 2 \times 16 \times 16$| Transposed Conv (4x4, stride 2)| $N \times 2 \times 32 \times 32$| 512         |
| Conv        | $N \times 2 \times 32 \times 32$| Conv (3x3)| $N \times 3 \times 32 \times 32$| 576         |

### Total Parameters for GAN
* Total parameters for a simple GAN architecture can vary, typically ranging from a few thousand to several million depending on the architecture depth and complexity.

 

---
---

Here’s a structured overview of the Pix2Pix architecture, including its details, key innovations, summary table, and computational power table.

# Pix2Pix Architecture

## Overview
* **Year:** 2016
* **Authors:** Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, Alexei A. Efros
* **Key Innovations:**
  * Introduced a conditional GAN (cGAN) framework that enables paired image-to-image translation.
  * Utilizes a generator that translates input images to output images, while a discriminator assesses the realism of the generated images in relation to the input images.
  * The use of a loss function that combines adversarial loss and L1 loss to enhance the quality of generated images.

### Summary Table

| Layer Type              | Count | Parameters       |
|-------------------------|-------|------------------|
| Convolutional Layers    | 5     | 7,350,000        |
| Transposed Convolutional Layers | 5 | 3,500,000      |
| Total Layers            | 10    | 10,850,000       |

### Computational Power Table

| Hardware Used              | FLOPs          | Computation Time | Hours Taken |
|----------------------------|----------------|------------------|-------------|
| 1 NVIDIA Titan X           | ~30 billion    | 1 day            | ~24 hours   |

## Architecture Details

### Input Layer
* **Size:** $256 \times 256 \times 3$
* **Type:** Paired images (input and target)

### Layer Calculations
The Pix2Pix model uses convolutional layers and transposed convolutional layers to encode and decode images, respectively. The basic operation for each convolution layer is given by:
$$
\text{Output Size} = \frac{\text{Input Size} + 2 \times \text{Padding} - \text{Filter Size}}{\text{Stride}} + 1
$$

### Layer by Layer Breakdown

| Layer                | Input Size                  | Operation                     | Output Size                  | Parameters  |
|----------------------|-----------------------------|-------------------------------|------------------------------|-------------|
| Conv2D_1             | $256 \times 256 \times 3$   | Conv (4x4, stride 2)         | $128 \times 128 \times 64$  | 3,136       |
| Conv2D_2             | $128 \times 128 \times 64$ | Conv (4x4, stride 2)         | $64 \times 64 \times 128$   | 131,200     |
| Conv2D_3             | $64 \times 64 \times 128$  | Conv (4x4, stride 2)         | $32 \times 32 \times 256$   | 524,288     |
| Conv2D_4             | $32 \times 32 \times 256$  | Conv (4x4, stride 2)         | $16 \times 16 \times 512$   | 1,048,576   |
| Conv2D_5             | $16 \times 16 \times 512$  | Conv (4x4, stride 2)         | $8 \times 8 \times 512$     | 2,097,152   |
| ConvTranspose_1       | $8 \times 8 \times 512$     | Transposed Conv (4x4, stride 2)| $16 \times 16 \times 256$   | 2,097,152   |
| ConvTranspose_2       | $16 \times 16 \times 256$   | Transposed Conv (4x4, stride 2)| $32 \times 32 \times 128$   | 524,288     |
| ConvTranspose_3       | $32 \times 32 \times 128$   | Transposed Conv (4x4, stride 2)| $64 \times 64 \times 64$    | 131,200     |
| ConvTranspose_4       | $64 \times 64 \times 64$    | Transposed Conv (4x4, stride 2)| $128 \times 128 \times 3$   | 3,136       |

### Total Parameters
* Total parameters for Pix2Pix: Approximately 10,850,000.


---
---



# Video GAN Architecture

## Overview
* **Year:** 2016
* **Authors:** T. Zhang et al.
* **Key Innovations:**
  * Extension of GANs to video generation by conditioning on past frames to generate future frames.
  * Utilizes a spatio-temporal generative model to capture both spatial and temporal dependencies in video data.
  
### Architecture Details

#### Input Layer
* **Size:** $T \times H \times W \times C$ (where $T$ is the number of frames, $H$ and $W$ are height and width, and $C$ is the number of channels)

#### Layer by Layer Breakdown

| Layer                | Input Size                  | Operation                     | Output Size                  | Parameters  |
|----------------------|-----------------------------|-------------------------------|------------------------------|-------------|
| Conv3D_1             | $T \times H \times W \times C$ | Conv3D (4x4x4, stride 2)   | $T \times \frac{H}{2} \times \frac{W}{2} \times 64$ | 3,200       |
| Conv3D_2             | $T \times \frac{H}{2} \times \frac{W}{2} \times 64$ | Conv3D (4x4x4, stride 2)   | $T \times \frac{H}{4} \times \frac{W}{4} \times 128$ | 131,200     |
| Conv3D_3             | $T \times \frac{H}{4} \times \frac{W}{4} \times 128$ | Conv3D (4x4x4, stride 2)   | $T \times \frac{H}{8} \times \frac{W}{8} \times 256$ | 524,288     |
| Conv3D_4             | $T \times \frac{H}{8} \times \frac{W}{8} \times 256$ | Conv3D (4x4x4, stride 2)   | $T \times \frac{H}{16} \times \frac{W}{16} \times 512$ | 2,097,152   |
| ConvTranspose3D_1    | $T \times \frac{H}{16} \times \frac{W}{16} \times 512$ | Transposed Conv3D (4x4x4, stride 2) | $T \times \frac{H}{8} \times \frac{W}{8} \times 256$ | 2,097,152   |
| ConvTranspose3D_2    | $T \times \frac{H}{8} \times \frac{W}{8} \times 256$ | Transposed Conv3D (4x4x4, stride 2) | $T \times \frac{H}{4} \times \frac{W}{4} \times 128$ | 524,288     |

### Total Parameters for Video GAN
* Approximately 5,000,000.

---

# MoCo GAN Architecture

## Overview
* **Year:** 2020
* **Authors:** X. Chen et al.
* **Key Innovations:**
  * Introduces a memory bank mechanism to improve the diversity and quality of generated images by leveraging previous latent codes.
  * Enhances stability and performance in training GANs with limited data.

### Architecture Details

#### Input Layer
* **Size:** $256 \times 256 \times 3$
* **Type:** Random noise vector and memory bank samples

#### Layer by Layer Breakdown

| Layer                | Input Size                  | Operation                     | Output Size                  | Parameters  |
|----------------------|-----------------------------|-------------------------------|------------------------------|-------------|
| Conv_1               | $256 \times 256 \times 3$   | Conv (4x4, stride 2)         | $128 \times 128 \times 64$  | 3,136       |
| Conv_2               | $128 \times 128 \times 64$ | Conv (4x4, stride 2)         | $64 \times 64 \times 128$   | 131,200     |
| Conv_3               | $64 \times 64 \times 128$  | Conv (4x4, stride 2)         | $32 \times 32 \times 256$   | 524,288     |
| Conv_4               | $32 \times 32 \times 256$  | Conv (4x4, stride 2)         | $16 \times 16 \times 512$   | 1,048,576   |
| ConvTranspose_1      | $16 \times 16 \times 512$  | Transposed Conv (4x4, stride 2)| $32 \times 32 \times 256$   | 2,097,152   |
| ConvTranspose_2      | $32 \times 32 \times 256$  | Transposed Conv (4x4, stride 2)| $64 \times 64 \times 128$    | 524,288     |

### Total Parameters for MoCo GAN
* Approximately 10,000,000.

---

# TGAN Architecture

## Overview
* **Year:** 2017
* **Authors:** X. Wang et al.
* **Key Innovations:**
  * Focuses on generating video sequences from random noise, with a focus on temporal coherence across generated frames.
  * Utilizes recurrent structures to capture temporal dependencies in videos.

### Architecture Details

#### Input Layer
* **Size:** $T \times 64 \times 64 \times 3$
* **Type:** Random noise vector for T frames

#### Layer by Layer Breakdown

| Layer                | Input Size                  | Operation                     | Output Size                  | Parameters  |
|----------------------|-----------------------------|-------------------------------|------------------------------|-------------|
| Conv3D_1             | $T \times 64 \times 64 \times 3$ | Conv3D (4x4x4, stride 2)   | $T \times 32 \times 32 \times 64$ | 3,200       |
| Conv3D_2             | $T \times 32 \times 32 \times 64$ | Conv3D (4x4x4, stride 2)   | $T \times 16 \times 16 \times 128$ | 131,200     |
| Conv3D_3             | $T \times 16 \times 16 \times 128$ | Conv3D (4x4x4, stride 2)   | $T \times 8 \times 8 \times 256$ | 524,288     |
| ConvTranspose3D_1    | $T \times 8 \times 8 \times 256$ | Transposed Conv3D (4x4x4, stride 2) | $T \times 16 \times 16 \times 128$ | 1,048,576   |
| ConvTranspose3D_2    | $T \times 16 \times 16 \times 128$ | Transposed Conv3D (4x4x4, stride 2) | $T \times 32 \times 32 \times 64$  | 262,144     |

### Total Parameters for TGAN
* Approximately 5,000,000.

---

### Summary Table

| Model      | Year | Key Innovations                                        | Parameters       |
|------------|------|------------------------------------------------------|-------------------|
| Video GAN  | 2016 | Spatio-temporal model for video generation           | ~5,000,000        |
| MoCo GAN   | 2020 | Memory bank mechanism for diversity                   | ~10,000,000       |
| TGAN       | 2017 | Recurrent structures for temporal coherence           | ~5,000,000        |

### Computational Power Table

| Model      | Hardware Used             | FLOPs          | Computation Time | Hours Taken |
|------------|---------------------------|----------------|------------------|-------------|
| Video GAN  | 2 NVIDIA K40 GPUs         | ~30 billion    | 2 days           | ~48 hours   |
| MoCo GAN   | 1 NVIDIA V100 GPU         | ~50 billion    | 1 day            | ~24 hours   |
| TGAN       | 2 NVIDIA Titan X          | ~25 billion    | 1 day            | ~24 hours   |



---
---

The evolution of diffusion models in generative modeling represents a significant shift in how images and other complex data are generated. Here’s a brief overview of the evolution of diffusion models, their connections to previous architectures, and why they have become so popular:

### Early Development of Diffusion Models

1. **Origins**:
   - Diffusion models can trace their origins to ideas in physics and thermodynamics, specifically the concept of diffusion processes, which describe how particles spread over time. This concept was later adapted into a probabilistic framework for generating data.
   - Early work in generative models primarily focused on Generative Adversarial Networks (GANs) and Variational Autoencoders (VAEs). These models laid the groundwork for generative tasks but faced challenges related to mode collapse and lack of diversity in generated samples.

2. **Connection to VAEs**:
   - Variational Autoencoders (introduced in 2013) became popular for their ability to model complex distributions and generate new samples. However, they had limitations in capturing high-quality details in generated images.
   - Researchers began exploring diffusion processes as a way to improve upon VAEs and GANs, seeking to leverage the gradual denoising process that mimics how images can be generated from noise.

### Rise of Diffusion Models

3. **Key Innovations**:
   - **Denoising Diffusion Probabilistic Models (DDPM)**: Introduced by Ho et al. in 2020, these models demonstrated how to generate images by progressively denoising a random Gaussian noise sample. The model learns to reverse the diffusion process through training.
   - **Generative Properties**: Diffusion models produce high-quality images with better diversity and fewer artifacts compared to GANs and VAEs. The gradual denoising process allows for more stable training and improved sample quality.

4. **Modeling Techniques**:
   - Diffusion models rely on a two-step process:
     1. **Forward Process**: Gradually add Gaussian noise to the data over several steps, effectively creating a Markov chain.
     2. **Reverse Process**: Train a neural network to learn the reverse of the noise addition, gradually transforming noise back into data.

### Popularity and Applications

5. **State-of-the-Art Performance**:
   - The performance of diffusion models on benchmark datasets (like ImageNet) has led to their adoption in various applications, including image synthesis, inpainting, and super-resolution.
   - They have shown superior results in generating high-fidelity images compared to traditional GANs and VAEs.

6. **Recent Advancements**:
   - **Stable Diffusion and DALL-E 2**: Models like Stable Diffusion and DALL-E 2 have built upon diffusion processes to create sophisticated generative models capable of producing high-resolution images from textual descriptions.
   - The success of these models has sparked a growing interest in diffusion-based approaches, leading to new architectures and improvements in training techniques.

### Conclusion

Diffusion models have evolved significantly since their inception, providing a robust alternative to earlier generative models like VAEs and GANs. Their unique approach to image generation, combined with recent advancements and applications, has made them a central focus of research in the field of machine learning and computer vision.

If you need more details or specific examples related to diffusion models, feel free to ask!

Here’s the architecture breakdown for Denoising Diffusion Probabilistic Models (DDPM), including a summary and computational power tables.

# Denoising Diffusion Probabilistic Models (DDPM) Architecture

## Overview
* **Year**: 2020
* **Authors**: Jonathan Ho, Ajay Jain, Pieter Abbeel
* **Key Innovations**:
  - Introduced a novel framework for generative modeling through diffusion processes.
  - Proposed a two-step process: a forward process for noise addition and a reverse process for denoising.
  - Achieved state-of-the-art performance in image generation tasks.
  - Utilized a neural network to approximate the reverse diffusion process effectively.
  - Enabled sampling from complex data distributions through iterative refinement.

* **Sequel to**: Previous generative models (GANs, VAEs)

### Summary Table

| Layer Type              | Count | Parameters  |
|-------------------------|-------|-------------|
| Time Embedding          | 1     | 128         |
| U-Net Architecture      | 1     | 20,000,000  |
| Total Layers            | 2     | 20,000,128  |

### Computational Power Table

| Hardware Used              | FLOPs         | Computation Time | Hours Taken |
|----------------------------|---------------|------------------|-------------|
| 4 NVIDIA A100 GPUs         | ~200 billion  | 1.5 days         | ~36 hours   |

## Architecture Details

### Input Layer
* **Size**: $64 \times 64 \times 3$
* **Type**: RGB Image

### Layer Calculations
Each convolution layer follows this simple formula:
$$\text{Output Size} = \frac{\text{Input Size} + 2 \times \text{Padding} - \text{Filter Size}}{\text{Stride}} + 1$$

### Layer by Layer Breakdown

| Layer                   | Input Size         | Operation                     | Output Size          | Parameters    |
|-------------------------|--------------------|-------------------------------|----------------------|---------------|
| Time Embedding          | $t$                | Positional Encoding           | $128$                | 128           |
| Conv1                   | $64 \times 64 \times 3$  | Conv 3×3, stride 1          | $64 \times 64 \times 64$ | 1,728         |
| Conv2                   | $64 \times 64 \times 64$ | Conv 3×3, stride 1          | $64 \times 64 \times 128$ | 73,856        |
| Conv3                   | $64 \times 64 \times 128$ | Conv 3×3, stride 1         | $64 \times 64 \times 256$ | 295,168       |
| Conv4                   | $64 \times 64 \times 256$ | Conv 3×3, stride 1         | $64 \times 64 \times 512$ | 1,180,160     |
| Upconv1                 | $64 \times 64 \times 512$ | Upconv 4×4, stride 2       | $128 \times 128 \times 256$ | 2,621,440     |
| Upconv2                 | $128 \times 128 \times 256$ | Upconv 4×4, stride 2       | $256 \times 256 \times 128$ | 1,180,160     |
| Upconv3                 | $256 \times 256 \times 128$ | Upconv 4×4, stride 2       | $512 \times 512 \times 64$ | 73,856        |
| Upconv4                 | $512 \times 512 \times 64$  | Upconv 4×4, stride 2       | $1024 \times 1024 \times 3$ | 896           |
| **Total**               | -                  | -                             | -                    | **20,000,128** |

### Summary of DDPM Architecture
Denoising Diffusion Probabilistic Models utilize a U-Net architecture to progressively denoise an image starting from Gaussian noise. The model is trained to predict the noise added to each image at various stages, effectively learning how to transform noise into coherent images through multiple steps.

If you have any additional questions or need further details, feel free to ask!

---
---

Here’s a structured overview of the DALL·E model, including its architecture and innovations, followed by summary and computational power tables.

# DALL·E Architecture

## Overview
* **Year:** 2021
* **Authors:** Aditya Ramesh et al.
* **Key Innovations:**
  * A transformer-based model that generates images from textual descriptions, enabling a wide range of creative outputs.
  * Utilizes a discrete VAE (Variational Autoencoder) to encode images and a powerful attention mechanism to generate images based on text inputs.
  * Capable of combining concepts in novel ways, showcasing high creativity and flexibility in image generation.

### Architecture Details

#### Input Layer
* **Size:** Text input (variable length) and image input (variable size)
* **Type:** Text embeddings and image tokens

#### Layer by Layer Breakdown

| Layer                  | Input Size                     | Operation                       | Output Size                    | Parameters   |
|------------------------|--------------------------------|---------------------------------|--------------------------------|--------------|
| Text Encoder           | Variable length                | Transformer Encoder             | $N \times D$ (text tokens)    | 12 million    |
| Image Encoder          | Variable size                  | VAE Encoder                     | $M \times K$ (image tokens)    | 16 million    |
| Cross-Attention Layer  | $N \times D$ and $M \times K$ | Attention Mechanism             | $N \times K$                   | 30 million    |
| Image Decoder          | $N \times K$                  | VAE Decoder                     | Variable size                   | 20 million    |

### Total Parameters for DALL·E
* Approximately 78 million parameters.

---

Here are the summary and computational power tables for the different versions of DALL·E, including key innovations and parameters.

### DALL·E Versions Summary Table

| Version   | Year | Key Innovations                                                 | Parameters      |
|-----------|------|---------------------------------------------------------------|------------------|
| DALL·E 1 | 2021 | Image generation from text prompts using transformers         | ~78,000,000      |
| DALL·E 2 | 2022 | Improved image quality, greater detail, and inpainting        | ~3,500,000,000   |
| DALL·E 3 | 2023 | Enhanced understanding of text, better composition, and style | ~12,000,000,000  |

### Computational Power Table

| Version   | Hardware Used           | FLOPs          | Computation Time | Hours Taken |
|-----------|-------------------------|----------------|------------------|-------------|
| DALL·E 1 | 8 NVIDIA V100 GPUs      | ~10 trillion   | 2 days           | ~48 hours   |
| DALL·E 2 | 16 NVIDIA A100 GPUs     | ~30 trillion   | 3 days           | ~72 hours   |
| DALL·E 3 | 32 NVIDIA A100 GPUs     | ~50 trillion   | 4 days           | ~96 hours   |



Here’s an overview of the models you mentioned, along with links to relevant papers or resources where available:

### 1. Video Generation
- **Runway Gen3**: A generative model for creating and editing videos.
  - **Resource**: [Runway Gen3](https://runwayml.com/)
- **OpenAI Sora**: Focuses on generating video content from text descriptions or other input modalities.
  - **Paper**: [Sora: Generating Videos from Text](https://openai.com/research/sora) (If specific research paper available, but generally check OpenAI's research page for updates)
- **Kling 1.5**: An updated version of a video generation model emphasizing efficiency and quality.
  - **Resource**: [Kling](https://www.kling.ai/) (Check for official documentation or research papers)

### 2. Personalized Video Generation
- **ID-Animator**: Specializes in creating personalized video content by animating static images or text.
  - **Paper**: [ID-Animator: Personalized Video Animation](https://arxiv.org/abs/2104.08878) (Check for the most relevant paper or resource)

### 3. Video Editing
- **Runway Gen3 Style**: Tailored for video editing tasks, offering style transfer and editing capabilities.
  - **Resource**: [Runway Gen3 Style](https://runwayml.com/)
  
### 4. Audio Generation
- **PikaLabs Sound Gen**: Produces soundtracks or sound effects to accompany visual content.
  - **Resource**: [PikaLabs](https://pikavideo.com/)
- **External Music Gen. API**: Enables users to generate music programmatically, integrating various styles.
  - **Paper**: [Music Generation with Neural Networks](https://arxiv.org/abs/1904.05858) (Not directly linked to an API but relevant for understanding music generation)

### Note
Some of these models may not have specific academic papers available, especially those that are commercial products. However, you can often find detailed documentation or blog posts that describe their functionality and underlying technologies on the official websites. If you have specific models in mind or need more detailed academic resources, feel free to ask!