# CNN Architectures 

In this section investigate the CNN Architectures that were created for image classification specifically: 
- AlexNet
- VGGNet 
- GoogleNet
- ResNet
- Wide ResNet, Dense Net

## The ImageNet Classification Challenge

Given 1,431,167 Images with human labels, with 1000 Object Classes, develop a model that's able to produce that highest accuracy rate of classification with top 5 Error evaluation. 
  
<br>
<div align="center">
<img src="../images/chap8/ImageNet.png" width="710"/>
</div>

### Pre-AlexNet

The main models that were used were Shallow NN, or Classical ML models where feature extraction was manually done. 
The best models during this period were obtaining 28-25% Error Rate.

## AlexNet

Suddenly AlexNet (developed by Ilya Sutskever, Geoffrey Hinton and Alex Krizhevsky) was the first model that was able to significantly reduce the error rates by roughly 10%, using Deep Neural network.


<div align="center">
<img src="../images/chap8/AlexNet.png" width="710"/>
</div>


### Key Innovations

| **Innovation** | **Description** | **Impact** |
|----------------|-----------------|------------|
| **Deep Architecture** | 8 learned layers (5 conv + 3 FC) | First successful very deep CNN |
| **ReLU Activation** | Used ReLU instead of tanh/sigmoid | 6× faster training, no vanishing gradients |
| **Dropout** | Applied 0.5 dropout in FC layers | Reduced overfitting significantly |
| **Data Augmentation** | Random crops, flips, color jittering | Increased training data diversity |
| **GPU Training** | Trained on 2 GTX 580 GPUs | Made large-scale training feasible |
| **Local Response Normalization** | Normalization across channels | Later replaced by Batch Normalization |
| **Overlapping Pooling** | Pool size 3×3, stride 2 | Slight accuracy improvement |

### Complete Architecture

**Input**: $227 \times 227 \times 3$ RGB image (originally stated as 224×224, but 227×227 is correct for the dimensions to work out)

**Assumptions**: 
- Batch size = 1 (single image)
- 32-bit floating point (4 bytes per value)
- FLOPs calculated as multiply-add operations

| **Layer** | **Type** | **Input Shape** | **Kernel Size** | **Filters/Units** | **Stride** | **Padding** | **Output Shape** | **Parameters** | **Memory (MB)** | **FLOPs** | **Activation** |
|-----------|----------|-----------------|-----------------|-------------------|------------|-------------|------------------|----------------|-----------------|-----------|----------------|
| **Input** | Input | - | - | - | - | - | $227 \times 227 \times 3$ | 0 | $227 \times 227 \times 3 \times 4 = 0.59$ | 0 | - |
| **Conv1** | Convolution | $227 \times 227 \times 3$ | $11 \times 11$ | 96 | 4 | 0 (valid) | $55 \times 55 \times 96$ | $34{,}944$ | $55 \times 55 \times 96 \times 4 = 1.11$ | $55 \times 55 \times 96 \times (11 \times 11 \times 3) = 105.4M$ | ReLU |
| **Pool1** | Max Pooling | $55 \times 55 \times 96$ | $3 \times 3$ | - | 2 | 0 | $27 \times 27 \times 96$ | 0 | $27 \times 27 \times 96 \times 4 = 0.27$ | $27 \times 27 \times 96 \times 9 = 6.3M$ | - |
| **LRN1** | Local Response Norm | $27 \times 27 \times 96$ | - | - | - | - | $27 \times 27 \times 96$ | 0 | $0.27$ | $27 \times 27 \times 96 \times 10 = 7.0M$ | - |
| **Conv2** | Convolution | $27 \times 27 \times 96$ | $5 \times 5$ | 256 | 1 | 2 (same) | $27 \times 27 \times 256$ | $614{,}656$ | $27 \times 27 \times 256 \times 4 = 0.72$ | $27 \times 27 \times 256 \times (5 \times 5 \times 96) = 448.1M$ | ReLU |
| **Pool2** | Max Pooling | $27 \times 27 \times 256$ | $3 \times 3$ | - | 2 | 0 | $13 \times 13 \times 256$ | 0 | $13 \times 13 \times 256 \times 4 = 0.17$ | $13 \times 13 \times 256 \times 9 = 3.9M$ | - |
| **LRN2** | Local Response Norm | $13 \times 13 \times 256$ | - | - | - | - | $13 \times 13 \times 256$ | 0 | $0.17$ | $13 \times 13 \times 256 \times 10 = 4.3M$ | - |
| **Conv3** | Convolution | $13 \times 13 \times 256$ | $3 \times 3$ | 384 | 1 | 1 (same) | $13 \times 13 \times 384$ | $885{,}120$ | $13 \times 13 \times 384 \times 4 = 0.25$ | $13 \times 13 \times 384 \times (3 \times 3 \times 256) = 149.5M$ | ReLU |
| **Conv4** | Convolution | $13 \times 13 \times 384$ | $3 \times 3$ | 384 | 1 | 1 (same) | $13 \times 13 \times 384$ | $1{,}327{,}488$ | $0.25$ | $13 \times 13 \times 384 \times (3 \times 3 \times 384) = 224.2M$ | ReLU |
| **Conv5** | Convolution | $13 \times 13 \times 384$ | $3 \times 3$ | 256 | 1 | 1 (same) | $13 \times 13 \times 256$ | $884{,}992$ | $0.17$ | $13 \times 13 \times 256 \times (3 \times 3 \times 384) = 149.5M$ | ReLU |
| **Pool3** | Max Pooling | $13 \times 13 \times 256$ | $3 \times 3$ | - | 2 | 0 | $6 \times 6 \times 256$ | 0 | $6 \times 6 \times 256 \times 4 = 0.04$ | $6 \times 6 \times 256 \times 9 = 0.8M$ | - |
| **Flatten** | Flatten | $6 \times 6 \times 256$ | - | - | - | - | $9{,}216$ | 0 | $9{,}216 \times 4 = 0.04$ | 0 | - |
| **FC1** | Fully Connected | $9{,}216$ | - | 4096 | - | - | $4{,}096$ | $37{,}752{,}832$ | $4{,}096 \times 4 = 0.016$ | $2 \times 9{,}216 \times 4{,}096 = 75.5M$ | ReLU + Dropout (0.5) |
| **FC2** | Fully Connected | $4{,}096$ | - | 4096 | - | - | $4{,}096$ | $16{,}781{,}312$ | $0.016$ | $2 \times 4{,}096 \times 4{,}096 = 33.6M$ | ReLU + Dropout (0.5) |
| **FC3** | Fully Connected | $4{,}096$ | - | 1000 | - | - | $1{,}000$ | $4{,}097{,}000$ | $1{,}000 \times 4 = 0.004$ | $2 \times 4{,}096 \times 1{,}000 = 8.2M$ | Softmax |
| **Output** | Softmax | $1{,}000$ | - | - | - | - | $1{,}000$ | 0 | $0.004$ | $1{,}000 \times 5 = 0.005M$ | - |
| | | | | | | | **TOTAL** | **61,378,344** | **~3.6 MB** | **~1.22 GFLOPS** | |



### Key Design Choices

| **Choice** | **Rationale** | **Impact** |
|------------|---------------|------------|
| **Large first kernel (11×11)** | Capture large receptive field early | Extract diverse low-level features |
| **Decreasing kernel sizes** | 11×11 → 5×5 → 3×3 | Balance computation and feature extraction |
| **Overlapping pooling** | Pool 3×3, stride 2 (not 2×2, stride 2) | Slight reduction in overfitting |
| **Deep FC layers (4096 units)** | High capacity for classification | Enables complex decision boundaries |
| **Dropout in FC only** | Conv layers have fewer parameters | Regularization where needed most |
| **No dropout in conv layers** | Conv layers less prone to overfitting | Retain feature extraction capacity |

### Limitations and Modern Improvements

| **AlexNet Feature** | **Limitation** | **Modern Solution** |
|---------------------|----------------|---------------------|
| **Local Response Normalization (LRN)** | Expensive, marginal benefit | Batch Normalization (more effective) |
| **Large FC layers** | 95% of parameters, overfitting | Global Average Pooling (GAP) |
| **Manual learning rate schedule** | Requires monitoring | Adaptive optimizers (Adam, AdamW) |
| **Fixed input size (227×227)** | Less flexible | Fully convolutional networks |
| **Two-GPU split** | Complex implementation | Better multi-GPU frameworks |
| **Large initial kernels (11×11)** | Expensive computation | Smaller kernels (3×3) stacked |

---





## VGG

The Primary goal of this network was **to show how depth affects performance**

**Design Rules**
1. **Simplicity** 
   - Convolution filters: 3x3, s=1, p=1
   - Max pools are: 2x2, s=2
   - After pooling, double number of channels
2. **Homogeneity** - Consistent design pattern.
3. **Small Kernels** - Remove large Kernels with stacks of 3x3 Kernels
4. **Systematic Depth** - By consisten design pattern depth is simple to increase and measure

<div align="center">
<img src="../images/chap8/VGG.png" width="195"/>
<img src="../images/chap8/VGGStruct.png" width="710"/>
</div>

### Measuring stacked Kernels vs. Large Kernels

**Receptive Field**

|Configuration | Receptive Field | Calculation | 
|--------------|-----------------|-------------|
|One 7x7 Conv  | 7x7 | $(7-1)\cdot1 + 1$ |
|One 5x5 Conv  | 5x5 | $(5-1)\cdot1 + 1$ | 
|Two 3x3 Conv  | 5x5 | $((3-1)\cdot 1 + (3-1)\cdot 1) +1$ | 
|Three 3x3 Conv| 7x7 | $((3-1)\cdot 1 + (3-1)\cdot 1 + (3-1)\cdot 1) + 1$

**Parameters**

Given $C \to C$ Channels

|Configuration | Parameters | Calculation | 
|--------------|-----------------|-------------|
|One 7x7 Conv  | $49C^2$ | $C \times C \times 7 \times 7$ |
|One 5x5 Conv  | $25C^2$ | $C \times C \times 5 \times 5$ | 
|Two 3x3 Conv  | $18C^2$| $2 \times (C \times C \times 3 \times 3)$ | 
|Three 3x3 Conv| $27C^2$ | $3 \times (C \times C \times 3 \times 3)$ | 

**Non-Linearity (Expressiveness)**

|Configuration | Parameters |
|--------------|-----------------|
|One 7x7 Conv  | 1 ReLU |
|One 5x5 Conv  | 1 ReLU |
|Two 3x3 Conv  | 2 ReLUs|
|Three 3x3 Conv| 13 ReLUs |

**Computational Costs (LFOPs)**

Given spatial Dimensions $H \times W$ and $C \to C$ Channels

|Configuration | FLOPs | Calculation | 
|--------------|-----------------|-------------|
|One 7x7 Conv  | $98C^2HW$ | $H \times W \times C \times (7 \times 7 \times C \times 2)$ |
|One 5x5 Conv  | $50C^2HW$ | $H \times W \times C \times (5 \times 5 \times C \times 2)$ | 
|Two 3x3 Conv  | $36C^2HW$ | $2 \times (H \times W \times C \times (3 \times 3 \times C \times 2))$ | 
|Three 3x3 Conv| $54C^2HW$ | $3 \times (H \times W \times C \times (3 \times 3 \times C \times 2))$|


### Conclusion 

<div align="center">
<img src="../images/chap8/VGGstats.png" width="595"/>
</div>

**VGG's Advantages: Stacked Small Kernels Win**

Compared to large kernels (5×5, 7×7), VGG's stacked 3×3 convolutions provide:

| **Metric** | **Advantage** | **Impact** |
|------------|---------------|------------|
| **Receptive Field** | Same coverage (3×3×3 = 7×7) | Equivalent spatial context |
| **Parameters** | 45% fewer (27C² vs 49C²) | More efficient learning |
| **Expressiveness** | 3× more ReLU activations | Better feature representations |
| **Computation** | 45% fewer FLOPs | Faster training and inference |
| **Accuracy** | 7.3% top-5 error | State-of-the-art in 2014 |

**VGG proved that depth + simplicity beats complexity.**

---

**VGG's Critical Weakness: Computational Inefficiency**

Despite its success, VGG has a major flaw:

| **Problem** | **Numbers** | **Issue** |
|-------------|-------------|-----------|
| **Memory consumption** | ~140M parameters (VGG-16) | Huge model size |
| **FLOPs per image** | ~15.5 billion operations | Very slow inference |
| **FC layer dominance** | 90% of parameters in FC layers | Inefficient parameter use |
| **Limited depth scaling** | Difficult to go beyond 19 layers | Diminishing returns |

**The Question:** Can we build deeper, more accurate networks **without** exploding computation?

**The Answer:** GoogleNet (Inception) introduces **multi-scale feature extraction** and **1×1 convolutions** to dramatically reduce parameters while maintaining (or improving) accuracy—achieving similar performance with **12× fewer parameters** than VGG.

---