# CNN Architectures 

In this section investigate the CNN Architectures that were created for image classification specifically: 
- AlexNet
- VGGNet 
- GoogleNet

## The ImageNet Classification Challenge

Given 1,431,167 Images with human labels, with 1000 Object Classes, develop a model that's able to produce that highest accuracy rate of classification with top 5 Error evaluation. 
  
<br>
<div align="center">
<img src="../images/chap8/ImageNet.png" width="710"/>
</div>

### Pre-AlexNet

The main models that were used were Shallow NN, or Classical ML models where feature extraction was manually done. 
The best models during this period were obtaining 28-25% Error Rate.

## AlexNet

Suddenly AlexNet (developed by Ilya Sutskever, Geoffrey Hinton and Alex Krizhevsky) was the first model that was able to significantly reduce the error rates by roughly 10%, using Deep Neural network.


<div align="center">
<img src="../images/chap8/AlexNet.png" width="710"/>
</div>


### Key Innovations

| **Innovation** | **Description** | **Impact** |
|----------------|-----------------|------------|
| **Deep Architecture** | 8 learned layers (5 conv + 3 FC) | First successful very deep CNN |
| **ReLU Activation** | Used ReLU instead of tanh/sigmoid | 6Ã— faster training, no vanishing gradients |
| **Dropout** | Applied 0.5 dropout in FC layers | Reduced overfitting significantly |
| **Data Augmentation** | Random crops, flips, color jittering | Increased training data diversity |
| **GPU Training** | Trained on 2 GTX 580 GPUs | Made large-scale training feasible |
| **Local Response Normalization** | Normalization across channels | Later replaced by Batch Normalization |
| **Overlapping Pooling** | Pool size 3Ã—3, stride 2 | Slight accuracy improvement |

### Complete Architecture

**Input**: $227 \times 227 \times 3$ RGB image (originally stated as 224Ã—224, but 227Ã—227 is correct for the dimensions to work out)

**Assumptions**: 
- Batch size = 1 (single image)
- 32-bit floating point (4 bytes per value)
- FLOPs calculated as multiply-add operations

| **Layer** | **Type** | **Input Shape** | **Kernel Size** | **Filters/Units** | **Stride** | **Padding** | **Output Shape** | **Parameters** | **Memory (MB)** | **FLOPs** | **Activation** |
|-----------|----------|-----------------|-----------------|-------------------|------------|-------------|------------------|----------------|-----------------|-----------|----------------|
| **Input** | Input | - | - | - | - | - | $227 \times 227 \times 3$ | 0 | $227 \times 227 \times 3 \times 4 = 0.59$ | 0 | - |
| **Conv1** | Convolution | $227 \times 227 \times 3$ | $11 \times 11$ | 96 | 4 | 0 (valid) | $55 \times 55 \times 96$ | $34{,}944$ | $55 \times 55 \times 96 \times 4 = 1.11$ | $55 \times 55 \times 96 \times (11 \times 11 \times 3) = 105.4M$ | ReLU |
| **Pool1** | Max Pooling | $55 \times 55 \times 96$ | $3 \times 3$ | - | 2 | 0 | $27 \times 27 \times 96$ | 0 | $27 \times 27 \times 96 \times 4 = 0.27$ | $27 \times 27 \times 96 \times 9 = 6.3M$ | - |
| **LRN1** | Local Response Norm | $27 \times 27 \times 96$ | - | - | - | - | $27 \times 27 \times 96$ | 0 | $0.27$ | $27 \times 27 \times 96 \times 10 = 7.0M$ | - |
| **Conv2** | Convolution | $27 \times 27 \times 96$ | $5 \times 5$ | 256 | 1 | 2 (same) | $27 \times 27 \times 256$ | $614{,}656$ | $27 \times 27 \times 256 \times 4 = 0.72$ | $27 \times 27 \times 256 \times (5 \times 5 \times 96) = 448.1M$ | ReLU |
| **Pool2** | Max Pooling | $27 \times 27 \times 256$ | $3 \times 3$ | - | 2 | 0 | $13 \times 13 \times 256$ | 0 | $13 \times 13 \times 256 \times 4 = 0.17$ | $13 \times 13 \times 256 \times 9 = 3.9M$ | - |
| **LRN2** | Local Response Norm | $13 \times 13 \times 256$ | - | - | - | - | $13 \times 13 \times 256$ | 0 | $0.17$ | $13 \times 13 \times 256 \times 10 = 4.3M$ | - |
| **Conv3** | Convolution | $13 \times 13 \times 256$ | $3 \times 3$ | 384 | 1 | 1 (same) | $13 \times 13 \times 384$ | $885{,}120$ | $13 \times 13 \times 384 \times 4 = 0.25$ | $13 \times 13 \times 384 \times (3 \times 3 \times 256) = 149.5M$ | ReLU |
| **Conv4** | Convolution | $13 \times 13 \times 384$ | $3 \times 3$ | 384 | 1 | 1 (same) | $13 \times 13 \times 384$ | $1{,}327{,}488$ | $0.25$ | $13 \times 13 \times 384 \times (3 \times 3 \times 384) = 224.2M$ | ReLU |
| **Conv5** | Convolution | $13 \times 13 \times 384$ | $3 \times 3$ | 256 | 1 | 1 (same) | $13 \times 13 \times 256$ | $884{,}992$ | $0.17$ | $13 \times 13 \times 256 \times (3 \times 3 \times 384) = 149.5M$ | ReLU |
| **Pool3** | Max Pooling | $13 \times 13 \times 256$ | $3 \times 3$ | - | 2 | 0 | $6 \times 6 \times 256$ | 0 | $6 \times 6 \times 256 \times 4 = 0.04$ | $6 \times 6 \times 256 \times 9 = 0.8M$ | - |
| **Flatten** | Flatten | $6 \times 6 \times 256$ | - | - | - | - | $9{,}216$ | 0 | $9{,}216 \times 4 = 0.04$ | 0 | - |
| **FC1** | Fully Connected | $9{,}216$ | - | 4096 | - | - | $4{,}096$ | $37{,}752{,}832$ | $4{,}096 \times 4 = 0.016$ | $2 \times 9{,}216 \times 4{,}096 = 75.5M$ | ReLU + Dropout (0.5) |
| **FC2** | Fully Connected | $4{,}096$ | - | 4096 | - | - | $4{,}096$ | $16{,}781{,}312$ | $0.016$ | $2 \times 4{,}096 \times 4{,}096 = 33.6M$ | ReLU + Dropout (0.5) |
| **FC3** | Fully Connected | $4{,}096$ | - | 1000 | - | - | $1{,}000$ | $4{,}097{,}000$ | $1{,}000 \times 4 = 0.004$ | $2 \times 4{,}096 \times 1{,}000 = 8.2M$ | Softmax |
| **Output** | Softmax | $1{,}000$ | - | - | - | - | $1{,}000$ | 0 | $0.004$ | $1{,}000 \times 5 = 0.005M$ | - |
| | | | | | | | **TOTAL** | **61,378,344** | **~3.6 MB** | **~1.22 GFLOPS** | |



### Key Design Choices

| **Choice** | **Rationale** | **Impact** |
|------------|---------------|------------|
| **Large first kernel (11Ã—11)** | Capture large receptive field early | Extract diverse low-level features |
| **Decreasing kernel sizes** | 11Ã—11 â†’ 5Ã—5 â†’ 3Ã—3 | Balance computation and feature extraction |
| **Overlapping pooling** | Pool 3Ã—3, stride 2 (not 2Ã—2, stride 2) | Slight reduction in overfitting |
| **Deep FC layers (4096 units)** | High capacity for classification | Enables complex decision boundaries |
| **Dropout in FC only** | Conv layers have fewer parameters | Regularization where needed most |
| **No dropout in conv layers** | Conv layers less prone to overfitting | Retain feature extraction capacity |

### Limitations and Modern Improvements

| **AlexNet Feature** | **Limitation** | **Modern Solution** |
|---------------------|----------------|---------------------|
| **Local Response Normalization (LRN)** | Expensive, marginal benefit | Batch Normalization (more effective) |
| **Large FC layers** | 95% of parameters, overfitting | Global Average Pooling (GAP) |
| **Manual learning rate schedule** | Requires monitoring | Adaptive optimizers (Adam, AdamW) |
| **Fixed input size (227Ã—227)** | Less flexible | Fully convolutional networks |
| **Two-GPU split** | Complex implementation | Better multi-GPU frameworks |
| **Large initial kernels (11Ã—11)** | Expensive computation | Smaller kernels (3Ã—3) stacked |

---





## VGG

The Primary goal of this network was **to show how depth affects performance**

**Design Rules**
1. **Simplicity** 
   - Convolution filters: 3x3, s=1, p=1
   - Max pools are: 2x2, s=2
   - After pooling, double number of channels
2. **Homogeneity** - Consistent design pattern.
3. **Small Kernels** - Remove large Kernels with stacks of 3x3 Kernels
4. **Systematic Depth** - By consisten design pattern depth is simple to increase and measure

<div align="center">
<img src="../images/chap8/VGG.png" width="195"/>
<img src="../images/chap8/VGGStruct.png" width="710"/>
</div>

### Measuring stacked Kernels vs. Large Kernels

**Receptive Field**

|Configuration | Receptive Field | Calculation | 
|--------------|-----------------|-------------|
|One 7x7 Conv  | 7x7 | $(7-1)\cdot1 + 1$ |
|One 5x5 Conv  | 5x5 | $(5-1)\cdot1 + 1$ | 
|Two 3x3 Conv  | 5x5 | $((3-1)\cdot 1 + (3-1)\cdot 1) +1$ | 
|Three 3x3 Conv| 7x7 | $((3-1)\cdot 1 + (3-1)\cdot 1 + (3-1)\cdot 1) + 1$

**Parameters**

Given $C \to C$ Channels

|Configuration | Parameters | Calculation | 
|--------------|-----------------|-------------|
|One 7x7 Conv  | $49C^2$ | $C \times C \times 7 \times 7$ |
|One 5x5 Conv  | $25C^2$ | $C \times C \times 5 \times 5$ | 
|Two 3x3 Conv  | $18C^2$| $2 \times (C \times C \times 3 \times 3)$ | 
|Three 3x3 Conv| $27C^2$ | $3 \times (C \times C \times 3 \times 3)$ | 

**Non-Linearity (Expressiveness)**

|Configuration | Parameters |
|--------------|-----------------|
|One 7x7 Conv  | 1 ReLU |
|One 5x5 Conv  | 1 ReLU |
|Two 3x3 Conv  | 2 ReLUs|
|Three 3x3 Conv| 13 ReLUs |

**Computational Costs (LFOPs)**

Given spatial Dimensions $H \times W$ and $C \to C$ Channels

|Configuration | FLOPs | Calculation | 
|--------------|-----------------|-------------|
|One 7x7 Conv  | $98C^2HW$ | $H \times W \times C \times (7 \times 7 \times C \times 2)$ |
|One 5x5 Conv  | $50C^2HW$ | $H \times W \times C \times (5 \times 5 \times C \times 2)$ | 
|Two 3x3 Conv  | $36C^2HW$ | $2 \times (H \times W \times C \times (3 \times 3 \times C \times 2))$ | 
|Three 3x3 Conv| $54C^2HW$ | $3 \times (H \times W \times C \times (3 \times 3 \times C \times 2))$|


### Conclusion 

<div align="center">
<img src="../images/chap8/VGGstats.png" width="795"/>
</div>

**VGG's Advantages: Stacked Small Kernels Win**

Compared to large kernels (5Ã—5, 7Ã—7), VGG's stacked 3Ã—3 convolutions provide:

| **Metric** | **Advantage** | **Impact** |
|------------|---------------|------------|
| **Receptive Field** | Same coverage (3Ã—3Ã—3 = 7Ã—7) | Equivalent spatial context |
| **Parameters** | 45% fewer (27CÂ² vs 49CÂ²) | More efficient learning |
| **Expressiveness** | 3Ã— more ReLU activations | Better feature representations |
| **Computation** | 45% fewer FLOPs | Faster training and inference |
| **Accuracy** | 7.3% top-5 error | State-of-the-art in 2014 |

**VGG proved that depth + simplicity beats complexity.**

---

**VGG's Critical Weakness: Computational Inefficiency**

Despite its success, VGG has a major flaw:

| **Problem** | **Numbers** | **Issue** |
|-------------|-------------|-----------|
| **Memory consumption** | ~140M parameters (VGG-16) | Huge model size |
| **FLOPs per image** | ~15.5 billion operations | Very slow inference |
| **FC layer dominance** | 90% of parameters in FC layers | Inefficient parameter use |
| **Limited depth scaling** | Difficult to go beyond 19 layers | Diminishing returns |

**The Question:** Can we build deeper, more accurate networks **without** exploding computation?

**The Answer:** GoogleNet (Inception) introduces **multi-scale feature extraction** and **1Ã—1 convolutions** to dramatically reduce parameters while maintaining (or improving) accuracyâ€”achieving similar performance with **12Ã— fewer parameters** than VGG.

---

## GoogleNet

Inception v1, 2014 was designed to address VGG's inefficiency with three core objectives: 
1. **Reduce Computational Cost**
    - VGG-16 has ~140 parameters and 15.5B FLOPs
    - GoogleNet has ~7M parameters
2. **Multi-Scale Feature Extraction**
    - Single Kernel size captures only one scale
    - Process features at multiple scales **Simulatneously**
3. **Deeper Networks without Exploding Parameters**
    - VGG's depth limited by parameters growth
    - GooglesNet Bottleneck architecture enable deeper network

### The Inception Module: Core Building Block

Before examining the full GoogleNet architecture, let's understand the **Inception module** â€” the fundamental innovation that makes GoogleNet efficient.

---

#### **The Problem: Which Kernel Size to Choose?**

Traditional CNNs force you to choose **one** kernel size per layer:
- **Large kernels (5Ã—5, 7Ã—7)**: Capture global patterns but expensive
- **Small kernels (3Ã—3)**: Efficient but may miss larger patterns
- **1Ã—1 kernels**: Fast but limited receptive field

**Question:** Why choose when you can use **all of them simultaneously**?

---

#### **Naive Inception Module**

<div align="center">
<img src="../images/chap8/NaiveIncept.png" width="600"/>
</div>

**Idea:** Apply multiple operations in parallel, then concatenate:

| **Operation** | **Purpose** | **Output Channels** |
|---------------|-------------|---------------------|
| **1Ã—1 conv** | Capture point-wise features | 64 |
| **3Ã—3 conv** | Capture local patterns | 128 |
| **5Ã—5 conv** | Capture larger patterns | 32 |
| **3Ã—3 max pool** | Preserve spatial information | (same as input) |

**Problem:** This is **computationally expensive**!

For input $28 \times 28 \times 256$:
- 5Ã—5 conv alone: $28 \times 28 \times 32 \times (5 \times 5 \times 256 \times 2) = 406M$ FLOPs

---

#### **Inception Module with Dimensionality Reduction**

<div align="center">
<img src="../images/chap8/Inception.png" width="600"/>
</div>

**Key Innovation:** Use **1Ã—1 convolutions as bottlenecks** before expensive operations.

**Architecture:**


---

#### **Why 1Ã—1 Convolutions Work**

**1Ã—1 convolutions** perform **dimensionality reduction**:

| **Step** | **Dimensions** | **Purpose** |
|----------|---------------|-------------|
| Input | $28 \times 28 \times 256$ | High-dimensional feature map |
| 1Ã—1 conv | $28 \times 28 \times 16$ | **Compress** channels (256 â†’ 16) |
| 5Ã—5 conv | $28 \times 28 \times 32$ | Process with reduced input depth |

**Computational Savings:**

| **Method** | **FLOPs** | **Calculation** |
|------------|-----------|-----------------|
| **Direct 5Ã—5 conv** | 406M | $28 \times 28 \times 32 \times (5 \times 5 \times 256 \times 2)$ |
| **With 1Ã—1 bottleneck** | 14M | $(28^2 \times 16 \times 256 \times 2) + (28^2 \times 32 \times 25 \times 16 \times 2)$ |
| **Reduction** | **96.6% fewer FLOPs** | 29Ã— more efficient! |

---

#### **Multi-Scale Feature Extraction**

Each branch captures **different scales**:

| **Branch** | **Receptive Field** | **What It Captures** |
|------------|---------------------|----------------------|
| 1Ã—1 conv | 1Ã—1 | Point-wise cross-channel patterns |
| 3Ã—3 conv | 3Ã—3 | Local spatial patterns |
| 5Ã—5 conv | 5Ã—5 | Larger spatial patterns |
| Max pool + 1Ã—1 | Preserves spatial info | Maintains strong activations |

**Result:** The network learns **which scale is most useful** for each task through training.

---

#### **Key Advantages**

| **Advantage** | **How It's Achieved** |
|---------------|----------------------|
| **Multi-scale processing** | Parallel branches with different kernel sizes |
| **Computational efficiency** | 1Ã—1 bottleneck convolutions (96%+ reduction) |
| **Feature diversity** | Concatenate outputs from all branches |
| **Network depth** | Can stack many modules without exploding cost |

---

**Next:** See how GoogleNet stacks 9 of these Inception modules to create a 22-layer network with only 7M parameters! ðŸš€


<div align="center">
<img src="../images/chap8/FullGNet.png" width="1000"/>
</div>

### GoogleNet Full Architecture

<div align="center">
<img src="../images/chap8/FullGNet.png" width="1000"/>
</div>

---

#### **ðŸ”µ Stem: Initial Layers (Blue)**

**Purpose:** Aggressive downsampling and feature extraction from raw images.

**Structure:**
- Conv 7Ã—7, stride=2 â†’ Pool 3Ã—3, stride=2
- Conv 1Ã—1 â†’ Conv 3Ã—3 â†’ Pool 3Ã—3, stride=2
- Reduces $224 \times 224 \times 3$ â†’ $28 \times 28 \times 192$

**Why?**
- **Large kernels (7Ã—7)**: Capture low-level features (edges, textures)
- **Aggressive pooling**: Reduce spatial dimensions quickly (224â†’28)
- **Traditional design**: Similar to AlexNet/VGG before Inception modules begin

---

#### **ðŸ”´ Inception Body: 9 Inception Modules (Maroon/Red)**

**Purpose:** Multi-scale feature extraction with computational efficiency.

**Structure:**
- **9 stacked Inception modules** organized in groups
- Spatial dimensions: $28 \times 28 \to 14 \times 14 \to 7 \times 7$
- Max pooling between module groups for downsampling

**Why?**
- **Multi-scale processing**: Each module captures 1Ã—1, 3Ã—3, 5Ã—5 features simultaneously
- **Bottleneck architecture**: 1Ã—1 convs reduce computation by ~85%
- **Deep without explosion**: 22 layers total, only 7M parameters

---

#### **ðŸŸ£ Auxiliary Classifiers (Purple)**

**Purpose:** Combat vanishing gradients during training in deep networks.

**Structure:**
- Two auxiliary branches attached to **intermediate layers**
- Each contains: AvgPool â†’ 1Ã—1 Conv â†’ FC â†’ Softmax
- Weighted loss (0.3Ã—) added to main loss during training

**Why?**
- **Gradient injection**: Provide direct gradient signal to middle layers
- **Regularization**: Act as implicit regularization (similar to dropout)
- **Discarded at inference**: Only used during training, removed for deployment

---

#### **ðŸŸ¢ Final Classifier (Green)**

**Purpose:** Global aggregation and classification.

**Structure:**
- Global Average Pooling (GAP): $7 \times 7 \times 1024 \to 1 \times 1 \times 1024$
- Dropout (0.4)
- Fully Connected: $1024 \to 1000$ classes
- Softmax activation

**Why?**
- **GAP replaces large FC layers**: Reduces parameters dramatically (no 4096-unit layers like VGG)
- **Translation invariance**: GAP makes network robust to input shifts
- **Dropout**: Prevents overfitting in final classification layer

---

### **Key Architectural Innovations**

| **Section** | **Innovation** | **Benefit** |
|-------------|---------------|-------------|
| **Stem** | Traditional conv layers | Efficient initial downsampling |
| **Inception Body** | Parallel multi-scale convolutions | Rich features, low cost |
| **Auxiliary Classifiers** | Mid-network supervision | Better gradient flow |
| **Final Classifier** | Global Average Pooling | Minimal parameters, strong generalization |

### **Key Results**

- **22 layers deep** with only **7M parameters** (VGG-16: 138M)
- **~1.5B FLOPs** per image (VGG-16: 15.5B)
- **6.7% top-5 error** on ImageNet (better than VGG's 7.3%)
- **12Ã— fewer parameters** than VGG with superior accuracy
