# Advanced CNN Architectures: A Deep Dive

**From LeNet to Vision Transformers -- The Complete Evolution of Visual Recognition**

---

This notebook provides a comprehensive, implementation-focused tour of the most influential
convolutional neural network architectures. Each section includes:

- The core **insight** that made the architecture important
- A **from-scratch implementation** in TensorFlow / Keras
- Practical **design guidelines** for your own projects

**Audience:** Intermediate deep-learning practitioners through production ML engineers.

**Prerequisites:** Basic understanding of convolutions, pooling, and backpropagation.

## 1. The Evolution Timeline

```
1998        2012        2014        2014        2015        2017        2019        2022        2020+
 |           |           |           |           |           |           |           |           |
LeNet --> AlexNet --> VGG ----> GoogLeNet --> ResNet --> DenseNet --> EfficientNet --> ConvNeXt --> ViT / Swin
 |           |           |           |           |           |           |           |           |
First     GPU +       Deeper     Multi-      Skip       Dense      Compound    Modernized   Self-
practical Dropout +   uniform    scale       connections feature    scaling     CNNs with    attention
CNN       ReLU +      3x3        feature     solve      reuse      balances    Transformer  replaces
          Data Aug    stacks     extraction  degradation            W x D x R  ideas        convolution
```

### Key Innovations at Each Stage

| Architecture | Year | Key Innovation | Parameters | ImageNet Top-1 Acc |
|:------------|:----:|:---------------|:----------:|:------------------:|
| **LeNet-5** | 1998 | First practical CNN (digit recognition) | 60 K | -- |
| **AlexNet** | 2012 | GPU training, ReLU, Dropout, Data Augmentation | 61 M | 63.3 % |
| **VGG-16** | 2014 | Uniform 3x3 convolutions, deeper networks | 138 M | 74.4 % |
| **GoogLeNet / Inception v1** | 2014 | Multi-scale (Inception) modules, 1x1 conv | 6.8 M | 74.8 % |
| **ResNet-50** | 2015 | Residual (skip) connections | 25.6 M | 76.0 % |
| **DenseNet-121** | 2017 | Dense connections -- every layer connects to every later layer | 8 M | 74.9 % |
| **EfficientNet-B0** | 2019 | Compound scaling of width, depth, resolution | 5.3 M | 77.1 % |
| **EfficientNet-B7** | 2019 | Scaled-up compound model | 66 M | 84.3 % |
| **ConvNeXt-T** | 2022 | Transformer design principles applied to pure CNNs | 29 M | 82.1 % |
| **ViT-B/16** | 2020 | Pure self-attention on image patches | 86 M | 77.9 % (ImageNet-1K only) |
| **Swin-T** | 2021 | Shifted-window local attention + hierarchical features | 29 M | 81.3 % |

> **Takeaway:** More parameters do not automatically mean higher accuracy. Architectural
> innovations -- skip connections, multi-scale processing, attention, and principled scaling --
> matter more than raw size.

## 2. Environment Setup

In [None]:
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, Model
import numpy as np
import matplotlib.pyplot as plt
import time

print(f"TensorFlow : {tf.__version__}")
print(f"Keras      : {keras.__version__}")
print(f"GPU available: {len(tf.config.list_physical_devices('GPU')) > 0}")

# Reproducibility
tf.random.set_seed(42)
np.random.seed(42)

## 3. VGG -- "Deeper Is Better" with Uniform 3x3 Convolutions

### Core Insight

The VGG team (Simonyan & Zisserman, 2014) showed that **stacking many small (3x3)
convolution filters** is more effective than using fewer large filters:

- Two stacked 3x3 convolutions have the **same receptive field** as one 5x5
  convolution, but use fewer parameters (2 x 3x3x C^2 = 18C^2 vs. 25C^2) and
  introduce an extra non-linearity.
- Three stacked 3x3 convolutions match a 7x7 receptive field.

### Architecture Pattern

```
[Conv3x3 -> Conv3x3 -> MaxPool] x 2   (64, 128 filters)
[Conv3x3 -> Conv3x3 -> Conv3x3 -> MaxPool] x 3   (256, 512, 512 filters)
Flatten -> FC-4096 -> FC-4096 -> FC-1000 -> Softmax
```

### Limitations

- **138 M parameters** -- most in the fully-connected layers
- No skip connections, so gradients vanish beyond ~20 layers
- No batch normalization (published before BN was introduced)

In [None]:
def build_vgg16(input_shape=(224, 224, 3), num_classes=1000):
    """Build VGG-16 from scratch using the Keras Functional API."""
    inputs = keras.Input(shape=input_shape)

    # Block 1
    x = layers.Conv2D(64, 3, padding='same', activation='relu', name='block1_conv1')(inputs)
    x = layers.Conv2D(64, 3, padding='same', activation='relu', name='block1_conv2')(x)
    x = layers.MaxPooling2D(2, strides=2, name='block1_pool')(x)

    # Block 2
    x = layers.Conv2D(128, 3, padding='same', activation='relu', name='block2_conv1')(x)
    x = layers.Conv2D(128, 3, padding='same', activation='relu', name='block2_conv2')(x)
    x = layers.MaxPooling2D(2, strides=2, name='block2_pool')(x)

    # Block 3
    x = layers.Conv2D(256, 3, padding='same', activation='relu', name='block3_conv1')(x)
    x = layers.Conv2D(256, 3, padding='same', activation='relu', name='block3_conv2')(x)
    x = layers.Conv2D(256, 3, padding='same', activation='relu', name='block3_conv3')(x)
    x = layers.MaxPooling2D(2, strides=2, name='block3_pool')(x)

    # Block 4
    x = layers.Conv2D(512, 3, padding='same', activation='relu', name='block4_conv1')(x)
    x = layers.Conv2D(512, 3, padding='same', activation='relu', name='block4_conv2')(x)
    x = layers.Conv2D(512, 3, padding='same', activation='relu', name='block4_conv3')(x)
    x = layers.MaxPooling2D(2, strides=2, name='block4_pool')(x)

    # Block 5
    x = layers.Conv2D(512, 3, padding='same', activation='relu', name='block5_conv1')(x)
    x = layers.Conv2D(512, 3, padding='same', activation='relu', name='block5_conv2')(x)
    x = layers.Conv2D(512, 3, padding='same', activation='relu', name='block5_conv3')(x)
    x = layers.MaxPooling2D(2, strides=2, name='block5_pool')(x)

    # Classifier head
    x = layers.Flatten(name='flatten')(x)
    x = layers.Dense(4096, activation='relu', name='fc1')(x)
    x = layers.Dropout(0.5)(x)
    x = layers.Dense(4096, activation='relu', name='fc2')(x)
    x = layers.Dropout(0.5)(x)
    outputs = layers.Dense(num_classes, activation='softmax', name='predictions')(x)

    return Model(inputs, outputs, name='VGG16')


vgg16 = build_vgg16()
vgg16.summary(show_trainable=True, expand_nested=False)

In [None]:
# Parameter distribution analysis
conv_params = sum(l.count_params() for l in vgg16.layers if 'conv' in l.name)
fc_params   = sum(l.count_params() for l in vgg16.layers if 'fc' in l.name or 'predictions' in l.name)
total       = vgg16.count_params()

print(f"Convolutional parameters : {conv_params:>12,}  ({100*conv_params/total:.1f}%)")
print(f"Fully-connected parameters: {fc_params:>12,}  ({100*fc_params/total:.1f}%)")
print(f"Total parameters          : {total:>12,}")
print()
print("Note: ~90% of VGG-16 parameters live in the FC layers.")
print("This is one reason modern architectures replaced FC layers with Global Average Pooling.")

## 4. ResNet -- The Skip Connection Revolution

### The Degradation Problem

Before ResNet, researchers observed a surprising phenomenon: **adding more layers to
a deep network eventually _increased_ the training error** (not just test error).
This is not caused by overfitting -- it is a fundamental optimization difficulty.

If a shallower network can reach a certain accuracy, a deeper network should be able
to do at least as well by learning identity mappings in the extra layers. In practice,
optimizers struggle to learn these identity mappings through stacked non-linear layers.

### The Residual Learning Solution

He et al. (2015) introduced **skip connections** (also called shortcut or residual
connections):

```
x ----+----> Conv -> BN -> ReLU -> Conv -> BN ---(+)---> ReLU -> output
      |                                           |
      +-------------- identity --------------------+
```

Instead of learning a mapping `H(x)` directly, the network learns the **residual**
`F(x) = H(x) - x`. If the identity mapping is optimal, the network only needs to
drive `F(x)` to zero, which is much easier.

### Two Types of Shortcuts

1. **Identity shortcut** -- used when input and output dimensions match.
2. **Projection shortcut** -- a 1x1 convolution with stride to match dimensions
   when spatial size or channel count changes.

### ResNet Bottleneck Block (used in ResNet-50/101/152)

```
1x1 Conv (reduce channels) -> 3x3 Conv -> 1x1 Conv (expand channels)
```

This reduces computation compared to using two 3x3 convolutions.

In [None]:
class ResidualBlock(layers.Layer):
    """Bottleneck residual block for ResNet-50/101/152.
    
    Structure: 1x1 -> 3x3 -> 1x1 with optional projection shortcut.
    """

    def __init__(self, filters, stride=1, use_projection=False, **kwargs):
        super().__init__(**kwargs)
        self.filters = filters
        self.stride = stride

        # Bottleneck layers
        self.conv1 = layers.Conv2D(filters, 1, strides=stride, padding='same', use_bias=False)
        self.bn1   = layers.BatchNormalization()
        self.conv2 = layers.Conv2D(filters, 3, padding='same', use_bias=False)
        self.bn2   = layers.BatchNormalization()
        self.conv3 = layers.Conv2D(filters * 4, 1, padding='same', use_bias=False)
        self.bn3   = layers.BatchNormalization()

        # Projection shortcut (when dimensions change)
        self.projection = None
        if use_projection:
            self.projection = keras.Sequential([
                layers.Conv2D(filters * 4, 1, strides=stride, use_bias=False),
                layers.BatchNormalization()
            ])

    def call(self, x, training=False):
        shortcut = x

        x = tf.nn.relu(self.bn1(self.conv1(x), training=training))
        x = tf.nn.relu(self.bn2(self.conv2(x), training=training))
        x = self.bn3(self.conv3(x), training=training)  # No ReLU before addition

        if self.projection is not None:
            shortcut = self.projection(shortcut, training=training)

        return tf.nn.relu(x + shortcut)


def build_resnet50(input_shape=(224, 224, 3), num_classes=1000):
    """Build ResNet-50 from scratch."""
    inputs = keras.Input(shape=input_shape)

    # Stem
    x = layers.Conv2D(64, 7, strides=2, padding='same', use_bias=False, name='stem_conv')(inputs)
    x = layers.BatchNormalization(name='stem_bn')(x)
    x = layers.Activation('relu')(x)
    x = layers.MaxPooling2D(3, strides=2, padding='same')(x)

    # Residual stages: (filters, num_blocks, first_block_stride)
    stage_configs = [
        (64,  3, 1),   # Stage 2  -- no downsampling (already pooled)
        (128, 4, 2),   # Stage 3
        (256, 6, 2),   # Stage 4
        (512, 3, 2),   # Stage 5
    ]

    for stage_idx, (filters, num_blocks, stride) in enumerate(stage_configs):
        # First block may change dimensions
        x = ResidualBlock(filters, stride=stride, use_projection=True,
                          name=f'stage{stage_idx+2}_block1')(x)
        for block_idx in range(1, num_blocks):
            x = ResidualBlock(filters, name=f'stage{stage_idx+2}_block{block_idx+1}')(x)

    # Head
    x = layers.GlobalAveragePooling2D(name='avg_pool')(x)
    outputs = layers.Dense(num_classes, activation='softmax', name='predictions')(x)

    return Model(inputs, outputs, name='ResNet50')


resnet50 = build_resnet50()
print(f"ResNet-50 total parameters: {resnet50.count_params():,}")

### Visualizing Gradient Flow: With vs. Without Skip Connections

The following simulation illustrates how gradients propagate through a deep network
**with** and **without** skip connections.

In [None]:
def simulate_gradient_flow(depth, use_skip=False, weight_scale=0.5):
    """Simulate gradient magnitudes through a chain of layers."""
    np.random.seed(0)
    gradient = 1.0
    gradients = [gradient]
    for _ in range(depth):
        layer_jacobian = np.random.randn() * weight_scale
        if use_skip:
            gradient = gradient * layer_jacobian + gradient  # residual path
        else:
            gradient = gradient * layer_jacobian
        gradients.append(abs(gradient))
    return gradients

depth = 50
grads_plain = simulate_gradient_flow(depth, use_skip=False)
grads_skip  = simulate_gradient_flow(depth, use_skip=True)

fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].semilogy(grads_plain, 'r-', linewidth=2)
axes[0].set_title('Plain Network (no skip connections)', fontsize=13)
axes[0].set_xlabel('Layer depth')
axes[0].set_ylabel('Gradient magnitude (log scale)')
axes[0].axhline(1.0, color='gray', linestyle='--', alpha=0.5)
axes[0].set_ylim([1e-20, 1e20])

axes[1].semilogy(grads_skip, 'b-', linewidth=2)
axes[1].set_title('Residual Network (with skip connections)', fontsize=13)
axes[1].set_xlabel('Layer depth')
axes[1].set_ylabel('Gradient magnitude (log scale)')
axes[1].axhline(1.0, color='gray', linestyle='--', alpha=0.5)
axes[1].set_ylim([1e-20, 1e20])

plt.suptitle('Gradient Flow Simulation (50 layers)', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print(f"Plain network  -- final gradient magnitude: {grads_plain[-1]:.2e}")
print(f"ResNet         -- final gradient magnitude: {grads_skip[-1]:.2e}")

## 5. Inception / GoogLeNet -- Multi-Scale Feature Extraction

### Core Insight

Szegedy et al. (2014) asked: _why choose a single filter size when you can use
multiple sizes in parallel?_

The **Inception module** applies 1x1, 3x3, and 5x5 convolutions **plus** max
pooling simultaneously, then concatenates the results along the channel axis.

### The 1x1 Convolution Trick

Naive concatenation of multi-scale outputs explodes the channel count. The key
innovation is using **1x1 convolutions as bottlenecks** to reduce dimensionality
_before_ the expensive 3x3 and 5x5 operations:

```
Input (28x28x256)
  |         |              |              |
1x1(64)   1x1(96)        1x1(16)       MaxPool3x3
  |       3x3(128)       5x5(32)        1x1(32)
  |         |              |              |
  +-------- Concat along channels --------+
Output (28x28x256)  <-- controlled growth
```

### Why It Matters

- GoogLeNet achieved **higher accuracy than VGG with 12x fewer parameters** (6.8M vs 138M)
- The multi-scale philosophy influenced all later architectures
- 1x1 convolutions became a standard building block

In [None]:
class InceptionModule(layers.Layer):
    """Inception module with dimension reduction (Inception v1).
    
    Args:
        f1:    filters for the 1x1 branch
        f3_r:  filters for the 1x1 reduction before 3x3
        f3:    filters for the 3x3 branch
        f5_r:  filters for the 1x1 reduction before 5x5
        f5:    filters for the 5x5 branch
        fpool: filters for the 1x1 after max-pool
    """

    def __init__(self, f1, f3_r, f3, f5_r, f5, fpool, **kwargs):
        super().__init__(**kwargs)

        # Branch 1: 1x1
        self.branch1 = layers.Conv2D(f1, 1, activation='relu', padding='same')

        # Branch 2: 1x1 -> 3x3
        self.branch2_reduce = layers.Conv2D(f3_r, 1, activation='relu', padding='same')
        self.branch2_conv   = layers.Conv2D(f3,   3, activation='relu', padding='same')

        # Branch 3: 1x1 -> 5x5
        self.branch3_reduce = layers.Conv2D(f5_r, 1, activation='relu', padding='same')
        self.branch3_conv   = layers.Conv2D(f5,   5, activation='relu', padding='same')

        # Branch 4: MaxPool -> 1x1
        self.branch4_pool = layers.MaxPooling2D(3, strides=1, padding='same')
        self.branch4_conv = layers.Conv2D(fpool, 1, activation='relu', padding='same')

    def call(self, x):
        b1 = self.branch1(x)
        b2 = self.branch2_conv(self.branch2_reduce(x))
        b3 = self.branch3_conv(self.branch3_reduce(x))
        b4 = self.branch4_conv(self.branch4_pool(x))
        return tf.concat([b1, b2, b3, b4], axis=-1)


# Quick test -- Inception module with GoogLeNet inception_3a config
inp = keras.Input(shape=(28, 28, 192))
out = InceptionModule(64, 96, 128, 16, 32, 32, name='inception_3a')(inp)
test_model = Model(inp, out)
print(f"Input shape : {inp.shape}")
print(f"Output shape: {test_model.output_shape}")
print(f"Parameters  : {test_model.count_params():,}")
print(f"Output channels = 64 + 128 + 32 + 32 = {64+128+32+32}")

## 6. DenseNet -- Dense Connections and Feature Reuse

### Core Insight

Huang et al. (2017) took skip connections to their logical extreme: **every layer
receives feature maps from _all_ preceding layers** in its dense block.

```
x0 ----+---------+---------+---------->
       |         |         |
       v         |         |
     [BN-ReLU-Conv]        |         |
       = x1      |         |
       |  +------+         |
       v  v                |
     [BN-ReLU-Conv]        |
       = x2                |
       |  +---+---+--------+
       v  v   v   v
     [BN-ReLU-Conv]
       = x3
```

Each layer concatenates (not adds) all previous feature maps.

### Growth Rate (k)

Each layer produces `k` new feature maps (the **growth rate**). After `L` layers,
the total channels are `k0 + L * k`. Typical `k = 32`.

### Bottleneck Layer (DenseNet-B)

To control computation: `BN -> ReLU -> 1x1 Conv(4k) -> BN -> ReLU -> 3x3 Conv(k)`

### Transition Layer

Between dense blocks: `BN -> 1x1 Conv (compression) -> 2x2 AvgPool`

### Benefits

- Strong gradient flow (every layer has direct access to the loss gradient)
- Feature reuse reduces parameter count
- Implicit deep supervision
- DenseNet-121 achieves comparable accuracy to ResNet with **fewer parameters**

In [None]:
class DenseLayer(layers.Layer):
    """Single layer within a DenseNet dense block (BN-ReLU-1x1-BN-ReLU-3x3)."""

    def __init__(self, growth_rate, **kwargs):
        super().__init__(**kwargs)
        self.bn1   = layers.BatchNormalization()
        self.conv1 = layers.Conv2D(4 * growth_rate, 1, use_bias=False, padding='same')
        self.bn2   = layers.BatchNormalization()
        self.conv2 = layers.Conv2D(growth_rate, 3, use_bias=False, padding='same')

    def call(self, x, training=False):
        out = self.conv1(tf.nn.relu(self.bn1(x, training=training)))
        out = self.conv2(tf.nn.relu(self.bn2(out, training=training)))
        return tf.concat([x, out], axis=-1)  # Dense concatenation


class DenseBlock(layers.Layer):
    """A dense block consisting of `num_layers` dense layers."""

    def __init__(self, num_layers, growth_rate, **kwargs):
        super().__init__(**kwargs)
        self.dense_layers = [
            DenseLayer(growth_rate, name=f'dense_layer_{i}')
            for i in range(num_layers)
        ]

    def call(self, x, training=False):
        for layer in self.dense_layers:
            x = layer(x, training=training)
        return x


class TransitionLayer(layers.Layer):
    """Transition between dense blocks: 1x1 conv + 2x2 avg pool."""

    def __init__(self, out_channels, **kwargs):
        super().__init__(**kwargs)
        self.bn   = layers.BatchNormalization()
        self.conv = layers.Conv2D(out_channels, 1, use_bias=False, padding='same')
        self.pool = layers.AveragePooling2D(2, strides=2)

    def call(self, x, training=False):
        x = self.conv(tf.nn.relu(self.bn(x, training=training)))
        return self.pool(x)


# Demonstrate a dense block
inp = keras.Input(shape=(32, 32, 64))
out = DenseBlock(num_layers=6, growth_rate=32, name='dense_block_1')(inp)
demo = Model(inp, out)
print(f"Input channels : 64")
print(f"Growth rate    : 32")
print(f"Layers         : 6")
print(f"Output channels: {demo.output_shape[-1]}  (64 + 6*32 = {64 + 6*32})")
print(f"Parameters     : {demo.count_params():,}")

## 7. EfficientNet -- Compound Scaling

### Core Insight

Tan & Le (2019) observed that network **width** (channels), **depth** (layers),
and input **resolution** should be scaled together, not independently.

### Compound Scaling Rule

Given a compound coefficient `phi`:

```
depth    = alpha ^ phi
width    = beta  ^ phi
resolution = gamma ^ phi

subject to: alpha * beta^2 * gamma^2 ~ 2
```

For EfficientNet: alpha=1.2, beta=1.1, gamma=1.15

### Building Blocks

EfficientNet uses **MBConv** (Mobile Inverted Bottleneck) blocks with:

1. **Expansion** -- 1x1 conv to expand channels (expansion ratio, e.g., 6x)
2. **Depthwise convolution** -- 3x3 or 5x5 depthwise separable conv
3. **Squeeze-and-Excitation (SE)** -- channel attention mechanism
4. **Projection** -- 1x1 conv to reduce channels back
5. **Skip connection** -- if input and output shapes match

```
Input -> 1x1 Expand -> DWConv -> SE -> 1x1 Project -> (+) -> Output
  |                                                     |
  +---------------- skip connection --------------------+
```

In [None]:
class SqueezeExcitation(layers.Layer):
    """Squeeze-and-Excitation block: learns per-channel attention weights.
    
    1. Squeeze: Global Average Pooling reduces HxW to 1x1
    2. Excitation: Two FC layers learn channel importance
    3. Scale: Reweight original channels
    """

    def __init__(self, filters, ratio=16, **kwargs):
        super().__init__(**kwargs)
        self.filters = filters
        self.gap    = layers.GlobalAveragePooling2D()
        self.dense1 = layers.Dense(filters // ratio, activation='relu')
        self.dense2 = layers.Dense(filters, activation='sigmoid')

    def call(self, x):
        # Squeeze
        scale = self.gap(x)                      # (B, C)
        # Excitation
        scale = self.dense1(scale)                # (B, C//r)
        scale = self.dense2(scale)                # (B, C)
        # Scale
        scale = tf.reshape(scale, [-1, 1, 1, self.filters])  # (B, 1, 1, C)
        return x * scale


class MBConvBlock(layers.Layer):
    """Mobile Inverted Bottleneck Convolution block (MBConv).
    
    Used in EfficientNet and MobileNetV2+.
    """

    def __init__(self, in_channels, out_channels, expand_ratio=6,
                 kernel_size=3, stride=1, se_ratio=4, **kwargs):
        super().__init__(**kwargs)
        self.use_residual = (stride == 1 and in_channels == out_channels)
        expanded = in_channels * expand_ratio

        block_layers = []

        # Expansion phase (skip if expand_ratio == 1)
        if expand_ratio != 1:
            block_layers += [
                layers.Conv2D(expanded, 1, use_bias=False, padding='same'),
                layers.BatchNormalization(),
                layers.Activation('swish'),
            ]

        # Depthwise convolution
        block_layers += [
            layers.DepthwiseConv2D(kernel_size, strides=stride,
                                   padding='same', use_bias=False),
            layers.BatchNormalization(),
            layers.Activation('swish'),
        ]

        self.pre_se = keras.Sequential(block_layers)

        # Squeeze-and-Excitation
        self.se = SqueezeExcitation(expanded, ratio=se_ratio)

        # Projection phase
        self.project = keras.Sequential([
            layers.Conv2D(out_channels, 1, use_bias=False, padding='same'),
            layers.BatchNormalization(),
        ])

    def call(self, x, training=False):
        residual = x
        x = self.pre_se(x, training=training)
        x = self.se(x)
        x = self.project(x, training=training)
        if self.use_residual:
            x = x + residual
        return x


# Test the MBConv block
inp = keras.Input(shape=(32, 32, 32))
out = MBConvBlock(32, 32, expand_ratio=6, kernel_size=3, stride=1, name='mbconv_test')(inp)
mbconv_model = Model(inp, out)
print(f"MBConv block -- Input: {inp.shape}, Output: {mbconv_model.output_shape}")
print(f"Parameters: {mbconv_model.count_params():,}")

## 8. ConvNeXt -- Modernizing CNNs with Transformer Ideas

### Core Insight

Liu et al. (2022) asked: _can a pure CNN match Vision Transformer performance if we
adopt the design choices that made transformers successful?_

Starting from a standard ResNet, they applied a series of modernizations:

| Modification | Inspiration | Effect |
|:------------|:------------|:-------|
| Patchify stem (4x4 stride-4 conv) | ViT patch embedding | Better initial downsampling |
| Inverted bottleneck (expand 4x) | Transformer FFN | More computation in high-dim space |
| Large kernel (7x7 depthwise) | ViT global attention | Larger receptive field |
| LayerNorm instead of BatchNorm | Transformer convention | Better training stability |
| GELU activation | Transformer convention | Smoother non-linearity |
| Fewer activation functions | Transformer block design | Cleaner gradient flow |

The result: **ConvNeXt matches or exceeds Swin Transformer** at all model sizes,
while remaining a pure convolutional network.

### ConvNeXt Block Structure

```
Input -> 7x7 DepthwiseConv -> LayerNorm -> 1x1 Conv (4x expand) -> GELU -> 1x1 Conv (project) -> (+) -> Output
  |                                                                                                |
  +------------------------------- skip connection ------------------------------------------------+
```

In [None]:
class ConvNeXtBlock(layers.Layer):
    """A single ConvNeXt block.
    
    Combines depthwise convolution, inverted bottleneck,
    LayerNorm, and GELU -- design principles borrowed from Transformers.
    """

    def __init__(self, dim, drop_rate=0.0, layer_scale_init=1e-6, **kwargs):
        super().__init__(**kwargs)
        self.dim = dim

        # Depthwise convolution with large 7x7 kernel
        self.dwconv = layers.DepthwiseConv2D(7, padding='same')
        self.norm   = layers.LayerNormalization(epsilon=1e-6)

        # Inverted bottleneck: expand -> GELU -> project
        self.pwconv1 = layers.Dense(4 * dim)       # Expand
        self.act     = layers.Activation('gelu')
        self.pwconv2 = layers.Dense(dim)            # Project

        # Layer Scale (learnable per-channel scaling)
        self.layer_scale_init = layer_scale_init
        self.drop_rate = drop_rate

    def build(self, input_shape):
        self.gamma = self.add_weight(
            name='layer_scale',
            shape=(self.dim,),
            initializer=tf.keras.initializers.Constant(self.layer_scale_init),
            trainable=True
        )
        super().build(input_shape)

    def call(self, x, training=False):
        shortcut = x

        x = self.dwconv(x)
        x = self.norm(x)
        x = self.pwconv1(x)
        x = self.act(x)
        x = self.pwconv2(x)
        x = self.gamma * x  # Layer scale

        if self.drop_rate > 0.0 and training:
            x = tf.nn.dropout(x, rate=self.drop_rate)

        return x + shortcut


# Test ConvNeXt block
inp = keras.Input(shape=(56, 56, 96))
out = ConvNeXtBlock(96, name='convnext_block')(inp)
demo = Model(inp, out)
print(f"ConvNeXt block -- Input: {inp.shape}, Output: {demo.output_shape}")
print(f"Parameters: {demo.count_params():,}")
print("\nNote: Compared to ResNet, ConvNeXt uses LayerNorm, GELU, and 7x7 depthwise conv.")

## 9. Vision Transformer (ViT) -- Self-Attention for Images

### Core Insight

Dosovitskiy et al. (2020) showed that a **pure Transformer** (no convolutions at all)
can achieve state-of-the-art image classification when pre-trained on large datasets.

### How ViT Works

1. **Patch Embedding**: Split the image into fixed-size patches (e.g., 16x16) and
   linearly project each patch to an embedding vector.
   - A 224x224 image with 16x16 patches yields 196 patch tokens.

2. **[CLS] Token**: Prepend a learnable classification token.

3. **Position Embedding**: Add learnable position embeddings (1D) to retain
   spatial information.

4. **Transformer Encoder**: Apply `L` standard Transformer blocks:
   ```
   LayerNorm -> Multi-Head Self-Attention -> Residual
   LayerNorm -> MLP (GELU) -> Residual
   ```

5. **Classification Head**: The final [CLS] token representation is passed through
   a linear classifier.

### Key Differences from CNNs

| Property | CNN | ViT |
|:---------|:----|:----|
| Inductive bias | Strong (locality, translation equiv.) | Weak (learns from data) |
| Receptive field | Grows with depth | Global from layer 1 |
| Data efficiency | Better with small data | Needs large pre-training data |
| Compute scaling | O(H*W*C*K^2) | O((H*W/P^2)^2 * D) |

In [None]:
class PatchEmbedding(layers.Layer):
    """Convert image to a sequence of patch embeddings.
    
    Uses a Conv2D with kernel_size=stride=patch_size for efficient extraction.
    """

    def __init__(self, patch_size, embed_dim, **kwargs):
        super().__init__(**kwargs)
        self.patch_size = patch_size
        self.embed_dim  = embed_dim
        self.projection = layers.Conv2D(
            embed_dim, kernel_size=patch_size, strides=patch_size
        )

    def call(self, x):
        # x: (B, H, W, C)
        x = self.projection(x)         # (B, H/P, W/P, embed_dim)
        B = tf.shape(x)[0]
        H, W, C = x.shape[1], x.shape[2], x.shape[3]
        x = tf.reshape(x, [B, H * W, C])  # (B, num_patches, embed_dim)
        return x

In [None]:
class TransformerBlock(layers.Layer):
    """Standard Transformer encoder block: MHSA + MLP with pre-norm."""

    def __init__(self, embed_dim, num_heads, mlp_dim, dropout=0.1, **kwargs):
        super().__init__(**kwargs)
        self.norm1 = layers.LayerNormalization(epsilon=1e-6)
        self.attn  = layers.MultiHeadAttention(
            num_heads=num_heads,
            key_dim=embed_dim // num_heads,
            dropout=dropout
        )
        self.norm2 = layers.LayerNormalization(epsilon=1e-6)
        self.mlp = keras.Sequential([
            layers.Dense(mlp_dim, activation='gelu'),
            layers.Dropout(dropout),
            layers.Dense(embed_dim),
            layers.Dropout(dropout),
        ])

    def call(self, x, training=False):
        # Multi-Head Self-Attention with pre-norm and residual
        x_norm = self.norm1(x)
        attn_out = self.attn(x_norm, x_norm, training=training)
        x = x + attn_out

        # Feed-forward MLP with pre-norm and residual
        x_norm = self.norm2(x)
        mlp_out = self.mlp(x_norm, training=training)
        x = x + mlp_out

        return x

In [None]:
class VisionTransformer(Model):
    """Complete Vision Transformer (ViT) implementation.
    
    Args:
        image_size:  Input image dimension (assumes square).
        patch_size:  Size of each image patch.
        num_classes: Number of output classes.
        embed_dim:   Embedding dimension.
        depth:       Number of Transformer blocks.
        num_heads:   Number of attention heads.
        mlp_dim:     Hidden dim in the MLP.
        dropout:     Dropout rate.
    """

    def __init__(self, image_size=224, patch_size=16, num_classes=1000,
                 embed_dim=768, depth=12, num_heads=12, mlp_dim=3072,
                 dropout=0.1, **kwargs):
        super().__init__(**kwargs)
        num_patches = (image_size // patch_size) ** 2

        # Patch embedding
        self.patch_embed = PatchEmbedding(patch_size, embed_dim)

        # Learnable [CLS] token
        self.cls_token = self.add_weight(
            name='cls_token',
            shape=(1, 1, embed_dim),
            initializer='zeros',
            trainable=True
        )

        # Learnable position embeddings (for num_patches + 1 CLS token)
        self.pos_embed = self.add_weight(
            name='pos_embed',
            shape=(1, num_patches + 1, embed_dim),
            initializer='random_normal',
            trainable=True
        )

        self.pos_drop = layers.Dropout(dropout)

        # Transformer encoder blocks
        self.blocks = [
            TransformerBlock(embed_dim, num_heads, mlp_dim, dropout,
                             name=f'transformer_block_{i}')
            for i in range(depth)
        ]

        self.norm = layers.LayerNormalization(epsilon=1e-6)

        # Classification head
        self.classifier = layers.Dense(num_classes, name='head')

    def call(self, x, training=False):
        B = tf.shape(x)[0]

        # Create patch embeddings
        x = self.patch_embed(x)                        # (B, N, D)

        # Prepend [CLS] token
        cls_tokens = tf.broadcast_to(self.cls_token, [B, 1, x.shape[-1]])
        x = tf.concat([cls_tokens, x], axis=1)         # (B, N+1, D)

        # Add position embeddings
        x = x + self.pos_embed
        x = self.pos_drop(x, training=training)

        # Transformer encoder
        for block in self.blocks:
            x = block(x, training=training)

        # Classification on [CLS] token
        x = self.norm(x)
        cls_output = x[:, 0]                            # (B, D)
        return self.classifier(cls_output)


# Build ViT-Tiny for demonstration (smaller than ViT-B for faster testing)
vit_tiny = VisionTransformer(
    image_size=224, patch_size=16, num_classes=1000,
    embed_dim=192, depth=12, num_heads=3, mlp_dim=768,
    dropout=0.1, name='ViT_Tiny'
)

# Build the model by passing a dummy input
dummy_input = tf.random.normal((1, 224, 224, 3))
dummy_output = vit_tiny(dummy_input)

print(f"ViT-Tiny")
print(f"  Input shape     : (B, 224, 224, 3)")
print(f"  Num patches     : {(224//16)**2}")
print(f"  Embed dim       : 192")
print(f"  Transformer depth: 12")
print(f"  Output shape    : {dummy_output.shape}")
print(f"  Total parameters: {vit_tiny.count_params():,}")

## 10. Swin Transformer -- Shifted Windows for Efficient Attention

### Problem with Standard ViT

Standard ViT computes **global** self-attention over all patches, which has
**O(N^2)** complexity where N is the number of patches. For high-resolution images,
this becomes prohibitively expensive.

### Swin Transformer Solution (Liu et al., 2021)

1. **Window-based attention**: Divide the feature map into non-overlapping
   local windows (e.g., 7x7 patches) and compute attention **within** each window.
   This reduces complexity to O(N) (linear in image size).

2. **Shifted windows**: In alternating layers, shift the window partition by
   half the window size. This creates cross-window connections without global attention.

```
Layer L:     Regular windows         Layer L+1:   Shifted windows
+---+---+---+---+                    +--+----+--+
|   |   |   |   |                    |  |    |  |
+---+---+---+---+     shift by       +--+----+--+
|   |   |   |   |  -> (M/2, M/2) -> |  |    |  |
+---+---+---+---+                    +--+----+--+
|   |   |   |   |                    |  |    |  |
+---+---+---+---+                    +--+----+--+
```

3. **Hierarchical feature maps**: Unlike ViT (which maintains a single resolution),
   Swin uses **patch merging** layers to progressively reduce spatial resolution
   and increase channels -- similar to a CNN feature pyramid.

### Architecture Overview

```
Image -> Patch Partition (4x4) -> Stage 1: Swin Blocks
      -> Patch Merging (2x down) -> Stage 2: Swin Blocks
      -> Patch Merging (2x down) -> Stage 3: Swin Blocks
      -> Patch Merging (2x down) -> Stage 4: Swin Blocks
      -> Global Average Pool -> Classifier
```

### Comparison with Standard ViT

| Property | ViT | Swin Transformer |
|:---------|:----|:-----------------|
| Attention scope | Global (all patches) | Local windows + shifted windows |
| Complexity | O(N^2) | O(N) -- linear |
| Feature hierarchy | Single-scale | Multi-scale (like FPN) |
| Dense prediction | Difficult (needs adaptation) | Natural fit |
| Pre-training data needed | Very large (JFT-300M) | ImageNet-1K sufficient |

### Why Swin Matters

- First Transformer backbone that works well for **both** classification and
  dense tasks (detection, segmentation)
- Linear complexity makes it practical for high-resolution inputs
- Swin-T (29M params) achieves 81.3% Top-1 on ImageNet

In [None]:
class WindowAttention(layers.Layer):
    """Window-based multi-head self-attention (W-MSA).
    
    Computes self-attention within local windows for efficiency.
    """

    def __init__(self, dim, window_size, num_heads, **kwargs):
        super().__init__(**kwargs)
        self.dim = dim
        self.window_size = window_size
        self.num_heads = num_heads
        head_dim = dim // num_heads
        self.scale = head_dim ** -0.5

        self.qkv = layers.Dense(dim * 3, use_bias=True)
        self.proj = layers.Dense(dim)

    def call(self, x, training=False):
        B_windows, N, C = tf.shape(x)[0], x.shape[1], x.shape[2]

        # Compute Q, K, V
        qkv = self.qkv(x)                                          # (B*nW, N, 3*C)
        qkv = tf.reshape(qkv, [B_windows, N, 3, self.num_heads, C // self.num_heads])
        qkv = tf.transpose(qkv, perm=[2, 0, 3, 1, 4])             # (3, B*nW, nH, N, head_dim)
        q, k, v = qkv[0], qkv[1], qkv[2]

        # Scaled dot-product attention
        attn = tf.matmul(q, k, transpose_b=True) * self.scale      # (B*nW, nH, N, N)
        attn = tf.nn.softmax(attn, axis=-1)

        x = tf.matmul(attn, v)                                     # (B*nW, nH, N, head_dim)
        x = tf.transpose(x, perm=[0, 2, 1, 3])                    # (B*nW, N, nH, head_dim)
        x = tf.reshape(x, [B_windows, N, C])                      # (B*nW, N, C)
        x = self.proj(x)
        return x


# Demonstrate window attention
# Simulating 4 windows, each 7x7 = 49 tokens, with embed_dim=96
test_input = tf.random.normal((4, 49, 96))
window_attn = WindowAttention(dim=96, window_size=7, num_heads=3)
test_output = window_attn(test_input)
print(f"Window Attention")
print(f"  Input  : {test_input.shape}  (4 windows, 49 tokens each, dim=96)")
print(f"  Output : {test_output.shape}")
print(f"  Each window attends only to its own 49 tokens -- O(49^2) per window, not O(196^2) globally.")

## 11. Architecture Comparison and Selection Guide

### Comprehensive Comparison

| Architecture | Params | FLOPs (G) | Top-1 (%) | Inference (ms)* | Key Strength |
|:------------|:------:|:---------:|:---------:|:---------------:|:-------------|
| MobileNetV2 | 3.4 M | 0.3 | 72.0 | 5 | Mobile / edge deployment |
| EfficientNet-B0 | 5.3 M | 0.4 | 77.1 | 8 | Best accuracy/param trade-off |
| ResNet-50 | 25.6 M | 4.1 | 76.0 | 12 | Reliable baseline, well-studied |
| DenseNet-121 | 8.0 M | 2.9 | 74.9 | 15 | Parameter efficiency |
| ConvNeXt-T | 29 M | 4.5 | 82.1 | 14 | Modern CNN, strong all-around |
| ViT-B/16 | 86 M | 17.6 | 77.9** | 18 | Scales well with data |
| Swin-T | 29 M | 4.5 | 81.3 | 16 | Multi-scale, detection-friendly |
| EfficientNet-B7 | 66 M | 37 | 84.3 | 45 | Maximum accuracy (single model) |

\* Approximate GPU inference time for 224x224 input (batch=1, V100).  
\** ViT-B achieves ~85%+ when pre-trained on larger datasets (JFT-300M).

### Decision Matrix: When to Use Which Architecture

| Scenario | Recommended | Rationale |
|:---------|:------------|:----------|
| **Mobile / Edge** (< 5M params) | MobileNetV2, EfficientNet-B0 | Low latency, small model size |
| **General baseline** | ResNet-50 | Well-understood, enormous ecosystem |
| **Maximum accuracy (medium budget)** | ConvNeXt-T/S or Swin-T/S | Best accuracy at ~30M params |
| **Maximum accuracy (large budget)** | EfficientNet-B5-B7 or Swin-L | Push the accuracy frontier |
| **Object detection / Segmentation** | Swin-T or ConvNeXt-T as backbone | Hierarchical features essential |
| **Transfer learning (small dataset)** | ResNet-50 or EfficientNet-B0 | Strong inductive bias helps |
| **Transfer learning (large dataset)** | ViT-B or Swin-B | Self-attention excels with data |
| **Research / Custom tasks** | Start with ResNet-50, then ConvNeXt | Easy to modify, well-documented |

### Cost-Benefit Analysis for Production

```
Accuracy
  ^      *EfficientNet-B7
  |                    *Swin-L
  |            *ConvNeXt-T   *Swin-T
  |     *EfficientNet-B0
  |          *ResNet-50
  |   *MobileNetV2
  +--------------------------------------> Cost (FLOPs / Latency)
```

**Key takeaway:** The "sweet spot" for most production applications is
**EfficientNet-B0 to B3** or **ConvNeXt-Tiny/Small**. Going beyond B3 / Small
gives diminishing returns for the added compute cost.

## 12. Practical: Benchmark Pre-Trained Architectures

We use `tf.keras.applications` to load pre-trained models and compare their
inference time, memory usage, and output characteristics on the same input.

In [None]:
# Load pre-trained models (feature extractors)
models_to_compare = {
    'MobileNetV2': tf.keras.applications.MobileNetV2(
        weights=None, include_top=False, input_shape=(224, 224, 3)
    ),
    'ResNet50': tf.keras.applications.ResNet50(
        weights=None, include_top=False, input_shape=(224, 224, 3)
    ),
    'EfficientNetB0': tf.keras.applications.EfficientNetB0(
        weights=None, include_top=False, input_shape=(224, 224, 3)
    ),
}

print("Models loaded (without pre-trained weights for environment compatibility).")
print("In production, set weights='imagenet' for pre-trained weights.\n")

for name, model in models_to_compare.items():
    print(f"{name:20s} -- Params: {model.count_params():>10,}   "
          f"Output shape: {model.output_shape}")

In [None]:
# Benchmark inference time
dummy_batch = tf.random.normal((8, 224, 224, 3))

results = {}
for name, model in models_to_compare.items():
    # Warm-up run
    _ = model(dummy_batch, training=False)
    
    # Timed runs
    times = []
    for _ in range(5):
        start = time.time()
        _ = model(dummy_batch, training=False)
        times.append(time.time() - start)
    
    avg_ms = np.mean(times) * 1000
    std_ms = np.std(times) * 1000
    results[name] = {'avg_ms': avg_ms, 'std_ms': std_ms,
                     'params': model.count_params()}
    print(f"{name:20s} -- Avg: {avg_ms:7.1f} ms (+/- {std_ms:.1f} ms) for batch of 8")

print("\nNote: Times depend heavily on hardware. GPU vs CPU makes a large difference.")

In [None]:
# Visualization: Parameters vs Inference Time
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

names = list(results.keys())
params = [results[n]['params'] / 1e6 for n in names]
times  = [results[n]['avg_ms'] for n in names]

colors = ['#2ecc71', '#3498db', '#e74c3c']

# Bar chart: parameter counts
bars1 = axes[0].bar(names, params, color=colors, edgecolor='black', linewidth=0.5)
axes[0].set_ylabel('Parameters (Millions)', fontsize=12)
axes[0].set_title('Model Size Comparison', fontsize=13, fontweight='bold')
for bar, p in zip(bars1, params):
    axes[0].text(bar.get_x() + bar.get_width()/2., bar.get_height() + 0.3,
                 f'{p:.1f}M', ha='center', fontsize=11, fontweight='bold')

# Bar chart: inference time
bars2 = axes[1].bar(names, times, color=colors, edgecolor='black', linewidth=0.5)
axes[1].set_ylabel('Inference Time (ms, batch=8)', fontsize=12)
axes[1].set_title('Inference Speed Comparison', fontsize=13, fontweight='bold')
for bar, t in zip(bars2, times):
    axes[1].text(bar.get_x() + bar.get_width()/2., bar.get_height() + 0.5,
                 f'{t:.1f}ms', ha='center', fontsize=11, fontweight='bold')

plt.tight_layout()
plt.show()

print("\nKey observations:")
print("- MobileNetV2 has the fewest parameters and fastest inference")
print("- ResNet50 is the largest but benefits from highly optimized implementations")
print("- EfficientNetB0 achieves higher accuracy with fewer params than ResNet50")

## 13. Exercises

### Exercise 1: Design a Lightweight Architecture (< 1M Parameters)

**Goal:** Build a custom CNN for CIFAR-10 that achieves > 90% test accuracy
with fewer than 1 million parameters.

**Hints:**
- Use depthwise separable convolutions (MobileNet-style) to reduce params
- Use Global Average Pooling instead of Flatten + Dense
- Consider using SE blocks for better feature selection
- Batch Normalization + data augmentation are essential

In [None]:
# Exercise 1 - Starter code
def build_lightweight_model(input_shape=(32, 32, 3), num_classes=10):
    """Design your lightweight architecture here.
    
    Target: < 1M parameters, > 90% accuracy on CIFAR-10.
    """
    inputs = keras.Input(shape=input_shape)

    # ------ YOUR ARCHITECTURE HERE ------
    # Suggestion: Start with a small conv stem, then stack depthwise separable
    # blocks with increasing channels and occasional stride-2 downsampling.

    # Example stem
    x = layers.Conv2D(32, 3, padding='same', use_bias=False)(inputs)
    x = layers.BatchNormalization()(x)
    x = layers.Activation('relu')(x)

    # Depthwise separable block (repeat and modify)
    for filters in [64, 128, 256]:
        x = layers.SeparableConv2D(filters, 3, padding='same', use_bias=False)(x)
        x = layers.BatchNormalization()(x)
        x = layers.Activation('relu')(x)
        x = layers.MaxPooling2D(2)(x)

    # Head
    x = layers.GlobalAveragePooling2D()(x)
    x = layers.Dropout(0.3)(x)
    outputs = layers.Dense(num_classes, activation='softmax')(x)

    return Model(inputs, outputs, name='LightweightCNN')


light_model = build_lightweight_model()
param_count = light_model.count_params()
print(f"Parameters: {param_count:,}")
print(f"Under 1M?  {'YES' if param_count < 1_000_000 else 'NO -- reduce model size!'}")

### Exercise 2: Hybrid CNN-Transformer

**Goal:** Build a model that uses a CNN backbone for local feature extraction
followed by a Transformer encoder for global reasoning.

**Architecture:**
```
Image -> CNN Stem (downsample to 7x7 feature map)
      -> Reshape to sequence of 49 tokens
      -> Add position embeddings
      -> Transformer Encoder (2-4 blocks)
      -> Global Average Pool over tokens
      -> Classification Head
```

**Hints:**
- Use a small ResNet or ConvNeXt as the CNN stem
- The CNN handles local patterns; the Transformer handles global relationships
- This is the approach used by models like CoAtNet and CvT

In [None]:
# Exercise 2 - Starter code
def build_hybrid_model(input_shape=(224, 224, 3), num_classes=1000,
                       embed_dim=256, num_heads=8, num_blocks=4):
    """Build a hybrid CNN-Transformer model."""
    inputs = keras.Input(shape=input_shape)

    # CNN Stem -- downsample aggressively
    x = layers.Conv2D(64, 7, strides=2, padding='same', use_bias=False)(inputs)
    x = layers.BatchNormalization()(x)
    x = layers.Activation('relu')(x)
    x = layers.MaxPooling2D(3, strides=2, padding='same')(x)

    # A few residual-style blocks
    for filters in [64, 128, embed_dim]:
        x = layers.Conv2D(filters, 3, padding='same', use_bias=False)(x)
        x = layers.BatchNormalization()(x)
        x = layers.Activation('relu')(x)
        x = layers.Conv2D(filters, 3, strides=2, padding='same', use_bias=False)(x)
        x = layers.BatchNormalization()(x)
        x = layers.Activation('relu')(x)

    # Reshape to sequence: (B, H*W, C)
    B = tf.shape(x)[0]
    H, W, C = x.shape[1], x.shape[2], x.shape[3]
    x = tf.reshape(x, [B, H * W, C])   # (B, num_tokens, embed_dim)

    # Add learnable position embeddings
    num_tokens = H * W
    pos_emb = tf.Variable(
        tf.random.normal((1, num_tokens, embed_dim), stddev=0.02),
        trainable=True, name='hybrid_pos_embed'
    )
    x = x + pos_emb

    # Transformer Encoder blocks
    for i in range(num_blocks):
        x = TransformerBlock(
            embed_dim, num_heads, mlp_dim=embed_dim * 4,
            dropout=0.1, name=f'hybrid_block_{i}'
        )(x)

    # Global average pool over tokens
    x = layers.LayerNormalization(epsilon=1e-6)(x)
    x = tf.reduce_mean(x, axis=1)   # (B, embed_dim)

    outputs = layers.Dense(num_classes, activation='softmax')(x)

    return Model(inputs, outputs, name='Hybrid_CNN_Transformer')


hybrid = build_hybrid_model(input_shape=(224, 224, 3), num_classes=100,
                            embed_dim=256, num_heads=8, num_blocks=4)
dummy = tf.random.normal((1, 224, 224, 3))
out = hybrid(dummy)
print(f"Hybrid CNN-Transformer")
print(f"  Output shape: {out.shape}")
print(f"  Parameters  : {hybrid.count_params():,}")

### Exercise 3: Architecture Ablation Study

**Goal:** Systematically study the effect of individual design choices.

Pick a baseline architecture (e.g., a simple ResNet-18) and measure the effect
of each modification independently:

| Experiment | Modification |
|:-----------|:-------------|
| Baseline | Plain ResNet-18 |
| + SE blocks | Add Squeeze-Excitation to each residual block |
| + Stochastic depth | Randomly drop entire residual blocks during training |
| + Large kernel (7x7) | Replace 3x3 conv with 7x7 depthwise conv |
| + GELU | Replace ReLU with GELU |
| + LayerNorm | Replace BatchNorm with LayerNorm |
| All combined | Apply all modifications (approximates ConvNeXt) |

**Suggested dataset:** CIFAR-10 or CIFAR-100 (fast iteration).

**What to measure:**
- Final test accuracy
- Training convergence speed (epochs to reach X% accuracy)
- Parameter count and inference time
- Training stability (variance across 3 seeds)

In [None]:
# Exercise 3 - Ablation study template

# Define your experiments as a dictionary
experiments = {
    'baseline':       {'se': False, 'large_kernel': False, 'activation': 'relu',  'norm': 'batch'},
    '+ SE blocks':    {'se': True,  'large_kernel': False, 'activation': 'relu',  'norm': 'batch'},
    '+ Large kernel': {'se': False, 'large_kernel': True,  'activation': 'relu',  'norm': 'batch'},
    '+ GELU':         {'se': False, 'large_kernel': False, 'activation': 'gelu',  'norm': 'batch'},
    '+ LayerNorm':    {'se': False, 'large_kernel': False, 'activation': 'relu',  'norm': 'layer'},
    'All combined':   {'se': True,  'large_kernel': True,  'activation': 'gelu',  'norm': 'layer'},
}

print("Ablation Study Experiments:")
print("=" * 80)
for name, config in experiments.items():
    print(f"  {name:20s} -> {config}")

print("\nTo run the study:")
print("  1. Implement a configurable build_model(config) function")
print("  2. Train each variant for N epochs on CIFAR-10/100")
print("  3. Log accuracy, loss, and timing metrics")
print("  4. Plot comparative learning curves")
print("  5. Repeat with 3 different seeds for statistical significance")

## Summary and Key Takeaways

### What We Covered

1. **VGG** -- Proved that uniform 3x3 convolutions scale to deep networks, but
   hit the parameter wall with fully-connected layers.

2. **ResNet** -- Solved the degradation problem with skip connections, enabling
   training of 100+ layer networks and becoming the most cited architecture.

3. **Inception / GoogLeNet** -- Introduced multi-scale parallel processing and
   1x1 convolution bottlenecks, achieving great accuracy with far fewer parameters.

4. **DenseNet** -- Maximized feature reuse through dense connections, achieving
   strong performance with compact models.

5. **EfficientNet** -- Showed that principled compound scaling of width, depth,
   and resolution is better than ad-hoc scaling.

6. **ConvNeXt** -- Demonstrated that pure CNNs can match Transformers when
   modernized with Transformer design principles.

7. **Vision Transformer (ViT)** -- Proved that pure self-attention works for
   vision, especially at scale.

8. **Swin Transformer** -- Made Transformers practical for vision with linear
   complexity and hierarchical features.

### Design Principles That Stood the Test of Time

| Principle | First Appeared | Still Used |
|:----------|:--------------|:-----------|
| Skip / residual connections | ResNet (2015) | Everywhere |
| Batch Normalization | Inception v2 (2015) | Most CNNs |
| 1x1 convolution bottlenecks | GoogLeNet (2014) | All modern CNNs |
| Global Average Pooling | NIN (2013) | Replaced FC layers |
| Depthwise separable convolutions | MobileNet (2017) | Efficient models |
| Squeeze-and-Excitation | SENet (2017) | EfficientNet, ConvNeXt |
| Multi-head self-attention | Transformer (2017) | ViT, Swin, hybrids |
| Layer Scale | CaiT (2021) | ConvNeXt, modern ViTs |

### Recommended Reading

- Simonyan & Zisserman, "Very Deep Convolutional Networks" (VGG), 2014
- He et al., "Deep Residual Learning for Image Recognition" (ResNet), 2015
- Szegedy et al., "Going Deeper with Convolutions" (GoogLeNet), 2014
- Huang et al., "Densely Connected Convolutional Networks" (DenseNet), 2017
- Tan & Le, "EfficientNet: Rethinking Model Scaling" 2019
- Liu et al., "A ConvNet for the 2020s" (ConvNeXt), 2022
- Dosovitskiy et al., "An Image is Worth 16x16 Words" (ViT), 2020
- Liu et al., "Swin Transformer" 2021

---

*End of notebook.*