# 5. Network in Network (NiN)

In [2]:
import torch
import torch.nn as nn

import numpy as np
import pandas as pd

LeNet, AlexNet, and VGG all share a common design pattern: **extract features exploiting spatial structure** via a sequence of convolutions and pooling layers and **post-process the representations** via fully connected layers. 

The **improvements** upon LeNet by AlexNet and VGG mainly lie in how these later networks **widen and deepen** these two modules.

However, these networks poses 2 major **challenges**:

1. the fully connected layers at the end consume **tremendous numbers of parameters**
2. it is **impossible to add fully connected layers earlier** in the network to increase nonlinearity (doing so would destroy the spatial structure)

## NiN Blocks

The idea behind **`NiN`** is to apply a **fully connected layer** at each **pixel location**.

NiN adopts independent **$1 \times 1$ concolution layers** to act as fully connected layers (with ReLU activations) connecting the **pixels** at the **same location** across all the **channels**.

A **NiN block** starts with a normal convolutional layers followed by two $1 \times 1$ concolution layers. 

![](http://d2l.ai/_images/nin.svg)

In [3]:
def nin_block(in_channels, out_channels, kernel_size, strides, padding):
    return nn.Sequential(nn.Conv2d(in_channels, out_channels, kernel_size, strides, padding),
                         nn.ReLU(),
                         nn.Conv2d(out_channels, out_channels, kernel_size=1), 
                         nn.ReLU(),
                         nn.Conv2d(out_channels, out_channels, kernel_size=1), 
                         nn.ReLU())

## NiN Model

The **NiN model** consists of **4 NiN blocks** with convolutional layers of shapes $11 \times 11$, $5 \times 5$ and $3 \times 3$. 

The **number of channels** used is similar to that in the AlexNet. 

There is a $3 \times 3$ **max-pooling** layer between each NiN block. 

Finally, the NiN model ends up with a **global average pooling** layer.

Since there is no **fully connected layers** at the end of the model, the **number of output channels** is the number of classes and for each channel the **output** will have shape $1 \times 1$.

In [11]:
net = nn.Sequential(nin_block(in_channels=1, out_channels=96, kernel_size=11, strides=4, padding=0),
                    nn.MaxPool2d(kernel_size=3, stride=2),
                    nin_block(in_channels=96, out_channels=256, kernel_size=5, strides=1, padding=2),
                    nn.MaxPool2d(kernel_size=3, stride=2),
                    nin_block(in_channels=256, out_channels=384, kernel_size=3, strides=1, padding=1),
                    nn.MaxPool2d(kernel_size=3, stride=2),
                    nn.Dropout(0.5),
                    nin_block(in_channels=384, out_channels=10, kernel_size=3, strides=1, padding=1),
                    nn.AdaptiveAvgPool2d(output_size=(1,1)),
                    nn.Flatten())

In [12]:
X = torch.rand(size=(1, 1, 224, 224))

for layer in net:
    X = layer(X)
    print(layer.__class__.__name__,'output shape:\t', X.shape)

Sequential output shape:	 torch.Size([1, 96, 54, 54])
MaxPool2d output shape:	 torch.Size([1, 96, 26, 26])
Sequential output shape:	 torch.Size([1, 256, 26, 26])
MaxPool2d output shape:	 torch.Size([1, 256, 12, 12])
Sequential output shape:	 torch.Size([1, 384, 12, 12])
MaxPool2d output shape:	 torch.Size([1, 384, 5, 5])
Dropout output shape:	 torch.Size([1, 384, 5, 5])
Sequential output shape:	 torch.Size([1, 10, 5, 5])
AdaptiveAvgPool2d output shape:	 torch.Size([1, 10, 1, 1])
Flatten output shape:	 torch.Size([1, 10])
