In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import torchvision
import torchvision.datasets as datasets
import torchvision.transforms as transforms

# Convolution
Convolutional Neural Network (CNN) refers to the neural networks which contains at least one layer convolution operation. Convolution is a widely used operator in signal processing. In digital image processing, convolution is used to apply `filters` on the origin image, which is able to detect features, (un)sharp or even restore images.  In CNN, the filers can be learned automatically. Formally, a convolution in CNN can be written:
$$ S(i, j) = (K* I)(i, j) = \sum_m\sum_nI(i-m, j-n)K(m, n)$$
in which $I$ is origin image, $K$ is the kernel (i.e., filter), and $S$ is the output (a.k.a. feature map). Following example (from the book [Deep Learning](https://www.deeplearningbook.org/contents/convnets.html)) shows the computation procedure of convolution:
![Convolution](./convolution.jpg)

## Advantages
There are several advantages of convolution. First of all, the number of parameters is much less than normal neural networks. The filters in common are not very large and the whole image will share those filters. Also, the convolution operation consider the surrounding information for each pixel. The MLP, however, treats each pixel as an individual input. Obviously CNN is more reasonable. Last but not least, one does not has to change the filters even though the input image size changed, but if using MLP, change of input size means change of parameter size.

## Back Propagation
The gradient of a convolution operation w.r.t. both data and weights is also a convolution operation.

# Pooling
A convolution layer is usually followed with a pooling layer. Pooling layers are mainly used for downsampling and reducing the size of parameters. In practice, Max pooling outperforms average average pooling. 

## Back propagation
For max pooling, the gradient of a pooling layer w.r.t. data (there's no weights in pooling layer) is the max value for the position of max value and zero for other position. In particular, there is no gradient with respect to non maximum values, since changing them slightly does not affect the output. Further the max is locally linear with slope 1, with respect to the input that actually achieves the max. Thus, the gradient from the next layer is passed back to only that neuron which achieved the max. All other neurons get zero gradient.

# Example 

In [2]:
class CNN(nn.Module):
    def __init__(self, num_classes=10):
        super(CNN, self).__init__()
        self.layer1 = nn.Sequential(nn.Conv2d(1, 16, kernel_size=5, stride=1, padding=2),
                                   nn.ReLU(),
                                   nn.MaxPool2d(kernel_size=2, stride=2))
        self.layer2 = nn.Sequential(nn.Conv2d(16, 32, kernel_size=5, stride=1, padding=2),
                                   nn.ReLU(),
                                   nn.MaxPool2d(kernel_size=2, stride=2))
        self.fc1 = nn.Linear(7*7*32, num_classes)
        
    def forward(self, x):
        out = self.layer1(x)
        out = self.layer2(out)
        out = out.reshape(out.size(0), -1)
        out = self.fc1(out)
        return out

In [3]:
device = 'cuda'
num_classes = 10
num_epoch = 10
model = CNN(10).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.001)

train_dataset = datasets.MNIST(root='./data', train=True, download=True, transform=transforms.ToTensor())
test_dataset = datasets.MNIST(root='./data', train=False, download=True, transform=transforms.ToTensor())

train_data_loader = torch.utils.data.DataLoader(train_dataset, batch_size=4, shuffle=True)
test_data_loader = torch.utils.data.DataLoader(test_dataset, batch_size=4, shuffle=True)
print ('==>>> total trainning batch number: {}'.format(len(train_data_loader)))
print ('==>>> total testing batch number: {}'.format(len(test_data_loader)))

# train
for epoch in range(num_epoch):
    for i, (images, labels) in enumerate(train_data_loader):
        images = images.to(device)
        labels = labels.to(device)
        
        # Forward pass
        outputs = model(images)
        loss = criterion(outputs, labels)
        
        # Backward and optimize
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

==>>> total trainning batch number: 15000
==>>> total testing batch number: 2500


Dropout and BatchNorm (and maybe some custom modules) behave differently during training and evaluation. You must let the model know when to switch to eval mode by calling .eval() on the model.

In [10]:
#evaluation
model.eval()

CNN(
  (layer1): Sequential(
    (0): Conv2d(1, 16, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
    (1): ReLU()
    (2): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (layer2): Sequential(
    (0): Conv2d(16, 32, kernel_size=(5, 5), stride=(1, 1), padding=(2, 2))
    (1): ReLU()
    (2): MaxPool2d(kernel_size=2, stride=2, padding=0, dilation=1, ceil_mode=False)
  )
  (fc1): Linear(in_features=1568, out_features=10, bias=True)
)

In [11]:
with torch.no_grad():
    correct = 0
    total = 0
    for images, labels in test_data_loader:
        images=images.to(device)
        labels=labels.to(device)
        outputs = model(images)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted==labels).sum().item()
        
    print('ACC: {}%'.format(100*correct/total))

ACC: 99.05%
