# **COSE474-2024F: Deep Learning HW2**
### Student Name: 아이샤
### Student ID: 2022320119

## 0.1 Installation

In [None]:
pip install torch==2.0.0 torchvision==0.15.1

Collecting torch==2.0.0
  Downloading torch-2.0.0-cp310-cp310-manylinux1_x86_64.whl.metadata (24 kB)
Collecting torchvision==0.15.1
  Downloading torchvision-0.15.1-cp310-cp310-manylinux1_x86_64.whl.metadata (11 kB)
Collecting nvidia-cuda-nvrtc-cu11==11.7.99 (from torch==2.0.0)
  Downloading nvidia_cuda_nvrtc_cu11-11.7.99-2-py3-none-manylinux1_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu11==11.7.99 (from torch==2.0.0)
  Downloading nvidia_cuda_runtime_cu11-11.7.99-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cuda-cupti-cu11==11.7.101 (from torch==2.0.0)
  Downloading nvidia_cuda_cupti_cu11-11.7.101-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu11==8.5.0.96 (from torch==2.0.0)
  Downloading nvidia_cudnn_cu11-8.5.0.96-2-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu11==11.10.3.66 (from torch==2.0.0)
  Downloading nvidia_cublas_cu11-11.10.3.66-py3-none-manylinux1_x86_64.whl.metadata (1.6 kB)

In [None]:
pip install d2l==1.0.3

## 7.1 From Fully Connected Layers to Convolutions

This section introduces the limitations of using fully connected layers (MLPs) for high-dimensional perceptual data like images. It explains that convolutional neural networks (CNNs) can exploit the spatial structure of images, drastically reducing the number of parameters required compared to fully connected networks. CNNs take advantage of two key principles: translation invariance (ability to detect objects regardless of their position) and locality (focusing on local regions of the image for analysis). These properties make CNNs effective for tasks like image classification.








## 7.1.2 Constraining the MLP
$$
[\mathbf{H}]_{i,j} = [\mathbf{U}]_{i,j} + \sum_{k} \sum_{l} [\mathbf{W}]_{i,j,k,l} [\mathbf{X}]_{k,l}
$$

$$
= [\mathbf{U}]_{i,j} + \sum_{a} \sum_{b} [\mathbf{V}]_{i,j,a,b} [\mathbf{X}]_{i+a,j+b}.
$$

This section discusses transforming fully connected layers into convolutional layers. By recognizing the spatial structure in images, we reduce the number of parameters from a fully connected layer to a convolutional one, resulting in more efficient models.



### 7.1.2.1 Translation Invariance
$$
[\mathbf{H}]_{i,j} = u + \sum_{a} \sum_{b} [\mathbf{V}]_{a,b} [\mathbf{X}]_{i+a,j+b}.
$$

Translation invariance ensures that a shift in the input image leads to a shift in the hidden representation, allowing the model to recognize objects regardless of location.




### 7.1.2.2 Locality
$$
[\mathbf{H}]_{i,j} = u + \sum_{a=-\Delta}^{\Delta} \sum_{b=-\Delta}^{\Delta} [\mathbf{V}]_{a,b} [\mathbf{X}]_{i+a,j+b}.
$$

Locality ensures that only local pixel information contributes to the output, reducing parameters by constraining filters to focus on nearby pixels.



## 7.1.3 Convolutions

In this section, the mathematical concept of convolution is explained. Convolution measures the overlap between two functions \( f \) and \( g \), and is defined in continuous form using an integral:

$$
(f * g)(\mathbf{x}) = \int f(\mathbf{z}) g(\mathbf{x} - \mathbf{z}) d\mathbf{z}
$$

For discrete objects, this becomes a sum:

$$
(f * g)(i) = \sum_{a} f(a) g(i - a)
$$

For two-dimensional tensors, the sum extends to both dimensions:

$$
(f * g)(i, j) = \sum_{a} \sum_{b} f(a, b) g(i - a, j - b)
$$

### 7.1.4 Channels
$$
[\mathbf{H}]_{i,j,d} = \sum_{a=-\Delta}^{\Delta} \sum_{b=-\Delta}^{\Delta} \sum_{c} [\mathbf{V}]_{a,b,c,d} [\mathbf{X}]_{i+a,j+b,c}.
$$

Explains how images with multiple channels (e.g., RGB) are processed by convolutional layers, introducing 3D tensors and filters for each channel.








## 7.2 Convolutions for images
This section introduces how convolutional neural networks (CNNs) can efficiently process image data by leveraging convolutions. Instead of using fully connected layers, CNNs use convolutional layers that capture spatial structures, making them effective for image classification, object detection, and other vision-related tasks.



In [None]:
import torch
from torch import nn
from d2l import torch as d2l

### 7.2.1 The Cross-Correlation Operation
Cross-correlation is the operation typically used in convolutional layers, even though it is often referred to as "convolution." The operation slides a kernel (filter) over the input data and performs elementwise multiplication between the input and the kernel, summing up the results to produce the output. The operation reduces the output size since it can only calculate values where the kernel fits within the image.



In [None]:
def corr2d(X, K):
    """Compute 2D cross-correlation."""
    h, w = K.shape
    Y = torch.zeros((X.shape[0] - h + 1, X.shape[1] - w + 1))
    for i in range(Y.shape[0]):
        for j in range(Y.shape[1]):
            Y[i, j] = (X[i:i + h, j:j + w] * K).sum()
    return Y

In [None]:
X = torch.tensor([[0.0, 1.0, 2.0], [3.0, 4.0, 5.0], [6.0, 7.0, 8.0]])
K = torch.tensor([[0.0, 1.0], [2.0, 3.0]])
corr2d(X, K)

### 7.2.2 Convolutional Layers
A convolutional layer performs the cross-correlation between the input and kernel, then adds a scalar bias. The kernel and bias are learnable parameters that are updated during training. The kernels are typically initialized randomly, and the model learns them through backpropagation.



In [None]:
class Conv2D(nn.Module):
    def __init__(self, kernel_size):
        super().__init__()
        self.weight = nn.Parameter(torch.rand(kernel_size))
        self.bias = nn.Parameter(torch.zeros(1))

    def forward(self, x):
        return corr2d(x, self.weight) + self.bias

### 7.2.3 Object Edge Detection in Images
Convolutional layers can detect edges in images by applying specific kernels, such as [1, -1], which can approximate a first derivative. When this kernel is applied to an image, it highlights the areas where pixel values change rapidly (edges), making it useful for edge detection.



In [None]:
X = torch.ones((6, 8))
X[:, 2:6] = 0
X

In [None]:
K = torch.tensor([[1.0, -1.0]])

In [None]:
Y = corr2d(X, K)
Y

In [None]:
corr2d(X.t(), K)

### 7.2.4 Learning a Kernel
Instead of manually designing kernels for tasks like edge detection, CNNs can learn the optimal kernels directly from data. The kernels are initialized randomly, and their values are adjusted during training by minimizing the loss function. This allows CNNs to automatically learn the filters that best capture features like edges, textures, and patterns from the data.



In [None]:
# Construct a two-dimensional convolutional layer with 1 output channel and a
# kernel of shape (1, 2). For the sake of simplicity, we ignore the bias here
conv2d = nn.LazyConv2d(1, kernel_size=(1, 2), bias=False)

# The two-dimensional convolutional layer uses four-dimensional input and
# output in the format of (example, channel, height, width), where the batch
# size (number of examples in the batch) and the number of channels are both 1
X = X.reshape((1, 1, 6, 8))
Y = Y.reshape((1, 1, 6, 7))
lr = 3e-2  # Learning rate

for i in range(10):
    Y_hat = conv2d(X)
    l = (Y_hat - Y) ** 2
    conv2d.zero_grad()
    l.sum().backward()
    # Update the kernel
    conv2d.weight.data[:] -= lr * conv2d.weight.grad
    if (i + 1) % 2 == 0:
        print(f'epoch {i + 1}, loss {l.sum():.3f}')

In [None]:
conv2d.weight.data.reshape((1, 2))

### 7.2.5 Cross-Correlation and Convolution
This section explains the difference between cross-correlation (used in most deep learning frameworks) and strict convolution. Cross-correlation doesn't flip the kernel, while strict convolution does. Despite this difference, the learned kernels produce the same result because the kernel is learned directly from the data, whether or not it is flipped.



### 7.2.6 Feature Map and Receptive Field
A feature map is the output of a convolutional layer, representing the learned features of the input data. The receptive field of an element in the feature map refers to the portion of the input that influences that element. As you stack convolutional layers in a deeper network, the receptive field of each element increases, allowing the network to capture more global information from the input image.



## 7.3 Padding and Stride

In [None]:
import torch
from torch import nn

### 7.3.1 Padding
Padding is a technique used in convolutional layers to prevent shrinking of the output size after multiple convolutions. It involves adding extra rows and columns (usually filled with zeros) around the boundary of an input image, ensuring that the kernel can be applied to every pixel, including the boundary pixels. This keeps the height and width of the output equal to that of the input, making it easier to predict output dimensions. Padding is especially useful when applying many convolution layers in succession.



In [None]:
# We define a helper function to calculate convolutions. It initializes the
# convolutional layer weights and performs corresponding dimensionality
# elevations and reductions on the input and output
def comp_conv2d(conv2d, X):
    # (1, 1) indicates that batch size and the number of channels are both 1
    X = X.reshape((1, 1) + X.shape)
    Y = conv2d(X)
    # Strip the first two dimensions: examples and channels
    return Y.reshape(Y.shape[2:])

# 1 row and column is padded on either side, so a total of 2 rows or columns
# are added
conv2d = nn.LazyConv2d(1, kernel_size=3, padding=1)
X = torch.rand(size=(8, 8))
comp_conv2d(conv2d, X).shape

In [None]:
# We use a convolution kernel with height 5 and width 3. The padding on either
# side of the height and width are 2 and 1, respectively
conv2d = nn.LazyConv2d(1, kernel_size=(5, 3), padding=(2, 1))
comp_conv2d(conv2d, X).shape

### 7.3.2 Stride
Stride refers to the number of pixels by which the convolutional kernel is shifted across the input image. In the default case, the kernel slides over one pixel at a time, but by increasing the stride, we can reduce the size of the output. A stride greater than 1 skips over intermediate pixels, allowing for downsampling, which is useful for reducing the computational complexity of the model. A larger stride produces smaller outputs and can help in cases where the input size is too large to process efficiently.








In [None]:
conv2d = nn.LazyConv2d(1, kernel_size=3, padding=1, stride=2)
comp_conv2d(conv2d, X).shape

In [None]:
conv2d = nn.LazyConv2d(1, kernel_size=(3, 5), padding=(0, 1), stride=(3, 4))
comp_conv2d(conv2d, X).shape

## 7.4 Multiple Input and Multiple Output Channels

This section delves into handling multiple input and output channels in Convolutional Neural Networks (CNNs). While earlier sections simplified examples by using single-channel inputs and outputs, real-world applications like RGB images inherently involve multiple channels. Here, both the input data and the convolutional kernels become three-dimensional tensors, incorporating the channel dimension alongside height and width. The section emphasizes the importance of aligning the number of input channels in the convolutional kernels with those in the input data to ensure proper cross-correlation operations. Additionally, it introduces the concept of multiple output channels, allowing CNNs to learn a diverse set of features by producing multiple feature maps from a single input.



In [None]:
import torch
from d2l import torch as d2l

### 7.4.1 Multiple Input Channels
When dealing with multi-channel input data (e.g., RGB images), convolutional kernels must have the same number of input channels. Each channel of the input interacts with its corresponding channel in the kernel, and the results are summed up to produce the final output. This helps capture features across multiple channels. For example, a convolution applied to an RGB image will combine information from all color channels.



In [None]:
def corr2d_multi_in(X, K):
    # Iterate through the 0th dimension (channel) of K first, then add them up
    return sum(d2l.corr2d(x, k) for x, k in zip(X, K))

In [None]:
X = torch.tensor([[[0.0, 1.0, 2.0], [3.0, 4.0, 5.0], [6.0, 7.0, 8.0]],
               [[1.0, 2.0, 3.0], [4.0, 5.0, 6.0], [7.0, 8.0, 9.0]]])
K = torch.tensor([[[0.0, 1.0], [2.0, 3.0]], [[1.0, 2.0], [3.0, 4.0]]])

corr2d_multi_in(X, K)

### 7.4.2 Multiple Output Channels
In practice, convolutional layers often have multiple output channels. Each output channel is the result of a separate convolution between the input and a unique set of filters. This increases the model's capacity to learn diverse features, with each output channel detecting different patterns in the input. As the network goes deeper, the number of output channels typically increases, allowing the model to learn more complex and hierarchical features.



In [None]:
def corr2d_multi_in_out(X, K):
    # Iterate through the 0th dimension of K, and each time, perform
    # cross-correlation operations with input X. All of the results are
    # stacked together
    return torch.stack([corr2d_multi_in(X, k) for k in K], 0)

In [None]:
K = torch.stack((K, K + 1, K + 2), 0)
K.shape

In [None]:
corr2d_multi_in_out(X, K)

### 7.4.3 $1\times1$ Convolutional Layer
A 1 × 1 convolution may seem counterintuitive at first, but it has practical uses. It operates on the channel dimension only, meaning that it transforms the input across different channels at each pixel location, but doesn't combine spatial information (height and width). This technique is used in more complex architectures to efficiently reduce or expand the number of channels while preserving the spatial resolution of the image. Essentially, it can be seen as a way to perform fully connected operations at every pixel, sharing weights across spatial dimensions.




In [None]:
def corr2d_multi_in_out_1x1(X, K):
    c_i, h, w = X.shape
    c_o = K.shape[0]
    X = X.reshape((c_i, h * w))
    K = K.reshape((c_o, c_i))
    # Matrix multiplication in the fully connected layer
    Y = torch.matmul(K, X)
    return Y.reshape((c_o, h, w))

In [None]:
X = torch.normal(0, 1, (3, 3, 3))
K = torch.normal(0, 1, (2, 3, 1, 1))
Y1 = corr2d_multi_in_out_1x1(X, K)
Y2 = corr2d_multi_in_out(X, K)
assert float(torch.abs(Y1 - Y2).sum()) < 1e-6

## 7.5 Pooling
Pooling is used in convolutional neural networks (CNNs) to reduce the spatial dimensions (height and width) of feature maps while retaining important information. Pooling layers are placed between convolutional layers, typically helping to make the model more robust to small translations in the input image. The main types of pooling are max-pooling and average pooling, and pooling is especially useful for downsampling and reducing the computational load.



In [None]:
import torch
from torch import nn
from d2l import torch as d2l

### 7.5.1 Maximum Pooling and Average Pooling
Pooling layers slide a fixed-size window over the input data, aggregating the values in the window. In max-pooling, the maximum value within each window is taken, while in average pooling, the average value is taken. Max-pooling is more commonly used since it provides better feature extraction by focusing on prominent features. Pooling reduces the spatial resolution of the input, making the network more invariant to small changes in the input, such as shifts in object positions.



In [None]:
def pool2d(X, pool_size, mode='max'):
    p_h, p_w = pool_size
    Y = torch.zeros((X.shape[0] - p_h + 1, X.shape[1] - p_w + 1))
    for i in range(Y.shape[0]):
        for j in range(Y.shape[1]):
            if mode == 'max':
                Y[i, j] = X[i: i + p_h, j: j + p_w].max()
            elif mode == 'avg':
                Y[i, j] = X[i: i + p_h, j: j + p_w].mean()
    return Y

In [None]:
X = torch.tensor([[0.0, 1.0, 2.0], [3.0, 4.0, 5.0], [6.0, 7.0, 8.0]])
pool2d(X, (2, 2))

In [None]:
pool2d(X, (2, 2), 'avg')

### 7.5.2 Padding and Stride
Just like in convolution layers, padding and stride can be applied to pooling layers. Padding ensures that the pooling operation covers the borders of the input data. Stride controls how far the pooling window moves across the input, allowing for further downsampling. By default, the stride equals the size of the pooling window, but it can be manually adjusted to control the output size. This allows for more control over the pooling layer’s output resolution.



In [None]:
X = torch.arange(16, dtype=torch.float32).reshape((1, 1, 4, 4))
X

In [None]:
pool2d = nn.MaxPool2d(3)
# Pooling has no model parameters, hence it needs no initialization
pool2d(X)

In [None]:
pool2d = nn.MaxPool2d(3, padding=1, stride=2)
pool2d(X)

In [None]:
pool2d = nn.MaxPool2d((2, 3), stride=(2, 3), padding=(0, 1))
pool2d(X)

### 7.5.3 Multiple Channels
In multi-channel inputs (like RGB images), the pooling layer applies the pooling operation to each channel separately. This preserves the number of channels from the input to the output. Pooling does not merge channels; it simply reduces the spatial dimensions independently for each channel. For example, applying pooling on a two-channel input will result in a two-channel output, with reduced spatial size.



In [None]:
X = torch.cat((X, X + 1), 1)
X

In [None]:
pool2d = nn.MaxPool2d(3, padding=1, stride=2)
pool2d(X)

### Summary
Pooling is a simple yet powerful technique for downsampling, improving translation invariance, and reducing computational load in CNNs. Max-pooling is generally preferred over average pooling for feature extraction. Strides and padding in pooling layers function similarly to how they do in convolutional layers. Lastly, pooling preserves the number of channels in multi-channel inputs, ensuring that the structural information is maintained across layers.








## 7.6 Convolutional Neural Networks (LeNet)
LeNet was one of the earliest and most influential convolutional neural networks (CNNs), introduced by Yann LeCun in the 1990s. It was designed primarily for handwritten digit recognition and played a significant role in the adoption of neural networks for computer vision tasks. LeNet consists of two key parts: a convolutional encoder (with two convolutional layers) and a dense block (three fully connected layers). This structure allows the network to effectively capture spatial information in images while reducing the number of parameters compared to fully connected layers.



In [None]:
import torch
from torch import nn
from d2l import torch as d2l

### 7.6.1 LeNet
LeNet is composed of:

1. Convolutional Layers: Two convolutional layers, each followed by a sigmoid activation function and an average pooling operation. These convolutional layers capture local spatial information while downsampling the input.
2. Pooling Operations: Pooling layers reduce the resolution of the image, making the network more invariant to small shifts in the input.
3. Fully Connected Layers: After the convolutional layers, the feature maps are flattened into a vector to be processed by fully connected layers, producing a final classification. The last layer outputs probabilities for 10 possible outcomes (in the case of digit classification).

The network's key innovation was reducing the size of the model by using convolution and pooling, making it feasible to apply neural networks to image data in a more efficient manner.

In [None]:
def init_cnn(module):
    """Initialize weights for CNNs."""
    if type(module) == nn.Linear or type(module) == nn.Conv2d:
        nn.init.xavier_uniform_(module.weight)

class LeNet(d2l.Classifier):
    """The LeNet-5 model."""
    def __init__(self, lr=0.1, num_classes=10):
        super().__init__()
        self.save_hyperparameters()
        self.net = nn.Sequential(
            nn.LazyConv2d(6, kernel_size=5, padding=2), nn.Sigmoid(),
            nn.AvgPool2d(kernel_size=2, stride=2),
            nn.LazyConv2d(16, kernel_size=5), nn.Sigmoid(),
            nn.AvgPool2d(kernel_size=2, stride=2),
            nn.Flatten(),
            nn.LazyLinear(120), nn.Sigmoid(),
            nn.LazyLinear(84), nn.Sigmoid(),
            nn.LazyLinear(num_classes))

In [None]:
@d2l.add_to_class(d2l.Classifier)
def layer_summary(self, X_shape):
    X = torch.randn(*X_shape)
    for layer in self.net:
        X = layer(X)
        print(layer.__class__.__name__, 'output shape:\t', X.shape)

model = LeNet()
model.layer_summary((1, 1, 28, 28))

### 7.6.2 Training
LeNet is typically trained using the Fashion-MNIST dataset, which consists of small grayscale images. The model is trained with cross-entropy loss and stochastic gradient descent. Despite having fewer parameters than a fully connected network, CNNs like LeNet require more computation per parameter due to the convolution operations.

Training involves initializing the model's parameters and then running the training process for several epochs. While LeNet is relatively simple by modern standards, it was groundbreaking at the time and remains a good starting point for understanding CNNs. The model’s success also showed how CNNs could outperform traditional machine learning models in visual tasks.

In [None]:
trainer = d2l.Trainer(max_epochs=10, num_gpus=1)
data = d2l.FashionMNIST(batch_size=128)
model = LeNet(lr=0.1)
model.apply_init([next(iter(data.get_dataloader(True)))[0]], init_cnn)
trainer.fit(model, data)

## 8.2 Networks Using Blocks (VGG)
The VGG network introduced the idea of using blocks of layers as reusable structures within a deep neural network. Instead of designing each layer individually, VGG uses repeated blocks of layers with convolutional and pooling operations, making the network deeper and more structured. This concept of modular blocks became a foundational principle in deep network design, enabling researchers to build deeper networks more systematically.



In [None]:
import torch
from torch import nn
from d2l import torch as d2l

### 8.2.1 VGG Blocks
A VGG block consists of a sequence of convolutional layers with small
3×3 kernels (with padding to maintain the resolution), followed by a nonlinearity (ReLU), and finally a max-pooling layer that halves the height and width. The use of smaller kernels stacked together helps extract more complex features without drastically increasing the number of parameters. Each block in the VGG network is designed to downsample the spatial dimensions while increasing the number of feature maps (channels).

The function **vgg_block()** in the code defines such a block, allowing for easy customization of the number of convolutional layers and output channels.

In [None]:
def vgg_block(num_convs, out_channels):
    layers = []
    for _ in range(num_convs):
        layers.append(nn.LazyConv2d(out_channels, kernel_size=3, padding=1))
        layers.append(nn.ReLU())
    layers.append(nn.MaxPool2d(kernel_size=2,stride=2))
    return nn.Sequential(*layers)

### 8.2.2 VGG Network
The VGG network was a major breakthrough because it showed that deep networks, with many layers of convolutions, performed better than shallower, wider ones. VGG-11, for instance, has 11 layers (8 convolutional layers and 3 fully connected layers). The VGG family of networks includes other configurations, such as VGG-16 and VGG-19, which use more convolutional layers for better accuracy.

In [None]:
class VGG(d2l.Classifier):
    def __init__(self, arch, lr=0.1, num_classes=10):
        super().__init__()
        self.save_hyperparameters()
        conv_blks = []
        for (num_convs, out_channels) in arch:
            conv_blks.append(vgg_block(num_convs, out_channels))
        self.net = nn.Sequential(
            *conv_blks, nn.Flatten(),
            nn.LazyLinear(4096), nn.ReLU(), nn.Dropout(0.5),
            nn.LazyLinear(4096), nn.ReLU(), nn.Dropout(0.5),
            nn.LazyLinear(num_classes))
        self.net.apply(d2l.init_cnn)

In [None]:
VGG(arch=((1, 64), (1, 128), (2, 256), (2, 512), (2, 512))).layer_summary(
    (1, 1, 224, 224))

### 8.2.3 Training
VGG is computationally more expensive than AlexNet due to the deeper architecture and larger number of parameters. Therefore, for practical purposes, a smaller VGG variant (with fewer output channels) can be used to train on datasets like Fashion-MNIST. The training process follows similar steps to AlexNet, using mini-batch stochastic gradient descent, but VGG requires more computational power due to the additional layers.

The key insight from VGG is that deeper networks can extract more complex patterns and lead to better performance in tasks like image recognition. However, the increase in depth comes at the cost of higher computation and memory usage.

In [None]:
model = VGG(arch=((1, 16), (1, 32), (2, 64), (2, 128), (2, 128)), lr=0.01)
trainer = d2l.Trainer(max_epochs=10, num_gpus=1)
data = d2l.FashionMNIST(batch_size=128, resize=(224, 224))
model.apply_init([next(iter(data.get_dataloader(True)))[0]], d2l.init_cnn)
trainer.fit(model, data)

## 8.6 Residual Networks (ResNet) and ResNeXt
As neural networks grow deeper, adding more layers doesn't always improve performance due to vanishing/exploding gradients and other training issues. Residual Networks (ResNet) were designed to address this by making it easier for networks to learn identity mappings. ResNet's key innovation is the "residual block," which allows layers to skip connections and pass their input directly to deeper layers. This allows for much deeper networks to be trained effectively.



In [None]:
import torch
from torch import nn
from torch.nn import functional as F
from d2l import torch as d2l

### 8.6.1 Function Classes
$$ f^*_{\mathcal{F}} = \underset{f}{\mathrm{argmin}} \, L(\mathbf{X}, \mathbf{y}, f) \, \text{subject to} \, f \in \mathcal{F} $$

### LaTeX for the equation:
```latex
f^*_{\mathcal{F}} = \underset{f}{\mathrm{argmin}} \, L(\mathbf{X}, \mathbf{y}, f) \, \text{subject to} \, f \in \mathcal{F}.
```

This equation represents the optimization problem where we are trying to find the best function $ f^*_{\mathcal{F}} $ within a class of functions $ \mathcal{F} $ that minimizes the loss function $ L(\mathbf{X}, \mathbf{y}, f) $ for given data features $ \mathbf{X} $ and labels $\mathbf{y}$. The class $ \mathcal{F}$ consists of all functions that a particular neural network architecture can approximate, given its parameters and hyperparameters. The goal is to find the best approximation of the true function $ f^* $.

### 8.6.2 Residual Blocks
Residual blocks are the core building blocks of ResNet. Instead of directly learning a function
F(x), residual blocks aim to learn the "residual"
F(x)−x. The key idea is that layers in a residual block learn the difference between the input and the output, allowing for more stable training.

In a regular block, the network must learn the direct transformation. In a residual block, the network learns the residual, which simplifies the learning process when the optimal transformation is close to the identity function. If the network doesn't need the extra transformation, the added layers in a residual block can learn the identity function and pass the input directly through the network.

In [None]:
class Residual(nn.Module):
    """The Residual block of ResNet models."""
    def __init__(self, num_channels, use_1x1conv=False, strides=1):
        super().__init__()
        self.conv1 = nn.LazyConv2d(num_channels, kernel_size=3, padding=1,
                                   stride=strides)
        self.conv2 = nn.LazyConv2d(num_channels, kernel_size=3, padding=1)
        if use_1x1conv:
            self.conv3 = nn.LazyConv2d(num_channels, kernel_size=1,
                                       stride=strides)
        else:
            self.conv3 = None
        self.bn1 = nn.LazyBatchNorm2d()
        self.bn2 = nn.LazyBatchNorm2d()

    def forward(self, X):
        Y = F.relu(self.bn1(self.conv1(X)))
        Y = self.bn2(self.conv2(Y))
        if self.conv3:
            X = self.conv3(X)
        Y += X
        return F.relu(Y)

In [None]:
blk = Residual(3)
X = torch.randn(4, 3, 6, 6)
blk(X).shape

In [None]:
blk = Residual(6, use_1x1conv=True, strides=2)
blk(X).shape

### 8.6.3 ResNet Model
ResNet models are built by stacking residual blocks. The first layers are the same as traditional convolutional networks (e.g., a
7×7 convolution followed by a max-pooling layer). Then, residual blocks are added in groups, with each group potentially reducing the spatial resolution while increasing the number of channels.


In ResNet-18, for example, there are four groups of residual blocks, with each block doubling the number of channels while halving the spatial resolution. The final output is passed through a global average pooling layer and a fully connected layer for classification.


The architecture allows ResNet to handle very deep networks (e.g., ResNet-152) without suffering from the degradation problem that typically affects deeper networks.



In [None]:
class ResNet(d2l.Classifier):
    def b1(self):
        return nn.Sequential(
            nn.LazyConv2d(64, kernel_size=7, stride=2, padding=3),
            nn.LazyBatchNorm2d(), nn.ReLU(),
            nn.MaxPool2d(kernel_size=3, stride=2, padding=1))

In [None]:
@d2l.add_to_class(ResNet)
def block(self, num_residuals, num_channels, first_block=False):
    blk = []
    for i in range(num_residuals):
        if i == 0 and not first_block:
            blk.append(Residual(num_channels, use_1x1conv=True, strides=2))
        else:
            blk.append(Residual(num_channels))
    return nn.Sequential(*blk)

In [None]:
@d2l.add_to_class(ResNet)
def __init__(self, arch, lr=0.1, num_classes=10):
    super(ResNet, self).__init__()
    self.save_hyperparameters()
    self.net = nn.Sequential(self.b1())
    for i, b in enumerate(arch):
        self.net.add_module(f'b{i+2}', self.block(*b, first_block=(i==0)))
    self.net.add_module('last', nn.Sequential(
        nn.AdaptiveAvgPool2d((1, 1)), nn.Flatten(),
        nn.LazyLinear(num_classes)))
    self.net.apply(d2l.init_cnn)

In [None]:
class ResNet18(ResNet):
    def __init__(self, lr=0.1, num_classes=10):
        super().__init__(((2, 64), (2, 128), (2, 256), (2, 512)),
                       lr, num_classes)

ResNet18().layer_summary((1, 1, 96, 96))

### 8.6.4 Training
ResNet’s effectiveness lies in its ability to train very deep networks efficiently. The use of residual connections allows for easier gradient flow and better convergence during training, as the identity mapping can be learned if needed.

ResNet-18, for instance, can be trained on datasets like Fashion-MNIST, using techniques like mini-batch stochastic gradient descent. The architecture, with its residual connections, ensures that the model does not overfit easily, although deeper versions of ResNet might require larger datasets for optimal performance.

In [None]:
model = ResNet18(lr=0.01)
trainer = d2l.Trainer(max_epochs=10, num_gpus=1)
data = d2l.FashionMNIST(batch_size=128, resize=(96, 96))
model.apply_init([next(iter(data.get_dataloader(True)))[0]], d2l.init_cnn)
trainer.fit(model, data)

## Exercises

### 7.1.6

In [None]:
import torch
import torch.nn.functional as F

# Create a 1D convolution for audio signal
audio_signal = torch.randn(1, 1, 100)  # Batch size 1, 1 channel, length 100
conv1d = torch.nn.Conv1d(in_channels=1, out_channels=1, kernel_size=3, stride=1, padding=1)

# Apply convolution
output_signal = conv1d(audio_signal)
print("Audio signal after convolution: ", output_signal.shape)


### 7.2.8

In [None]:
import torch
import torch.nn.functional as F

# Define a directional edge-detection kernel
kernel = torch.tensor([[-1.0, 0.0, 1.0], [-1.0, 0.0, 1.0], [-1.0, 0.0, 1.0]])  # Sobel-like kernel
kernel = kernel.unsqueeze(0).unsqueeze(0)  # Shape (1, 1, 3, 3)

# Image with diagonal edges
image = torch.randn(1, 1, 5, 5)

# Apply convolution
edge_detected = F.conv2d(image, kernel, padding=1)
print("Edge-detected output shape: ", edge_detected.shape)


### 7.3.4

In [None]:
import torch

# Define a convolutional layer with specific kernel size, padding, and stride
conv2d = torch.nn.Conv2d(1, 1, kernel_size=(3, 5), padding=(0, 1), stride=(3, 4))

# Input tensor (random image)
X = torch.rand(size=(1, 1, 10, 10))

# Apply convolution
output = conv2d(X)
print("Output shape with kernel (3, 5), padding (0, 1), and stride (3, 4):", output.shape)


### 7.4.5

In [None]:
import torch

# Define two convolution kernels
conv1 = torch.nn.Conv2d(1, 1, kernel_size=3, padding=1)
conv2 = torch.nn.Conv2d(1, 1, kernel_size=3, padding=1)

# Input tensor
X = torch.rand(size=(1, 1, 10, 10))

# Apply convolutions
output1 = conv1(X)
output2 = conv2(output1)

print("Output shape after applying two convolutions: ", output2.shape)
