<img src="https://fsdl.me/logo-720-dark-horizontal">

# Lab 02b: Training a CNN on Synthetic Handwriting Data

### What You Will Learn

- Fundamental principles for building neural networks with convolutional components
- How to generate synthetic data in PyTorch
- How to use Lightning's training framework via a CLI

## Setup

If you're running this notebook on Google Colab,
the cell below will run full environment setup.

It should take about three minutes to run.

In [None]:
%env FSDL_REPO=fsdl-text-recognizer-2022

In [None]:
lab_idx = 2

if "bootstrap" not in locals() or bootstrap.run:
    # path management for Python
    pythonpath, = !echo $PYTHONPATH
    if "." not in pythonpath.split(":"):
        pythonpath = ".:" + pythonpath
        %env PYTHONPATH={pythonpath}
        !echo $PYTHONPATH

    # get both Colab and local notebooks into the same state
    !wget --quiet https://fsdl.me/gist-bootstrap -O bootstrap.py
    import bootstrap

    # change into the lab directory
    bootstrap.change_to_lab_dir(lab_idx=lab_idx)

    # allow "hot-reloading" of modules
    %load_ext autoreload
    %autoreload 2

    bootstrap.run = False  # change to True re-run setup
    
!pwd
%ls

# Why convolutions?

The most basic neural networks,
multi-layer perceptrons,
are built by alternating
parameterized linear transformations
with non-linear transformations.

This combination is capable of expressing
[functions of arbitrary complexity](http://neuralnetworksanddeeplearning.com/chap4.html),
so long as those functions
take in fixed-size arrays and return fixed-size arrays.

But not all functions have that type signature.

For example, we might want to identify the content of images
that have different sizes.
Without gross hacks,
an MLP won't be able to solve this problem,
even though it seems simple enough.

In [None]:
import random

import IPython.display as display

randsize = 10 ** (random.random() * 2 + 1)

Url = "https://fsdl-public-assets.s3.us-west-2.amazonaws.com/emnist/U.png"

# run multiple times to display the same image at different sizes
#  the content of the image remains unambiguous
display.Image(url=Url, width=randsize, height=randsize)

Relatedly, MLPs are too general to be efficient:
they use an unstructured matrix
to transform their inputs,
but most of the data we want to apply them to is highly structured.

Being able to represent arbitrary functions is nice, but
[most functions are monstrous outrages against common sense](https://en.wikipedia.org/wiki/Weierstrass_function#Density_of_nowhere-differentiable_functions).
It is useful to encode some of our assumptions
about the kinds of functions we might want to learn
from our data into our model's architecture.

One of the most common types of structure in data is "locality" --
the most relevant information for understanding or predicting a pixel
is a small number of pixels around it.

Locality is a fundamental feature of the physical world,
so it shows up in data drawn from physical observations,
like photographs and audio recordings.

Locality means most meaningful linear transformations of our input
only have large weights in a small number of entries that are close to one another,
rather than having weights in all entries.

In [None]:
import torch

generic_linear_transform = torch.randn(8, 1)
print("generic:", generic_linear_transform, sep="\n")

local_linear_transform = torch.tensor([
    [0], [0], [0], [random.random()], [random.random()], [random.random()], [0], [0]])
print("local:", local_linear_transform, sep="\n")

Another type of structure commonly observed is "translation invariance" --
the top-left pixel is not meaningfully different from the bottom-right pixel
or a pixel in the middle of the image.
Relative relationships matter more than absolute relationships.

Translation invariance arises in images because there is generally no privileged
vantage point for taking the image.
We could just as easily have taken the image while standing a few feet to the left or right.

Translation invariance means that a linear transformation that is meaningful at one position
in our input is likely to be meaningful at all other points.
We can learn something about a linear transformation from a datapoint where it is useful
in the bottom-left and then apply it to another datapoint where it's useful in the top-right.

In [None]:
generic_linear_transform = torch.arange(8)[:, None]
print("generic:", generic_linear_transform, sep="\n")

invariant_linear_transform = torch.stack([torch.roll(generic_linear_transform[:, 0], ii) for ii in range(8)], dim=1)
print("translation invariant:", invariant_linear_transform, sep="\n")

A linear transformation that is translation invariant is called a _convolution_.

If the weights of that linear transformation are mostly zero
except for a few that are close to one another,
that convolution is said to have a _kernel_.

In [None]:
conv_layer = torch.nn.Conv1d(in_channels=1, out_channels=1, kernel_size=3)  # the equivalent of 

conv_layer.weight  # aka kernel

Instead of using normal matrix multiplication to apply the kernel to the input,
we repeatedly apply that kernel over and over again,
"sliding" it over the input to produce an output.

Every convolution kernel has an equivalent matrix form:

In [None]:
conv_kernel_as_vector = torch.hstack([conv_layer.weight[0][0], torch.zeros(5)])
conv_layer_as_matrix = torch.stack([torch.roll(conv_kernel_as_vector, ii) for ii in range(8)], dim=0)
print("convolution matrix:", conv_layer_as_matrix, sep="\n")

> <small> Under the hood, the actual operation that implements the application of a convolutional kernel
need not look like either of these
(common approaches include
[Winograd-type algorithms](https://arxiv.org/abs/1509.09308)
and [Fast Fourier Transform-based algorithms](https://arxiv.org/abs/1312.5851)) </small>

Though they may seem somewhat arbitrary and technical,
convolutions are actually a deep and fundamental bit of mathematics and computer science.
Fundamental as in
[closely related to the multiplication algorithm we learn as children](https://charlesfrye.github.io/math/2019/02/20/multiplication-convoluted-part-one.html)
and deep as in
[closely related to the Fourier transform](https://math.stackexchange.com/questions/918345/fourier-transform-as-diagonalization-of-convolution).
They can show up wherever there is a sum over paths,
as is common in dynamic programming.

See Chris Olah's blog series
([1](https://colah.github.io/posts/2014-07-Conv-Nets-Modular/),
[2](https://colah.github.io/posts/2014-07-Understanding-Convolutions/),
[3](https://colah.github.io/posts/2014-12-Groups-Convolution/))
for a friendly introduction.

# Applying convolutions to handwritten characters: `CNN` and `EMNIST`

## First, data

In [None]:
import text_recognizer.models


text_recognizer.models.CNN??

Need a data config to instantiate, so we need data.

Let's bring in a slightly more fun dataset than MNIST: _EMNIST_,
which has letters

We've built a a PyTorch Lightning `DataModule`
to encapsualte all the code needed to get this dataset ready to go.

In [None]:
import text_recognizer.data

emnist = text_recognizer.data.EMNIST()  # configure
emnist.prepare_data()  # download, save to disk
emnist.setup()  # create torch.utils.data.Datasets, do train/val split

This class comes with pretty printing
for quick examination of metadata and basic descriptive statistics.

> <small> You can add pretty printing to your own Python classes by writing
`__str__` or `__repr__` methods for them.
The former is generally expected to be human-readable,
while the latter is generally expected to be machine-readable;
we've broken with that custom here. </small>

In [None]:
emnist

Because we've run `.prepare_data` and `.setup`,
we can expect that this `DataModule` is ready to provide a `DataLoader`
if we invoke the right method --
sticking to the PyTorch Lightning API brings these kinds of convenient guarantees
even when we're not using the `Trainer`.

In [None]:
xs, ys = next(iter(emnist.train_dataloader()))

Run the cell below to inspect random elements of this batch.

In [None]:
import wandb

idx = random.randint(0, len(xs))

print(emnist.mapping[ys[idx]])
wandb.Image(xs[idx]).image

## Putting convolutions in a `torch.nn.Module`

Because we have the data,
we have a `data_config`
and can instantiate the model.

In [None]:
data_config = emnist.config()

cnn = text_recognizer.models.CNN(data_config)
cnn  # reveals the sub-nn.Modules attached to our nn.Module

We can inspect the `.forward` method to see how these `nn.Module`s are used:

In [None]:
cnn.forward??

Classic "LeNet" architecture for labeling images:
Convolutions followed by non-linearities,
with (maximum) pooling.

then `torch.flatten` to get vectors to pass into an MLP.

The bugbear of convolutional networks: shape inference.

Easiest thing is to just keep the shape the same within a block,
then downsample between.
Otherwise the
[arithmetic gets pretty hairy](https://arxiv.org/abs/1603.07285).

That's what we do.

As shapes change, so does the amount of GPU memory taken up by the tensors.
Keeping sizes fixed within a block removes one axis of variation
for an important resource.
After applying our pooling layer,
we can just increase the number of kernels by the same factor
to keep memory constant.

See the "parameter efficiency" of convolutional layers:

In [None]:
[p.numel() for p in cnn.parameters()]  # conv weight + bias, conv weight + bias, fc weight + bias, fc weight + bias

The biggest layer is typically between the convolution component and the fc component:

In [None]:
biggest_layer = [p for p in cnn.parameters() if p.numel() == max(p.numel() for p in cnn.parameters())][0]
biggest_layer.shape, cnn.fc_input_dim

this layer dominates the cost of storing the network on disk.

but that doesn't mean it dominates compute costs.

convolutions are parameter efficient, but those parameters are used many times,
so the total number of computations done by the layer can be higher

In [None]:
# number of multiplications per input == nparams
cnn.fc1.weight.numel()

In [None]:
def count_conv_multiplications(kernel_shape, input_size=(64, 28, 28)):
    num_kernel_elements = 1
    for dimension in kernel_shape[-3:]:
        num_kernel_elements *= dimension
    num_input_channels, num_kernels = input_size[0], kernel_shape[0]
    num_spatial_applications = ((input_size[1] - kernel_shape[-2]) * (input_size[2] - kernel_shape[-1]))
    mutliplications_per_kernel = num_spatial_applications * num_kernel_elements * num_input_channels
    return mutliplications_per_kernel * num_kernels

In [None]:
count_conv_multiplications(cnn.conv2.conv.weight.shape)

In [None]:
count_conv_multiplications(cnn.conv2.conv.weight.shape) // cnn.fc1.weight.numel()

Depending on your compute hardware's characteristics --
are you memory constrained, like when transferring a model "over the wire" to a browser
or are you compute constrained, like when running a model on a low-power edge device? --
either of these components could become the critical bottleneck.

## Training the CNN on EMNIST

Using the CLI, getting `--help`.

# Creating lines of text from handwritten characters: `EMNISTLines`

Problem we're solving not obviously useful:
it's individual characters.

Would need a component to first pull out the characters.

Not the ethos of deep learning, which operates "end-to-end".

Let's kick the realism up a notch by building lines of text out of our characters.

Not yet fully realistic, even for single lines --
characters don't overlap, they're all the same length.

# Applying CNNs to handwritten text: `LineCNNSimple`

Repeatedly apply the CNN.

In [None]:
text_recognizer.models.LineCNNSimple??