<img src="https://i.imgur.com/gb6B4ig.png" width="400" alt="Weights & Biases" />

# A Convolutional Network for MNIST

## Installing and Importing Libraries

In [None]:
%%capture
!pip install pytorch-lightning==1.3.8 torchviz wandb
!git clone https://github.com/wandb/lit_utils
!cd "/content/lit_utils" && git pull

import math

import pytorch_lightning as pl
import torch
import wandb

import lit_utils as lu

lu.utils.filter_warnings()

In [None]:
wandb.login()

## Defining the `Model`

In [None]:
class LitCNN(lu.nn.modules.LoggedImageClassifierModule):
  """A simple CNN Model, with under-the-hood wandb and pytorch-lightning features (logging, metrics, etc.)."""

  def __init__(self, config):  # make the model
    super().__init__()

    # first, convolutional component
    self.conv_layers = torch.nn.Sequential(*[  # specify our LEGOs. edit this by adding to the list!
      # hidden conv layer
      lu.nn.conv.Convolution2d(
        in_channels=1, kernel_size=config["kernel_size"],
        activation=config["activation"],
        out_channels=config["conv.channels"][0]),
      # hidden conv layer
      lu.nn.conv.Convolution2d(
        in_channels=config["conv.channels"][0], kernel_size=config["kernel_size"],
        activation=config["activation"],
        out_channels=config["conv.channels"][1]),
      # pooling often follows 2 convs
      torch.nn.MaxPool2d(config["pool_size"]),
    ])


    # need a fixed-size input for fully-connected component,
    #  so apply a "re-sizing" layer, to size set in config
    self.resize_layer = torch.nn.AdaptiveAvgPool2d(
      (config["final_height"], config["final_width"]))

    # now, we can apply our fully-connected component
    final_size = config["final_height"] * config["final_width"] * config["conv.channels"][-1]
    self.fc_layers = torch.nn.Sequential(*[ # specify our LEGOs. edit this by adding to the list!
      lu.nn.fc.FullyConnected(
        in_features=final_size, activation=config["activation"],
        out_features=config["fc.size"][0]),
      lu.nn.fc.FullyConnected(
        in_features=config["fc.size"][0], activation=config["activation"],
        out_features=config["fc.size"][1]),
      lu.nn.fc.FullyConnected(
        in_features=config["fc.size"][-1],  # "read-out" layer
        out_features=10),
    ])

    self.loss = config["loss_fn"]
    self.optimizer = config["optimizer"]
    self.optimizer_params = config["optimizer.params"]
    config.update({f"channels_{ii}": channels
                   for ii, channels in enumerate(config["conv.channels"])})

  def forward(self, x):  # produce outputs
    # first apply convolutional layers
    for layer in self.conv_layers: 
      x = layer(x)

    # then convert to a fixed-size vector
    x = self.resize_layer(x)
    x = torch.flatten(x, start_dim=1)

    # then apply the fully-connected layers
    for layer in self.fc_layers: # snap together the LEGOs
      x = layer(x)

    return x

## Choosing hyperparameters

In [None]:
config = {
  "batch_size": 256,
  "train_size": 1024,  # reducing to a small subset to observe overfitting; set to 50000 for full dataset
  "max_epochs": 15,
  "kernel_size": 7,
  "conv.channels": [256, 512],
  "pool_size": 2,
  "final_height": 8,
  "final_width": 8,
  "fc.size": [4096, 2048],
  "activation": torch.nn.ReLU(),
  "loss_fn": torch.nn.CrossEntropyLoss(),  # cross-entropy loss
  "optimizer": torch.optim.Adam,
  "optimizer.params": {"lr": 0.0001},
}


In [None]:
dmodule = lu.datamodules.MNISTDataModule(batch_size=config["batch_size"])
lcnn = LitCNN(config)
dmodule.prepare_data()
dmodule.setup()
dmodule.training_data = torch.utils.data.Subset(  
  dmodule.training_data, indices=range(config["train_size"]))

### Debugging Code

In [None]:
# for debugging purposes (checking shapes, etc.), make these available
dloader = dmodule.train_dataloader()  # set up the Loader

example_batch = next(iter(dloader))  # grab a batch from the Loader
example_x, example_y = example_batch[0].to("cuda"), example_batch[1].to("cuda")

print(f"Input Shape: {example_x.shape}")
print(f"Target Shape: {example_y.shape}")

lcnn.to("cuda")
outputs = lcnn.forward(example_x)
print(f"Output Shape: {outputs.shape}")
print(f"Loss : {lcnn.loss(outputs, example_y)}")

### Running `.fit`

In [None]:
with wandb.init(project="lit-cnn", entity="wandb", config=config):
  # 🪵 configure logging
  cbs=[lu.callbacks.WandbCallback(),  # callbacks add extra features, like better logging
       lu.callbacks.FilterLogCallback(image_size=(28, 28), log_input=True),  # this one logs the weights as images
       lu.callbacks.ImagePredLogCallback(labels=dmodule.classes, on_train=True)  # and this one logs the inputs and outputs
       ]
  wandblogger = pl.loggers.WandbLogger(save_code=True)
  if hasattr(lcnn, "_wandb_watch_called") and not lcnn._wandb_watch_called:
    wandblogger.watch(lcnn)  # track gradients

  # 👟 configure Trainer 
  trainer = pl.Trainer(gpus=1,  # use the GPU for .forward
                       logger=wandblogger,  # log to Weights & Biases
                       callbacks=cbs,  # use callbacks to log lots of run data
                       max_epochs=config["max_epochs"], log_every_n_steps=1,
                       progress_bar_refresh_rate=50)

  # 🏃‍♀️ run the Trainer on the model
  trainer.fit(lcnn, datamodule=dmodule)

  # 🧪 test the model on unseen data
  trainer.test(lcnn, datamodule=dmodule)

## Exercises

#### **Exercise**: Compare the validation loss and training loss, what do you see? Do you notice any overfitting? What can you do reduce overfitting? 
> _Hint:_ look at the `dropout` keyword argument of the `lu.nn.conv.Convolution2d` and `lu.nn.fc.FullyConnected` modules. Where do you think it will be most effective at reducing over-fitting?

#### **Exercise**: Notice your model's parameter count and compare it to the number of datapoints (`config["train_size"]`). Similarly, compare the total size of the network's parameters (`size_mb`) to the total size of the dataset (for a training set size of `1024`, it's about 1 MB). Can you make the parameter count and disk size smaller without reducing performance?
> _Hint:_ try reducing the size of the weight matrix for the fully-connected layer. What are the two ways to control the size of that matrix?

#### **Exercise**: How would you make this network deeper? Add layers to the `conv`olutional component, the`f`ully-`c`onnected component, and both. Try to do so while not increasing the parameter count (i.e. reduce the number of channels and the output size of the fully-connected components when you add more layers). Does this impact performance on the training set? What about on the validation and test sets?

#### **Exercise**: After increasing the depth enough, you should start to notice the training performance decrease, even to chance. Optimization of deeper networks is often more prone to error, but there are fixes. Look into the `batchnorm` argument of the `lu.nn.conv.Convolution2d` and `lu.nn.fc.FullyConnected` modules and the [Batch Norm layer](https://pytorch.org/docs/stable/generated/torch.nn.BatchNorm1d.html). Set `batchnorm` to `post` for a network that's deep enough to show optimization problems. Does this help?