# Why Modules

A typical training procedure for a neural net:

0. Define a dataset ($X$ and $Y$)
1. Define the neural network with some learnable weights
2. Iterate over the dataset
3. Pass inputs to the network (forward pass)
4. Compute the loss
5. Compute gradients w.r.t. network's weights (backward pass)
6. Update weights (e.g., weight = weight - lr * gradient)

PyTorch handles 1-6 for you via encapsulation, so you still have the flexibility to change something in between if you want! 

## Example: MNIST classifier

The MNIST dataset is composed of images of digits that must be classified with labels from 0 to 9. The inputs are 28x28 matrices containing the grayscale intensity in each pixel.

We will download the MNIST dataset for training a classifier. PyTorch provides a convenient function for that.

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torchvision import datasets
import matplotlib.pyplot as plt
torch.manual_seed(0);

# Dataset
It's easy to create your `Dataset`,
but PyTorch comes with several built-in datasets for [vision](https://pytorch.org/vision/stable/datasets.html), [audio](https://pytorch.org/audio/stable/datasets.html), and [text](https://pytorch.org/text/stable/datasets.html) modalities.

The class `Dataset` gives you information about the number of samples (implement `__len__`) and gives you the sample at a given index (implement `__getitem__`). It's a nice and simple abstraction to work with data. It has the following structure:

```python
class Dataset(object):
    def __getitem__(self, index):
        raise NotImplementedError

    def __len__(self):
        raise NotImplementedError

    def __add__(self, other):
        return ConcatDataset([self, other])
```

For now, let's use MNIST. But feel free to use another `Dataset` as an exercise.

In [None]:
from torch.utils.data import Dataset

In [None]:
# download MNIST and store it in "../data"
# PyTorch.datasets also handles caching for you so you don't have to download the dataset twice
train_data = datasets.MNIST('../data', train=True, download=True)
test_data = datasets.MNIST('../data', train=False)

train_x = train_data.data
train_y = train_data.targets
test_x = test_data.data
test_y = test_data.targets

In [None]:
n_train_examples = train_x.shape[0]
n_test_examples = test_x.shape[0]
print('Training instances:', n_train_examples)
print('Test instances:', n_test_examples)

Check the shape of our training data to see how many input features we have:

In [None]:
train_x.shape, train_y.shape

And what the images looks like:

In [None]:
C = 8
fig, axs = plt.subplots(3, C, figsize=(12, 4))
for i in range(3):
    for j in range(C):
        axs[i, j].imshow(train_x[i*C + j], cmap='gray')
        axs[i, j].set_axis_off()
print(train_y[:24].reshape(3, C))

### Formatting

Each sample is a 28x28 matrix. But we want to represent them as vectors, since our model (which will be a simple MLP) doesn't take any advantage of the 2D nature of the data.

So, we reshape the data:

In [None]:
num_features = 28 * 28
train_x_vectors = train_x.view(n_train_examples, num_features)
print(train_x_vectors.shape)

When we reshape an array (or torch tensor, for that matter), we don't need to specify all dimensions. We can leave one as -1, and it will be automatically determined from the size of the data. This is useful when we don't know a priori the shape of some array.

In [None]:
train_x_vectors = train_x.view(n_train_examples, -1)
test_x_vectors = test_x.view(n_test_examples, -1)

print(train_x_vectors.shape, test_x_vectors.shape)

Also, the values are integers in the range $[0, 255]$. It is better to work with float values in a smaller interval, such as $[0, 1]$ or $[-1, 1]$. There are some more elaborate normalization techniques, but for now let's just normalize the data into $[0, 1]$.

In [None]:
train_x_norm = train_x_vectors / 255.0
test_x_norm = test_x_vectors / 255.0
print(train_x_norm.max(), train_x_norm.min(), train_x_norm.mean(), train_x_norm.std())

Now, let's check all the available labels:

In [None]:
print(torch.unique(train_y))
num_classes = len(torch.unique(train_y))
print('Num classes:', num_classes)

# Modules and MLPs

We've seen how the internals of a simple linear classifier work. However, we still had to set a lot of things manually. It's much better to have a higher-level API that encapsulates the classifier.

We are going to see that now, with pytorch Module objects. Then, it will allow us to build more complex models, like a multilayer perceptron.

We begin by loading, reshaping and normalizing the data again (so the code looks concise):

In [None]:
from torchvision.transforms import ToTensor

train_dataset = datasets.MNIST('../data', train=True, download=True, transform=ToTensor())
test_dataset = datasets.MNIST('../data', train=False, transform=ToTensor())

train_x = train_dataset.data
train_y = train_dataset.targets
test_x = test_dataset.data
test_y = test_dataset.targets

num_features = 28 * 28
num_classes = len(torch.unique(train_y))
new_shape = [-1, num_features]
train_x_vectors = train_x.reshape(new_shape)
test_x_vectors = test_x.reshape(new_shape)

# shorten the names
train_x = train_x_vectors.float() / 255
test_x = test_x_vectors.float() / 255

## Using Modules

PyTorch provides some basic building blocks for neural nets under `.nn` module. Here you can check the complete list of available blocks: https://pytorch.org/docs/stable/nn.html

For now, let's recreate a simple linear model using `nn.Linear` (see [doc](https://pytorch.org/docs/stable/generated/torch.nn.Linear.html#torch.nn.Linear)).

In [None]:
class LinearModel(nn.Module):
    def __init__(self, n_features, n_classes):
        super().__init__()
        self.linear_layer = nn.Linear(n_features, n_classes)
        
    def forward(self, X):
        # This is the same as doing:
        # return X @ self.linear_layer.weight.t() + self.linear_layer.bias
        # where weight and bias are instances of nn.Parameter
        return self.linear_layer(X)

linear_model = LinearModel(num_features, num_classes)

As before, the model can be called as function in order to produce an output:

In [None]:
batch = train_x[:2]
outputs = linear_model(batch)
outputs

Same as doing the forward method $$w^T x + b$$

In [None]:
batch @ linear_model.linear_layer.weight.t() + linear_model.linear_layer.bias

Now that we defined our model, we just have to: 
- define an iterator
- define and compute the loss
- compute gradients
- define the strategy to update the parameters of our model
- glue previous steps to form the training loop!

#### Batching

Batching can be boring to code. PyTorch provides the `DataLoader` class to help us! Dealing with data is one of the most important yet more time consuming tasks. Take a look in the PyTorch `data` submodule to [learn more](https://pytorch.org/docs/stable/data.html).

In general, we just have to pass a torch `Dataset` object as input to the dataloader, and then set some hyperparams for the iterator: 

In [None]:
from torch.utils.data import DataLoader
print(type(train_dataset))

train_dataloader = DataLoader(train_dataset, batch_size=64, shuffle=True)

#### Loss

Here is the complete list of available [loss functions](https://pytorch.org/docs/stable/nn.html#loss-functions).
If the provided loss functions don't satisfy your constraints, it is easy to define your own loss function: just use torch operations (and be careful with differentiability issues). For example:

In [None]:
with torch.no_grad():  # disable gradient-tracking
    
    dummy_loss = nn.CrossEntropyLoss()
    
    # try other losses!
    # multi-class classification hinge loss (margin-based loss):
    # dummy_loss = nn.MultiMarginLoss()  
    batch = train_x[:2]
    targets = train_y[:2]
    predictions = linear_model(batch)
    
    print(predictions.shape, targets.shape)
    print(dummy_loss(predictions, targets))

And writing our own function (from the definition of the Cross Entropy loss):

$$
CE(p,y) = - \log\frac{\exp(p_y)}{\sum_c \exp(p_c)}
$$

In [None]:
def dummy_loss(y_pred, y):
    one_hot = y.unsqueeze(1) == torch.arange(num_classes).unsqueeze(0)
    res = - torch.log(torch.exp(y_pred) / torch.exp(y_pred).sum(-1).unsqueeze(-1))[one_hot]
    return res.mean()  # average per sample

print(dummy_loss(predictions, targets))

We will use the CrossEntropy function as our loss

In [None]:
loss_function = nn.CrossEntropyLoss()

#### Optimizer

The optimizer is the object which handles the update of the model's parameters. In the previous exercise, we were using the famous "delta" rule to update our weights:

$$\mathbf{w}_t = \mathbf{w}_{t-1} - \alpha \frac{\partial L}{\partial \mathbf{w}}.$$

But there are more ellaborate ways of updating our parameters: 

<!-- <img src="http://cs231n.github.io/assets/nn3/opt2.gif" width="45%" /> -->

<img src="http://cs231n.github.io/assets/nn3/opt1.gif" width="45%" />


PyTorch provides an extensive list of optimizers: https://pytorch.org/docs/stable/optim.html. Notice that, as everything else, it should be easy to define your own optimizer procedure. 

We will use the simple yet powerful SGD optmizer. The optimizer needs to be told which are the parameters to optimize.

In [None]:
parameters = linear_model.parameters()  # we will optimize all model's parameters!
optimizer = torch.optim.SGD(parameters, lr=0.1)

#### Training loop

Now we write the main training loop. This is the basic skeleton for training PyTorch models.

In [None]:
def train_model(model, dataloader, optimizer, loss_function, num_epochs=1):
    # Tell PyTorch that we are in training mode.
    # This is useful for mechanisms that work differently during training and test time, like Dropout. 
    model.train()
    
    losses = []
    for epoch in range(1, num_epochs+1):
        print('Starting epoch %d' % epoch)
        total_loss = 0
        hits = 0

        for batch_x, batch_y in dataloader:
            # check shapes with:
            # import ipdb; ipdb.set_trace()
            # batch_x.shape is (batch_size, 28, 28)
            # batch_y.shape is (batch_size, )
            
            # Step 1. Remember that PyTorch accumulates gradients.
            # We need to clear them out before each step
            optimizer.zero_grad()
            
            # Step 2. Preprocess the data
            # (batch_size, 28, 28) -> (batch_size, 784 = 28 * 28)
            batch_x = batch_x.reshape(batch_x.shape[0], -1)
            batch_x = batch_x.to(torch.float) / 255.0

            # Step 3. Run forward pass.
            logits = model(batch_x)

            # Step 4. Compute loss
            loss = loss_function(logits, batch_y)
            
            # Step 5. Compute gradeints
            loss.backward()
            
            # Step 6. After determining the gradients, take a step toward their (neg-)direction
            optimizer.step()
            
            # Optional. Save statistics of your training
            loss_value = loss.item()
            total_loss += loss_value
            losses.append(loss_value)
            y_pred = logits.argmax(dim=1)
            hits += torch.sum(y_pred == batch_y).item()
        
        avg_loss = total_loss / len(train_dataloader.dataset)
        print('Epoch loss: %.4f' % avg_loss)
        acc = hits / len(train_dataloader.dataset)
        print('Epoch accuracy: %.4f' % acc)
    
    print('Done!')
    return losses

In [None]:
linear_losses = train_model(linear_model, train_dataloader, optimizer, loss_function, num_epochs=10)

Graphics are good to understand the performance of a model. Let's plot the loss curve by training step:

In [None]:
fig, ax = plt.subplots()
ax.plot(linear_losses, "-")
ax.set_xlabel('Step')
ax.set_ylabel('Loss');

What can you conclude from this?

## Multilayer Perceptron

We can now proceed to a more sofisticated classifier: a multilayer perceptron. Let's build one using the Sequential API.

In [None]:
class MLP(nn.Module):
    def __init__(self, n_features, hidden_size, n_classes):
        super().__init__()
        linear_layer1 = nn.Linear(n_features, hidden_size)
        linear_layer2 = nn.Linear(hidden_size, hidden_size)
        linear_layer3 = nn.Linear(hidden_size, n_classes)
        self.feedforward = nn.Sequential(
            linear_layer1, 
            nn.Tanh(), 
            linear_layer2, 
            nn.Tanh(),
            linear_layer3
        )

    def forward(self, X):
        return self.feedforward(X)

hidden_size = 200
mlp = MLP(num_features, hidden_size, num_classes)
loss_function = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(mlp.parameters(), lr=0.1)

Now let's train the model.

In [None]:
mlp_losses = train_model(mlp, train_dataloader, optimizer, loss_function, num_epochs=5)

How do the loss and accuracy compare with the linear model?

You probably also noticed a difference in running time!

In [None]:
fig, ax = plt.subplots()
ax.plot(linear_losses, ".", label="linear")
ax.plot(mlp_losses, ".", label="mlp")
ax.legend()

Note the different concentration of dots in the MLP and Linear graphics!

### Validation data

Evaluating the performance on training data is important to understand if the model is actually learning, but if we want to know if our model has any usefulness, we should evaluate its performance on validation or test data.



In [None]:
def evaluate_model(model, test_x, test_y):
    # Tell PyTorch that we are in evaluation mode.
    model.eval()

    with torch.no_grad():
        loss_function = torch.nn.CrossEntropyLoss()
        logits = model(test_x)
        loss = loss_function(logits, test_y)

        y_pred = logits.argmax(dim=1)
        hits = torch.sum(y_pred == test_y).item()
    
    return loss.item() / len(test_x), hits / len(test_x)

In [None]:
evaluate_model(mlp, train_x, train_y)

In [None]:
evaluate_model(mlp, test_x, test_y)

In [None]:
evaluate_model(linear_model, train_x, train_y)

In [None]:
evaluate_model(linear_model, test_x, test_y)

How can we make our model better? There are two things to be done:

1. **Hyperparameter search**. Do a grid search or random search on the hyperparameters (hidden size, learning rate, batch size, activation function, type of optimizer, ...)
2. **Generalize better**. This include either finding some better feature representation or regularizing, i.e., add some kind of penalty to the model weights that encourages it to find a more general solution. Examples: L2-norm weight regularization, dropout.
3. **Early stop**. Evaluate the model on validation data after each epoch or some number of batches; only save it when validation performance increases. This means detecting when the model achieved its performance peak.

#### Dropout

We could try dropout. It effectivelly deactivates some neural connections at random, forcing the network to avoid depending on specific inputs.

In [None]:
class MLPDropout(nn.Module):
    def __init__(self, n_features, hidden_size, n_classes, p_dropout):
        super().__init__()
        linear_layer1 = nn.Linear(n_features, hidden_size)
        linear_layer2 = nn.Linear(hidden_size, n_classes)
        self.feedforward = nn.Sequential(
            linear_layer1,
            nn.Tanh(),
            nn.Dropout(p_dropout),
            linear_layer2
        )

    def forward(self, X):
        return self.feedforward(X)

hidden_size = 200
p_dropout = 0.5
mlp_dropout = MLPDropout(num_features, hidden_size, num_classes, p_dropout)
loss_function = nn.CrossEntropyLoss()
optimizer = torch.optim.SGD(mlp_dropout.parameters(), lr=0.1)

In [None]:
losses = train_model(mlp_dropout, train_dataloader, optimizer, loss_function, num_epochs=5)

Training loss is a bit worse, as expected. After all, we are obstructing some connections.

Now let's check validation performance:

In [None]:
evaluate_model(mlp, test_x, test_y)

In [None]:
evaluate_model(mlp_dropout, test_x, test_y)

No improvement. Ideally, we should retrain our model with different hyperparamters (learning rates, layer sizes, number of layers, dropout rate) as well as some changes in the structure (different optimizers, activation functions, losses). However, data representation plays a key role. 

<br>
<center>
<i>Do you think representing the input as independent pixels is a good idea for recognizing digits?</i>
</center>

### Saving

Persisting the model after training is obviously important to reuse it later. In Pytorch, we can save the model calling `save()` and passing  the model's `state_dict` (a Python dict that maps all parameters name to their actual tensors).

In [None]:
torch.save(mlp.state_dict(), 'mlp.model')

Later, recreate the model and load the data.

In [None]:
mlp2 = MLP(num_features, hidden_size, num_classes)
mlp2.load_state_dict(torch.load('mlp.model'))

Let's check the performance to see if it's the same!

In [None]:
evaluate_model(mlp, test_x, test_y)

# The End

![https://twitter.com/karpathy/status/1013244313327681536](img/common_mistakes.png)
https://twitter.com/karpathy/status/1013244313327681536

### Exercises

- Try running the MLP example for more epochs
- Try using CNNs: https://pytorch.org/docs/stable/generated/torch.nn.Conv2d.html