# Neural Network Training from software engineer perspective

In [3]:
from fastai.vision import *
from fastai import *
import matplotlib.pyplot as plt
from time import sleep
import torch.nn.functional as F

## 0. Prepare data and model

In [185]:
MNIST_URL='http://deeplearning.net/data/mnist/mnist.pkl'

def get_data():
    path = datasets.download_data(MNIST_URL, ext='.gz')
    with gzip.open(path, 'rb') as f:
        ((x_train, y_train), (x_valid, y_valid), _) = pickle.load(f, encoding='latin-1')
    return map(tensor, (x_train, y_train, x_valid, y_valid))

def normalize(x, m, s): return (x-m)/s

In [186]:
x_train, y_train, x_valid, y_valid = get_data()

In [187]:
train_mean, train_std = x_train.mean(), x_train.std()
x_train = normalize(x_train, train_mean, train_std)
x_valid = normalize(x_valid, train_mean, train_std)

In [188]:
# x_train = x_train.view(-1, 1, 28, 28)
# x_valid = x_valid.view(-1, 1, 28, 28)
x_train.shape, x_valid.shape

(torch.Size([50000, 784]), torch.Size([10000, 784]))

In [189]:
n, m = x_train.shape
c = y_train.max()+1
nh = 50

n, m, c, nh

(50000, 784, tensor(10), 50)

In [190]:
class SimpleModel(nn.Module):
    def __init__(self, n_in, nh, n_out):
        super().__init__()
        self.layers = [nn.Linear(n_in, nh), nn.ReLU(), nn.Linear(nh, n_out)]
    
    def __call__(self, x):
        for layer in self.layers: x = layer(x)
        return x

In [191]:
simple_model = SimpleModel(m, nh, 10)

In [192]:
pred = simple_model(x_train)

In [193]:
pred[0]

tensor([ 0.2419, -0.3725,  0.0997,  0.1837,  0.1618,  0.3392,  0.2548,  0.5938,
        -0.2189,  0.1613], grad_fn=<SelectBackward>)

## 1. Basic Training Loop

Basically the training loop repeats over the following steps:

<!-- ![basic_training_loop.png](attachment:basic_training_loop.png) -->
<img src="./assets/basic_training_loop.png" alt="basic_training_loop" width="300" height="300">

1. get the output of the model on a batch of inputs.
2. compare the output to the labels we have and compute a loss.
3. calculate the gradients of the loss w.r.t. every parameter of the model.
4. updata said parameters with those gradients to make them a little bit better.

In [194]:
loss_func = F.cross_entropy

In [195]:
def accuracy(out, yb):
    return (torch.argmax(out, dim=1)==yb).float().mean()

In [196]:
x_train,y_train,x_valid,y_valid = get_data()

In [197]:
bs=64 # batch size
xb = x_train[0:bs]
preds = model(xb)
preds[0], preds.shape

(tensor([ -0.4276,   0.5412,   2.0155,  11.3686, -17.0490,  16.3552,  -7.2134,
          -4.1756,  -6.5640,   2.0268], grad_fn=<SelectBackward>),
 torch.Size([64, 10]))

In [198]:
yb = y_train[0:bs]
loss = loss_func(preds, yb)
loss

tensor(0.0507, grad_fn=<NllLossBackward>)

In [199]:
# the accuracy is about 0.1 since we have trained the model yet
accuracy(preds, yb)

tensor(1.)

In [200]:
lr = 0.5
epochs = 1

### Naive and simple procedure for training

In [201]:
def fit(epochs, model, loss_func, x_train, y_train):
    for epoch in range(epochs):
        for i in range((n-1)//bs +1):
            start_i = i*bs
            end_i = start_i + bs
            xb = x_train[start_i:end_i]
            yb = y_train[start_i:end_i]
            loss = loss_func(model(xb), yb)

            loss.backward()
            with torch.no_grad():
                for l in model.layers:
                    if hasattr(l, 'weight'):
                        l.weight -= l.weight.grad * lr
                        l.bias -= l.bias.grad * lr
                        l.weight.grad.zero_()
                        l.bias.grad.zero_()

In [202]:
model = SimpleModel(m, nh, 10)
fit(1, model, loss_func, x_train, y_train)
loss_func(model(xb), yb), accuracy(model(xb), yb)

(tensor(0.1463, grad_fn=<NllLossBackward>), tensor(0.9375))

**torch.no_grad()**

> We need to use `torch.no_grad()` when we are updating the weights because we don’t want these to affect gradients. After this line of code we go all our layers one by one and update the weights. `hasattr(l,'weight')` will check is there a weight parameter on that layer.

The following code segment in the naive training procedure is a little bit awkward because we have to go through all layers of the model and check whether the layer has weight attribute, and finally update both weight and bias. We may refactor this code segment a bit in the next section to make the code more coherent and elegant.

```python
with torch.no_grad():
    for l in model.layers:
        if hasattr(l, 'weight'):
            l.weight -= l.weight.grad * lr
            l.bias -= l.bias.grad * lr
            l.weight.grad.zero_()
            l.bias.grad.zero_()
```

## 2. Refactor parameters

PyTorch manages/registers parameters of submodules you defined through `__setattr__` function. We show the DummyModule class below, which registers submodules in the similar way as PyTorch does. 

`__setattr__` will be called every time when something is assigned to self. Our `__setattr__` will first check that variable name is not starting with underscore. When there is underscore before a variable it mean that it should be handled as private. 

In [203]:
class DummyModule():
    def __init__(self, n_in, nh, n_out):
        self._modules={}
        self.l1 = nn.Linear(n_in, nh)
        self.l2 = nn.Linear(nh, n_out)
    
    def __setattr__(self, k, v):
        if not k.startswith("_"): self._modules[k]=v
        super().__setattr__(k, v)
        
    def __repr__(self): return f'{self._modules}'
    
    def parameters(self):
        # collect all parameters of submodules
        for l in self._modules.values():
            for p in l.parameters(): yield p

In [204]:
mdl = DummyModule(m, nh, 10)
mdl

{'l1': Linear(in_features=784, out_features=50, bias=True), 'l2': Linear(in_features=50, out_features=10, bias=True)}

In [205]:
[o.shape for o in mdl.parameters()]

[torch.Size([50, 784]),
 torch.Size([50]),
 torch.Size([10, 50]),
 torch.Size([10])]

#### \_\_setattr\_\_

> Behind the scences, PyTorch overrides the `__setattr__` function in `nn.Module` so that the submodules you define are properly registered as parameters of the model.

Notice that `super().__init__()` is important because it will do the parent class `__init__` which will create the dictionary we wanted.

In [206]:
class Model(nn.Module):
    def __init__(self, n_in, nh, n_out):
        super().__init__()
        self.l1 = nn.Linear(n_in, nh)
        self.l2 = nn.Linear(nh, n_out)
    
    def __call__(self, x):
        x = self.l2(F.relu(self.l1(x)))
        return x

In [207]:
model = Model(m, nh, 10)
model

Model(
  (l1): Linear(in_features=784, out_features=50, bias=True)
  (l2): Linear(in_features=50, out_features=10, bias=True)
)

In [208]:
for name, l in model.named_children(): print(f"{name}:{l}")

l1:Linear(in_features=784, out_features=50, bias=True)
l2:Linear(in_features=50, out_features=10, bias=True)


We refactor the naive training procedure as follow:

In [211]:
def fit(epochs, model, loss_func, x_train, y_train):
    for epoch in range(epochs):
        for i in range((n-1)//bs +1):
            start_i = i*bs
            end_i = start_i + bs
            xb = x_train[start_i:end_i]
            yb = y_train[start_i:end_i]
#             print(xb.shape, yb.shape)
            loss = loss_func(model(xb), yb)

            loss.backward()
            with torch.no_grad():
                for p in model.parameters():
                    p -= p.grad*lr
                model.zero_grad()

In [212]:
model = Model(m, nh, 10)
fit(1, model, loss_func, x_train, y_train)
loss_func(model(xb), yb), accuracy(model(xb), yb)

(tensor(0.1287, grad_fn=<NllLossBackward>), tensor(0.9688))

## 3. Refactor registering modules

Instead of explicitly registering modules one by one, we can register a list of modules altogether:

In [213]:
class Model(nn.Module):
    def __init__(self, layers):
        super().__init__()
        self.layers = layers
        for i,l in enumerate(self.layers): 
            # Using add_module func of nn.Module class
            self.add_module(f'layer_{i}', l)
        
    def __call__(self, x):
        for l in self.layers: x = l(x)
        return x

In [214]:
layers = [nn.Linear(m, nh), nn.ReLU(), nn.Linear(nh, 10)]
model = Model(layers)
model

Model(
  (layer_0): Linear(in_features=784, out_features=50, bias=True)
  (layer_1): ReLU()
  (layer_2): Linear(in_features=50, out_features=10, bias=True)
)

Instead of using `add_module` func of `nn.Module` to add layers by looping through the layer list, we can simply use `nn.ModuleList` to do the whole thing.

In [215]:
class SequentialModel(nn.Module):
    def __init__(self, layers):
        super().__init__()
        self.layers = nn.ModuleList(layers)
    
    def __call__(self, x):
        for layer in self.layers:
            x = layer(x)
        return x

In [216]:
model = SequentialModel(layers)
model

SequentialModel(
  (layers): ModuleList(
    (0): Linear(in_features=784, out_features=50, bias=True)
    (1): ReLU()
    (2): Linear(in_features=50, out_features=10, bias=True)
  )
)

### nn.Sequential

`nn.Sequential` is a convenient class which does the same as the above:

In [217]:
model = nn.Sequential(nn.Linear(m, nh), nn.ReLU(), nn.Linear(nh, 10))

In [219]:
fit(1, model, loss_func, x_train, y_train)
loss_func(model(xb), yb), accuracy(model(xb), yb)

(tensor(0.1912, grad_fn=<NllLossBackward>), tensor(0.9375))

In [36]:
nn.Sequential??

If you look at the `forward` function of `nn.Sequential`, it does basically the same thing as the DummyModule.

```python
def forward(self, input):
    for module in self:
        input = module(input)
    return input
```

## 4. Refactor optimizer

Previously, we mannually coded optimization step:

```python
with torch.no_grad():
    for p in model.parameters(): p -= p.grad * lr
    model.zero_grad()
```

To make code more loosely coupled and thus easier to use different optimizers without changing training code, we encapsulate optimization step in a optimizer class, and perform the optimization step as follow:
```python
opt.step()
opt.zero_grad()
```

In [220]:
class Optimizer():
    def __init__(self, params, learning_rate=0.5):
        self.params = list(params)
        self.learning_rate = learning_rate
        
    def step(self):
        with torch.no_grad():
            for p in self.params:
                p -= self.learning_rate * p.grad
    
    def zero_grad(self):
        for p in self.params: p.grad.data.zero_()

In [226]:
def fit(epoches, model, loss_func, opt, x_train, y_train):
    for epoch in range(epochs):
        for i in range((n-1)//bs + 1):
            start_i = i*bs
            end_i = start_i+bs
            xb = x_train[start_i:end_i]
            yb = y_train[start_i:end_i]
            pred = model(xb)
            loss = loss_func(pred, yb)

            loss.backward()
            opt.step()
            opt.zero_grad()

In [232]:
model = nn.Sequential(nn.Linear(m, nh), nn.ReLU(), nn.Linear(nh, 10))
opt = Optimizer(model.parameters())
fit(1, model, loss_func, opt, x_train, y_train)
loss,acc = loss_func(model(xb), yb), accuracy(model(xb), yb)
loss,acc

(tensor(0.1348, grad_fn=<NllLossBackward>), tensor(0.9688))

PyTorch already provides this exact functionality in optim.SGD (it also handles stuff like momentum, which we'll look at later - except we'll be doing it in a more flexible way!)

In [228]:
from torch import optim

In [44]:
optim.SGD.step??

```python
def step(self, closure=None):
        """Performs a single optimization step.

        Arguments:
            closure (callable, optional): A closure that reevaluates the model
                and returns the loss.
        """
        loss = None
        if closure is not None:
            loss = closure()

        for group in self.param_groups:
            weight_decay = group['weight_decay']
            momentum = group['momentum']
            dampening = group['dampening']
            nesterov = group['nesterov']

            for p in group['params']:
                if p.grad is None:
                    continue
                    
                # get gradient of current parameter
                # and update the gradient with weight decay and momentum.
                d_p = p.grad.data
                if weight_decay != 0:
                    d_p.add_(weight_decay, p.data)
                if momentum != 0:
                    param_state = self.state[p]
                    if 'momentum_buffer' not in param_state:
                        buf = param_state['momentum_buffer'] = torch.clone(d_p).detach()
                    else:
                        buf = param_state['momentum_buffer']
                        buf.mul_(momentum).add_(1 - dampening, d_p)
                    if nesterov:
                        d_p = d_p.add(momentum, buf)
                    else:
                        d_p = buf

                # update the parameter with the gradient
                p.data.add_(-group['lr'], d_p)

        return loss
```

In [233]:
def get_model(lr=0.5):
    model = nn.Sequential(nn.Linear(m, nh), nn.ReLU(), nn.Linear(nh, 10))
    return model, optim.SGD(model.parameters(), lr=lr)

In [238]:
model, opt = get_model(lr=0.5)
type(model), type(opt)

(torch.nn.modules.container.Sequential, torch.optim.sgd.SGD)

In [239]:
loss_func(model(xb), yb)

tensor(2.3216, grad_fn=<NllLossBackward>)

In [240]:
fit(1, model, loss_func, opt, x_train, y_train)
loss,acc = loss_func(model(xb), yb), accuracy(model(xb), yb)
loss,acc

(tensor(0.1662, grad_fn=<NllLossBackward>), tensor(0.9531))

In [241]:
assert acc > 0.9

## 5. Dataset and DataLoader

### Dataset

It's clunky to iterate through mini-batches of x and y values separately:

```python
xb = x_train[start_i:end_i]
yb = y_train[start_i:end_i]
```
Instead, let's do these two steps together, by introducing a Dataset class:

```python
xb,yb = train_ds[i*bs : i*bs+bs]
```

In [242]:
class Dataset():
    def __init__(self, x, y):
        self.x = x
        self.y = y
        
    def __len__(self):
        return len(self.x)
    
    def __getitem__(self, idx):
        return self.x[idx], self.y[idx]

In [243]:
train_ds, valid_ds = Dataset(x_train, y_train), Dataset(x_valid, y_valid)
assert len(train_ds) == len(x_train)
assert len(valid_ds) == len(x_valid)

In [244]:
xb, yb = train_ds[0:5]
assert xb.shape == (5, 28*28)
assert yb.shape == (5,)

In [245]:
xb, yb

(tensor([[0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.],
         [0., 0., 0.,  ..., 0., 0., 0.]]),
 tensor([5, 0, 4, 1, 9]))

Then, we can refactor the training procedure using Dataset as follow:

In [262]:
def fit(epoches, model, loss_func, opt, train_ds):
    print("fit with dataset")
    for epoch in range(epochs):
        for i in range((n-1)//bs + 1):
            start_i = i*bs
            end_i = start_i+bs
            # here, we are using the training dataset
            xb, yb = train_ds[start_i:end_i]
            pred = model(xb)
            loss = loss_func(pred, yb)

            loss.backward()
            opt.step()
            opt.zero_grad()

In [263]:
model, opt = get_model(lr=0.1)
fit(1, model, loss_func, opt, train_ds)   
loss,acc = loss_func(model(xb), yb), accuracy(model(xb), yb)
loss,acc

fit with dataset


(tensor(0.1002, grad_fn=<NllLossBackward>), tensor(1.))

### DataLoader

Previsouly, our loop iterated over batches (xb, yb) like this:

```python
for i in range((n-1)//bs + 1):
    start_i = i*bs
    end_i = start_i+bs
    xb, yb = train_ds[start_i:end_i]
```

Let's make out loop much clearner, using a data loader:
```python
for xb, yb in train_dataloader:
    ...
```


In [283]:
class DataLoader():
    def __init__(self, dataset, batch_size):
        self.dataset = dataset
        self.batch_size = batch_size
    
    def __iter__(self):
        for i in range(0, len(self.dataset), self.batch_size):
            yield self.dataset[i:i+self.batch_size]

In [284]:
train_dataloader = DataLoader(train_ds, 32)
valid_dataloader = DataLoader(valid_ds, 32)

Then, we can refactor the previous training procedure using data loader

In [285]:
def fit(epochs, model, loss_func, opt, train_dataloader):
    print("fit with dataloader")
    for epoch in range(epochs):
        # here, we are using the training data loader
        for xb, yb in train_dataloader:
            pred = model(xb)
            loss = loss_func(pred, yb)
            loss.backward()
            opt.step()
            opt.zero_grad()

In [288]:
model, opt = get_model(lr=0.1)
fit(1, model, loss_func, opt, train_dataloader)   
loss,acc = loss_func(model(xb), yb), accuracy(model(xb), yb)
loss,acc

fit with dataloader


(tensor(0.1795, grad_fn=<NllLossBackward>), tensor(0.9531))

### Random sampling

We want our training set to be in a random order, and that order should differ each iteration. But the validation set should not be randomized.

In [264]:
class Sampler():
    def __init__(self, dataset, batch_size, shuffle=False):
        self.num_samples = len(dataset)
        self.batch_size = batch_size
        self.shuffle = shuffle
    
    def __iter__(self):
        self.idxs = torch.randperm(self.num_samples) \
                    if self.shuffle \
                    else torch.arange(self.num_samples)
        for i in range(0, self.num_samples, self.batch_size):
            # Notice that we return the indices of samples of a batch
            yield self.idxs[i:i+self.batch_size]

In [265]:
small_dataset = Dataset(*train_ds[:10])

In [266]:
sampler = Sampler(dataset=small_dataset, batch_size=3, shuffle=False)
[o for o in sampler]

[tensor([0, 1, 2]), tensor([3, 4, 5]), tensor([6, 7, 8]), tensor([9])]

In [267]:
sampler = Sampler(dataset=small_dataset, batch_size=3, shuffle=True)
[o for o in sampler]

[tensor([9, 3, 6]), tensor([4, 5, 2]), tensor([0, 7, 1]), tensor([8])]

In [289]:
def collate(b):
    xs, ys = zip(*b)
    return torch.stack(xs), torch.stack(ys)

class DataLoader():
    def __init__(self, dataset, sampler, collate_fn=collate):
        self.dataset = dataset
        self.sampler = sampler
        self.collate_fn = collate_fn
    
    def __iter__(self):
        for idxs in self.sampler:
#             data_b = self.dataset[idxs]
#             yield data_b
            data_b = [self.dataset[i] for i in idxs]
            yield self.collate_fn(data_b)

In [290]:
train_sampler = Sampler(dataset=train_ds, batch_size=32, shuffle=True)
train_dataloader = DataLoader(train_ds, train_sampler, collate)

valid_sampler = Sampler(dataset=valid_ds, batch_size=32, shuffle=False)
valid_dataloader = DataLoader(valid_ds, valid_sampler, collate)

In [291]:
xb, yb = next(iter(train_dataloader))
xb.shape, yb.shape

(torch.Size([32, 784]), torch.Size([32]))

In [295]:
model, opt = get_model(lr=0.1)
fit(1, model, loss_func, opt, train_dataloader)     
loss,acc = loss_func(model(xb), yb), accuracy(model(xb), yb)
loss,acc

fit with dataloader


(tensor(0.1255, grad_fn=<NllLossBackward>), tensor(0.9688))

### PyTorch DataLoader

Actually, PyTorch have already provided us with the two classes for representing and accessing a training set or validation set:

- `Dataset`: a collection which returns a tuple of your independent and dependent variable for a single item
- `DataLoader`: an iterator which provides a stream of mini batches, where each mini batch is a couple of a batch of independent variables and a batch of dependent variables

In [296]:
from torch.utils.data import DataLoader, SequentialSampler, RandomSampler

In [297]:
train_dl = DataLoader(train_ds, bs, sampler=RandomSampler(train_ds), collate_fn=collate)
valid_dl = DataLoader(valid_ds, bs, sampler=SequentialSampler(valid_ds), collate_fn=collate)

In [298]:
model,opt = get_model(lr=0.1)
fit(1, model, loss_func, opt, train_dataloader) 
loss_func(model(xb), yb), accuracy(model(xb), yb)

fit with dataloader


(tensor(0.2075, grad_fn=<NllLossBackward>), tensor(0.9688))

Most of the time we don’t need our own sampler or collate function. PyTorch's defaults work fine for most things however:

In [299]:
train_dl = DataLoader(train_ds, bs, shuffle=True, drop_last=True)
valid_dl = DataLoader(valid_ds, bs, shuffle=False)

One thing we didn’t mentioned was num_workers. This just mean that data is handled in a way that it can use multiple cores of CPU.

In [304]:
model,opt = get_model(lr=0.1)
fit(1, model, loss_func, opt, train_dataloader) 

loss,acc = loss_func(model(xb), yb), accuracy(model(xb), yb)
assert acc > 0.7
loss,acc

fit with dataloader


(tensor(0.1317, grad_fn=<NllLossBackward>), tensor(0.9688))

Note that PyTorch's DataLoader, if you pass num_workers, will use multiple threads to call your Dataset.

## 6. Validation

You **always** should also have a validation set, in order to identify if you are overfitting.

We will calculate and print the validation loss at the end of each epoch.

> Note that we always call **model.train()** before training, and **model.eval()** before inference, because these are used by layers such as `nn.BatchNorm2d` and `nn.Dropout` to ensure appropriate behaviour for these different phases.

> We do validation inside the `torch.no_grad()` context since we don’t calculate gradients during validation. Also, since we don't need to store the gradients, we have twice as much space and we can double the batch size for validation.  

In [305]:
def fit(epochs, model, loss_func, opt, train_dl, valid_dl):
    for epoch in range(epochs):
        # Handle batchnorm / dropout
        model.train()
#         print(model.training)
        for xb,yb in train_dl:
            loss = loss_func(model(xb), yb)
            loss.backward()
            opt.step()
            opt.zero_grad()

        model.eval()
#         print(model.training)
        with torch.no_grad():
            tot_loss, tot_acc = 0.,0.
            for xb,yb in valid_dl:
                pred = model(xb)
                tot_loss += loss_func(pred, yb)
                tot_acc  += accuracy(pred,yb)
        nv = len(valid_dl)
        print(epoch, tot_loss/nv, tot_acc/nv)
    return tot_loss/nv, tot_acc/nv

> Question: Are these validation results correct if batch size varies?

We can further wrap the construction of DataLoader classes inside the `get_dataloaders` function for code coherent and reusability.

In [306]:
def get_dataloaders(train_ds, valid_ds, bs, **kwargs):
    return (DataLoader(train_ds, batch_size=bs, shuffle=True, **kwargs),
            DataLoader(valid_ds, batch_size=bs*2, **kwargs))

Now, our whole process of obtaining the data loaders and fitting the model can be run in 3 lines of code:

In [309]:
train_dl,valid_dl = get_dataloaders(train_ds, valid_ds, bs)
model,opt = get_model()
loss,acc = fit(5, model, loss_func, opt, train_dl, valid_dl)

0 tensor(0.1986) tensor(0.9405)
1 tensor(0.1237) tensor(0.9644)
2 tensor(0.1141) tensor(0.9647)
3 tensor(0.1221) tensor(0.9662)
4 tensor(0.1070) tensor(0.9687)


In [184]:
assert acc>0.9

- **Question**: Why we need to zero our gradients in PyTorch? 
- **Aws**: If we don’t zero the gradients in every loop it is going to add the new gradients to the old ones. Then why can’t PyTorch just zero the grads automatically? This is because sometimes we want to use multiple different modules and if PyTorch would automatically zero the gradients we couldn’t do this. One important implementation of this is that we can use bigger batch sizes than our computer could normally use. For example if your computer can run certain model with batch size of 32 using this it could run as big batch size as you want. It is not faster but it is updating the weights the same way bigger batch size would do. For example if we use batch size of 32 and zero the gradients only every other loop it is updating the weights same way 64 batch size would do.

Almost all materials are taken from FastAi Lession 9:
- [Lesson video](https://course.fast.ai/videos/?lesson=9)
- [Lesson notebooks](https://github.com/fastai/course-v3/tree/master/nbs/dl2)
- [Lesson notes](https://medium.com/@lankinen/fast-ai-lesson-9-notes-part-2-v3-ca046a1a62ef)