<img src="img/saturn_logo.png" width="300" />

# Set Up Training

We don't need to run all of Notebook 5 again, we'll just call `setup2.py` in the next chunk to get ourselves back to the right state.

In [1]:
%run -i setup2.py

[2020-12-02 15:13:09] INFO - dask-saturn | Cluster is ready
[2020-12-02 15:13:09] INFO - dask-saturn | Registering default plugins
[2020-12-02 15:13:09] INFO - dask-saturn | {'tcp://10.0.17.127:45357': {'status': 'repeat'}, 'tcp://10.0.31.106:44971': {'status': 'repeat'}, 'tcp://10.0.5.190:39921': {'status': 'repeat'}}


In [2]:
client

0,1
Client  Scheduler: tcp://d-steph-resnet-article-50680500b0454380ace50ab2e594ca29.main-namespace:8786  Dashboard: https://d-steph-resnet-article-50680500b0454380ace50ab2e594ca29.internal.saturnenterprise.io,Cluster  Workers: 3  Cores: 30  Memory: 127.50 GB


We're ready to do some learning! 

## Regular Model Details

Aside from the Special Elements noted below, we can write this section essentially the same way we write any other PyTorch training loop. 
* Cross Entropy Loss for our loss function
* SGD (Stochastic Gradient Descent) for our optimizer

We're also using a particular learning rate scheduler called `ReduceLROnPlateau` which leaves our base learning rate alone until the model's efforts hit a plateau and the loss function is no longer decreasing.

We have two stages in this process, as well - training and evaluation. We run the training set completely using batches of 100 before we move to the evaluation step, where we run the eval set completely also using batches of 100.

***

## Special Elements

Most of the training workflow function shown below is pretty standard for users of PyTorch. However, there are a couple of elements that are different.

### DaskResultsHandler
In order to use the model output handler, we need to initialize the `DaskResultsHandler` class for our experiment.
This object has a few important methods, including letting our model performance at each iteration be automatically documented.  

In [3]:
import uuid
key = uuid.uuid4().hex

rh = results.DaskResultsHandler(key)


### Worker Rank
```
worker_rank = int(dist.get_rank())
```

This is checking to see which of the workers in the cluster we're on. This way, our results records can tell which worker this performance represents.


### Model to Device

```
device = torch.device(0)
net = models.resnet50(pretrained=True)
model = net.to(device)
```

As you'll recall from Notebook 4, we need to make sure our model is placed on the worker- here we do it one time before the training loops begin. We will also pass each image and its label to the worker within the training and evaluation loops - see if you can find this spot, you need to fill in the blanks!


### DDP Wrap
```
device_ids = [0]
model = DDP(model, device_ids=device_ids)
```

And finally, we need to enable the DistributedDataParallel framework. To do this, we are using the `DDP()` wrapper around the model, which is short for the PyTorch function `torch.nn.parallel.DistributedDataParallel`. There is a lot to know about this, but for our purposes the important thing is to understand that this allows the model training to run in parallel on our cluster. https://pytorch.org/docs/stable/notes/ddp.html

***

## Train that Baby!
Our whole training process is going to be contained in one function, here named `run_transfer_learning`. There are some empty spots here referring to concepts we have discussed. Fill in the blanks in between `<<< >>>` marks to get the correct training function, or click the ellipsis below to check your work.

In [4]:
import torch
from torch import nn, optim
from torch.nn.parallel import DistributedDataParallel as DDP

from torchvision import datasets, transforms, models
from torch.utils.data import DataLoader
from torch.utils.data.sampler import SubsetRandomSampler

import torch.distributed as dist
from torch.optim import lr_scheduler

In [5]:
def run_transfer_learning(bucket, prefix, train_pct, batch_size, n_epochs, base_lr):
    '''Load basic Resnet50, run transfer learning over given epochs.
    Uses dataset from the path given as the pool from which to take the 
    training and evaluation samples.'''
    # --------- Format model and params --------- #
    worker_rank = int(dist.get_rank())
    
    device = torch.device(0)
    net = models.resnet50(pretrained=True)
    model = net.to(device)
    device_ids = [0]
    model = DDP(model, device_ids=device_ids)
    
    criterion = nn.CrossEntropyLoss().cuda()    
    lr = base_lr * dist.get_world_size()
    optimizer = optim.SGD(model.parameters(), lr=lr, momentum=0.9)
    scheduler = lr_scheduler.ReduceLROnPlateau(optimizer, mode = 'min', patience = 2)
    
    # --------- Retrieve data for training and eval --------- #
    whole_dataset = prepro_batches(bucket, prefix)
    train, val = get_splits_parallel(train_pct, whole_dataset, batch_size=batch_size)
    dataloaders = {'train' : train, 'val': val}

    # --------- Start iterations --------- #
    count = 0
    t_count = 0
    for epoch in range(n_epochs):
    # --------- Training section --------- #    
        model.train()  # Set model to training mode
        for inputs, labels in dataloaders["train"]:
            dt = datetime.datetime.now().isoformat()
            inputs = inputs.to(device)
            labels = labels.to(device)
            
            outputs = model(inputs)
            _, preds = torch.max(outputs, 1)
            loss = criterion(outputs, labels)
            correct = (preds == labels).sum().item()
            
            # zero the parameter gradients
            optimizer.zero_grad()
            loss.backward()
            optimizer.step()
            count += 1
            # statistics
            for param_group in optimizer.param_groups:
                current_lr = param_group['lr']
            # Record the results of this model iteration (training sample) for later review.
            rh.submit_result(
                f"worker/{worker_rank}/data-{dt}.json", 
                json.dumps({'loss': loss.item(),'learning_rate':current_lr, 'correct':correct, 'epoch': epoch, 'count': count, 'worker': worker_rank, 'sample': 'train'})
            )
        
            if (count % 100) == 0 and worker_rank == 0:
                # Grab a snapshot of the current state of the model, in case of interruption or need to review
                rh.submit_result(f"checkpoint-{dt}.pkl", pickle.dumps(model.state_dict()))

    # --------- Evaluation section --------- #   
        with torch.no_grad():
            model.eval()  # Set model to evaluation mode
            for inputs_t, labels_t in dataloaders["val"]:
                dt = datetime.datetime.now().isoformat()
                inputs_t = inputs_t.to(device)
                labels_t = labels_t.to(device)
            
                outputs_t = model(inputs_t)
                _,pred_t = torch.max(outputs_t, dim=1)
                loss_t = criterion(outputs_t, labels_t)
                correct_t = (pred_t == labels_t).sum().item()
                t_count += 1

                # statistics
                for param_group in optimizer.param_groups:
                    current_lr = param_group['lr']
                # Record the results of this model iteration (evaluation sample) for later review.
                rh.submit_result(
                    f"worker/{worker_rank}/data-{dt}.json", 
                    json.dumps({'loss': loss_t.item(),'learning_rate':current_lr, 'correct':correct_t, 'epoch': epoch, 'count': t_count, 'worker': worker_rank, 'sample': 'eval'})
                )

        scheduler.step(loss)

Now we've done all the hard work, and just need to run our function! Using `dispatch.run` from `dask-pytorch-ddp`, we pass in the transfer learning function so that it gets distributed correctly across our cluster. This creates futures, which are not yet computed, and we use `process_results` to actually kick off the computation. 

### Define Model Parameters

As with any PyTorch model, you'll want to define the epochs of training you plan to do, the batch size if using batches, and the starting learning rate. We're also able to assign the train/test split here because of how the functions above are written.

(We're using only two epochs here to save time, but of course you can increase this.)

In [6]:
startparams = {'n_epochs': 2, 
               'batch_size': 100,
               'train_pct': .8,
               'base_lr': 0.01}

In [7]:
import math
import numpy as np
import multiprocessing as mp
import datetime
import json 
import pickle

num_workers = 64

### Send Tasks to Workers
 
We talked in Notebook 2 about how we distribute tasks to the workers in our cluster, and now you get to see it firsthand. Inside the `dispatch.run()` function in `dask-pytorch-ddp`, we are actually using the `client.submit()` method to pass tasks to our workers, and collecting these as futures in a list which we will compute later. We can prove this by looking at the results, here named "futures", where we can see they are in fact all pending futures, one for each of the workers in our cluster.

Remember, each worker is attacking the same problem - our transfer learning problem - and pursuing a solution simultaneously. We'll have one and only one job per worker.

In [8]:
%%time    
futures = dispatch.run(client, run_transfer_learning, bucket = "dask-datasets", prefix = "dogs/Images", **startparams)

CPU times: user 18.6 ms, sys: 322 µs, total: 19 ms
Wall time: 18.4 ms


In [9]:
futures

[<Future: pending, key: dispatch_with_ddp-acc5d577ea81c2af5ec1a48debf7ebab>,
 <Future: pending, key: dispatch_with_ddp-7fa0f08ee02823473f705fe2fce0bcfc>,
 <Future: pending, key: dispatch_with_ddp-ffeb696fdc2b0dc02ef3c9f5a0c7070b>]

### Run Computations

This step is where we actually start the workers running the computations. The `process_results()` method on the DaskResultsHandler is requesting the results of each future, which is familiar from [Notebook 4](04-parallel-inference.ipynb), where we used `fut.result()`.

In [10]:
!rm -rf /home/jovyan/project/workshop-dask-pytorch/workshop_results

In [11]:
%%time

rh.process_results("/home/jovyan/project/workshop-dask-pytorch/workshop_results", futures, raise_errors=False)

CPU times: user 2.26 s, sys: 1.23 s, total: 3.49 s
Wall time: 13min 29s


<img src="https://media.giphy.com/media/VFDeGtRSHswfe/giphy.gif" alt="parallel" style="width: 200px;"/>

Now we let our workers run for awhile. This step will take time, so you may not be able to see the full results during our workshop. (In tests, it took about 15 minutes to do two epochs.) To see partial results coming in, you should have the `workshop_results` folder in the folder menu a few moments after you run the last chunk. Look in this folder to see the results each worker is returning to us.

If you want to stop the job early, just click the square "stop" button at the top of this notebook.


## Proof of Results

We don't have the time today to run an assortment of different cluster sizes to see what works best, but I happen to have the results of those runs saved and visualized, to demonstrate how well it works! [Follow me to Notebook 7!](07-learning-results.ipynb)