<img src="img/saturn_logo.png" width="300" />


# Transfer Learning

Now we know how to run a very speedy inference job with our parallelization from Dask. But what if we need to train a model? Let's do a transfer learning task to see how that might work.

We are still using Stanford Dogs and starting with Resnet50, and we will use transfer learning to make it perform better at dog image identification.

In order to make this work, we have a few steps to carry out:
* Preprocessing our data appropriately
* Applying infrastructure for parallelizing the learning process
* Running the transfer learning workflow and generating evaluation data


To start, you know the drill by now: get our cluster connected. Fill in the blanks in between `<<< >>>` marks to get the correct code, or click the ellipsis below to check your work.

In [None]:
### FILL IN THE BLANKS ###

from dask_saturn import SaturnCluster
from dask.distributed import Client

cluster = <<< FILL IN >>>(
    n_workers = 3,
    scheduler_size = 'medium',
    worker_size = 'p32xlarge',
    nthreads = 8
)
client = <<< FILL IN >>>(cluster)
client.wait_for_workers(3)
client

In [1]:
from dask_saturn import SaturnCluster
from dask.distributed import Client

cluster = SaturnCluster(
    n_workers = 3,
    scheduler_size = 'medium',
    worker_size = 'p32xlarge',
    nthreads = 8
)

client = Client(cluster)
client.wait_for_workers(3)
client

[2020-12-04 21:24:20] INFO - dask-saturn | Cluster is ready
[2020-12-04 21:24:20] INFO - dask-saturn | Registering default plugins
[2020-12-04 21:24:20] INFO - dask-saturn | {'tcp://10.0.0.218:33245': {'status': 'repeat'}, 'tcp://10.0.10.215:33649': {'status': 'repeat'}, 'tcp://10.0.3.161:46117': {'status': 'repeat'}}


0,1
Client  Scheduler: tcp://d-steph-workshop-dask-pytorch-466b41db5a6b4fca8eb0c02a20d046a4.main-namespace:8786  Dashboard: https://d-steph-workshop-dask-pytorch-466b41db5a6b4fca8eb0c02a20d046a4.internal.saturnenterprise.io,Cluster  Workers: 3  Cores: 24  Memory: 181.50 GB


In [2]:
import torch
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

***

## Preprocessing Data

We are using `dask-pytorch-ddp` to handle a lot of the work involved in training across the entire cluster. This will abstract away lots of worker management tasks, and also sets up a tidy infrastructure for managing model output, but if you're interested to learn more about this, we maintain the [codebase and documentation on Github](https://github.com/saturncloud/dask-pytorch).

Because we want to load our images directly from S3, without saving them to memory (and wasting space/time!) we are going to use the `dask-pytorch-ddp` custom class inheriting from the Dataset class called `S3ImageFolder`.

The preprocessing steps are quite short- we want to load images using the class we discussed above, and apply the transformation of our choosing. If you like, you can make the transformations an argument to this function and pass it in.


In [3]:
from dask_pytorch_ddp import results, data, dispatch
from torch.utils.data.sampler import SubsetRandomSampler

In [4]:
def prepro_batches(bucket, prefix):
    '''Initialize the custom Dataset class defined above, apply transformations.'''
    transform = transforms.Compose([
    transforms.Resize(256), 
    transforms.CenterCrop(250), 
    transforms.ToTensor()])
    whole_dataset = data.S3ImageFolder(bucket, prefix, transform=transform, anon = True)
    return whole_dataset

### Select Training and Evaluation Samples

In order to run our training, we'll create training and evaluation sample sets to use later. These generate DataLoader objects which we can iterate over. We'll use both later to run and monitor our model's learning.

In [5]:
def get_splits_parallel(train_pct, data, batch_size):
    '''Select two samples of data for training and evaluation'''
    classes = data.classes
    train_size = math.floor(len(data) * train_pct)
    indices = list(range(len(data)))
    np.random.shuffle(indices)
    train_idx = indices[:train_size]
    test_idx = indices[train_size:len(data)]

    train_sampler = SubsetRandomSampler(train_idx)
    test_sampler = SubsetRandomSampler(test_idx)
    
    train_loader = torch.utils.data.DataLoader(data, sampler=train_sampler, batch_size=batch_size, num_workers=num_workers, multiprocessing_context=mp.get_context('fork'))
    test_loader = torch.utils.data.DataLoader(data, sampler=train_sampler, batch_size=batch_size, num_workers=num_workers, multiprocessing_context=mp.get_context('fork'))
    
    return train_loader, test_loader

Aside from using our custom data object, this should be very similar to other PyTorch workflows. While I am using the `S3ImageFolder` class here, you definitely don't have to in your own work. Any standard PyTorch data object type should be compatible with the Dask work we're doing next.

Now, it's time for learning, in [Notebook 6](06-transfer-training.ipynb)!

<img src="https://media.giphy.com/media/mC7VjtF9sYofs9DUa5/giphy.gif" alt="learn" style="width: 300px;"/>