# Demo Weights and Biases
### Train Image Classifier with PyTorch on GPU Dask Cluster

In this project, we use the Stanford Dogs dataset, and starting with a pre-trained version of Resnet50, we will use transfer learning to make it perform better at dog image identification.



In [1]:
import numpy as np
import os
import math
import datetime
import json 
import torch
from torch import nn, optim
from torchvision import datasets, transforms, models
from torch.utils.data import DataLoader
from torch.utils.data.sampler import RandomSampler
from dask_pytorch_ddp import data, dispatch
import multiprocessing as mp

## Set up Weights and Biases

Import the Weights and Biases library, and confirm that you are logged in. 

>The Start Script in this project uses your Weights and Biases token to log in. Make sure that this is correctly saved in the Credentials section, and named `WANDB_LOGIN`, if you have any trouble. This is important because all the workers in your cluster need to have this token. This credential needs to be set up before the cluster is started.

In [2]:
import wandb
wandb.login()

[34m[1mwandb[0m: Currently logged in as: [33mmorgan[0m (use `wandb login --relogin` to force relogin)


True

## Cluster Specific Elements

Because this task uses a Dask cluster, we need to load a few extra libraries, and ensure our cluster is running.

In [3]:
from concurrent.futures import ThreadPoolExecutor
from torch.nn.parallel import DistributedDataParallel as DDP
from dask_saturn import SaturnCluster
from dask.distributed import Client
import torch.distributed as dist

In [4]:
cluster = SaturnCluster()
client = Client(cluster)
client.wait_for_workers(2)
client

INFO:dask-saturn:Cluster is ready
INFO:dask-saturn:Registering default plugins
INFO:dask-saturn:{'tcp://192.168.174.132:37167': {'status': 'repeat'}, 'tcp://192.168.219.196:40323': {'status': 'repeat'}, 'tcp://192.168.35.132:35173': {'status': 'repeat'}}


0,1
Client  Scheduler: tcp://d-morga-wandb-demo-c8e449a4e54f4d5d84aadb1051667f53.main-namespace:8786  Dashboard: https://d-morga-wandb-demo-c8e449a4e54f4d5d84aadb1051667f53.community.saturnenterprise.io,Cluster  Workers: 3  Cores: 12  Memory: 46.50 GB


### Label Formatting 
These utilities ensure the training data labels correspond to the pretrained model's label expectations.

In [5]:
import re
import s3fs

##### Load label dataset
s3 = s3fs.S3FileSystem(anon=True)
with s3.open('s3://saturn-public-data/dogs/imagenet1000_clsidx_to_labels.txt') as f:
    imagenetclasses = [line.strip() for line in f.readlines()]
    
##### Format labels to match pretrained Resnet
def replace_label(dataset_label, model_labels):
    label_string = re.search('n[0-9]+-([^/]+)', dataset_label).group(1)
    
    for i in model_labels:
        i = str(i).replace('{', '').replace('}', '')
        model_label_str = re.search('''b["'][0-9]+: ["']([^\/]+)["'],["']''', str(i))
        model_label_idx = re.search('''b["']([0-9]+):''', str(i)).group(1)
        
        if re.search(str(label_string).replace('_', ' '), str(model_label_str).replace('_', ' ')):
            return i, model_label_idx
            break

## Set Model Specifications

Here you can assign your model hyperparameters, as well as identifying where the training data is housed on S3. All these parameters, as well as some extra elements like Notes and Tags, are tracked by Weights and Biases for you.

In [6]:
model_params = {'n_epochs': 6, 
    'batch_size': 64,
    'base_lr': .0003,
    'downsample_to':.5, # Value represents percent of training data you want to use
    'bucket': "saturn-public-data",
    'prefix': "dogs/Images",
    'pretrained_classes':imagenetclasses} 

In [7]:
wbargs = {**model_params,
    'classes':120,
    'Notes':"baseline",
    'Tags': ['downsample', 'cluster', 'gpu', '6wk', 'subsample'],
    'Group': "DDP",
    'dataset':"StanfordDogs",
    'architecture':"ResNet"}

## Training Function

This function encompasses the training task. 
* Load model and wrap it in PyTorch's Distributed Data Parallel function
* Initialize Weights and Biases run
* Set up DataLoader to iterate over training data
* Perform training tasks
* Write model performance data to Weights and Biases

In [8]:
def simple_train_cluster(bucket, prefix, batch_size, downsample_to, n_epochs, base_lr, pretrained_classes):
#     os.environ["DASK_DISTRIBUTED__WORKER__DAEMON"] = "False"
    os.environ["WANDB_START_METHOD"] = "thread"
    
    worker_rank = int(dist.get_rank())
    
    # --------- Format params --------- #
    device = torch.device("cuda")
    net = models.resnet50(pretrained=True) # True means we start with the imagenet version
    model = net.to(device)
    model = DDP(model)

    # --------- Start wandb --------- #
    if worker_rank == 0:
        wandb.init(config=wbargs, entity='wandb', project = 'wandb_saturn_demo')
        wandb.watch(model)

    # --------- Set up eval --------- #
    criterion = nn.CrossEntropyLoss().cuda()    
    optimizer = optim.AdamW(model.parameters(), lr=base_lr, eps=1e-06)

    # --------- Retrieve data for training --------- #
    transform = transforms.Compose([
    transforms.Resize(256), 
    transforms.CenterCrop(250), 
    transforms.ToTensor()])
    
    # Because we want to load our images directly and lazily from S3,
    # we use a custom Dataset class called S3ImageFolder.
    whole_dataset = data.S3ImageFolder(
        bucket, 
        prefix, 
        transform=transform, 
        anon = True
    )
    
    # Format target labels
    new_class_to_idx = {x: int(replace_label(x, pretrained_classes)[1]) for x in whole_dataset.classes}
    whole_dataset.class_to_idx = new_class_to_idx

    # ------ Create dataloader ------- #
    train_loader = torch.utils.data.DataLoader(
        whole_dataset, 
        sampler=RandomSampler(
            whole_dataset, 
            replacement = True,
            num_samples = math.floor(len(whole_dataset)*downsample_to)), 
        batch_size=batch_size, 
        num_workers=0 
    )   
    
    # Using the OneCycleLR learning rate schedule
    scheduler = optim.lr_scheduler.OneCycleLR(optimizer, max_lr=base_lr, 
                                                steps_per_epoch=len(train_loader), 
                                                epochs=n_epochs)
    
    # ------ Prepare wandb Table for predictions ------- #
    if worker_rank == 0:
        columns=["image", "label", "prediction", "score"]
        preds_table = wandb.Table(columns=columns)

    # --------- Start Training ------- #
    for epoch in range(n_epochs):
        count = 0
        model.train()
        
        for inputs, labels in train_loader:
            # zero the parameter gradients
            optimizer.zero_grad()
            
            dt = datetime.datetime.now().isoformat()
            inputs, labels = inputs.to(device), labels.to(device)

            # Run model iteration
            outputs = model(inputs)

            # Format results
            pred_idx, preds = torch.max(outputs, 1)
            perct = [torch.nn.functional.softmax(el, dim=0)[i].item() for i, el in zip(preds, outputs)]
            
            loss = criterion(outputs, labels)
            correct = (preds == labels).sum().item()
            
            loss.backward()
            optimizer.step()
            scheduler.step()
            
            # ✍️ Log your metrics to wandb ✍️
            if worker_rank == 0: 
                logs = {
                        'train/train_loss': loss.item(),
                        'train/learning_rate':scheduler.get_last_lr()[0], 
                        'train/correct':correct, 
                        'train/epoch': epoch + count/len(train_loader), 
                        'train/count': count,     
                    }

                # ✍️  Occasionally some images to ensure the image data looks correct ✍️
                if count % 25 == 0:
                    logs['examples/example_images'] = wandb.Image(inputs[:5], caption=f'Step: {count}')

                # ✍️ Log some predictions to wandb during final epoch for analysis✍️ 
                if epoch == max(range(n_epochs)) and count % 4 == 0:
                    for i in range(len(labels)):
                        preds_table.add_data(wandb.Image(inputs[i]), labels[i], preds[i], perct[i]) 

                # ✍️  Log metrics to wandb ✍️         
                wandb.log(logs)
            
            count += 1
    
    # ✍️  Upload your predictions table for analysis ✍️  
    if worker_rank == 0: 
        predictions_artifact = wandb.Artifact("train_predictions_" + str(wandb.run.id), type="train_predictions")
        predictions_artifact.add(preds_table, "train_predictions")
        wandb.run.log_artifact(predictions_artifact)  

        # ✍️ Close your wandb run ✍️ 
        wandb.run.finish()

## Run Model

To run the model, we use the `dask-pytorch-ddp` function `dispatch.run()`. This takes our client, our training function, and our dictionary of model parameters. You can monitor the model run on all workers using the Dask dashboard, or monitor the performance of Worker 0 on Weights and Biases.

In [None]:
client.restart() # Clears memory on cluster- optional but recommended.

In [None]:
%%time    
futures = dispatch.run(client, simple_train_cluster, **model_params)
futures

In [None]:
# If one or more worker jobs errors, this will describe the issue
futures[0].result()

At this point, you can view the Weights and Biases dashboard to see the performance of the model and system resources utilization in real time!