# Benchmarking distributed training with PyTorch DataParallel

## Overview

This notebook allows to you measure the first and second epoch train time for training deep learning models on multiple GPUs using the PyTorch DataParallel strategy.

This experiment varies the following:

* Dataset size: MNIST digits repeated 1x, 4x, 8x = 60k, 240k, 480k training images
* Model size: Small with 402k trainable parameters, large with 2.6m trainable parameters. Both: adam optimizer, cross entropy loss
* Batch size: 128, 256, 512 images
* GPUs: I used: GCP n1-highmem-2 (2 vCPUs, 13 GB memory) with {1, 2, 4} NVIDIA Tesla K80 GPUs

And then it records:

* First epoch train time: incurs any startup costs
* Second epoch train time: representative of future epoch train times since training incurs same number of operations/epoch

### Import dependencies

In [1]:
import torch
import torchvision
from torchvision import datasets, transforms
from torch import nn, optim
import torch.nn.functional as F

from time import time
import os
import json

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

In [19]:
# pytorch transforms.ToTensor implicitly divides pixel values by 255
transform = transforms.Compose([transforms.ToTensor()])

train_set = datasets.MNIST('data/', download=True, train=True, transform=transform)
test_set = datasets.MNIST('data/', download=True, train=False, transform=transform)
test_loader = torch.utils.data.DataLoader(test_set, batch_size=64, shuffle=True)

In [4]:
small_model = nn.Sequential(nn.Conv2d(1, 32, 3, padding=1),
                              nn.ReLU(),
                              nn.MaxPool2d((2,2)),
                              nn.Flatten(),
                              nn.Linear(6272, 64),
                              nn.ReLU(),
                              nn.Linear(64, 10)
                             )

large_model = nn.Sequential(nn.Conv2d(1, 128, 3, padding=1),
                              nn.ReLU(),
                              nn.MaxPool2d((2,2)),
                              nn.Conv2d(128, 256, 3, padding=1),
                              nn.ReLU(),
                              nn.MaxPool2d((2,2)),
                              nn.Conv2d(256, 512, 3, padding=1),
                              nn.ReLU(),
                              nn.MaxPool2d((2,2)),                
                              nn.Flatten(),
                              nn.Linear(4608, 512),
                              nn.ReLU(),
                              nn.Linear(512, 512),
                              nn.ReLU(),
                              nn.Linear(512, 10)
                             )

model_fncs = {'small': small_model, 'large': large_model}

Sequential(
  (0): Conv2d(1, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (1): ReLU()
  (2): MaxPool2d(kernel_size=(2, 2), stride=(2, 2), padding=0, dilation=1, ceil_mode=False)
  (3): Flatten()
  (4): Linear(in_features=6272, out_features=64, bias=True)
  (5): ReLU()
  (6): Linear(in_features=64, out_features=10, bias=True)
)
Sequential(
  (0): Conv2d(1, 128, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (1): ReLU()
  (2): MaxPool2d(kernel_size=(2, 2), stride=(2, 2), padding=0, dilation=1, ceil_mode=False)
  (3): Conv2d(128, 256, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (4): ReLU()
  (5): MaxPool2d(kernel_size=(2, 2), stride=(2, 2), padding=0, dilation=1, ceil_mode=False)
  (6): Conv2d(256, 512, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1))
  (7): ReLU()
  (8): MaxPool2d(kernel_size=(2, 2), stride=(2, 2), padding=0, dilation=1, ceil_mode=False)
  (9): Flatten()
  (10): Linear(in_features=4608, out_features=512, bias=True)
  (11): ReLU()
  (12): Line

In [16]:
def test(model):
    with torch.no_grad():
        model.eval()
        test_loss = 0
        correct = 0
        for data, target in test_loader:
            data, target = data.to(device), target.to(device)
            output = model(data)

            pred = output.max(1, keepdim=True)[1]
            correct += pred.eq(target.view_as(pred)).sum().item()
        return 100. * correct / len(test_loader.dataset)

## Define experiment
The experiment function takes in the three variables we are interested in: number of times to repeat the MNIST dataset, batch_size (per replica), and the number of GPUs to use. It will save the time it takes to train the first and second epochs.

In [17]:
# using default params for adam optimizer from tf.keras.optim

def experiment(model_size, n_dataset_repeat, batch_size_per_replica, n_gpus, record_results=False):
    batch_size = batch_size_per_replica * n_gpus

    dupl_train_set = torch.utils.data.ConcatDataset([train_set for i in range(n_dataset_repeat)])
    dupl_train_loader = torch.utils.data.DataLoader(dupl_train_set, batch_size=batch_size, shuffle=True)

    model = model_fncs[model_size]
    model = nn.DataParallel(model, device_ids=[i for i in range(n_gpus)]).to(device)
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.999), eps=1e-07, weight_decay=0, amsgrad=False)
    
    epoch_times = []
    epochs = 2
    for e in range(epochs):
        running_loss = 0
        start = time()
        for images, labels in dupl_train_loader:
            images, labels = images.to(device), labels.to(device)

            optimizer.zero_grad()    
            output = model(images)
            loss = criterion(output, labels)
            loss.backward()
            optimizer.step()
            running_loss += loss.item()
        else:
            end = time()
            epoch_times.append(end-start)
            print("Epoch {} - Training loss: {} - Test accuracy: {} - Time(s) {}".format(e, running_loss/len(dupl_train_loader), test(model), end-start))
    
    
    if record_results:
        with open('pytorch_results.txt', 'a') as f:
            f.write(json.dumps(
                        {'n_dataset_repeat': n_dataset_repeat,
                        'batch_size': batch_size_per_replica,
                        'n_gpus': n_gpus,
                        'model_size': model_size,
                        'first epoch time': epoch_times[0],
                        'second epoch time': epoch_times[1]}) + '\n')

    

In [22]:
MODELS = ['small', 'large']
DATASET_REPEATS = [1,4,8]
BATCH_SIZES = [128, 256, 512]
# GPU_NUMS = [i for i in range(torch.cuda.device_count()+1) if i in (1,2,4,8)]
GPU_NUMS = [2]

for d in DATASET_REPEATS:
    for b in BATCH_SIZES:
        for g in GPU_NUMS:
            for m in MODELS:
                print('\n' + '*'*80 + '\n')
                print('Now training: {} model on dataset repeated {}x with batch size {} on {} gpu(s)'.format(m, d, b, g))
                experiment(m,d,b,g, record_results=True)


********************************************************************************

Now training: small model on dataset repeated 1x with batch size 128 on 2 gpu(s)
Epoch 0 - Training loss: 0.0005689484065713143 - Test accuracy: 98.68 - Time(s) 10.922572374343872
Epoch 1 - Training loss: 2.435248299778013e-05 - Test accuracy: 98.71 - Time(s) 8.288250207901001

********************************************************************************

Now training: large model on dataset repeated 1x with batch size 128 on 2 gpu(s)
Epoch 0 - Training loss: 0.00335146024811733 - Test accuracy: 99.18 - Time(s) 20.00277328491211
Epoch 1 - Training loss: 0.0035170903741790754 - Test accuracy: 99.4 - Time(s) 19.96966004371643

********************************************************************************

Now training: small model on dataset repeated 1x with batch size 256 on 2 gpu(s)
Epoch 0 - Training loss: 0.0001330823421486164 - Test accuracy: 98.49 - Time(s) 7.674724817276001
Epoch 1 - Training l

In [23]:
! cat results.txt

{"n_dataset_repeat": 1, "batch_size": 128, "n_gpus": 1, "first epoch time": 9.389503240585327, "second epoch time": 9.451826333999634}
{"n_dataset_repeat": 1, "batch_size": 128, "n_gpus": 1, "first epoch time": 26.72149634361267, "second epoch time": 26.690542936325073}
{"n_dataset_repeat": 1, "batch_size": 256, "n_gpus": 1, "first epoch time": 8.530330896377563, "second epoch time": 8.714176416397095}
{"n_dataset_repeat": 1, "batch_size": 256, "n_gpus": 1, "first epoch time": 24.03084683418274, "second epoch time": 23.860966682434082}
{"n_dataset_repeat": 1, "batch_size": 512, "n_gpus": 1, "first epoch time": 8.179916381835938, "second epoch time": 8.380159378051758}
{"n_dataset_repeat": 1, "batch_size": 512, "n_gpus": 1, "first epoch time": 23.048163652420044, "second epoch time": 22.975470066070557}
{"n_dataset_repeat": 4, "batch_size": 128, "n_gpus": 1, "first epoch time": 37.70245623588562, "second epoch time": 37.85469889640808}
{"n_dataset_repeat": 4, "batch_size": 128, "