# Proof-of-concept to Production: How to scale your Deep Learning Models

## Neural Collaborative Filtering

NCF is a simple DNN based model for recommendation.

<img src="img/ncf_diagram.png" width="600" title="Neural Collaborative Filtering Model Overview">

## Stochastic Gradient Descent

$$ min_{x \in R^n} f(x) := \frac{1}{M} \sum_{i=1}^M f_i (x)$$

Problem of SGD can be expressed as equation above where $ f_i $ is a loss function for data points $ i \in \{1,2...M\} $ and x is the vector of weights being optimized. Stochastic Gradient Descent is often used to iteratively optimize the above function as shown below.

$$ x_{k+1} = x_k - \alpha_k \frac{1}{|B_k|} \sum_{i \in B_k} \delta f_i(x_k) $$

where $B_k \in \{1,2...M\} $ is a batch sampled from the dataset and $\alpha_k$ is the learning rate. 

In [None]:
##Setup
from torch import optim
import torch.nn as nn
import utils
import neumf
import math

# Process Data

def process_data():
    processed_data = utils.process_data()

    return processed_data

train_label, train_users, train_items, test_users, test_items, \
    dup_mask, real_indices, all_test_users, \
    nb_users, nb_items, mat = process_data()

print('Load data done. #user=%d, #item=%d, #train=%d, #test=%d'
      % (nb_users, nb_items, len(train_users),
         nb_users))

In [None]:
# Initialize NCF Model
model = neumf.initialize_model(nb_users, nb_items)

#Initialize SGD Optimizer
learning_rate = 0.005
momentum = 0.9
optimizer = optim.SGD(model.parameters(), lr=learning_rate, momentum=momentum)

In [6]:
#Takes ~20 minutes to run. In the interest of time, output is saved.
#Run Training
# batch_size = 4096
# utils.train(model, optimizer, processed_data, batch_size, learning_rate, 
#             mode="train", warmup_fn=None, scale_loss_fn=None)


Opt Learning rate 0.0050
Input Batch Size 4096

Epoch = 0
Train Throughput = 736327.9173
Epoch 0: HR@10 = 0.8325, NDCG@10 = 0.5457, train_time = 134.87, val_time = 0.58
New best hr!

Epoch = 1
Train Throughput = 921067.1092
Epoch 1: HR@10 = 0.8331, NDCG@10 = 0.5460, train_time = 107.82, val_time = 0.58
New best hr!

Epoch = 2
Train Throughput = 739430.0484
Epoch 2: HR@10 = 0.8345, NDCG@10 = 0.5468, train_time = 134.30, val_time = 0.58
New best hr!

Epoch = 3
Train Throughput = 918804.9649
Epoch 3: HR@10 = 0.8392, NDCG@10 = 0.5490, train_time = 108.08, val_time = 0.58
New best hr!

Epoch = 4
Train Throughput = 737332.3955
Epoch 4: HR@10 = 0.8485, NDCG@10 = 0.5554, train_time = 134.69, val_time = 0.57
New best hr!

Epoch = 5
Train Throughput = 735914.3612
Epoch 5: HR@10 = 0.8632, NDCG@10 = 0.5724, train_time = 134.95, val_time = 0.57
New best hr!

Epoch = 6
Train Throughput = 737593.7288
Epoch 6: HR@10 = 0.8830, NDCG@10 = 0.5979, train_time = 134.64, val_time = 0.58
New best hr!

Epoch 

0.9003848569963825

### SIZE MATTERS

$$ x_{k+1} = x_k - \alpha_k \frac{1}{|B_k|} \sum_{i \in B_k} \delta f_i(x_k) $$

Increasing the size of a single batch improves parallelism, for example, by using GPU for computationally intensive subroutines like matrix multiplications or by using multiple cores to perform SGD in parallel. 

In [None]:
batch_size = 4096 * 16
utils.train(model, optimizer, processed_data, batch_size, learning_rate, mode="perf")

However, when trained with large batch sizes, accuracy drops as high as 5% were noted even for small networks due to loss in generalization. 

<img src="img/resnet_bs_error.png" width="600" title="ImageNet top-1 validation error vs minibatch size">

In [None]:
batch_size = 4096 * 16

model = neumf.initialize_model(nb_users, nb_items)
optimizer = optim.SGD(model.parameters(), lr=learning_rate, momentum=momentum)
utils.train(model, optimizer, processed_data, batch_size, learning_rate)

## Convergence and Scaling Efficiency: Linear Scaling Rule

* [1] Linear Scaling Rule - When the minibatch size is multiplied by k, multiply the learning rate by k.

* It is straight forward to understand, for the same number of epochs by increasing batch size by k, k fewer steps are taken. Hence increasing the step size by k seems intuitive.

<img src="img/lr_scaling.png" width="1500" title="Visual Analogy of Linear Scaling">

In [None]:
# Scale Batch Size Arbitrarily by 16x

batch_size = 4096 * 16
momentum = 0.9
learning_rate = # FILL ME

model = neumf.initialize_model(nb_users, nb_items)
optimizer = optim.SGD(model.parameters(), lr=learning_rate, momentum=momentum)

utils.train(model, optimizer, processed_data, batch_size, learning_rate)

In [None]:
#Clear memory and reload data if running into OOM Errors
# %reset -f
# train_label, train_users, train_items, test_users, test_items, \
#     dup_mask, real_indices, all_test_users, \
#     nb_users, nb_items, mat = process_data()

# Select Batch size by 192x as it fits on our GPU

batch_size = 4096 * 192
momentum = 0.9
learning_rate = # FILL ME

model = neumf.initialize_model(nb_users, nb_items)
optimizer = optim.SGD(model.parameters(), lr=learning_rate, momentum=momentum)

utils.train(model, optimizer, processed_data, batch_size, learning_rate)

## Convergence and Scaling Efficiency: Warmup

* For even larger minibatches, Linear scaling rule is shown to break down when the network changes rapidly due to instability. When $\alpha_k$ is large, the update $\alpha_k |\delta f_i(x_k)| $ can be larger than $x_k$ causing divergence. This causes the training to be highly dependent on the weight initialization and initial LR. We can use warmup to combat this.

* Warmup is a method by which we gradually increase learning rate at the start of training. 

In [None]:
import math

#code to perform warmup
warmup_epochs = 0.1

def warmup(optimizer, iter_i, batches_per_epoch):
    warmup_iters = warmup_epochs * batches_per_epoch
    if iter_i >= warmup_iters:
        lr_current = learning_rate
    else:
        warmup_factor = math.exp(math.log(0.01) * (warmup_iters - iter_i) / warmup_iters)
        lr_current = learning_rate * warmup_factor
    for grp in optimizer.param_groups:
        grp['lr'] = lr_current
    return

batch_size = 4096 * 192
learning_rate = 0.005 * 192
momentum = 0.9
model = neumf.initialize_model(nb_users, nb_items)
optimizer = optim.SGD(model.parameters(), lr=learning_rate, momentum=momentum)


utils.train(model, optimizer, processed_data, batch_size, learning_rate, warmup_fn=warmup)

## Convergence and Scaling Efficiency: LARS

* Layer wise Adaptive Rate Scaling (LARS) - is another effective method popular to combat the instability caused by high learning rates. 

* Standard SGD has the same learning rate for all layers. when $\lambda$ is large, the update might be larger that the weight itself causing divergence. LARS introduces another term $\lambda$ for each layer l and trust coeficcient $\eta < 1$. 

$$ x_{k+1}^{l} = x_k - \alpha_k \lambda^l \delta f_i(x_k^l) $$

$$ \lambda^l = \eta \frac{||x_k^l||}{\delta f_i(x_k^l)} $$


<img src="img/lars.png" width="600" title="LARS: Alexnet-BN with B=8k">

In [None]:
import lars

model = neumf.initialize_model(nb_users, nb_items)
optimizer = lars.LARS(model.parameters(), lr=learning_rate, momentum=momentum)


utils.train(model, optimizer, processed_data, batch_size, learning_rate, warmup_fn=warmup)

## Computational Efficiency: Mixed Precision Training

Train with half precision while maintaining the network accuracy achieved with single precision resulting in
* **Increased throughput**
* **Increased Capability** due to reduced memory footprint enabling larger batch sizes/models

<img src="img/mp_training.png" width="600" title="Mixed Precision Speedups">

In SSD, for example, 31% of gradient values become 0s are they are not representable in fp16 causing the model to diverge. How do we overcome this limitation? Loss scaling!

<img src="img/fp16_gradients.png" width="600" title="Histogram of activation gradient magnitudes throughout FP32 training of Multibox SSD network">

Enabling mixed precision involves two steps: 
* Porting the model to use the half-precision data type where appropriate
* loss scaling to preserve small gradient values.

**As easy as adding three lines of code in PyTorch, TensorFlow and MXNet.**

In [None]:
batch_size = 4096 * 192
learning_rate = 0.005 * 192
warmup_epochs = 0.1
momentum = 0.9

model = neumf.initialize_model(nb_users, nb_items)
optimizer = lars.LARS(model.parameters(), lr=learning_rate, momentum=momentum, weight_decay=0.0001)

import apex.amp as amp
model, optimizer = amp.initialize(model, optimizer, opt_level="O2")

def scale_loss(optimizer, loss):
    with amp.scale_loss(loss, optimizer) as scaled_loss:
        scaled_loss.backward()
    return
utils.train(model, optimizer, processed_data, batch_size, learning_rate, 
            warmup_fn=warmup, scale_loss_fn=scale_loss)

# What else can we do?


##### References

\[1\] https://github.com/NVIDIA/DeepLearningExamples/tree/master/PyTorch/Recommendation/NCF

\[2\] Micikevicius, S. Narang, J. Alben, G. F. Diamos,E. Elsen, D. Garcia, B. Ginsburg, M. Houston, O. Kuchaiev, G. Venkatesh, and H. Wu. Mixed precision training. CoRR, abs/1710.03740, 2017

\[3\] Yang You, Igor Gitman, Boris Ginsburg. Large Batch Training of Convolutional Networks. arXiv:1708.03888

\[4\] Samuel L. Smith, Pieter-Jan Kindermans, Chris Ying, Quoc V. Le. Don't Decay the Learning Rate, Increase the Batch Size. arXiv:1711.00489

\[5\] Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, angqing Jia, and Kaiming He. Accurate, large minibatch sgd: Training imagenet n 1 hour. arXiv preprint arXiv:1706.02677, 2017.

\[6\] https://github.com/noahgolmant/pytorch-lars