# Distributed Mixed-precision Training with PyTorch and NVIDIA `Apex`

## What is `Apex`?
A Pytorch extension with NVIDIA-maintained utilities to streamline mixed precision and distributed training. It has the full features of the built-in PyTorch Distributed Data Parallel (DDP) package. Additionally, it integrates better with NVIDIA GPUs and provides mixed-precision training acceleration.

Most deep learning frameworks, including PyTorch, train using 32-bit floating point (FP32) arithmetic by default. However, using FP32 for all operations is not essential to achieve full accuracy for many state-of-the-art deep neural networks (DNNs). In mixed precision training, majority of the network uses FP16 arithmetic, while automatically casting potentially unstable operations to FP32.

Key points:
- Ensuring that weight updates are carried out in FP32.
- Loss scaling to prevent underflowing gradients.
- A few operations (e.g. large reductions) left in FP32.
- Everything else (the majority of the network) executed in FP16.

## Why `Apex`?

- comes with all the distributed training features of the built-in PyTorch DDP
- better performance than built-in DDP
- reducing memory storage/bandwidth demands by 2x
- use larger batch sizes
- take advantage of NVIDIA Tensor Cores for matrix multiplications and convolutions
- don't need to explicitly convert your model, or the input data, to half().

## How to use `Apex`?

## Bells and Whistles

### How to prevent race condition when mutiple devices try to do logging or printing?

### How to use `Tensorboard` in a distributed context?

## The full ImageNet training script
Please see: `imagenet_ddp_apex.py`