⚡ DeepSpeed ZeRO Stage 2 model parallel training #2

weiji14 · 2022-05-20T19:03:56Z

To prevent out-of-memory (OOM) errors when running the Transformer models! Change from distributed data parallel (DDP) to a data+model parallel strategy

Current State

As of 9793587, we have been using distributed data parallel (DDP) to split the data batch-wise across multiple GPUs. However when running on a full-size Sentinel-2 image (batch_size=1) during test phase (#1), this can already cause out-of-memory issues for our Super-Resolution Segmentation task.

Future State

One possible solution is to shard the neural network model itself across multiple GPUs. This reduces the GPU memory requirements and allows for larger models and/or bigger datasets to be used for training/inference.

Specifically, we'll be switching to use DeepSpeed (https://github.com/microsoft/DeepSpeed) which offers several 'levels' of model sharding, and . See https://devblog.pytorchlightning.ai/experiment-with-billion-parameter-models-faster-using-deepspeed-and-meta-tensors-2e9c255edd71 and https://huggingface.co/blog/zero-deepspeed-fairscale for a good explainer

Main DeepSpeed stages (from https://pytorch-lightning.readthedocs.io/en/1.6.3/advanced/model_parallel.html#deepspeed):

DeepSpeed ZeRO Stage 1 - Shard optimizer states, remains at speed parity with DDP whilst providing memory improvement
DeepSpeed ZeRO Stage 2 - Shard optimizer states and gradients, remains at speed parity with DDP whilst providing even more memory improvement
DeepSpeed ZeRO Stage 3 - Shard optimizer states, gradients, parameters and optionally activations. Increases distributed communication volume, but provides even more memory improvement

💡 Suggest to use Stage 2 instead of Stage 3 because while Stage 3 improves memory use, it comes with increased latency from the cost of extra distributed communication.

Other benefits of using DeepSpeed:

Stage 2 and Stage 3 also has an 'Offload' to CPU feature to save on memory, in cases when the GPU memory is simply not enough
Allows me to train the model on just 16GB of GPU RAM on my workstation 🤯

Alternative strategies (and why they were not considered)

Pytorch-Lightning offers several other advanced training strategies. These might work well for other cases, but probably not for our specific project.

Bagua (https://github.com/BaguaSys/bagua)
- Data Distributed Parallel
- Why not use this? Due to it not being model parallel (though they may be working on it)
- https://pytorch-lightning.readthedocs.io/en/1.6.3/accelerators/gpu.html#bagua
- https://devblog.pytorchlightning.ai/bagua-a-new-efficient-distributed-training-strategy-available-in-pytorch-lightning-1-6-d6392633b15
Fairscale (https://github.com/facebookresearch/fairscale)
- Model parallel training, a close competitor to DeepSpeed
- Why not use this? I did try, but the conda-forge package couldn't work because of some ABI compatibility issue.
- https://pytorch-lightning.readthedocs.io/en/1.6.3/advanced/model_parallel.html#fully-sharded-training

TODO:

Add deepspeed dependency (0a66601)
Switch model to use DeepSpeed ZeRO Stage 2 (6394c12)
~~Use Meta Tensors, c.f. https://devblog.pytorchlightning.ai/experiment-with-billion-parameter-models-faster-using-deepspeed-and-meta-tensors-2e9c255edd71~~
- Nope, doesn't work. Error given is NotImplementedError: Could not run 'aten::_local_scalar_dense' with arguments from the 'Meta' backend. See also General MPS op coverage tracking issue pytorch/pytorch#77764
Decide whether to remove the Super-Resolution branch 🤔

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective! Using PyPI source for now until conda-forge package is released. Also need to install newer gcc version to prevent error `Your compiler (c++ 4.8.5) may be ABI-incompatible with PyTorch! Please use a compiler that is ABI-compatible with GCC 5.0 and above` on the hpc server.

Working towards conserving GPU memory for matters (inference on full-size images). Using DeepSpeed ZeRO Stage 2 which shards optimizer states (Stage 1) and gradients (Stage 2) across multiple GPUs. Have set devices to be auto instead of 2 so that I can run on 1 GPU on my laptop or 2 GPUs on the HPC server without changing values. Also needed to explicitly convert the input Sentinel-2 image tensor to float16 (if using 16-bit training) to avoid `RuntimeError: Input type (torch.cuda.ShortTensor) and weight type (torch.cuda.HalfTensor) should be the same`.

weiji14 · 2022-05-20T19:20:08Z

environment.yml

@@ -17,5 +19,6 @@ dependencies:
  - conda-forge::rioxarray=0.10.1
  - conda-forge::torchgeo=0.2.0
  - pip:
+    - deepspeed==0.6.4


TODO wait for conda-forge package at conda-forge/staged-recipes#19021 so this doesn't need to be installed from PyPI. Also check if using conda-forge package means gcc/gxx_linux-64 isn't needed.

weiji14 added 2 commits May 20, 2022 13:43

weiji14 added the enhancement New feature or request label May 20, 2022

weiji14 self-assigned this May 20, 2022

weiji14 commented May 20, 2022

View reviewed changes

weiji14 marked this pull request as ready for review May 21, 2022 19:45

weiji14 merged commit d7f391c into main May 21, 2022

weiji14 deleted the model/deepspeed branch May 21, 2022 19:45

weiji14 mentioned this pull request May 21, 2022

🧐 Implement test_step on independent hold-out set of images #1

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

⚡ DeepSpeed ZeRO Stage 2 model parallel training #2

⚡ DeepSpeed ZeRO Stage 2 model parallel training #2

weiji14 commented May 20, 2022 •

edited

weiji14 May 20, 2022

⚡ DeepSpeed ZeRO Stage 2 model parallel training #2

⚡ DeepSpeed ZeRO Stage 2 model parallel training #2

Conversation

weiji14 commented May 20, 2022 • edited

Current State

Future State

Alternative strategies (and why they were not considered)

weiji14 May 20, 2022

Choose a reason for hiding this comment

weiji14 commented May 20, 2022 •

edited