Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

⚡ DeepSpeed ZeRO Stage 2 model parallel training #2

Merged
merged 2 commits into from May 21, 2022
Merged

Conversation

weiji14
Copy link
Owner

@weiji14 weiji14 commented May 20, 2022

To prevent out-of-memory (OOM) errors when running the Transformer models! Change from distributed data parallel (DDP) to a data+model parallel strategy

Current State

As of 9793587, we have been using distributed data parallel (DDP) to split the data batch-wise across multiple GPUs. However when running on a full-size Sentinel-2 image (batch_size=1) during test phase (#1), this can already cause out-of-memory issues for our Super-Resolution Segmentation task.

Future State

One possible solution is to shard the neural network model itself across multiple GPUs. This reduces the GPU memory requirements and allows for larger models and/or bigger datasets to be used for training/inference.

Sharding model parameters, optimizers and gradients

Specifically, we'll be switching to use DeepSpeed (https://github.com/microsoft/DeepSpeed) which offers several 'levels' of model sharding, and . See https://devblog.pytorchlightning.ai/experiment-with-billion-parameter-models-faster-using-deepspeed-and-meta-tensors-2e9c255edd71 and https://huggingface.co/blog/zero-deepspeed-fairscale for a good explainer

Main DeepSpeed stages (from https://pytorch-lightning.readthedocs.io/en/1.6.3/advanced/model_parallel.html#deepspeed):

  • DeepSpeed ZeRO Stage 1 - Shard optimizer states, remains at speed parity with DDP whilst providing memory improvement
  • DeepSpeed ZeRO Stage 2 - Shard optimizer states and gradients, remains at speed parity with DDP whilst providing even more memory improvement
  • DeepSpeed ZeRO Stage 3 - Shard optimizer states, gradients, parameters and optionally activations. Increases distributed communication volume, but provides even more memory improvement

💡 Suggest to use Stage 2 instead of Stage 3 because while Stage 3 improves memory use, it comes with increased latency from the cost of extra distributed communication.

DeepSpeed Stage 1, 2, 3

Other benefits of using DeepSpeed:

  • Stage 2 and Stage 3 also has an 'Offload' to CPU feature to save on memory, in cases when the GPU memory is simply not enough
  • Allows me to train the model on just 16GB of GPU RAM on my workstation 🤯

Alternative strategies (and why they were not considered)

Pytorch-Lightning offers several other advanced training strategies. These might work well for other cases, but probably not for our specific project.

TODO:

DeepSpeed is a deep learning optimization library that makes distributed training and inference easy, efficient, and effective! Using PyPI source for now until conda-forge package is released.

Also need to install newer gcc version to prevent error `Your compiler (c++ 4.8.5) may be ABI-incompatible with PyTorch! Please use a compiler that is ABI-compatible with GCC 5.0 and above` on the hpc server.
Working towards conserving GPU memory for matters (inference on full-size images). Using DeepSpeed ZeRO Stage 2 which shards optimizer states (Stage 1) and gradients (Stage 2) across multiple GPUs. Have set devices to be auto instead of 2 so that I can run on 1 GPU on my laptop or 2 GPUs on the HPC server without changing values. Also needed to explicitly convert the input Sentinel-2 image tensor to float16 (if using 16-bit training) to avoid `RuntimeError: Input type (torch.cuda.ShortTensor) and weight type (torch.cuda.HalfTensor) should be the same`.
@weiji14 weiji14 added the enhancement New feature or request label May 20, 2022
@weiji14 weiji14 self-assigned this May 20, 2022
@@ -17,5 +19,6 @@ dependencies:
- conda-forge::rioxarray=0.10.1
- conda-forge::torchgeo=0.2.0
- pip:
- deepspeed==0.6.4
Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO wait for conda-forge package at conda-forge/staged-recipes#19021 so this doesn't need to be installed from PyPI. Also check if using conda-forge package means gcc/gxx_linux-64 isn't needed.

@weiji14 weiji14 marked this pull request as ready for review May 21, 2022 19:45
@weiji14 weiji14 merged commit d7f391c into main May 21, 2022
@weiji14 weiji14 deleted the model/deepspeed branch May 21, 2022 19:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

1 participant