# Deepspeed 2 & 3 Validation
This model being trained has the same settings as raven 1B5 model.
- Layer count: 24
- Embed size: 2048

The goal is to validate the trainer across deepspeed 2 & 3 - with and without offload. All other training params remain constant.

Note, you will need a dual+ GPU setup, that is capable of handling deepspeed 2 (minimum 24GB * 2)

> This project assumes you have the rwkv-infctx conda env setup, and you are executing in that environment - see the main README.md for the conda env setup steps
>
> And that you have completed the `baseline-setup.ipynb` and the environment setup found in project `README.md`

## What does deepspeed 2 & 3 do (With/Without CPU offload) ??

Instead of simply splitting the dataset being trained, and having a full copy of nearly everything in all GPU's (aka DDP / DeepSpeed 1).

Deepspeed 2, keeps a full copy of the model weights on each GPU, but splits the training gradient descent memory usage into multiple GPUs, or offload it into CPU memory (+ CPU offload option).

Deepspeed 3, takes it a step further, and distributes the model weights across all the GPUs, drastically lowering the vram requirement, while increasing the amount of GPU to GPU traffic drastically. Gradient descent memory is still split across multiple GPUs, with the option to offload into CPU memory (Same as deepspeed 2)

Finally, Deepspeed 3, also introduce options to further offload such model weights / gradient descent, more into CPU memory or NVMe. However this option was not enabled or explored in the following benchmarks.

See more here: https://huggingface.co/docs/transformers/main_classes/deepspeed

## Benchmark results

Benchmark was done on 11th July 2023. With Torch 2.0.1, Cuda 11.8. On two seperate machines, via vast.ai

- 2 x A5000 (AMD EPYC 7513 - per gpu specs: 24gb, pcie4x16, 34.4 TFlops, nvlinked)
- 2 x 3090 (AMD EPYC 7302 - per gpu specs: 24gb, pcie3x16, 44.1 TFlops)

| Deepspeed Strat       | Time (A5000)          | Time (3090)           | VRAM Usage       | RAM Usage | Validation Loss |
| --------------------- | --------------------- | --------------------- | ---------------- | --------- | --------------- |
| Stage 2               | 24 mins : 55 sec      | 35 mins : 04 sec      | ~22.3 + 23.8 GB  | ~85 GB    | 6.173           |
| Stage 2 + CPU offload | 43 mins : 08 sec      | 59 mins : 04 sec      | ~9.7 + 10.3 GB   | ~128 GB   | 6.124           |
| Stage 3               | 29 mins : 12 sec      | 50 mins : 04 sec      | ~23.0 + 23.2 GB^ | ~85 GB    | 5.665           |
| Stage 3 + CPU offload | 1hr : 42mins : 38 sec | 1hr : 29mins : 15 sec | ~7.0 + 7.3 GB    | ~145 GB   | 5.668           |

---

> ^ note in theory deepspeed 3 uses less vram then deepspeed 2, however it will also try to use up more ram then its needed for "cache" items if possible, maxing out to the same level as deepspeed 2 here

> Torch.JIT was enabled for deepspeed 2, But was disabled for deepspeed 3 (not compatible). Torch.compile was disabled

> Full report with charts and figures can be found at WANDB : https://wandb.ai/picocreator/RWKV-InfCtx-Validation/reports/-RWKV-infctx-DeepSpeed-2-3-comparisons--Vmlldzo0ODM5MjAy

## Configure and apply your preferred settings

Adjust your desired deepspeed settings, and gpu device count.

Enable/Disable WANDB here as well ( Enabled by default, as we need the loss curve for this experiment )

( note you will need to rerun this cell, if you restart your env )

In [1]:
GPU_DEVICES="auto"
ENABLE_WANDB=True
WANDB_PREFIX="infctx-deepspeed-deterministic"

print("ENABLE_WANDB:", ENABLE_WANDB)
print("GPU_DEVICES:", GPU_DEVICES)

if ENABLE_WANDB:
    WANDB_MODE="online"
else:
    WANDB_MODE="disabled"

ENABLE_WANDB: True
GPU_DEVICES: auto


# Deepspeed 2
Perform a full 1 epoch training run of training context size = 1024. With deepspeed 2

In [2]:
!cd ../../RWKV-v4neo && \
    export WANDB_MODE="{WANDB_MODE}" && \
    python3 lightning_trainer.py fit \
        -c ../notebook/trainer-validation/config/baseline-1024.yaml \
        --trainer.logger.init_args.name="{WANDB_PREFIX} (deepspeed_stage_2, train-ctx=1024, data-ctx=1024)" \
        --trainer.strategy="deepspeed_stage_2" \
        --trainer.devices="{GPU_DEVICES}"

Setting ds_accelerator to cuda (auto detect)
[RWKV.model] Running RWKV model using 'torch-jit' with torch '2.0.1+cu118'
Global seed set to 3941088705
[34m[1mwandb[0m: Currently logged in as: [33mpicocreator[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: wandb version 0.15.5 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade
[34m[1mwandb[0m: Tracking run with wandb version 0.15.4
[34m[1mwandb[0m: Run data is saved locally in [35m[1m./wandb/run-20230710_201455-u3419fsk[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33minfctx-deepspeed-deterministic (deepspeed_stage_2, train-ctx=1024, data-ctx=1024)[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/picocreator/RWKV-InfCtx-Validation[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/picocreator/RWKV-InfCtx-Validation/runs/u3419fsk[0m
Using /root/.cache/torch_extensio

# Deepspeed 2 + Offload
Perform a full 1 epoch training run of training context size = 1024. With deepspeed 2

In [3]:
!cd ../../RWKV-v4neo && \
    export WANDB_MODE="{WANDB_MODE}" && \
    python3 lightning_trainer.py fit \
        -c ../notebook/trainer-validation/config/baseline-1024.yaml \
        --trainer.logger.init_args.name="{WANDB_PREFIX} (deepspeed_stage_2_offload, train-ctx=1024, data-ctx=1024)" \
        --trainer.strategy="deepspeed_stage_2_offload" \
        --trainer.devices="{GPU_DEVICES}"

Setting ds_accelerator to cuda (auto detect)
[RWKV.model] Running RWKV model using 'torch-jit' with torch '2.0.1+cu118'
Global seed set to 3941088705
[34m[1mwandb[0m: Currently logged in as: [33mpicocreator[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: wandb version 0.15.5 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade
[34m[1mwandb[0m: Tracking run with wandb version 0.15.4
[34m[1mwandb[0m: Run data is saved locally in [35m[1m./wandb/run-20230710_205104-ral9a67a[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33minfctx-deepspeed-deterministic (deepspeed_stage_2_offload, train-ctx=1024, data-ctx=1024)[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/picocreator/RWKV-InfCtx-Validation[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/picocreator/RWKV-InfCtx-Validation/runs/ral9a67a[0m
Using /root/.cache/torch_

# Deepspeed 3
Perform a full 1 epoch training run of training context size = 1024. With deepspeed 3

In [4]:
!cd ../../RWKV-v4neo && \
    export RWKV_JIT_ON=0 && \
    export WANDB_MODE="{WANDB_MODE}" && \
    python3 lightning_trainer.py fit \
        -c ../notebook/trainer-validation/config/baseline-1024.yaml \
        --trainer.logger.init_args.name="{WANDB_PREFIX} (deepspeed_stage_3, train-ctx=1024, data-ctx=1024)" \
        --trainer.strategy="deepspeed_stage_3" \
        --trainer.devices="{GPU_DEVICES}"

Setting ds_accelerator to cuda (auto detect)
[RWKV.model] Running RWKV model using 'torch-native' with torch '2.0.1+cu118'
Global seed set to 3941088705
[34m[1mwandb[0m: Currently logged in as: [33mpicocreator[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: wandb version 0.15.5 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade
[34m[1mwandb[0m: Tracking run with wandb version 0.15.4
[34m[1mwandb[0m: Run data is saved locally in [35m[1m./wandb/run-20230710_215131-7jwlhk35[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33minfctx-deepspeed-deterministic (deepspeed_stage_3, train-ctx=1024, data-ctx=1024)[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/picocreator/RWKV-InfCtx-Validation[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/picocreator/RWKV-InfCtx-Validation/runs/7jwlhk35[0m
Using /root/.cache/torch_exten

# Deepspeed 3 + offload
Perform a full 1 epoch training run of training context size = 1024. With deepspeed 3 + offload

In [5]:
!cd ../../RWKV-v4neo && \
    export RWKV_JIT_ON=0 && \
    export WANDB_MODE="{WANDB_MODE}" && \
    python3 lightning_trainer.py fit \
        -c ../notebook/trainer-validation/config/baseline-1024.yaml \
        --trainer.logger.init_args.name="{WANDB_PREFIX} (deepspeed_stage_3_offload, train-ctx=1024, data-ctx=1024)" \
        --trainer.strategy="deepspeed_stage_3_offload" \
        --trainer.devices="{GPU_DEVICES}"

Setting ds_accelerator to cuda (auto detect)
[RWKV.model] Running RWKV model using 'torch-native' with torch '2.0.1+cu118'
Global seed set to 3941088705
[34m[1mwandb[0m: Currently logged in as: [33mpicocreator[0m. Use [1m`wandb login --relogin`[0m to force relogin
[34m[1mwandb[0m: wandb version 0.15.5 is available!  To upgrade, please run:
[34m[1mwandb[0m:  $ pip install wandb --upgrade
[34m[1mwandb[0m: Tracking run with wandb version 0.15.4
[34m[1mwandb[0m: Run data is saved locally in [35m[1m./wandb/run-20230710_224256-9hkg3yg1[0m
[34m[1mwandb[0m: Run [1m`wandb offline`[0m to turn off syncing.
[34m[1mwandb[0m: Syncing run [33minfctx-deepspeed-deterministic (deepspeed_stage_3_offload, train-ctx=1024, data-ctx=1024)[0m
[34m[1mwandb[0m: ⭐️ View project at [34m[4mhttps://wandb.ai/picocreator/RWKV-InfCtx-Validation[0m
[34m[1mwandb[0m: 🚀 View run at [34m[4mhttps://wandb.ai/picocreator/RWKV-InfCtx-Validation/runs/9hkg3yg1[0m
Using /root/.cache/tor