# Comprehensive Distributed LLM Training Experiments

This notebook runs an exhaustive set of training jobs to compare different DeepSpeed ZeRO stages across multiple GPU configurations.

## 1. Install Dependencies

In [None]:
!pip install -r requirements.txt

## 2. Prepare Dataset

In [1]:
!python ../scripts/prepare_dataset.py

PREPARING GLAIVE-CODE-ASSISTANT DATASET

[1/4] Loading dataset from HuggingFace...
Dataset: glaiveai/glaive-code-assistant
✓ Loaded 136,109 samples

[2/4] Using full dataset (136,109 samples)

[3/4] Formatting conversations for Llama 2...
Converting question-answer pairs to Llama 2 chat format...
Formatting conversations: 100%|█| 136109/136109 [00:05<00:00, 23943.43 examples/
✓ Formatted 136,109 conversations

----------------------------------------------------------------------
EXAMPLE FORMATTED CONVERSATION (first 500 chars):
----------------------------------------------------------------------
<s>[INST] How can I output bold text in Bash? I have a Bash script that prints some text to the screen using the `echo "Some Text"` command. Is there a way I can format the text to make it bold? [/INST] Yes, you can format the output text in Bash to make it bold. Bash allows you to use special escape sequences for text decoration. To make some text bold in bash, you would use the escape sequ

## 3. Baseline Training (1 GPU)

In [None]:
!python training/train_baseline.py \n
    --dataset_path ./data/glaive_code_full \n
    --num_train_epochs 1

# DeepSpeed ZeRO Stage 1 Experiments

### ZeRO-1 with 1 GPU

In [None]:
%%bash
source /home/chaudhari.paw/distributed-llm-training-inference/myenv/bin/activate
deepspeed --num_gpus=1 training/train_deepspeed_zero1.py --dataset_path ./data/glaive_code_full --deepspeed_config configs/ds_config_zero1.json --num_train_epochs 1

### ZeRO-1 with 2 GPUs

In [None]:
%%bash
source /home/chaudhari.paw/distributed-llm-training-inference/myenv/bin/activate
deepspeed --num_gpus=2 training/train_deepspeed_zero1.py --dataset_path ./data/glaive_code_full --deepspeed_config configs/ds_config_zero1.json --num_train_epochs 1

### ZeRO-1 with 3 GPUs

In [None]:
%%bash
source /home/chaudhari.paw/distributed-llm-training-inference/myenv/bin/activate
deepspeed --num_gpus=3 training/train_deepspeed_zero1.py --dataset_path ./data/glaive_code_full --deepspeed_config configs/ds_config_zero1.json --num_train_epochs 1

### ZeRO-1 with 4 GPUs

In [None]:
%%bash
source /home/chaudhari.paw/distributed-llm-training-inference/myenv/bin/activate
deepspeed --num_gpus=4 training/train_deepspeed_zero1.py --dataset_path ./data/glaive_code_full --deepspeed_config configs/ds_config_zero1.json --num_train_epochs 1

# DeepSpeed ZeRO Stage 2 Experiments

### ZeRO-2 with 1 GPU

In [None]:
%%bash
source /home/chaudhari.paw/distributed-llm-training-inference/myenv/bin/activate
deepspeed --num_gpus=1 train_deepspeed_zero2.py --dataset_path data/glaive_code_full --deepspeed_config ../configs/ds_config_zero2.json --num_train_epochs 1

  backends.update(_get_backends("networkx.backends"))
  backends.update(_get_backends("networkx.backends"))


Detected VISIBLE_DEVICES=0 but ignoring it because one or several of --include/--exclude/--num_gpus/--num_nodes cl args were used. If you want to use CUDA_VISIBLE_DEVICES don't pass any of these arguments to deepspeed.
[2025-11-13 23:34:48,147] [INFO] [runner.py:630:main] cmd = /home/chaudhari.paw/distributed-llm-training-inference/myenv/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMF19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None --log_level=info train_deepspeed_zero2.py --dataset_path data/glaive_code_full --deepspeed_config ../configs/ds_config_zero2.json --num_train_epochs 1


  backends.update(_get_backends("networkx.backends"))
  backends.update(_get_backends("networkx.backends"))


[2025-11-13 23:35:01,327] [INFO] [launch.py:162:main] WORLD INFO DICT: {'localhost': [0]}
[2025-11-13 23:35:01,327] [INFO] [launch.py:168:main] nnodes=1, num_local_procs=1, node_rank=0
[2025-11-13 23:35:01,327] [INFO] [launch.py:179:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0]})
[2025-11-13 23:35:01,327] [INFO] [launch.py:180:main] dist_world_size=1
[2025-11-13 23:35:01,327] [INFO] [launch.py:184:main] Setting CUDA_VISIBLE_DEVICES=0
[2025-11-13 23:35:01,344] [INFO] [launch.py:272:main] process 2241266 spawned with command: ['/home/chaudhari.paw/distributed-llm-training-inference/myenv/bin/python3', '-u', 'train_deepspeed_zero2.py', '--local_rank=0', '--dataset_path', 'data/glaive_code_full', '--deepspeed_config', '../configs/ds_config_zero2.json', '--num_train_epochs', '1']


  backends.update(_get_backends("networkx.backends"))



DEEPSPEED ZeRO-2 TRAINING

Experiment: zero2_1gpu
GPUs: 1
ZeRO Stage: 2
Config: ../configs/ds_config_zero2.json
Output: ./checkpoints/zero2_1gpu

NOTE: Using DeepSpeed ZeRO-2 for optimizer + gradient state partitioning
Expected: Lower memory usage than ZeRO-1, similar speed on 1 GPU

Single GPU Training
  GPU: Tesla V100-SXM2-32GB
  Memory: 34.1 GB

[1/5] Loading tokenizer...
Tokenizer loaded

[2/5] Loading model...


`torch_dtype` is deprecated! Use `dtype` instead!
Loading checkpoint shards: 100%|██████████| 2/2 [00:29<00:00, 14.68s/it]


Model loaded

[3/5] Applying LoRA (r=16)...
trainable params: 16,777,216 || all params: 6,755,192,832 || trainable%: 0.2484
LoRA applied

[4/5] Loading dataset...
Dataset ready: 136,109 samples

[5/5] Configuring training with DeepSpeed...
✓ Trainer configured with DeepSpeed ZeRO-2

Effective batch size: 1
  = 1 (per_device) × 1 (grad_accum) × 1 (GPUs)

STARTING TRAINING: ZERO2_1GPU



  0%|          | 0/136109 [00:00<?, ?it/s]`use_cache=True` is incompatible with gradient checkpointing. Setting `use_cache=False`.
  backends.update(_get_backends("networkx.backends"))


{'loss': 0.9429, 'grad_norm': 0.5983909368515015, 'learning_rate': 0.0002, 'epoch': 0.0}
{'loss': 0.8647, 'grad_norm': 1.2069456577301025, 'learning_rate': 0.0002, 'epoch': 0.0}
{'loss': 0.7048, 'grad_norm': 0.969007670879364, 'learning_rate': 0.0002, 'epoch': 0.0}
{'loss': 0.71, 'grad_norm': 0.6146083474159241, 'learning_rate': 0.0002, 'epoch': 0.0}
{'loss': 0.7665, 'grad_norm': 0.5691312551498413, 'learning_rate': 0.0002, 'epoch': 0.0}
{'loss': 0.7654, 'grad_norm': 0.4435938000679016, 'learning_rate': 0.0002, 'epoch': 0.0}
{'loss': 0.9338, 'grad_norm': 1.056951880455017, 'learning_rate': 0.0002, 'epoch': 0.0}
{'loss': 0.7799, 'grad_norm': 0.6003472805023193, 'learning_rate': 0.0002, 'epoch': 0.0}
{'loss': 0.9445, 'grad_norm': 0.4042268991470337, 'learning_rate': 0.0002, 'epoch': 0.0}
{'loss': 0.6808, 'grad_norm': 0.7908876538276672, 'learning_rate': 0.0002, 'epoch': 0.0}
{'loss': 0.6573, 'grad_norm': 0.7943319082260132, 'learning_rate': 0.0002, 'epoch': 0.0}
{'loss': 0.8843, 'grad_no

  1%|          | 980/136109 [07:39<12:02:00,  3.12it/s] 

{'loss': 0.825, 'grad_norm': 0.7595116496086121, 'learning_rate': 0.0002, 'epoch': 0.01}
{'loss': 0.7383, 'grad_norm': 0.8015678524971008, 'learning_rate': 0.0002, 'epoch': 0.01}
{'loss': 0.6939, 'grad_norm': 0.7927616238594055, 'learning_rate': 0.0002, 'epoch': 0.01}
{'loss': 0.7404, 'grad_norm': 0.5571944117546082, 'learning_rate': 0.0002, 'epoch': 0.01}
{'loss': 0.8011, 'grad_norm': 0.6059478521347046, 'learning_rate': 0.0002, 'epoch': 0.01}
{'loss': 0.7123, 'grad_norm': 0.6414658427238464, 'learning_rate': 0.0002, 'epoch': 0.01}
{'loss': 0.7796, 'grad_norm': 0.6385908126831055, 'learning_rate': 0.0002, 'epoch': 0.01}
{'loss': 0.8706, 'grad_norm': 1.821088194847107, 'learning_rate': 0.0002, 'epoch': 0.01}
{'loss': 0.6274, 'grad_norm': 0.9049996137619019, 'learning_rate': 0.0002, 'epoch': 0.01}
{'loss': 0.7857, 'grad_norm': 0.7886815071105957, 'learning_rate': 0.0002, 'epoch': 0.01}
{'loss': 0.7431, 'grad_norm': 0.7908366322517395, 'learning_rate': 0.0002, 'epoch': 0.01}
{'loss': 0.6

  1%|▏         | 1930/136109 [15:15<13:34:27,  2.75it/s] 

{'loss': 0.7674, 'grad_norm': 0.7560083270072937, 'learning_rate': 0.0002, 'epoch': 0.01}
{'loss': 0.8152, 'grad_norm': 0.7469887137413025, 'learning_rate': 0.0002, 'epoch': 0.01}
{'loss': 0.7372, 'grad_norm': 0.9750287532806396, 'learning_rate': 0.0002, 'epoch': 0.01}
{'loss': 0.6844, 'grad_norm': 0.6634159088134766, 'learning_rate': 0.0002, 'epoch': 0.01}
{'loss': 0.7256, 'grad_norm': 0.6087028384208679, 'learning_rate': 0.0002, 'epoch': 0.01}
{'loss': 0.7093, 'grad_norm': 0.596566379070282, 'learning_rate': 0.0002, 'epoch': 0.01}
{'loss': 0.594, 'grad_norm': 0.6430320143699646, 'learning_rate': 0.0002, 'epoch': 0.01}
{'loss': 0.7352, 'grad_norm': 0.5753172636032104, 'learning_rate': 0.0002, 'epoch': 0.01}
{'loss': 0.8181, 'grad_norm': 0.5369955897331238, 'learning_rate': 0.0002, 'epoch': 0.01}
{'loss': 0.7185, 'grad_norm': 0.7640063762664795, 'learning_rate': 0.0002, 'epoch': 0.01}
{'loss': 0.6714, 'grad_norm': 0.7889512777328491, 'learning_rate': 0.0002, 'epoch': 0.01}
{'loss': 0.6

  2%|▏         | 2856/136109 [22:31<12:31:38,  2.95it/s] 

{'loss': 0.8802, 'grad_norm': 0.7793275117874146, 'learning_rate': 0.0002, 'epoch': 0.02}
{'loss': 0.6845, 'grad_norm': 1.0517269372940063, 'learning_rate': 0.0002, 'epoch': 0.02}
{'loss': 0.8086, 'grad_norm': 0.8578470945358276, 'learning_rate': 0.0002, 'epoch': 0.02}
{'loss': 0.6924, 'grad_norm': 0.6484431624412537, 'learning_rate': 0.0002, 'epoch': 0.02}
{'loss': 0.7156, 'grad_norm': 1.0211008787155151, 'learning_rate': 0.0002, 'epoch': 0.02}
{'loss': 0.7601, 'grad_norm': 0.7176656723022461, 'learning_rate': 0.0002, 'epoch': 0.02}
{'loss': 0.5683, 'grad_norm': 0.8353300094604492, 'learning_rate': 0.0002, 'epoch': 0.02}
{'loss': 0.7841, 'grad_norm': 1.085649847984314, 'learning_rate': 0.0002, 'epoch': 0.02}
{'loss': 0.7207, 'grad_norm': 0.7614701986312866, 'learning_rate': 0.0002, 'epoch': 0.02}
{'loss': 0.7413, 'grad_norm': 0.6306690573692322, 'learning_rate': 0.0002, 'epoch': 0.02}
{'loss': 0.6833, 'grad_norm': 0.6472669839859009, 'learning_rate': 0.0002, 'epoch': 0.02}
{'loss': 0.

### ZeRO-2 with 2 GPUs

In [20]:
%%bash
source /home/chaudhari.paw/distributed-llm-training-inference/myenv/bin/activate
deepspeed --num_gpus=2 train_deepspeed_zero2.py --dataset_path data/glaive_code_full --deepspeed_config ../configs/ds_config_zero2.json --num_train_epochs 1

  backends.update(_get_backends("networkx.backends"))


Detected VISIBLE_DEVICES=0 but ignoring it because one or several of --include/--exclude/--num_gpus/--num_nodes cl args were used. If you want to use CUDA_VISIBLE_DEVICES don't pass any of these arguments to deepspeed.
[2025-11-12 23:28:47,534] [INFO] [runner.py:630:main] cmd = /home/chaudhari.paw/distributed-llm-training-inference/myenv/bin/python3 -u -m deepspeed.launcher.launch --world_info=eyJsb2NhbGhvc3QiOiBbMCwgMV19 --master_addr=127.0.0.1 --master_port=29500 --enable_each_rank_log=None --log_level=info train_deepspeed_zero2.py --dataset_path data/glaive_code_full --deepspeed_config ../configs/ds_config_zero2.json --num_train_epochs 1


  backends.update(_get_backends("networkx.backends"))
  backends.update(_get_backends("networkx.backends"))


[2025-11-12 23:28:57,665] [INFO] [launch.py:162:main] WORLD INFO DICT: {'localhost': [0, 1]}
[2025-11-12 23:28:57,666] [INFO] [launch.py:168:main] nnodes=1, num_local_procs=2, node_rank=0
[2025-11-12 23:28:57,666] [INFO] [launch.py:179:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1]})
[2025-11-12 23:28:57,666] [INFO] [launch.py:180:main] dist_world_size=2
[2025-11-12 23:28:57,666] [INFO] [launch.py:184:main] Setting CUDA_VISIBLE_DEVICES=0,1
[2025-11-12 23:28:57,695] [INFO] [launch.py:272:main] process 4091255 spawned with command: ['/home/chaudhari.paw/distributed-llm-training-inference/myenv/bin/python3', '-u', 'train_deepspeed_zero2.py', '--local_rank=0', '--dataset_path', 'data/glaive_code_full', '--deepspeed_config', '../configs/ds_config_zero2.json', '--num_train_epochs', '1']
[2025-11-12 23:28:57,719] [INFO] [launch.py:272:main] process 4091256 spawned with command: ['/home/chaudhari.paw/distributed-llm-training-inference/myenv/bin/python3', '-u', 'trai

  backends.update(_get_backends("networkx.backends"))
  backends.update(_get_backends("networkx.backends"))
  backends.update(_get_backends("networkx.backends"))



DEEPSPEED ZeRO-2 TRAINING

Experiment: zero2_2gpu
GPUs: 2
ZeRO Stage: 2
Config: ../configs/ds_config_zero2.json
Output: ./checkpoints/zero2_2gpu

NOTE: Using DeepSpeed ZeRO-2 for optimizer + gradient state partitioning
Expected: Greater memory savings + 2x speedup with 2 GPUs

Distributed Training: 2 GPUs
  GPU 0: Tesla V100-SXM2-32GB
  GPU 1: Tesla V100-SXM2-32GB

[1/5] Loading tokenizer...
Tokenizer loaded

[2/5] Loading model...


`torch_dtype` is deprecated! Use `dtype` instead!
`torch_dtype` is deprecated! Use `dtype` instead!
Loading checkpoint shards: 100%|██████████| 2/2 [00:00<00:00, 33.01it/s]
Loading checkpoint shards: 100%|██████████| 2/2 [00:00<00:00, 33.05it/s]


Model loaded

[3/5] Applying LoRA (r=16)...
trainable params: 16,777,216 || all params: 6,755,192,832 || trainable%: 0.2484
LoRA applied

[4/5] Loading dataset...
Dataset ready: 136,109 samples

[5/5] Configuring training with DeepSpeed...


Traceback (most recent call last):
  File "/home/chaudhari.paw/distributed-llm-training-inference/training/train_deepspeed_zero2.py", line 393, in <module>
    main()
  File "/home/chaudhari.paw/distributed-llm-training-inference/training/train_deepspeed_zero2.py", line 261, in main
    training_args = TrainingArguments(
  File "<string>", line 135, in __init__
  File "/home/chaudhari.paw/distributed-llm-training-inference/myenv/lib64/python3.9/site-packages/transformers/training_args.py", line 1811, in __post_init__
    self.device
  File "/home/chaudhari.paw/distributed-llm-training-inference/myenv/lib64/python3.9/site-packages/transformers/training_args.py", line 2355, in device
    return self._setup_devices
  File "/usr/lib64/python3.9/functools.py", line 993, in __get__
    val = self.func(instance)
  File "/home/chaudhari.paw/distributed-llm-training-inference/myenv/lib64/python3.9/site-packages/transformers/training_args.py", line 2282, in _setup_devices
    self.distributed_st

[2025-11-12 23:29:33,756] [INFO] [launch.py:335:sigkill_handler] Killing subprocess 4091255
[2025-11-12 23:29:33,959] [INFO] [launch.py:335:sigkill_handler] Killing subprocess 4091256
[2025-11-12 23:29:33,959] [ERROR] [launch.py:341:sigkill_handler] ['/home/chaudhari.paw/distributed-llm-training-inference/myenv/bin/python3', '-u', 'train_deepspeed_zero2.py', '--local_rank=1', '--dataset_path', 'data/glaive_code_full', '--deepspeed_config', '../configs/ds_config_zero2.json', '--num_train_epochs', '1'] exits with return code = 1


CalledProcessError: Command 'b'source /home/chaudhari.paw/distributed-llm-training-inference/myenv/bin/activate\ndeepspeed --num_gpus=2 train_deepspeed_zero2.py --dataset_path data/glaive_code_full --deepspeed_config ../configs/ds_config_zero2.json --num_train_epochs 1\n'' returned non-zero exit status 1.

### ZeRO-2 with 3 GPUs

In [None]:
%%bash
source /home/chaudhari.paw/distributed-llm-training-inference/myenv/bin/activate
deepspeed --num_gpus=3 training/train_deepspeed_zero2.py --dataset_path ./data/glaive_code_full --deepspeed_config configs/ds_config_zero2.json --num_train_epochs 1

### ZeRO-2 with 4 GPUs

In [None]:
%%bash
source /home/chaudhari.paw/distributed-llm-training-inference/myenv/bin/activate
deepspeed --num_gpus=4 training/train_deepspeed_zero2.py --dataset_path ./data/glaive_code_full --deepspeed_config configs/ds_config_zero2.json --num_train_epochs 1

# DeepSpeed ZeRO Stage 3 Experiments

### ZeRO-3 with 1 GPU

In [None]:
%%bash
source /home/chaudhari.paw/distributed-llm-training-inference/myenv/bin/activate
deepspeed --num_gpus=1 training/train_deepspeed_zero3.py --dataset_path ./data/glaive_code_full --deepspeed configs/ds_config_zero3.json --num_train_epochs 1

### ZeRO-3 with 2 GPUs

In [None]:
%%bash
source /home/chaudhari.paw/distributed-llm-training-inference/myenv/bin/activate
deepspeed --num_gpus=2 training/train_deepspeed_zero3.py --dataset_path ./data/glaive_code_full --deepspeed configs/ds_config_zero3.json --num_train_epochs 1

### ZeRO-3 with 3 GPUs

In [None]:
%%bash
source /home/chaudhari.paw/distributed-llm-training-inference/myenv/bin/activate
deepspeed --num_gpus=3 training/train_deepspeed_zero3.py --dataset_path ./data/glaive_code_full --deepspeed configs/ds_config_zero3.json --num_train_epochs 1

### ZeRO-3 with 4 GPUs

In [None]:
%%bash
source /home/chaudhari.paw/distributed-llm-training-inference/myenv/bin/activate
deepspeed --num_gpus=4 training/train_deepspeed_zero3.py --dataset_path ./data/glaive_code_full --deepspeed configs/ds_config_zero3.json --num_train_epochs 1