Adding support for multi-node DDP training by smedhe · Pull Request #708 · quic/efficient-transformers

smedhe · 2026-01-07T09:33:27Z

Add support for multi-node Distributed Data Parallel (DDP) training to the QEfficient finetuning pipeline. This enables scaling training across multiple nodes while keeping the existing single-node behavior unchanged.

Commands for DDP across 2 servers:
For the Master Addr or the Primary Machine, use node-rank as 0:
QAIC_VISIBLE_DEVICES=0,1,2,3 torchrun --nnodes=2 --nproc-per-node=4 --seed 0 --node-rank=0 --master_addr=<MASTER_NODE_IP> --master_port=8000 -m QEfficient.cloud.finetune --device qaic --enable_ddp --model_name "meta-llama/Llama-3.2-1B" --dataset alpaca_dataset --train_batch_size 1 --val_batch_size 1 --num_epochs 1 --max_train_step 200 --max_eval_step 50

For Node 1, use node-rank as 1:
QAIC_VISIBLE_DEVICES=0,1,2,3 torchrun --nnodes=2 --nproc-per-node=4 --seed 0 --node-rank=1 --master_addr=<MASTER_NODE_IP> --master_port=8000 -m QEfficient.cloud.finetune --device qaic --enable_ddp --model_name "meta-llama/Llama-3.2-1B" --dataset alpaca_dataset --train_batch_size 1 --val_batch_size 1 --num_epochs 1 --max_train_step 200 --max_eval_step 50

quic-akuruvil

Put the sample command for reference interface, along with description

quic-swatia · 2026-01-08T09:19:40Z

For PP, device_map.py uses get_rank() which was earlier returning local rank but now returns rank. This will not break PP as of now, but it's a wrong practice. In case PP is used with multi node, it would break. Please call the get_local_rank() in device_map once it is defined.

QEfficient/finetune/utils/helper.py

QEfficient/cloud/finetune.py

quic-swatia · 2026-01-08T09:54:20Z

For PP, device_map.py uses get_rank() which was earlier returning local rank but now returns rank. This will not break PP as of now, but it's a wrong practice. In case PP is used with multi node, it would break. Please call the get_local_rank() in device_map once it is defined.

After this change, we can also test if PP + DDP for multi node is working with this PR. It seems like it might work. It's okay if it doesn't, we can still go ahead with the PR.

quic-meetkuma

Looks fine to me. Please update the finetune.md file. Imagine or other teams refer the commands from there. They will need to change command at their side of the code.

QEfficient/finetune/utils/helper.py

quic-akuruvil

Start CI run also for this. So CI passes before merge

quic-swatia

PR is good to merge with the latest changes . Please make sure these commands are added in the finetuning documentation. Thanks

quic-akuruvil · 2026-01-13T04:48:34Z

Looks fine to me. Please update the finetune.md file. Imagine or other teams refer the commands from there. They will need to change command at their side of the code.

Added as part of PR 717

Signed-off-by: Sharvari Medhe <smedhe@qti.qualcomm.com>

Add support for multi-node Distributed Data Parallel (DDP) training to the QEfficient finetuning pipeline. This enables scaling training across multiple nodes while keeping the existing single-node behavior unchanged. Commands for DDP across 2 servers: For the Master Addr or the Primary Machine, use node-rank as 0: QAIC_VISIBLE_DEVICES=0,1,2,3 torchrun --nnodes=2 --nproc-per-node=4 --seed 0 --node-rank=0 --master_addr=<MASTER_NODE_IP> --master_port=8000 -m QEfficient.cloud.finetune --device qaic --enable_ddp --model_name "meta-llama/Llama-3.2-1B" --dataset alpaca_dataset --train_batch_size 1 --val_batch_size 1 --num_epochs 1 --max_train_step 200 --max_eval_step 50 For Node 1, use node-rank as 1: QAIC_VISIBLE_DEVICES=0,1,2,3 torchrun --nnodes=2 --nproc-per-node=4 --seed 0 --node-rank=1 --master_addr=<MASTER_NODE_IP> --master_port=8000 -m QEfficient.cloud.finetune --device qaic --enable_ddp --model_name "meta-llama/Llama-3.2-1B" --dataset alpaca_dataset --train_batch_size 1 --val_batch_size 1 --num_epochs 1 --max_train_step 200 --max_eval_step 50 --------- Signed-off-by: Sharvari Medhe <smedhe@qti.qualcomm.com>

Add support for multi-node Distributed Data Parallel (DDP) training to the QEfficient finetuning pipeline. This enables scaling training across multiple nodes while keeping the existing single-node behavior unchanged. Commands for DDP across 2 servers: For the Master Addr or the Primary Machine, use node-rank as 0: QAIC_VISIBLE_DEVICES=0,1,2,3 torchrun --nnodes=2 --nproc-per-node=4 --seed 0 --node-rank=0 --master_addr=<MASTER_NODE_IP> --master_port=8000 -m QEfficient.cloud.finetune --device qaic --enable_ddp --model_name "meta-llama/Llama-3.2-1B" --dataset alpaca_dataset --train_batch_size 1 --val_batch_size 1 --num_epochs 1 --max_train_step 200 --max_eval_step 50 For Node 1, use node-rank as 1: QAIC_VISIBLE_DEVICES=0,1,2,3 torchrun --nnodes=2 --nproc-per-node=4 --seed 0 --node-rank=1 --master_addr=<MASTER_NODE_IP> --master_port=8000 -m QEfficient.cloud.finetune --device qaic --enable_ddp --model_name "meta-llama/Llama-3.2-1B" --dataset alpaca_dataset --train_batch_size 1 --val_batch_size 1 --num_epochs 1 --max_train_step 200 --max_eval_step 50 --------- Signed-off-by: Sharvari Medhe <smedhe@qti.qualcomm.com> Signed-off-by: Dipankar Sarkar <dipankar@qti.qualcomm.com>

Add support for multi-node Distributed Data Parallel (DDP) training to the QEfficient finetuning pipeline. This enables scaling training across multiple nodes while keeping the existing single-node behavior unchanged. Commands for DDP across 2 servers: For the Master Addr or the Primary Machine, use node-rank as 0: QAIC_VISIBLE_DEVICES=0,1,2,3 torchrun --nnodes=2 --nproc-per-node=4 --seed 0 --node-rank=0 --master_addr=<MASTER_NODE_IP> --master_port=8000 -m QEfficient.cloud.finetune --device qaic --enable_ddp --model_name "meta-llama/Llama-3.2-1B" --dataset alpaca_dataset --train_batch_size 1 --val_batch_size 1 --num_epochs 1 --max_train_step 200 --max_eval_step 50 For Node 1, use node-rank as 1: QAIC_VISIBLE_DEVICES=0,1,2,3 torchrun --nnodes=2 --nproc-per-node=4 --seed 0 --node-rank=1 --master_addr=<MASTER_NODE_IP> --master_port=8000 -m QEfficient.cloud.finetune --device qaic --enable_ddp --model_name "meta-llama/Llama-3.2-1B" --dataset alpaca_dataset --train_batch_size 1 --val_batch_size 1 --num_epochs 1 --max_train_step 200 --max_eval_step 50 --------- Signed-off-by: Sharvari Medhe <smedhe@qti.qualcomm.com>

smedhe requested review from ochougul, quic-amitraj, quic-hemagnih and quic-rishinr as code owners January 7, 2026 09:33

quic-akuruvil requested review from quic-meetkuma and quic-swatia January 7, 2026 10:26

quic-akuruvil reviewed Jan 7, 2026

View reviewed changes

quic-swatia reviewed Jan 8, 2026

View reviewed changes

QEfficient/finetune/utils/helper.py Show resolved Hide resolved

QEfficient/cloud/finetune.py Outdated Show resolved Hide resolved

QEfficient/cloud/finetune.py Outdated Show resolved Hide resolved

quic-meetkuma approved these changes Jan 9, 2026

View reviewed changes

QEfficient/finetune/utils/helper.py Show resolved Hide resolved

quic-akuruvil reviewed Jan 12, 2026

View reviewed changes

quic-swatia approved these changes Jan 12, 2026

View reviewed changes

smedhe added 2 commits January 13, 2026 06:17

Adding support for multi-node DDP training

948f53e

Signed-off-by: Sharvari Medhe <smedhe@qti.qualcomm.com>

adding functions in helper.py

1ebfdb3

Signed-off-by: Sharvari Medhe <smedhe@qti.qualcomm.com>

smedhe force-pushed the multi_node_ddp branch from fd66951 to 1ebfdb3 Compare January 13, 2026 06:17

quic-akuruvil merged commit c76d5ea into quic:main Jan 13, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding support for multi-node DDP training#708

Adding support for multi-node DDP training#708
quic-akuruvil merged 2 commits intoquic:mainfrom
smedhe:multi_node_ddp

smedhe commented Jan 7, 2026 •

edited

Loading

Uh oh!

quic-akuruvil left a comment

Uh oh!

quic-swatia commented Jan 8, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

quic-swatia commented Jan 8, 2026

Uh oh!

quic-meetkuma left a comment

Uh oh!

Uh oh!

quic-akuruvil left a comment

Uh oh!

quic-swatia left a comment

Uh oh!

quic-akuruvil commented Jan 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Comments

Conversation

smedhe commented Jan 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

quic-akuruvil left a comment

Choose a reason for hiding this comment

Uh oh!

quic-swatia commented Jan 8, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

quic-swatia commented Jan 8, 2026

Uh oh!

quic-meetkuma left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

quic-akuruvil left a comment

Choose a reason for hiding this comment

Uh oh!

quic-swatia left a comment

Choose a reason for hiding this comment

Uh oh!

quic-akuruvil commented Jan 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Comments

smedhe commented Jan 7, 2026 •

edited

Loading