Skip to content

Adding support for multi-node DDP training#708

Merged
quic-akuruvil merged 2 commits intoquic:mainfrom
smedhe:multi_node_ddp
Jan 13, 2026
Merged

Adding support for multi-node DDP training#708
quic-akuruvil merged 2 commits intoquic:mainfrom
smedhe:multi_node_ddp

Conversation

@smedhe
Copy link
Contributor

@smedhe smedhe commented Jan 7, 2026

Add support for multi-node Distributed Data Parallel (DDP) training to the QEfficient finetuning pipeline. This enables scaling training across multiple nodes while keeping the existing single-node behavior unchanged.

Commands for DDP across 2 servers:
For the Master Addr or the Primary Machine, use node-rank as 0:
QAIC_VISIBLE_DEVICES=0,1,2,3 torchrun --nnodes=2 --nproc-per-node=4 --seed 0 --node-rank=0 --master_addr=<MASTER_NODE_IP> --master_port=8000 -m QEfficient.cloud.finetune --device qaic --enable_ddp --model_name "meta-llama/Llama-3.2-1B" --dataset alpaca_dataset --train_batch_size 1 --val_batch_size 1 --num_epochs 1 --max_train_step 200 --max_eval_step 50

For Node 1, use node-rank as 1:
QAIC_VISIBLE_DEVICES=0,1,2,3 torchrun --nnodes=2 --nproc-per-node=4 --seed 0 --node-rank=1 --master_addr=<MASTER_NODE_IP> --master_port=8000 -m QEfficient.cloud.finetune --device qaic --enable_ddp --model_name "meta-llama/Llama-3.2-1B" --dataset alpaca_dataset --train_batch_size 1 --val_batch_size 1 --num_epochs 1 --max_train_step 200 --max_eval_step 50

Copy link
Contributor

@quic-akuruvil quic-akuruvil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Put the sample command for reference interface, along with description

@quic-swatia
Copy link
Contributor

For PP, device_map.py uses get_rank() which was earlier returning local rank but now returns rank. This will not break PP as of now, but it's a wrong practice. In case PP is used with multi node, it would break. Please call the get_local_rank() in device_map once it is defined.

@quic-swatia
Copy link
Contributor

For PP, device_map.py uses get_rank() which was earlier returning local rank but now returns rank. This will not break PP as of now, but it's a wrong practice. In case PP is used with multi node, it would break. Please call the get_local_rank() in device_map once it is defined.

After this change, we can also test if PP + DDP for multi node is working with this PR. It seems like it might work. It's okay if it doesn't, we can still go ahead with the PR.

Copy link
Contributor

@quic-meetkuma quic-meetkuma left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine to me. Please update the finetune.md file. Imagine or other teams refer the commands from there. They will need to change command at their side of the code.

Copy link
Contributor

@quic-akuruvil quic-akuruvil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Start CI run also for this. So CI passes before merge

Copy link
Contributor

@quic-swatia quic-swatia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR is good to merge with the latest changes . Please make sure these commands are added in the finetuning documentation. Thanks

@quic-akuruvil
Copy link
Contributor

Looks fine to me. Please update the finetune.md file. Imagine or other teams refer the commands from there. They will need to change command at their side of the code.

Added as part of PR 717

Signed-off-by: Sharvari Medhe <smedhe@qti.qualcomm.com>
Signed-off-by: Sharvari Medhe <smedhe@qti.qualcomm.com>
@quic-akuruvil quic-akuruvil merged commit c76d5ea into quic:main Jan 13, 2026
4 checks passed
tchawada pushed a commit to tchawada/QEff_tanisha that referenced this pull request Feb 4, 2026
Add support for multi-node Distributed Data Parallel (DDP) training to
the QEfficient finetuning pipeline. This enables scaling training across
multiple nodes while keeping the existing single-node behavior
unchanged.

Commands for DDP across 2 servers:
For the Master Addr or the Primary Machine, use node-rank as 0:
QAIC_VISIBLE_DEVICES=0,1,2,3 torchrun --nnodes=2 --nproc-per-node=4
--seed 0 --node-rank=0 --master_addr=<MASTER_NODE_IP> --master_port=8000
-m QEfficient.cloud.finetune --device qaic --enable_ddp --model_name
"meta-llama/Llama-3.2-1B" --dataset alpaca_dataset --train_batch_size 1
--val_batch_size 1 --num_epochs 1 --max_train_step 200 --max_eval_step
50

For Node 1, use node-rank as 1:
QAIC_VISIBLE_DEVICES=0,1,2,3 torchrun --nnodes=2 --nproc-per-node=4
--seed 0 --node-rank=1 --master_addr=<MASTER_NODE_IP> --master_port=8000
-m QEfficient.cloud.finetune --device qaic --enable_ddp --model_name
"meta-llama/Llama-3.2-1B" --dataset alpaca_dataset --train_batch_size 1
--val_batch_size 1 --num_epochs 1 --max_train_step 200 --max_eval_step
50

---------

Signed-off-by: Sharvari Medhe <smedhe@qti.qualcomm.com>
tchawada pushed a commit to tchawada/QEff_tanisha that referenced this pull request Feb 4, 2026
Add support for multi-node Distributed Data Parallel (DDP) training to
the QEfficient finetuning pipeline. This enables scaling training across
multiple nodes while keeping the existing single-node behavior
unchanged.

Commands for DDP across 2 servers:
For the Master Addr or the Primary Machine, use node-rank as 0:
QAIC_VISIBLE_DEVICES=0,1,2,3 torchrun --nnodes=2 --nproc-per-node=4
--seed 0 --node-rank=0 --master_addr=<MASTER_NODE_IP> --master_port=8000
-m QEfficient.cloud.finetune --device qaic --enable_ddp --model_name
"meta-llama/Llama-3.2-1B" --dataset alpaca_dataset --train_batch_size 1
--val_batch_size 1 --num_epochs 1 --max_train_step 200 --max_eval_step
50

For Node 1, use node-rank as 1:
QAIC_VISIBLE_DEVICES=0,1,2,3 torchrun --nnodes=2 --nproc-per-node=4
--seed 0 --node-rank=1 --master_addr=<MASTER_NODE_IP> --master_port=8000
-m QEfficient.cloud.finetune --device qaic --enable_ddp --model_name
"meta-llama/Llama-3.2-1B" --dataset alpaca_dataset --train_batch_size 1
--val_batch_size 1 --num_epochs 1 --max_train_step 200 --max_eval_step
50

---------

Signed-off-by: Sharvari Medhe <smedhe@qti.qualcomm.com>
tchawada pushed a commit to tchawada/QEff_tanisha that referenced this pull request Feb 4, 2026
Add support for multi-node Distributed Data Parallel (DDP) training to
the QEfficient finetuning pipeline. This enables scaling training across
multiple nodes while keeping the existing single-node behavior
unchanged.

Commands for DDP across 2 servers:
For the Master Addr or the Primary Machine, use node-rank as 0:
QAIC_VISIBLE_DEVICES=0,1,2,3 torchrun --nnodes=2 --nproc-per-node=4
--seed 0 --node-rank=0 --master_addr=<MASTER_NODE_IP> --master_port=8000
-m QEfficient.cloud.finetune --device qaic --enable_ddp --model_name
"meta-llama/Llama-3.2-1B" --dataset alpaca_dataset --train_batch_size 1
--val_batch_size 1 --num_epochs 1 --max_train_step 200 --max_eval_step
50

For Node 1, use node-rank as 1:
QAIC_VISIBLE_DEVICES=0,1,2,3 torchrun --nnodes=2 --nproc-per-node=4
--seed 0 --node-rank=1 --master_addr=<MASTER_NODE_IP> --master_port=8000
-m QEfficient.cloud.finetune --device qaic --enable_ddp --model_name
"meta-llama/Llama-3.2-1B" --dataset alpaca_dataset --train_batch_size 1
--val_batch_size 1 --num_epochs 1 --max_train_step 200 --max_eval_step
50

---------

Signed-off-by: Sharvari Medhe <smedhe@qti.qualcomm.com>
tchawada pushed a commit to tchawada/QEff_tanisha that referenced this pull request Feb 5, 2026
Add support for multi-node Distributed Data Parallel (DDP) training to
the QEfficient finetuning pipeline. This enables scaling training across
multiple nodes while keeping the existing single-node behavior
unchanged.

Commands for DDP across 2 servers:
For the Master Addr or the Primary Machine, use node-rank as 0:
QAIC_VISIBLE_DEVICES=0,1,2,3 torchrun --nnodes=2 --nproc-per-node=4
--seed 0 --node-rank=0 --master_addr=<MASTER_NODE_IP> --master_port=8000
-m QEfficient.cloud.finetune --device qaic --enable_ddp --model_name
"meta-llama/Llama-3.2-1B" --dataset alpaca_dataset --train_batch_size 1
--val_batch_size 1 --num_epochs 1 --max_train_step 200 --max_eval_step
50

For Node 1, use node-rank as 1:
QAIC_VISIBLE_DEVICES=0,1,2,3 torchrun --nnodes=2 --nproc-per-node=4
--seed 0 --node-rank=1 --master_addr=<MASTER_NODE_IP> --master_port=8000
-m QEfficient.cloud.finetune --device qaic --enable_ddp --model_name
"meta-llama/Llama-3.2-1B" --dataset alpaca_dataset --train_batch_size 1
--val_batch_size 1 --num_epochs 1 --max_train_step 200 --max_eval_step
50

---------

Signed-off-by: Sharvari Medhe <smedhe@qti.qualcomm.com>
qcdipankar pushed a commit to qcdipankar/efficient-transformers that referenced this pull request Feb 8, 2026
Add support for multi-node Distributed Data Parallel (DDP) training to
the QEfficient finetuning pipeline. This enables scaling training across
multiple nodes while keeping the existing single-node behavior
unchanged.

Commands for DDP across 2 servers:
For the Master Addr or the Primary Machine, use node-rank as 0:
QAIC_VISIBLE_DEVICES=0,1,2,3 torchrun --nnodes=2 --nproc-per-node=4
--seed 0 --node-rank=0 --master_addr=<MASTER_NODE_IP> --master_port=8000
-m QEfficient.cloud.finetune --device qaic --enable_ddp --model_name
"meta-llama/Llama-3.2-1B" --dataset alpaca_dataset --train_batch_size 1
--val_batch_size 1 --num_epochs 1 --max_train_step 200 --max_eval_step
50

For Node 1, use node-rank as 1:
QAIC_VISIBLE_DEVICES=0,1,2,3 torchrun --nnodes=2 --nproc-per-node=4
--seed 0 --node-rank=1 --master_addr=<MASTER_NODE_IP> --master_port=8000
-m QEfficient.cloud.finetune --device qaic --enable_ddp --model_name
"meta-llama/Llama-3.2-1B" --dataset alpaca_dataset --train_batch_size 1
--val_batch_size 1 --num_epochs 1 --max_train_step 200 --max_eval_step
50

---------

Signed-off-by: Sharvari Medhe <smedhe@qti.qualcomm.com>
Signed-off-by: Dipankar Sarkar <dipankar@qti.qualcomm.com>
tchawada pushed a commit to tchawada/QEff_tanisha that referenced this pull request Feb 16, 2026
Add support for multi-node Distributed Data Parallel (DDP) training to
the QEfficient finetuning pipeline. This enables scaling training across
multiple nodes while keeping the existing single-node behavior
unchanged.

Commands for DDP across 2 servers:
For the Master Addr or the Primary Machine, use node-rank as 0:
QAIC_VISIBLE_DEVICES=0,1,2,3 torchrun --nnodes=2 --nproc-per-node=4
--seed 0 --node-rank=0 --master_addr=<MASTER_NODE_IP> --master_port=8000
-m QEfficient.cloud.finetune --device qaic --enable_ddp --model_name
"meta-llama/Llama-3.2-1B" --dataset alpaca_dataset --train_batch_size 1
--val_batch_size 1 --num_epochs 1 --max_train_step 200 --max_eval_step
50

For Node 1, use node-rank as 1:
QAIC_VISIBLE_DEVICES=0,1,2,3 torchrun --nnodes=2 --nproc-per-node=4
--seed 0 --node-rank=1 --master_addr=<MASTER_NODE_IP> --master_port=8000
-m QEfficient.cloud.finetune --device qaic --enable_ddp --model_name
"meta-llama/Llama-3.2-1B" --dataset alpaca_dataset --train_batch_size 1
--val_batch_size 1 --num_epochs 1 --max_train_step 200 --max_eval_step
50

---------

Signed-off-by: Sharvari Medhe <smedhe@qti.qualcomm.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants

Comments