Skip to content

['dp_shard_mod_ep', 'dp_shard_in_ep', 'tp'], [4, 16, 4] errors out #1570

@vwxyzjn

Description

@vwxyzjn

Hi TorchTitan team,

I am trying to run

PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True" CONFIG_FILE="./torchtitan/models/deepseek_v3/train_configs/deepseek_v3_671b.toml" uv run  --no-sync ./run_train.sh \
--training.steps=100000 \
--training.dataset=c4_test \
--training.local_batch_size=1 \
--parallelism.data_parallel_shard_degree=-1 \
--parallelism.tensor_parallel_degree=4 \
--parallelism.pipeline_parallel_degree=1 \
--parallelism.expert_parallel_degree=64 \
--parallelism.expert_tensor_parallel_degree=1 \
--metrics.log_freq=1 \

and got

+ export NGPU=8
+ NGPU=8
+ export LOG_RANK=0
+ LOG_RANK=0
+ export CONFIG_FILE=./torchtitan/models/deepseek_v3/train_configs/deepseek_v3_671b.toml
+ CONFIG_FILE=./torchtitan/models/deepseek_v3/train_configs/deepseek_v3_671b.toml
+ overrides=
+ '[' 9 -ne 0 ']'
+ overrides='--training.steps=100000 --training.dataset=c4_test --training.local_batch_size=1 --parallelism.data_parallel_shard_degree=-1 --parallelism.tensor_parallel_degree=4 --parallelism.pipeline_parallel_degree=1 --parallelism.expert_parallel_degree=64 --parallelism.expert_tensor_parallel_degree=1 --metrics.log_freq=1'
+ export TORCHFT_LIGHTHOUSE=http://localhost:29510
+ TORCHFT_LIGHTHOUSE=http://localhost:29510
+ export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
+ PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
+ '[' -z 32 ']'
+ '[' 32 = 1 ']'
+ RDZV_ID=1
+ RDZV_HOST=costa-671b-5-worker-0.costa-671b-5
+ torchrun --nnodes 32 --nproc_per_node=8 --rdzv_backend c10d --rdzv_endpoint costa-671b-5-worker-0.costa-671b-5:29500 --rdzv_id 1 --local-ranks-filter 0 --role rank --tee 3 -m torchtitan.train --job.config_file ./torchtitan/models/deepseek_v3/train_configs/deepseek_v3_671b.toml --training.steps=100000 --training.dataset=c4_test --training.local_batch_size=1 --parallelism.data_parallel_shard_degree=-1 --parallelism.tensor_parallel_degree=4 --parallelism.pipeline_parallel_degree=1 --parallelism.expert_parallel_degree=64 --parallelism.expert_tensor_parallel_degree=1 --metrics.log_freq=1
W0814 06:32:53.078000 316 torch/distributed/run.py:803] 
W0814 06:32:53.078000 316 torch/distributed/run.py:803] *****************************************
W0814 06:32:53.078000 316 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W0814 06:32:53.078000 316 torch/distributed/run.py:803] *****************************************
[rank0]:/home/ubuntu/code/thirdparty/torchtitan-dev/.venv/lib/python3.12/site-packages/torch/cuda/__init__.py:410: UserWarning: PYTORCH_CUDA_ALLOC_CONF is deprecated, use PYTORCH_ALLOC_CONF instead (Triggered internally at /pytorch/c10/core/AllocatorConfig.cpp:28.)
[rank0]:  torch._C._cuda_init()
[rank0]:[titan] 2025-08-14 06:33:13,382 - root - WARNING - tokenizer_path is deprecated, use model.hf_assets_path instead. Setting hf_assets_path to tokenizer_path temporarily.
[rank0]:[titan] 2025-08-14 06:33:13,382 - root - INFO - Starting job: DeepSeek-V3 671B model training
[rank0]:[titan] 2025-08-14 06:33:14,873 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config
[rank0]:[titan] 2025-08-14 06:33:14,879 - root - INFO - Building 3-D device mesh with ['dp_shard_mod_ep', 'dp_shard_in_ep', 'tp'], [4, 16, 4]
[rank0]:[titan] 2025-08-14 06:33:14,890 - root - INFO - [GC] Initial GC collection 0.00 seconds
[rank0]:costa-671b-5-worker-1:382:382 [0] NCCL INFO cudaDriverVersion 12090
[rank0]:costa-671b-5-worker-1:382:382 [0] NCCL INFO Bootstrap: Using eth0:10.1.68.153<0>
[rank0]:costa-671b-5-worker-1:382:382 [0] NCCL INFO NCCL version 2.27.5+cuda12.9
[rank0]:costa-671b-5-worker-1:382:382 [0] NCCL INFO Comm config Blocking set to 1
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/Plugin: Plugin name set by env to libnccl-net.so
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/Plugin: Loaded net plugin Libfabric (v10)
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v10 symbol.
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v9 symbol.
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v7 symbol.
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO Successfully loaded external plugin libnccl-net.so
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.15.0
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI Using Libfabric version 2.1
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI Using CUDA driver version 12090 with runtime 12090
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI Configuring AWS-specific options
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI Setting NCCL_NET_FORCE_FLUSH=0 since this platform does not require a network flush.
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI Internode latency set at 75.0 us
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI Using transport protocol RDMA (platform set)
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI Selected provider is efa, fabric is efa-direct (found 32 nics)
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI NIC group 0 device #0 0000:52:00.0
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI NIC group 0 device #1 0000:51:00.0
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI NIC group 0 device #2 0000:50:00.0
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI NIC group 0 device #3 0000:4f:00.0
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI NIC group 1 device #0 0000:63:00.0
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI NIC group 1 device #1 0000:62:00.0
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI NIC group 1 device #2 0000:61:00.0
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI NIC group 1 device #3 0000:60:00.0
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI NIC group 2 device #0 0000:74:00.0
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI NIC group 2 device #1 0000:73:00.0
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI NIC group 2 device #2 0000:72:00.0
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI NIC group 2 device #3 0000:71:00.0
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI NIC group 3 device #0 0000:85:00.0
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI NIC group 3 device #1 0000:84:00.0
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI NIC group 3 device #2 0000:83:00.0
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI NIC group 3 device #3 0000:82:00.0
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI NIC group 4 device #0 0000:96:00.0
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI NIC group 4 device #1 0000:95:00.0
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI NIC group 4 device #2 0000:94:00.0
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI NIC group 4 device #3 0000:93:00.0
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI NIC group 5 device #0 0000:a7:00.0
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI NIC group 5 device #1 0000:a6:00.0
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI NIC group 5 device #2 0000:a5:00.0
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI NIC group 5 device #3 0000:a4:00.0
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI NIC group 6 device #0 0000:b8:00.0
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI NIC group 6 device #1 0000:b7:00.0
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI NIC group 6 device #2 0000:b6:00.0
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI NIC group 6 device #3 0000:b5:00.0
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI NIC group 7 device #0 0000:c9:00.0
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI NIC group 7 device #1 0000:c8:00.0
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI NIC group 7 device #2 0000:c7:00.0
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI NIC group 7 device #3 0000:c6:00.0
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI Setting NCCL_TOPO_FILE environment variable to /proc/self/fd/156
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI Creating one domain per process
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI GUID of rdmap82s0: 578a1dff7f090501
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI GUID for dev[0]: 00000000000000000a01449900000000
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI GUID of rdmap99s0: 7b7dd78e6c201c01
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI GUID for dev[1]: 00000000000000000a01449900000001
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI GUID of rdmap116s0: 7ba4fd7c85464600
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI GUID for dev[2]: 00000000000000000a01449900000002
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI GUID of rdmap133s0: dd4414e66c5d5d00
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI GUID for dev[3]: 00000000000000000a01449900000003
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI GUID of rdmap150s0: 3917a554ff838300
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI GUID for dev[4]: 00000000000000000a01449900000004
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI GUID of rdmap167s0: b9b56ca9a49b9b00
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI GUID for dev[5]: 00000000000000000a01449900000005
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI GUID of rdmap184s0: d90ac2d4f4bebe00
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI GUID for dev[6]: 00000000000000000a01449900000006
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI GUID of rdmap201s0: 7b7607630dd6d600
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI GUID for dev[7]: 00000000000000000a01449900000007
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI Support for global registrations: true
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI Support for DMA-BUF registrations: false
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO Initialized NET plugin Libfabric
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO Assigned NET plugin Libfabric to comm
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO Using network Libfabric
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO DMA-BUF is available on GPU device 0
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO ncclCommInitRankScalable comm 0x2bff7340 rank 8 nranks 256 cudaDev 0 nvmlDev 0 busId 53000 commId 0xc11aff4bf0a5a0b0 - Init START
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO RAS client listening socket at ::1<28028>
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO Bootstrap timings total 0.828328 (create 0.000043, send 0.000325, recv 0.000577, ring 0.826804, delay 0.000000)
[rank0]:[titan] 2025-08-14 06:33:21,267 - root - INFO - Loading tokenizer from tokenizer.json
[rank0]:[titan] 2025-08-14 06:33:21,491 - root - INFO - Preparing c4_test dataset from tests/assets/c4_test
[rank0]:[titan] 2025-08-14 06:33:21,520 - root - INFO - Building deepseek_v3 671B with DeepSeekV3ModelArgs(_enforced='This field is used to enforce all fields have defaults.', max_batch_size=8, max_seq_len=4096, dtype='fp8', vocab_size=129280, dim=7168, inter_dim=18432, moe_inter_dim=2048, n_layers=61, n_dense_layers=3, n_heads=128, norm_eps=1e-05, moe_args=MoEArgs(num_experts=256, num_shared_experts=1, score_func='sigmoid', route_norm=True, route_scale=2.5, score_before_experts=False, top_k=8, use_grouped_mm=True, load_balance_coeff=0.001), n_expert_groups=8, n_limited_groups=4, q_lora_rank=1536, kv_lora_rank=512, qk_nope_head_dim=128, qk_rope_head_dim=64, v_head_dim=128, use_flex_attn=False, attn_mask_type='causal', original_seq_len=4096, rope_theta=10000.0, rope_factor=40, beta_fast=32, beta_slow=1, mscale=1.0)
[rank0]:[titan] 2025-08-14 06:33:22,298 - root - INFO - CUDA capacity: NVIDIA H100 80GB HBM3 with 79.10GiB memory
[rank0]:[titan] 2025-08-14 06:33:22,615 - root - INFO - Total parameter count: dense 14,456,871,936, sparse 656,569,532,416, active 37,552,282,624
[rank0]:[titan] 2025-08-14 06:33:22,615 - root - INFO - Model deepseek_v3 671B size: 671,026,404,352 total parameters
[rank0]:[titan] 2025-08-14 06:33:22,822 - root - INFO - Applied Tensor Parallelism to the model
[rank0]:[titan] 2025-08-14 06:33:22,958 - root - INFO - Applied full activation checkpointing to the model
[rank0]:[titan] 2025-08-14 06:33:23,606 - root - INFO - Applied FSDP to the model
[rank0]:[titan] 2025-08-14 06:33:25,109 - root - INFO - Peak FLOPS used for computing MFU: 9.890e+14
[rank0]:[titan] 2025-08-14 06:33:25,109 - root - INFO - CUDA memory usage for model: 12.16GiB(15.38%)
[rank0]:[titan] 2025-08-14 06:33:25,114 - root - INFO - Mixed precision training is handled by fully_shard
[rank0]:[titan] 2025-08-14 06:33:25,114 - root - INFO - Trainer is initialized with local batch size 1, global batch size 64, gradient accumulation steps 1, sequence length 4096, total steps 100000 (warmup 2000)
[rank0]:[titan] 2025-08-14 06:33:25,114 - root - INFO - Training starts at step 1
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO MNNVL busId 0x53000 fabric UUID 0.0 cliqueId 0x0 state 3 healthMask 0x0
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NCCL_TOPO_FILE set by environment to /proc/self/fd/156
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO Setting affinity for GPU 0 to 0-47,96-143
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NVLS multicast support is available on dev 0 (NVLS_NCHANNELS 16)
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO comm 0x2bff7340 rank 8 nRanks 256 nNodes 32 localRanks 8 localRank 0 MNNVL 0
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO Trees [0] 9/-1/-1->8->16 [1] -1/-1/-1->8->15 [2] 9/-1/-1->8->15 [3] 9/-1/-1->8->15 [4] 9/-1/-1->8->15 [5] 9/-1/-1->8->15 [6] 9/-1/-1->8->15 [7] 9/-1/-1->8->15 [8] 9/16/0->8->24 [9] -1/-1/-1->8->15 [10] 9/-1/-1->8->15 [11] 9/-1/-1->8->15 [12] 9/-1/-1->8->15 [13] 9/-1/-1->8->15 [14] 9/-1/-1->8->15 [15] 9/-1/-1->8->15
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NCCL_BUFFSIZE set by environment to 8388608.
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NCCL_P2P_NET_CHUNKSIZE set by environment to 524288.
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO P2P Chunksize set to 524288
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so. 
[rank0]:costa-671b-5-worker-1:382:568 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 99
[rank0]:costa-671b-5-worker-1:382:561 [0] NCCL INFO [Proxy Service] Device 0 CPU core 25
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NCCL_NVLS_CHUNKSIZE set by environment to 524288.
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO threadThresholds 8/8/64 | 2048/8/64 | 512 | 512
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO 16 coll channels, 16 collnet channels, 16 nvls channels, 16 p2p channels, 1 p2p channels per peer
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin.
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v4 symbol.
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v3 symbol.
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2 symbol, using internal tuner instead.
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO ncclCommInitRankScalable comm 0x2bff7340 rank 8 nranks 256 cudaDev 0 nvmlDev 0 busId 53000 commId 0xc11aff4bf0a5a0b0 - Init COMPLETE
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO Init timings - ncclCommInitRankScalable: rank 8 nranks 256 total 4.94 (kernels 0.25, alloc 2.23, bootstrap 0.83, allgathers 0.15, topo 1.20, graphs 0.01, connections 0.26, rest 0.01)
[rank0]:costa-671b-5-worker-1:382:582 [0] NCCL INFO Channel 02/0 : 8[0] -> 9[1] via P2P/CUMEM
[rank0]:costa-671b-5-worker-1:382:582 [0] NCCL INFO Channel 04/0 : 8[0] -> 9[1] via P2P/CUMEM
[rank0]:costa-671b-5-worker-1:382:582 [0] NCCL INFO Channel 06/0 : 8[0] -> 9[1] via P2P/CUMEM
[rank0]:costa-671b-5-worker-1:382:582 [0] NCCL INFO Channel 10/0 : 8[0] -> 9[1] via P2P/CUMEM
[rank0]:costa-671b-5-worker-1:382:582 [0] NCCL INFO Channel 12/0 : 8[0] -> 9[1] via P2P/CUMEM
[rank0]:costa-671b-5-worker-1:382:582 [0] NCCL INFO Channel 14/0 : 8[0] -> 9[1] via P2P/CUMEM
[rank0]:costa-671b-5-worker-1:382:582 [0] NCCL INFO Channel 01/0 : 8[0] -> 15[7] via P2P/CUMEM
[rank0]:costa-671b-5-worker-1:382:582 [0] NCCL INFO Channel 03/0 : 8[0] -> 15[7] via P2P/CUMEM
[rank0]:costa-671b-5-worker-1:382:582 [0] NCCL INFO Channel 05/0 : 8[0] -> 15[7] via P2P/CUMEM
[rank0]:costa-671b-5-worker-1:382:582 [0] NCCL INFO Channel 07/0 : 8[0] -> 15[7] via P2P/CUMEM
[rank0]:costa-671b-5-worker-1:382:582 [0] NCCL INFO Channel 09/0 : 8[0] -> 15[7] via P2P/CUMEM
[rank0]:costa-671b-5-worker-1:382:582 [0] NCCL INFO Channel 11/0 : 8[0] -> 15[7] via P2P/CUMEM
[rank0]:costa-671b-5-worker-1:382:582 [0] NCCL INFO Channel 13/0 : 8[0] -> 15[7] via P2P/CUMEM
[rank0]:costa-671b-5-worker-1:382:582 [0] NCCL INFO Channel 15/0 : 8[0] -> 15[7] via P2P/CUMEM
[rank0]:costa-671b-5-worker-1:382:582 [0] NCCL INFO NCCL_NET_FORCE_FLUSH set by environment to 0.
[rank0]:costa-671b-5-worker-1:382:590 [0] NCCL INFO [Proxy Progress] Device 0 CPU core 14
[rank0]:costa-671b-5-worker-1:382:582 [0] NCCL INFO Channel 01/0 : 0[0] -> 8[0] [receive] via NET/Libfabric/0/GDRDMA
[rank0]:costa-671b-5-worker-1:382:582 [0] NCCL INFO Channel 09/0 : 0[0] -> 8[0] [receive] via NET/Libfabric/0/GDRDMA
[rank0]:costa-671b-5-worker-1:382:582 [0] NCCL INFO Channel 00/0 : 8[0] -> 16[0] [send] via NET/Libfabric/0/GDRDMA
[rank0]:costa-671b-5-worker-1:382:582 [0] NCCL INFO Channel 08/0 : 8[0] -> 16[0] [send] via NET/Libfabric/0/GDRDMA
[rank0]:costa-671b-5-worker-1:382:582 [0] NCCL INFO Connected all rings, use ring PXN 0 GDR 1
[rank0]:costa-671b-5-worker-1:382:382 [0] NCCL INFO Comm config Blocking set to 1
[rank0]:costa-671b-5-worker-1:382:635 [0] NCCL INFO Assigned NET plugin Libfabric to comm
[rank0]:costa-671b-5-worker-1:382:635 [0] NCCL INFO Using network Libfabric
[rank0]:costa-671b-5-worker-1:382:635 [0] NCCL INFO DMA-BUF is available on GPU device 0
[rank0]:costa-671b-5-worker-1:382:635 [0] NCCL INFO ncclCommInitRankConfig comm 0x3a5f7a70 rank 2 nranks 64 cudaDev 0 nvmlDev 0 busId 53000 commId 0x891e39595d474161 - Init START
[rank0]:costa-671b-5-worker-1:382:635 [0] NCCL INFO Bootstrap timings total 0.745780 (create 0.000038, send 0.000275, recv 0.040516, ring 0.704749, delay 0.000000)
[rank0]:costa-671b-5-worker-1:382:635 [0] NCCL INFO MNNVL busId 0x53000 fabric UUID 0.0 cliqueId 0x0 state 3 healthMask 0x0
[rank0]:costa-671b-5-worker-1:382:635 [0] NCCL INFO NCCL_TOPO_FILE set by environment to /proc/self/fd/156
[rank0]:costa-671b-5-worker-1:382:635 [0] NCCL INFO Setting affinity for GPU 0 to 0-47,96-143
[rank0]:costa-671b-5-worker-1:382:635 [0] NCCL INFO comm 0x3a5f7a70 rank 2 nRanks 64 nNodes 32 localRanks 2 localRank 0 MNNVL 0
[rank0]:costa-671b-5-worker-1:382:635 [0] NCCL INFO Trees [0] 3/-1/-1->2->4 [1] -1/-1/-1->2->3 [2] 3/-1/-1->2->4 [3] -1/-1/-1->2->3 [4] 3/4/0->2->6 [5] -1/-1/-1->2->3 [6] 3/4/0->2->6 [7] -1/-1/-1->2->3
[rank0]:costa-671b-5-worker-1:382:635 [0] NCCL INFO P2P Chunksize set to 524288
[rank0]:costa-671b-5-worker-1:382:663 [0] NCCL INFO [Proxy Service] Device 0 CPU core 26
[rank0]:costa-671b-5-worker-1:382:664 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 119
[rank0]:costa-671b-5-worker-1:382:635 [0] NCCL INFO threadThresholds 8/8/64 | 512/8/64 | 512 | 512
[rank0]:costa-671b-5-worker-1:382:635 [0] NCCL INFO 8 coll channels, 8 collnet channels, 0 nvls channels, 8 p2p channels, 1 p2p channels per peer
[rank0]:costa-671b-5-worker-1:382:635 [0] NCCL INFO ncclCommInitRankConfig comm 0x3a5f7a70 rank 2 nranks 64 cudaDev 0 nvmlDev 0 busId 53000 commId 0x891e39595d474161 - Init COMPLETE
[rank0]:costa-671b-5-worker-1:382:635 [0] NCCL INFO Init timings - ncclCommInitRankConfig: rank 2 nranks 64 total 1.37 (kernels 0.00, alloc 0.00, bootstrap 0.75, allgathers 0.24, topo 0.35, graphs 0.00, connections 0.02, rest 0.00)
[rank0]:costa-671b-5-worker-1:382:671 [0] NCCL INFO [Proxy Progress] Device 0 CPU core 26
[rank0]:costa-671b-5-worker-1:382:668 [0] NCCL INFO Channel 00/0 : 1[4] -> 2[0] [receive] via NET/Libfabric/0/GDRDMA
[rank0]:costa-671b-5-worker-1:382:668 [0] NCCL INFO Channel 02/0 : 1[4] -> 2[0] [receive] via NET/Libfabric/0/GDRDMA
[rank0]:costa-671b-5-worker-1:382:668 [0] NCCL INFO Channel 04/0 : 1[4] -> 2[0] [receive] via NET/Libfabric/0/GDRDMA
[rank0]:costa-671b-5-worker-1:382:668 [0] NCCL INFO Channel 06/0 : 1[4] -> 2[0] [receive] via NET/Libfabric/0/GDRDMA
[rank0]:costa-671b-5-worker-1:382:668 [0] NCCL INFO Channel 00/0 : 2[0] -> 3[4] via P2P/CUMEM
[rank0]:costa-671b-5-worker-1:382:668 [0] NCCL INFO Channel 02/0 : 2[0] -> 3[4] via P2P/CUMEM
[rank0]:costa-671b-5-worker-1:382:668 [0] NCCL INFO Channel 04/0 : 2[0] -> 3[4] via P2P/CUMEM
[rank0]:costa-671b-5-worker-1:382:668 [0] NCCL INFO Channel 06/0 : 2[0] -> 3[4] via P2P/CUMEM
[rank0]:costa-671b-5-worker-1:382:668 [0] NCCL INFO Channel 01/0 : 2[0] -> 5[4] [send] via NET/Libfabric/4(3)/GDRDMA
[rank0]:costa-671b-5-worker-1:382:668 [0] NCCL INFO Channel 03/0 : 2[0] -> 5[4] [send] via NET/Libfabric/4(3)/GDRDMA
[rank0]:costa-671b-5-worker-1:382:668 [0] NCCL INFO Channel 05/0 : 2[0] -> 5[4] [send] via NET/Libfabric/4(3)/GDRDMA
[rank0]:costa-671b-5-worker-1:382:668 [0] NCCL INFO Channel 07/0 : 2[0] -> 5[4] [send] via NET/Libfabric/4(3)/GDRDMA
[rank0]:costa-671b-5-worker-1:382:668 [0] NCCL INFO Connected all rings, use ring PXN 1 GDR 1
[rank0]:costa-671b-5-worker-1:382:382 [0] NCCL INFO Comm config Blocking set to 1
[rank0]:costa-671b-5-worker-1:382:675 [0] NCCL INFO Assigned NET plugin Libfabric to comm
[rank0]:costa-671b-5-worker-1:382:675 [0] NCCL INFO Using network Libfabric
[rank0]:costa-671b-5-worker-1:382:675 [0] NCCL INFO DMA-BUF is available on GPU device 0
[rank0]:costa-671b-5-worker-1:382:675 [0] NCCL INFO ncclCommInitRankConfig comm 0x43be0680 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 53000 commId 0x45d2e8f7f215955b - Init START
[rank0]:costa-671b-5-worker-1:382:675 [0] NCCL INFO Bootstrap timings total 0.001645 (create 0.000036, send 0.000096, recv 0.000778, ring 0.000291, delay 0.000000)
[rank0]:costa-671b-5-worker-1:382:675 [0] NCCL INFO MNNVL busId 0x53000 fabric UUID 0.0 cliqueId 0x0 state 3 healthMask 0x0
[rank0]:costa-671b-5-worker-1:382:675 [0] NCCL INFO NCCL_TOPO_FILE set by environment to /proc/self/fd/156
[rank0]:costa-671b-5-worker-1:382:675 [0] NCCL INFO Setting affinity for GPU 0 to 0-47,96-143
[rank0]:costa-671b-5-worker-1:382:675 [0] NCCL INFO NVLS multicast support is available on dev 0 (NVLS_NCHANNELS 16)
[rank0]:costa-671b-5-worker-1:382:675 [0] NCCL INFO comm 0x43be0680 rank 0 nRanks 4 nNodes 1 localRanks 4 localRank 0 MNNVL 0
[rank0]:costa-671b-5-worker-1:382:675 [0] NCCL INFO Channel 00/24 : 0 1 2 3
[rank0]:costa-671b-5-worker-1:382:675 [0] NCCL INFO Channel 01/24 : 0 1 2 3
[rank0]:costa-671b-5-worker-1:382:675 [0] NCCL INFO Channel 02/24 : 0 1 2 3
[rank0]:costa-671b-5-worker-1:382:675 [0] NCCL INFO Channel 03/24 : 0 1 2 3
[rank0]:costa-671b-5-worker-1:382:675 [0] NCCL INFO Channel 04/24 : 0 1 2 3
[rank0]:costa-671b-5-worker-1:382:675 [0] NCCL INFO Channel 05/24 : 0 1 2 3
[rank0]:costa-671b-5-worker-1:382:675 [0] NCCL INFO Channel 06/24 : 0 1 2 3
[rank0]:costa-671b-5-worker-1:382:675 [0] NCCL INFO Channel 07/24 : 0 1 2 3
[rank0]:costa-671b-5-worker-1:382:675 [0] NCCL INFO Channel 08/24 : 0 1 2 3
[rank0]:costa-671b-5-worker-1:382:675 [0] NCCL INFO Channel 09/24 : 0 1 2 3
[rank0]:costa-671b-5-worker-1:382:675 [0] NCCL INFO Channel 10/24 : 0 1 2 3
[rank0]:costa-671b-5-worker-1:382:675 [0] NCCL INFO Channel 11/24 : 0 1 2 3
[rank0]:costa-671b-5-worker-1:382:675 [0] NCCL INFO Channel 12/24 : 0 1 2 3
[rank0]:costa-671b-5-worker-1:382:675 [0] NCCL INFO Channel 13/24 : 0 1 2 3
[rank0]:costa-671b-5-worker-1:382:675 [0] NCCL INFO Channel 14/24 : 0 1 2 3
[rank0]:costa-671b-5-worker-1:382:675 [0] NCCL INFO Channel 15/24 : 0 1 2 3
[rank0]:costa-671b-5-worker-1:382:675 [0] NCCL INFO Channel 16/24 : 0 1 2 3
[rank0]:costa-671b-5-worker-1:382:675 [0] NCCL INFO Channel 17/24 : 0 1 2 3
[rank0]:costa-671b-5-worker-1:382:675 [0] NCCL INFO Channel 18/24 : 0 1 2 3
[rank0]:costa-671b-5-worker-1:382:675 [0] NCCL INFO Channel 19/24 : 0 1 2 3
[rank0]:costa-671b-5-worker-1:382:675 [0] NCCL INFO Channel 20/24 : 0 1 2 3
[rank0]:costa-671b-5-worker-1:382:675 [0] NCCL INFO Channel 21/24 : 0 1 2 3
[rank0]:costa-671b-5-worker-1:382:675 [0] NCCL INFO Channel 22/24 : 0 1 2 3
[rank0]:costa-671b-5-worker-1:382:675 [0] NCCL INFO Channel 23/24 : 0 1 2 3
[rank0]:costa-671b-5-worker-1:382:675 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] 1/-1/-1->0->-1 [19] 1/-1/-1->0->-1 [20] 1/-1/-1->0->-1 [21] 1/-1/-1->0->-1 [22] 1/-1/-1->0->-1 [23] 1/-1/-1->0->-1
[rank0]:costa-671b-5-worker-1:382:675 [0] NCCL INFO P2P Chunksize set to 524288
[rank0]:costa-671b-5-worker-1:382:675 [0] NCCL INFO Check P2P Type isAllDirectP2p 1 directMode 0
[rank0]:costa-671b-5-worker-1:382:698 [0] NCCL INFO [Proxy Service] Device 0 CPU core 17
[rank0]:costa-671b-5-worker-1:382:699 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 39
[rank0]:costa-671b-5-worker-1:382:675 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
[rank0]:costa-671b-5-worker-1:382:675 [0] NCCL INFO 24 coll channels, 24 collnet channels, 16 nvls channels, 32 p2p channels, 32 p2p channels per peer
[rank0]:costa-671b-5-worker-1:382:675 [0] NCCL INFO CC Off, workFifoBytes 1048576
[rank0]:costa-671b-5-worker-1:382:675 [0] NCCL INFO ncclCommInitRankConfig comm 0x43be0680 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 53000 commId 0x45d2e8f7f215955b - Init COMPLETE
[rank0]:costa-671b-5-worker-1:382:675 [0] NCCL INFO Init timings - ncclCommInitRankConfig: rank 0 nranks 4 total 1.57 (kernels 0.00, alloc 0.00, bootstrap 0.00, allgathers 0.01, topo 1.23, graphs 0.01, connections 0.28, rest 0.04)
[rank0]:costa-671b-5-worker-1:382:704 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/CUMEM
[rank0]:costa-671b-5-worker-1:382:704 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/CUMEM
[rank0]:costa-671b-5-worker-1:382:704 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/CUMEM
[rank0]:costa-671b-5-worker-1:382:704 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/CUMEM
[rank0]:costa-671b-5-worker-1:382:704 [0] NCCL INFO Channel 04/0 : 0[0] -> 1[1] via P2P/CUMEM
[rank0]:costa-671b-5-worker-1:382:704 [0] NCCL INFO Channel 05/0 : 0[0] -> 1[1] via P2P/CUMEM
[rank0]:costa-671b-5-worker-1:382:704 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[1] via P2P/CUMEM
[rank0]:costa-671b-5-worker-1:382:704 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[1] via P2P/CUMEM
[rank0]:costa-671b-5-worker-1:382:704 [0] NCCL INFO Channel 08/0 : 0[0] -> 1[1] via P2P/CUMEM
[rank0]:costa-671b-5-worker-1:382:704 [0] NCCL INFO Channel 09/0 : 0[0] -> 1[1] via P2P/CUMEM
[rank0]:costa-671b-5-worker-1:382:704 [0] NCCL INFO Channel 10/0 : 0[0] -> 1[1] via P2P/CUMEM
[rank0]:costa-671b-5-worker-1:382:704 [0] NCCL INFO Channel 11/0 : 0[0] -> 1[1] via P2P/CUMEM
[rank0]:costa-671b-5-worker-1:382:704 [0] NCCL INFO Channel 12/0 : 0[0] -> 1[1] via P2P/CUMEM
[rank0]:costa-671b-5-worker-1:382:704 [0] NCCL INFO Channel 13/0 : 0[0] -> 1[1] via P2P/CUMEM
[rank0]:costa-671b-5-worker-1:382:704 [0] NCCL INFO Channel 14/0 : 0[0] -> 1[1] via P2P/CUMEM
[rank0]:costa-671b-5-worker-1:382:704 [0] NCCL INFO Channel 15/0 : 0[0] -> 1[1] via P2P/CUMEM
[rank0]:costa-671b-5-worker-1:382:704 [0] NCCL INFO Channel 16/0 : 0[0] -> 1[1] via P2P/CUMEM
[rank0]:costa-671b-5-worker-1:382:704 [0] NCCL INFO Channel 17/0 : 0[0] -> 1[1] via P2P/CUMEM
[rank0]:costa-671b-5-worker-1:382:704 [0] NCCL INFO Channel 18/0 : 0[0] -> 1[1] via P2P/CUMEM
[rank0]:costa-671b-5-worker-1:382:704 [0] NCCL INFO Channel 19/0 : 0[0] -> 1[1] via P2P/CUMEM
[rank0]:costa-671b-5-worker-1:382:704 [0] NCCL INFO Channel 20/0 : 0[0] -> 1[1] via P2P/CUMEM
[rank0]:costa-671b-5-worker-1:382:704 [0] NCCL INFO Channel 21/0 : 0[0] -> 1[1] via P2P/CUMEM
[rank0]:costa-671b-5-worker-1:382:704 [0] NCCL INFO Channel 22/0 : 0[0] -> 1[1] via P2P/CUMEM
[rank0]:costa-671b-5-worker-1:382:704 [0] NCCL INFO Channel 23/0 : 0[0] -> 1[1] via P2P/CUMEM
[rank0]:costa-671b-5-worker-1:382:704 [0] NCCL INFO Connected all rings, use ring PXN 0 GDR 1
[rank0]:costa-671b-5-worker-1:382:382 [0] NCCL INFO Comm config Blocking set to 1
[rank0]:costa-671b-5-worker-1:382:718 [0] NCCL INFO Assigned NET plugin Libfabric to comm
[rank0]:costa-671b-5-worker-1:382:718 [0] NCCL INFO Using network Libfabric
[rank0]:costa-671b-5-worker-1:382:718 [0] NCCL INFO DMA-BUF is available on GPU device 0
[rank0]:costa-671b-5-worker-1:382:718 [0] NCCL INFO ncclCommInitRankConfig comm 0x5bc85dc0 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 53000 commId 0xbb118fbd103cc510 - Init START
[rank0]:costa-671b-5-worker-1:382:718 [0] NCCL INFO Bootstrap timings total 0.002449 (create 0.000028, send 0.000084, recv 0.001388, ring 0.000559, delay 0.000000)
W0814 06:34:52.848000 316 torch/distributed/elastic/multiprocessing/api.py:908] Sending process 382 closing signal SIGTERM
W0814 06:34:52.849000 316 torch/distributed/elastic/multiprocessing/api.py:908] Sending process 383 closing signal SIGTERM
W0814 06:34:52.850000 316 torch/distributed/elastic/multiprocessing/api.py:908] Sending process 384 closing signal SIGTERM
W0814 06:34:52.851000 316 torch/distributed/elastic/multiprocessing/api.py:908] Sending process 385 closing signal SIGTERM
W0814 06:34:52.852000 316 torch/distributed/elastic/multiprocessing/api.py:908] Sending process 386 closing signal SIGTERM
W0814 06:34:52.852000 316 torch/distributed/elastic/multiprocessing/api.py:908] Sending process 387 closing signal SIGTERM
W0814 06:34:52.853000 316 torch/distributed/elastic/multiprocessing/api.py:908] Sending process 388 closing signal SIGTERM
E0814 06:34:56.669000 316 torch/distributed/elastic/multiprocessing/api.py:882] failed (exitcode: 1) local_rank: 7 (pid: 389) of binary: /home/ubuntu/code/thirdparty/torchtitan-dev/.venv/bin/python
E0814 06:34:56.673000 316 torch/distributed/elastic/multiprocessing/errors/error_handler.py:141] no error file defined for parent, to copy child error file (/tmp/torchelastic_90ov82__/1_gpm7ozb5/attempt_0/7/error.json)
[rank0]:costa-671b-5-worker-1:382:718 [0] NCCL INFO MNNVL busId 0x53000 fabric UUID 0.0 cliqueId 0x0 state 3 healthMask 0x0
[rank0]:costa-671b-5-worker-1:382:718 [0] NCCL INFO NCCL_TOPO_FILE set by environment to /proc/self/fd/156
[rank0]:costa-671b-5-worker-1:382:718 [0] NCCL INFO Setting affinity for GPU 0 to 0-47,96-143
[rank0]:costa-671b-5-worker-1:382:718 [0] NCCL INFO comm 0x5bc85dc0 rank 0 nRanks 4 nNodes 4 localRanks 1 localRank 0 MNNVL 0
[rank0]:costa-671b-5-worker-1:382:718 [0] NCCL INFO Channel 00/04 : 0 1 2 3
[rank0]:costa-671b-5-worker-1:382:718 [0] NCCL INFO Channel 01/04 : 0 1 2 3
[rank0]:costa-671b-5-worker-1:382:718 [0] NCCL INFO Channel 02/04 : 0 1 2 3
[rank0]:costa-671b-5-worker-1:382:718 [0] NCCL INFO Channel 03/04 : 0 1 2 3
[rank0]:costa-671b-5-worker-1:382:718 [0] NCCL INFO Trees [0] 2/-1/-1->0->-1 [1] 2/-1/-1->0->-1 [2] -1/-1/-1->0->1 [3] -1/-1/-1->0->1
[rank0]:costa-671b-5-worker-1:382:718 [0] NCCL INFO P2P Chunksize set to 524288
[rank0]:costa-671b-5-worker-1:382:718 [0] NCCL INFO Check P2P Type isAllDirectP2p 0 directMode 0
[rank0]:costa-671b-5-worker-1:382:734 [0] NCCL INFO [Proxy Service] Device 0 CPU core 114
[rank0]:costa-671b-5-worker-1:382:735 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 100
[rank0]:costa-671b-5-worker-1:382:718 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
[rank0]:costa-671b-5-worker-1:382:718 [0] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
Traceback (most recent call last):
  File "/home/ubuntu/code/thirdparty/torchtitan-dev/.venv/bin/torchrun", line 10, in <module>
[rank0]:costa-671b-5-worker-1:382:718 [0] NCCL INFO CC Off, workFifoBytes 1048576
[rank0]:costa-671b-5-worker-1:382:718 [0] NCCL INFO ncclCommInitRankConfig comm 0x5bc85dc0 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 53000 commId 0xbb118fbd103cc510 - Init COMPLETE
    sys.exit(main())
[rank0]:costa-671b-5-worker-1:382:718 [0] NCCL INFO Init timings - ncclCommInitRankConfig: rank 0 nranks 4 total 79.79 (kernels 0.00, alloc 0.00, bootstrap 0.00, allgathers 31.52, topo 48.26, graphs 0.00, connections 0.01, rest 0.00)
[rank0]:costa-671b-5-worker-1:382:741 [0] NCCL INFO [Proxy Progress] Device 0 CPU core 33
[rank0]:costa-671b-5-worker-1:382:738 [0] NCCL INFO Channel 00/0 : 3[0] -> 0[0] [receive] via NET/Libfabric/0/GDRDMA
[rank0]:costa-671b-5-worker-1:382:738 [0] NCCL INFO Channel 01/0 : 3[0] -> 0[0] [receive] via NET/Libfabric/0/GDRDMA
[rank0]:costa-671b-5-worker-1:382:738 [0] NCCL INFO Channel 02/0 : 3[0] -> 0[0] [receive] via NET/Libfabric/0/GDRDMA
[rank0]:costa-671b-5-worker-1:382:738 [0] NCCL INFO Channel 03/0 : 3[0] -> 0[0] [receive] via NET/Libfabric/0/GDRDMA
[rank0]:costa-671b-5-worker-1:382:738 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[0] [send] via NET/Libfabric/0/GDRDMA
[rank0]:costa-671b-5-worker-1:382:738 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[0] [send] via NET/Libfabric/0/GDRDMA
[rank0]:costa-671b-5-worker-1:382:738 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[0] [send] via NET/Libfabric/0/GDRDMA
[rank0]:costa-671b-5-worker-1:382:738 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[0] [send] via NET/Libfabric/0/GDRDMA
             ^^^^^^
  File "/home/ubuntu/code/thirdparty/torchtitan-dev/.venv/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 357, in wrapper
    return f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/code/thirdparty/torchtitan-dev/.venv/lib/python3.12/site-packages/torch/distributed/run.py", line 936, in main
    run(args)
  File "/home/ubuntu/code/thirdparty/torchtitan-dev/.venv/lib/python3.12/site-packages/torch/distributed/run.py", line 927, in run
    elastic_launch(
  File "/home/ubuntu/code/thirdparty/torchtitan-dev/.venv/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 157, in __call__
    return launch_agent(self._config, self._entrypoint, list(args))
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/ubuntu/code/thirdparty/torchtitan-dev/.venv/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 294, in launch_agent
    raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError: 
============================================================
torchtitan.train FAILED
------------------------------------------------------------
Failures:
  <NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
  time      : 2025-08-14_06:34:50
  host      : costa-671b-5-worker-1.costa-671b-5.default.svc.cluster.local
  rank      : 15 (local_rank: 7)
  exitcode  : 1 (pid: 389)
  error_file: /tmp/torchelastic_90ov82__/1_gpm7ozb5/attempt_0/7/error.json
  traceback : Traceback (most recent call last):
    File "/home/ubuntu/code/thirdparty/torchtitan-dev/.venv/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 357, in wrapper
      return f(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^
    File "/home/ubuntu/code/thirdparty/torchtitan-dev/torchtitan/train.py", line 576, in train
      self.train_step(data_iterator)
    File "/home/ubuntu/code/thirdparty/torchtitan-dev/torchtitan/train.py", line 476, in train_step
      loss = self.forward_backward_step(input_dict, labels)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/home/ubuntu/code/thirdparty/torchtitan-dev/torchtitan/train.py", line 452, in forward_backward_step
      pred = model_parts[0](inputs, eos_id=self.tokenizer.eos_id)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/home/ubuntu/code/thirdparty/torchtitan-dev/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
      return self._call_impl(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/home/ubuntu/code/thirdparty/torchtitan-dev/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1881, in _call_impl
      return inner()
             ^^^^^^^
    File "/home/ubuntu/code/thirdparty/torchtitan-dev/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1829, in inner
      result = forward_call(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/home/ubuntu/code/thirdparty/torchtitan-dev/torchtitan/models/deepseek_v3/model/model.py", line 431, in forward
      h = layer(h, self.freqs_cis)
          ^^^^^^^^^^^^^^^^^^^^^^^^
    File "/home/ubuntu/code/thirdparty/torchtitan-dev/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
      return self._call_impl(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/home/ubuntu/code/thirdparty/torchtitan-dev/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1881, in _call_impl
      return inner()
             ^^^^^^^
    File "/home/ubuntu/code/thirdparty/torchtitan-dev/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1829, in inner
      result = forward_call(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/home/ubuntu/code/thirdparty/torchtitan-dev/.venv/lib/python3.12/site-packages/torch/distributed/algorithms/_checkpoint/checkpoint_wrapper.py", line 171, in forward
      return self.checkpoint_fn(  # type: ignore[misc]
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/home/ubuntu/code/thirdparty/torchtitan-dev/.venv/lib/python3.12/site-packages/torch/_compile.py", line 53, in inner
      return disable_fn(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/home/ubuntu/code/thirdparty/torchtitan-dev/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1005, in _fn
      return fn(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^
    File "/home/ubuntu/code/thirdparty/torchtitan-dev/.venv/lib/python3.12/site-packages/torch/utils/checkpoint.py", line 495, in checkpoint
      ret = function(*args, **kwargs)
            ^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/home/ubuntu/code/thirdparty/torchtitan-dev/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
      return self._call_impl(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/home/ubuntu/code/thirdparty/torchtitan-dev/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
      return forward_call(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/home/ubuntu/code/thirdparty/torchtitan-dev/torchtitan/models/deepseek_v3/model/model.py", line 337, in forward
      x = x + self.moe(self.ffn_norm(x))
              ^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/home/ubuntu/code/thirdparty/torchtitan-dev/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
      return self._call_impl(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/home/ubuntu/code/thirdparty/torchtitan-dev/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1881, in _call_impl
      return inner()
             ^^^^^^^
    File "/home/ubuntu/code/thirdparty/torchtitan-dev/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1829, in inner
      result = forward_call(*args, **kwargs)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/home/ubuntu/code/thirdparty/torchtitan-dev/torchtitan/models/moe.py", line 407, in forward
      routed_output = self.experts(routed_input, num_tokens_per_expert)
                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/home/ubuntu/code/thirdparty/torchtitan-dev/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
      return self._call_impl(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/home/ubuntu/code/thirdparty/torchtitan-dev/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1881, in _call_impl
      return inner()
             ^^^^^^^
    File "/home/ubuntu/code/thirdparty/torchtitan-dev/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1808, in inner
      args_kwargs_result = hook(self, args, kwargs)  # type: ignore[misc]
                           ^^^^^^^^^^^^^^^^^^^^^^^^
    File "/home/ubuntu/code/thirdparty/torchtitan-dev/.venv/lib/python3.12/site-packages/torch/distributed/fsdp/_fully_shard/_fsdp_state.py", line 62, in fsdp_hook_wrapper
      return torch._dynamo.disable(
             ^^^^^^^^^^^^^^^^^^^^^^
    File "/home/ubuntu/code/thirdparty/torchtitan-dev/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1005, in _fn
      return fn(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^
    File "/home/ubuntu/code/thirdparty/torchtitan-dev/.venv/lib/python3.12/site-packages/torch/distributed/fsdp/_fully_shard/_fsdp_state.py", line 248, in _pre_forward
      args, kwargs = self._fsdp_param_group.pre_forward(module, args, kwargs)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/home/ubuntu/code/thirdparty/torchtitan-dev/.venv/lib/python3.12/site-packages/torch/distributed/fsdp/_fully_shard/_fsdp_param_group.py", line 386, in pre_forward
      self.unshard(self.unshard_async_op)
    File "/home/ubuntu/code/thirdparty/torchtitan-dev/.venv/lib/python3.12/site-packages/torch/distributed/fsdp/_fully_shard/_fsdp_param_group.py", line 314, in unshard
      self._all_gather_result = foreach_all_gather(
                                ^^^^^^^^^^^^^^^^^^^
    File "/home/ubuntu/code/thirdparty/torchtitan-dev/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
      return func(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^
    File "/home/ubuntu/code/thirdparty/torchtitan-dev/.venv/lib/python3.12/site-packages/torch/distributed/fsdp/_fully_shard/_fsdp_collectives.py", line 275, in foreach_all_gather
      all_gather_work = all_gather_comm(
                        ^^^^^^^^^^^^^^^^
    File "/home/ubuntu/code/thirdparty/torchtitan-dev/.venv/lib/python3.12/site-packages/torch/distributed/fsdp/_fully_shard/_fsdp_collectives.py", line 89, in __call__
      return dist.all_gather_into_tensor(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "/home/ubuntu/code/thirdparty/torchtitan-dev/.venv/lib/python3.12/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
      return func(*args, **kwargs)
             ^^^^^^^^^^^^^^^^^^^^^
    File "/home/ubuntu/code/thirdparty/torchtitan-dev/.venv/lib/python3.12/site-packages/torch/distributed/distributed_c10d.py", line 3988, in all_gather_into_tensor
      work = group._allgather_base(output_tensor, input_tensor, opts)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/NCCLUtils.cpp:94, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.27.5
  ncclUnhandledCudaError: Call to CUDA function failed.
  Last error:
  Cuda failure 'device-side assert triggered'
  

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions