-
Notifications
You must be signed in to change notification settings - Fork 566
Closed
Description
Hi TorchTitan team,
I am trying to run
PYTORCH_CUDA_ALLOC_CONF="expandable_segments:True" CONFIG_FILE="./torchtitan/models/deepseek_v3/train_configs/deepseek_v3_671b.toml" uv run --no-sync ./run_train.sh \
--training.steps=100000 \
--training.dataset=c4_test \
--training.local_batch_size=1 \
--parallelism.data_parallel_shard_degree=-1 \
--parallelism.tensor_parallel_degree=4 \
--parallelism.pipeline_parallel_degree=1 \
--parallelism.expert_parallel_degree=64 \
--parallelism.expert_tensor_parallel_degree=1 \
--metrics.log_freq=1 \
and got
+ export NGPU=8
+ NGPU=8
+ export LOG_RANK=0
+ LOG_RANK=0
+ export CONFIG_FILE=./torchtitan/models/deepseek_v3/train_configs/deepseek_v3_671b.toml
+ CONFIG_FILE=./torchtitan/models/deepseek_v3/train_configs/deepseek_v3_671b.toml
+ overrides=
+ '[' 9 -ne 0 ']'
+ overrides='--training.steps=100000 --training.dataset=c4_test --training.local_batch_size=1 --parallelism.data_parallel_shard_degree=-1 --parallelism.tensor_parallel_degree=4 --parallelism.pipeline_parallel_degree=1 --parallelism.expert_parallel_degree=64 --parallelism.expert_tensor_parallel_degree=1 --metrics.log_freq=1'
+ export TORCHFT_LIGHTHOUSE=http://localhost:29510
+ TORCHFT_LIGHTHOUSE=http://localhost:29510
+ export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
+ PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
+ '[' -z 32 ']'
+ '[' 32 = 1 ']'
+ RDZV_ID=1
+ RDZV_HOST=costa-671b-5-worker-0.costa-671b-5
+ torchrun --nnodes 32 --nproc_per_node=8 --rdzv_backend c10d --rdzv_endpoint costa-671b-5-worker-0.costa-671b-5:29500 --rdzv_id 1 --local-ranks-filter 0 --role rank --tee 3 -m torchtitan.train --job.config_file ./torchtitan/models/deepseek_v3/train_configs/deepseek_v3_671b.toml --training.steps=100000 --training.dataset=c4_test --training.local_batch_size=1 --parallelism.data_parallel_shard_degree=-1 --parallelism.tensor_parallel_degree=4 --parallelism.pipeline_parallel_degree=1 --parallelism.expert_parallel_degree=64 --parallelism.expert_tensor_parallel_degree=1 --metrics.log_freq=1
W0814 06:32:53.078000 316 torch/distributed/run.py:803]
W0814 06:32:53.078000 316 torch/distributed/run.py:803] *****************************************
W0814 06:32:53.078000 316 torch/distributed/run.py:803] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
W0814 06:32:53.078000 316 torch/distributed/run.py:803] *****************************************
[rank0]:/home/ubuntu/code/thirdparty/torchtitan-dev/.venv/lib/python3.12/site-packages/torch/cuda/__init__.py:410: UserWarning: PYTORCH_CUDA_ALLOC_CONF is deprecated, use PYTORCH_ALLOC_CONF instead (Triggered internally at /pytorch/c10/core/AllocatorConfig.cpp:28.)
[rank0]: torch._C._cuda_init()
[rank0]:[titan] 2025-08-14 06:33:13,382 - root - WARNING - tokenizer_path is deprecated, use model.hf_assets_path instead. Setting hf_assets_path to tokenizer_path temporarily.
[rank0]:[titan] 2025-08-14 06:33:13,382 - root - INFO - Starting job: DeepSeek-V3 671B model training
[rank0]:[titan] 2025-08-14 06:33:14,873 - root - WARNING - ENV[TORCH_NCCL_ASYNC_ERROR_HANDLING] = 1 will be overridden to 3 based on job config
[rank0]:[titan] 2025-08-14 06:33:14,879 - root - INFO - Building 3-D device mesh with ['dp_shard_mod_ep', 'dp_shard_in_ep', 'tp'], [4, 16, 4]
[rank0]:[titan] 2025-08-14 06:33:14,890 - root - INFO - [GC] Initial GC collection 0.00 seconds
[rank0]:costa-671b-5-worker-1:382:382 [0] NCCL INFO cudaDriverVersion 12090
[rank0]:costa-671b-5-worker-1:382:382 [0] NCCL INFO Bootstrap: Using eth0:10.1.68.153<0>
[rank0]:costa-671b-5-worker-1:382:382 [0] NCCL INFO NCCL version 2.27.5+cuda12.9
[rank0]:costa-671b-5-worker-1:382:382 [0] NCCL INFO Comm config Blocking set to 1
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/Plugin: Plugin name set by env to libnccl-net.so
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/Plugin: Loaded net plugin Libfabric (v10)
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v10 symbol.
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v9 symbol.
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v8 symbol.
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v7 symbol.
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/Plugin: Failed to find ncclCollNetPlugin_v6 symbol.
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO Successfully loaded external plugin libnccl-net.so
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI Initializing aws-ofi-nccl 1.15.0
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI Using Libfabric version 2.1
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI Using CUDA driver version 12090 with runtime 12090
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI Configuring AWS-specific options
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI Setting NCCL_NET_FORCE_FLUSH=0 since this platform does not require a network flush.
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI Setting NCCL_NVLSTREE_MAX_CHUNKSIZE to 512KiB
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI Setting NCCL_NVLS_CHUNKSIZE to 512KiB
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI Internode latency set at 75.0 us
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI Using transport protocol RDMA (platform set)
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI Selected provider is efa, fabric is efa-direct (found 32 nics)
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI NIC group 0 device #0 0000:52:00.0
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI NIC group 0 device #1 0000:51:00.0
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI NIC group 0 device #2 0000:50:00.0
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI NIC group 0 device #3 0000:4f:00.0
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI NIC group 1 device #0 0000:63:00.0
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI NIC group 1 device #1 0000:62:00.0
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI NIC group 1 device #2 0000:61:00.0
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI NIC group 1 device #3 0000:60:00.0
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI NIC group 2 device #0 0000:74:00.0
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI NIC group 2 device #1 0000:73:00.0
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI NIC group 2 device #2 0000:72:00.0
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI NIC group 2 device #3 0000:71:00.0
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI NIC group 3 device #0 0000:85:00.0
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI NIC group 3 device #1 0000:84:00.0
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI NIC group 3 device #2 0000:83:00.0
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI NIC group 3 device #3 0000:82:00.0
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI NIC group 4 device #0 0000:96:00.0
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI NIC group 4 device #1 0000:95:00.0
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI NIC group 4 device #2 0000:94:00.0
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI NIC group 4 device #3 0000:93:00.0
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI NIC group 5 device #0 0000:a7:00.0
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI NIC group 5 device #1 0000:a6:00.0
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI NIC group 5 device #2 0000:a5:00.0
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI NIC group 5 device #3 0000:a4:00.0
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI NIC group 6 device #0 0000:b8:00.0
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI NIC group 6 device #1 0000:b7:00.0
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI NIC group 6 device #2 0000:b6:00.0
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI NIC group 6 device #3 0000:b5:00.0
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI NIC group 7 device #0 0000:c9:00.0
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI NIC group 7 device #1 0000:c8:00.0
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI NIC group 7 device #2 0000:c7:00.0
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI NIC group 7 device #3 0000:c6:00.0
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI Setting NCCL_TOPO_FILE environment variable to /proc/self/fd/156
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI Creating one domain per process
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI GUID of rdmap82s0: 578a1dff7f090501
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI GUID for dev[0]: 00000000000000000a01449900000000
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI GUID of rdmap99s0: 7b7dd78e6c201c01
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI GUID for dev[1]: 00000000000000000a01449900000001
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI GUID of rdmap116s0: 7ba4fd7c85464600
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI GUID for dev[2]: 00000000000000000a01449900000002
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI GUID of rdmap133s0: dd4414e66c5d5d00
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI GUID for dev[3]: 00000000000000000a01449900000003
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI GUID of rdmap150s0: 3917a554ff838300
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI GUID for dev[4]: 00000000000000000a01449900000004
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI GUID of rdmap167s0: b9b56ca9a49b9b00
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI GUID for dev[5]: 00000000000000000a01449900000005
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI GUID of rdmap184s0: d90ac2d4f4bebe00
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI GUID for dev[6]: 00000000000000000a01449900000006
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI GUID of rdmap201s0: 7b7607630dd6d600
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI GUID for dev[7]: 00000000000000000a01449900000007
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI Support for global registrations: true
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NET/OFI Support for DMA-BUF registrations: false
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO Initialized NET plugin Libfabric
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO Assigned NET plugin Libfabric to comm
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO Using network Libfabric
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO DMA-BUF is available on GPU device 0
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO ncclCommInitRankScalable comm 0x2bff7340 rank 8 nranks 256 cudaDev 0 nvmlDev 0 busId 53000 commId 0xc11aff4bf0a5a0b0 - Init START
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO RAS client listening socket at ::1<28028>
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO Bootstrap timings total 0.828328 (create 0.000043, send 0.000325, recv 0.000577, ring 0.826804, delay 0.000000)
[rank0]:[titan] 2025-08-14 06:33:21,267 - root - INFO - Loading tokenizer from tokenizer.json
[rank0]:[titan] 2025-08-14 06:33:21,491 - root - INFO - Preparing c4_test dataset from tests/assets/c4_test
[rank0]:[titan] 2025-08-14 06:33:21,520 - root - INFO - Building deepseek_v3 671B with DeepSeekV3ModelArgs(_enforced='This field is used to enforce all fields have defaults.', max_batch_size=8, max_seq_len=4096, dtype='fp8', vocab_size=129280, dim=7168, inter_dim=18432, moe_inter_dim=2048, n_layers=61, n_dense_layers=3, n_heads=128, norm_eps=1e-05, moe_args=MoEArgs(num_experts=256, num_shared_experts=1, score_func='sigmoid', route_norm=True, route_scale=2.5, score_before_experts=False, top_k=8, use_grouped_mm=True, load_balance_coeff=0.001), n_expert_groups=8, n_limited_groups=4, q_lora_rank=1536, kv_lora_rank=512, qk_nope_head_dim=128, qk_rope_head_dim=64, v_head_dim=128, use_flex_attn=False, attn_mask_type='causal', original_seq_len=4096, rope_theta=10000.0, rope_factor=40, beta_fast=32, beta_slow=1, mscale=1.0)
[rank0]:[titan] 2025-08-14 06:33:22,298 - root - INFO - CUDA capacity: NVIDIA H100 80GB HBM3 with 79.10GiB memory
[rank0]:[titan] 2025-08-14 06:33:22,615 - root - INFO - Total parameter count: dense 14,456,871,936, sparse 656,569,532,416, active 37,552,282,624
[rank0]:[titan] 2025-08-14 06:33:22,615 - root - INFO - Model deepseek_v3 671B size: 671,026,404,352 total parameters
[rank0]:[titan] 2025-08-14 06:33:22,822 - root - INFO - Applied Tensor Parallelism to the model
[rank0]:[titan] 2025-08-14 06:33:22,958 - root - INFO - Applied full activation checkpointing to the model
[rank0]:[titan] 2025-08-14 06:33:23,606 - root - INFO - Applied FSDP to the model
[rank0]:[titan] 2025-08-14 06:33:25,109 - root - INFO - Peak FLOPS used for computing MFU: 9.890e+14
[rank0]:[titan] 2025-08-14 06:33:25,109 - root - INFO - CUDA memory usage for model: 12.16GiB(15.38%)
[rank0]:[titan] 2025-08-14 06:33:25,114 - root - INFO - Mixed precision training is handled by fully_shard
[rank0]:[titan] 2025-08-14 06:33:25,114 - root - INFO - Trainer is initialized with local batch size 1, global batch size 64, gradient accumulation steps 1, sequence length 4096, total steps 100000 (warmup 2000)
[rank0]:[titan] 2025-08-14 06:33:25,114 - root - INFO - Training starts at step 1
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO MNNVL busId 0x53000 fabric UUID 0.0 cliqueId 0x0 state 3 healthMask 0x0
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NCCL_TOPO_FILE set by environment to /proc/self/fd/156
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO Setting affinity for GPU 0 to 0-47,96-143
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NVLS multicast support is available on dev 0 (NVLS_NCHANNELS 16)
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO comm 0x2bff7340 rank 8 nRanks 256 nNodes 32 localRanks 8 localRank 0 MNNVL 0
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO Trees [0] 9/-1/-1->8->16 [1] -1/-1/-1->8->15 [2] 9/-1/-1->8->15 [3] 9/-1/-1->8->15 [4] 9/-1/-1->8->15 [5] 9/-1/-1->8->15 [6] 9/-1/-1->8->15 [7] 9/-1/-1->8->15 [8] 9/16/0->8->24 [9] -1/-1/-1->8->15 [10] 9/-1/-1->8->15 [11] 9/-1/-1->8->15 [12] 9/-1/-1->8->15 [13] 9/-1/-1->8->15 [14] 9/-1/-1->8->15 [15] 9/-1/-1->8->15
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NCCL_BUFFSIZE set by environment to 8388608.
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NCCL_P2P_NET_CHUNKSIZE set by environment to 524288.
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO P2P Chunksize set to 524288
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO PROFILER/Plugin: Could not find: libnccl-profiler.so.
[rank0]:costa-671b-5-worker-1:382:568 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 99
[rank0]:costa-671b-5-worker-1:382:561 [0] NCCL INFO [Proxy Service] Device 0 CPU core 25
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO NCCL_NVLS_CHUNKSIZE set by environment to 524288.
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO threadThresholds 8/8/64 | 2048/8/64 | 512 | 512
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO 16 coll channels, 16 collnet channels, 16 nvls channels, 16 p2p channels, 1 p2p channels per peer
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO TUNER/Plugin: Could not find: libnccl-tuner.so. Using internal tuner plugin.
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v4 symbol.
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v3 symbol.
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO TUNER/Plugin: Failed to find ncclTunerPlugin_v2 symbol, using internal tuner instead.
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO ncclCommInitRankScalable comm 0x2bff7340 rank 8 nranks 256 cudaDev 0 nvmlDev 0 busId 53000 commId 0xc11aff4bf0a5a0b0 - Init COMPLETE
[rank0]:costa-671b-5-worker-1:382:548 [0] NCCL INFO Init timings - ncclCommInitRankScalable: rank 8 nranks 256 total 4.94 (kernels 0.25, alloc 2.23, bootstrap 0.83, allgathers 0.15, topo 1.20, graphs 0.01, connections 0.26, rest 0.01)
[rank0]:costa-671b-5-worker-1:382:582 [0] NCCL INFO Channel 02/0 : 8[0] -> 9[1] via P2P/CUMEM
[rank0]:costa-671b-5-worker-1:382:582 [0] NCCL INFO Channel 04/0 : 8[0] -> 9[1] via P2P/CUMEM
[rank0]:costa-671b-5-worker-1:382:582 [0] NCCL INFO Channel 06/0 : 8[0] -> 9[1] via P2P/CUMEM
[rank0]:costa-671b-5-worker-1:382:582 [0] NCCL INFO Channel 10/0 : 8[0] -> 9[1] via P2P/CUMEM
[rank0]:costa-671b-5-worker-1:382:582 [0] NCCL INFO Channel 12/0 : 8[0] -> 9[1] via P2P/CUMEM
[rank0]:costa-671b-5-worker-1:382:582 [0] NCCL INFO Channel 14/0 : 8[0] -> 9[1] via P2P/CUMEM
[rank0]:costa-671b-5-worker-1:382:582 [0] NCCL INFO Channel 01/0 : 8[0] -> 15[7] via P2P/CUMEM
[rank0]:costa-671b-5-worker-1:382:582 [0] NCCL INFO Channel 03/0 : 8[0] -> 15[7] via P2P/CUMEM
[rank0]:costa-671b-5-worker-1:382:582 [0] NCCL INFO Channel 05/0 : 8[0] -> 15[7] via P2P/CUMEM
[rank0]:costa-671b-5-worker-1:382:582 [0] NCCL INFO Channel 07/0 : 8[0] -> 15[7] via P2P/CUMEM
[rank0]:costa-671b-5-worker-1:382:582 [0] NCCL INFO Channel 09/0 : 8[0] -> 15[7] via P2P/CUMEM
[rank0]:costa-671b-5-worker-1:382:582 [0] NCCL INFO Channel 11/0 : 8[0] -> 15[7] via P2P/CUMEM
[rank0]:costa-671b-5-worker-1:382:582 [0] NCCL INFO Channel 13/0 : 8[0] -> 15[7] via P2P/CUMEM
[rank0]:costa-671b-5-worker-1:382:582 [0] NCCL INFO Channel 15/0 : 8[0] -> 15[7] via P2P/CUMEM
[rank0]:costa-671b-5-worker-1:382:582 [0] NCCL INFO NCCL_NET_FORCE_FLUSH set by environment to 0.
[rank0]:costa-671b-5-worker-1:382:590 [0] NCCL INFO [Proxy Progress] Device 0 CPU core 14
[rank0]:costa-671b-5-worker-1:382:582 [0] NCCL INFO Channel 01/0 : 0[0] -> 8[0] [receive] via NET/Libfabric/0/GDRDMA
[rank0]:costa-671b-5-worker-1:382:582 [0] NCCL INFO Channel 09/0 : 0[0] -> 8[0] [receive] via NET/Libfabric/0/GDRDMA
[rank0]:costa-671b-5-worker-1:382:582 [0] NCCL INFO Channel 00/0 : 8[0] -> 16[0] [send] via NET/Libfabric/0/GDRDMA
[rank0]:costa-671b-5-worker-1:382:582 [0] NCCL INFO Channel 08/0 : 8[0] -> 16[0] [send] via NET/Libfabric/0/GDRDMA
[rank0]:costa-671b-5-worker-1:382:582 [0] NCCL INFO Connected all rings, use ring PXN 0 GDR 1
[rank0]:costa-671b-5-worker-1:382:382 [0] NCCL INFO Comm config Blocking set to 1
[rank0]:costa-671b-5-worker-1:382:635 [0] NCCL INFO Assigned NET plugin Libfabric to comm
[rank0]:costa-671b-5-worker-1:382:635 [0] NCCL INFO Using network Libfabric
[rank0]:costa-671b-5-worker-1:382:635 [0] NCCL INFO DMA-BUF is available on GPU device 0
[rank0]:costa-671b-5-worker-1:382:635 [0] NCCL INFO ncclCommInitRankConfig comm 0x3a5f7a70 rank 2 nranks 64 cudaDev 0 nvmlDev 0 busId 53000 commId 0x891e39595d474161 - Init START
[rank0]:costa-671b-5-worker-1:382:635 [0] NCCL INFO Bootstrap timings total 0.745780 (create 0.000038, send 0.000275, recv 0.040516, ring 0.704749, delay 0.000000)
[rank0]:costa-671b-5-worker-1:382:635 [0] NCCL INFO MNNVL busId 0x53000 fabric UUID 0.0 cliqueId 0x0 state 3 healthMask 0x0
[rank0]:costa-671b-5-worker-1:382:635 [0] NCCL INFO NCCL_TOPO_FILE set by environment to /proc/self/fd/156
[rank0]:costa-671b-5-worker-1:382:635 [0] NCCL INFO Setting affinity for GPU 0 to 0-47,96-143
[rank0]:costa-671b-5-worker-1:382:635 [0] NCCL INFO comm 0x3a5f7a70 rank 2 nRanks 64 nNodes 32 localRanks 2 localRank 0 MNNVL 0
[rank0]:costa-671b-5-worker-1:382:635 [0] NCCL INFO Trees [0] 3/-1/-1->2->4 [1] -1/-1/-1->2->3 [2] 3/-1/-1->2->4 [3] -1/-1/-1->2->3 [4] 3/4/0->2->6 [5] -1/-1/-1->2->3 [6] 3/4/0->2->6 [7] -1/-1/-1->2->3
[rank0]:costa-671b-5-worker-1:382:635 [0] NCCL INFO P2P Chunksize set to 524288
[rank0]:costa-671b-5-worker-1:382:663 [0] NCCL INFO [Proxy Service] Device 0 CPU core 26
[rank0]:costa-671b-5-worker-1:382:664 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 119
[rank0]:costa-671b-5-worker-1:382:635 [0] NCCL INFO threadThresholds 8/8/64 | 512/8/64 | 512 | 512
[rank0]:costa-671b-5-worker-1:382:635 [0] NCCL INFO 8 coll channels, 8 collnet channels, 0 nvls channels, 8 p2p channels, 1 p2p channels per peer
[rank0]:costa-671b-5-worker-1:382:635 [0] NCCL INFO ncclCommInitRankConfig comm 0x3a5f7a70 rank 2 nranks 64 cudaDev 0 nvmlDev 0 busId 53000 commId 0x891e39595d474161 - Init COMPLETE
[rank0]:costa-671b-5-worker-1:382:635 [0] NCCL INFO Init timings - ncclCommInitRankConfig: rank 2 nranks 64 total 1.37 (kernels 0.00, alloc 0.00, bootstrap 0.75, allgathers 0.24, topo 0.35, graphs 0.00, connections 0.02, rest 0.00)
[rank0]:costa-671b-5-worker-1:382:671 [0] NCCL INFO [Proxy Progress] Device 0 CPU core 26
[rank0]:costa-671b-5-worker-1:382:668 [0] NCCL INFO Channel 00/0 : 1[4] -> 2[0] [receive] via NET/Libfabric/0/GDRDMA
[rank0]:costa-671b-5-worker-1:382:668 [0] NCCL INFO Channel 02/0 : 1[4] -> 2[0] [receive] via NET/Libfabric/0/GDRDMA
[rank0]:costa-671b-5-worker-1:382:668 [0] NCCL INFO Channel 04/0 : 1[4] -> 2[0] [receive] via NET/Libfabric/0/GDRDMA
[rank0]:costa-671b-5-worker-1:382:668 [0] NCCL INFO Channel 06/0 : 1[4] -> 2[0] [receive] via NET/Libfabric/0/GDRDMA
[rank0]:costa-671b-5-worker-1:382:668 [0] NCCL INFO Channel 00/0 : 2[0] -> 3[4] via P2P/CUMEM
[rank0]:costa-671b-5-worker-1:382:668 [0] NCCL INFO Channel 02/0 : 2[0] -> 3[4] via P2P/CUMEM
[rank0]:costa-671b-5-worker-1:382:668 [0] NCCL INFO Channel 04/0 : 2[0] -> 3[4] via P2P/CUMEM
[rank0]:costa-671b-5-worker-1:382:668 [0] NCCL INFO Channel 06/0 : 2[0] -> 3[4] via P2P/CUMEM
[rank0]:costa-671b-5-worker-1:382:668 [0] NCCL INFO Channel 01/0 : 2[0] -> 5[4] [send] via NET/Libfabric/4(3)/GDRDMA
[rank0]:costa-671b-5-worker-1:382:668 [0] NCCL INFO Channel 03/0 : 2[0] -> 5[4] [send] via NET/Libfabric/4(3)/GDRDMA
[rank0]:costa-671b-5-worker-1:382:668 [0] NCCL INFO Channel 05/0 : 2[0] -> 5[4] [send] via NET/Libfabric/4(3)/GDRDMA
[rank0]:costa-671b-5-worker-1:382:668 [0] NCCL INFO Channel 07/0 : 2[0] -> 5[4] [send] via NET/Libfabric/4(3)/GDRDMA
[rank0]:costa-671b-5-worker-1:382:668 [0] NCCL INFO Connected all rings, use ring PXN 1 GDR 1
[rank0]:costa-671b-5-worker-1:382:382 [0] NCCL INFO Comm config Blocking set to 1
[rank0]:costa-671b-5-worker-1:382:675 [0] NCCL INFO Assigned NET plugin Libfabric to comm
[rank0]:costa-671b-5-worker-1:382:675 [0] NCCL INFO Using network Libfabric
[rank0]:costa-671b-5-worker-1:382:675 [0] NCCL INFO DMA-BUF is available on GPU device 0
[rank0]:costa-671b-5-worker-1:382:675 [0] NCCL INFO ncclCommInitRankConfig comm 0x43be0680 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 53000 commId 0x45d2e8f7f215955b - Init START
[rank0]:costa-671b-5-worker-1:382:675 [0] NCCL INFO Bootstrap timings total 0.001645 (create 0.000036, send 0.000096, recv 0.000778, ring 0.000291, delay 0.000000)
[rank0]:costa-671b-5-worker-1:382:675 [0] NCCL INFO MNNVL busId 0x53000 fabric UUID 0.0 cliqueId 0x0 state 3 healthMask 0x0
[rank0]:costa-671b-5-worker-1:382:675 [0] NCCL INFO NCCL_TOPO_FILE set by environment to /proc/self/fd/156
[rank0]:costa-671b-5-worker-1:382:675 [0] NCCL INFO Setting affinity for GPU 0 to 0-47,96-143
[rank0]:costa-671b-5-worker-1:382:675 [0] NCCL INFO NVLS multicast support is available on dev 0 (NVLS_NCHANNELS 16)
[rank0]:costa-671b-5-worker-1:382:675 [0] NCCL INFO comm 0x43be0680 rank 0 nRanks 4 nNodes 1 localRanks 4 localRank 0 MNNVL 0
[rank0]:costa-671b-5-worker-1:382:675 [0] NCCL INFO Channel 00/24 : 0 1 2 3
[rank0]:costa-671b-5-worker-1:382:675 [0] NCCL INFO Channel 01/24 : 0 1 2 3
[rank0]:costa-671b-5-worker-1:382:675 [0] NCCL INFO Channel 02/24 : 0 1 2 3
[rank0]:costa-671b-5-worker-1:382:675 [0] NCCL INFO Channel 03/24 : 0 1 2 3
[rank0]:costa-671b-5-worker-1:382:675 [0] NCCL INFO Channel 04/24 : 0 1 2 3
[rank0]:costa-671b-5-worker-1:382:675 [0] NCCL INFO Channel 05/24 : 0 1 2 3
[rank0]:costa-671b-5-worker-1:382:675 [0] NCCL INFO Channel 06/24 : 0 1 2 3
[rank0]:costa-671b-5-worker-1:382:675 [0] NCCL INFO Channel 07/24 : 0 1 2 3
[rank0]:costa-671b-5-worker-1:382:675 [0] NCCL INFO Channel 08/24 : 0 1 2 3
[rank0]:costa-671b-5-worker-1:382:675 [0] NCCL INFO Channel 09/24 : 0 1 2 3
[rank0]:costa-671b-5-worker-1:382:675 [0] NCCL INFO Channel 10/24 : 0 1 2 3
[rank0]:costa-671b-5-worker-1:382:675 [0] NCCL INFO Channel 11/24 : 0 1 2 3
[rank0]:costa-671b-5-worker-1:382:675 [0] NCCL INFO Channel 12/24 : 0 1 2 3
[rank0]:costa-671b-5-worker-1:382:675 [0] NCCL INFO Channel 13/24 : 0 1 2 3
[rank0]:costa-671b-5-worker-1:382:675 [0] NCCL INFO Channel 14/24 : 0 1 2 3
[rank0]:costa-671b-5-worker-1:382:675 [0] NCCL INFO Channel 15/24 : 0 1 2 3
[rank0]:costa-671b-5-worker-1:382:675 [0] NCCL INFO Channel 16/24 : 0 1 2 3
[rank0]:costa-671b-5-worker-1:382:675 [0] NCCL INFO Channel 17/24 : 0 1 2 3
[rank0]:costa-671b-5-worker-1:382:675 [0] NCCL INFO Channel 18/24 : 0 1 2 3
[rank0]:costa-671b-5-worker-1:382:675 [0] NCCL INFO Channel 19/24 : 0 1 2 3
[rank0]:costa-671b-5-worker-1:382:675 [0] NCCL INFO Channel 20/24 : 0 1 2 3
[rank0]:costa-671b-5-worker-1:382:675 [0] NCCL INFO Channel 21/24 : 0 1 2 3
[rank0]:costa-671b-5-worker-1:382:675 [0] NCCL INFO Channel 22/24 : 0 1 2 3
[rank0]:costa-671b-5-worker-1:382:675 [0] NCCL INFO Channel 23/24 : 0 1 2 3
[rank0]:costa-671b-5-worker-1:382:675 [0] NCCL INFO Trees [0] 1/-1/-1->0->-1 [1] 1/-1/-1->0->-1 [2] 1/-1/-1->0->-1 [3] 1/-1/-1->0->-1 [4] 1/-1/-1->0->-1 [5] 1/-1/-1->0->-1 [6] 1/-1/-1->0->-1 [7] 1/-1/-1->0->-1 [8] 1/-1/-1->0->-1 [9] 1/-1/-1->0->-1 [10] 1/-1/-1->0->-1 [11] 1/-1/-1->0->-1 [12] 1/-1/-1->0->-1 [13] 1/-1/-1->0->-1 [14] 1/-1/-1->0->-1 [15] 1/-1/-1->0->-1 [16] 1/-1/-1->0->-1 [17] 1/-1/-1->0->-1 [18] 1/-1/-1->0->-1 [19] 1/-1/-1->0->-1 [20] 1/-1/-1->0->-1 [21] 1/-1/-1->0->-1 [22] 1/-1/-1->0->-1 [23] 1/-1/-1->0->-1
[rank0]:costa-671b-5-worker-1:382:675 [0] NCCL INFO P2P Chunksize set to 524288
[rank0]:costa-671b-5-worker-1:382:675 [0] NCCL INFO Check P2P Type isAllDirectP2p 1 directMode 0
[rank0]:costa-671b-5-worker-1:382:698 [0] NCCL INFO [Proxy Service] Device 0 CPU core 17
[rank0]:costa-671b-5-worker-1:382:699 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 39
[rank0]:costa-671b-5-worker-1:382:675 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
[rank0]:costa-671b-5-worker-1:382:675 [0] NCCL INFO 24 coll channels, 24 collnet channels, 16 nvls channels, 32 p2p channels, 32 p2p channels per peer
[rank0]:costa-671b-5-worker-1:382:675 [0] NCCL INFO CC Off, workFifoBytes 1048576
[rank0]:costa-671b-5-worker-1:382:675 [0] NCCL INFO ncclCommInitRankConfig comm 0x43be0680 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 53000 commId 0x45d2e8f7f215955b - Init COMPLETE
[rank0]:costa-671b-5-worker-1:382:675 [0] NCCL INFO Init timings - ncclCommInitRankConfig: rank 0 nranks 4 total 1.57 (kernels 0.00, alloc 0.00, bootstrap 0.00, allgathers 0.01, topo 1.23, graphs 0.01, connections 0.28, rest 0.04)
[rank0]:costa-671b-5-worker-1:382:704 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[1] via P2P/CUMEM
[rank0]:costa-671b-5-worker-1:382:704 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[1] via P2P/CUMEM
[rank0]:costa-671b-5-worker-1:382:704 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[1] via P2P/CUMEM
[rank0]:costa-671b-5-worker-1:382:704 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[1] via P2P/CUMEM
[rank0]:costa-671b-5-worker-1:382:704 [0] NCCL INFO Channel 04/0 : 0[0] -> 1[1] via P2P/CUMEM
[rank0]:costa-671b-5-worker-1:382:704 [0] NCCL INFO Channel 05/0 : 0[0] -> 1[1] via P2P/CUMEM
[rank0]:costa-671b-5-worker-1:382:704 [0] NCCL INFO Channel 06/0 : 0[0] -> 1[1] via P2P/CUMEM
[rank0]:costa-671b-5-worker-1:382:704 [0] NCCL INFO Channel 07/0 : 0[0] -> 1[1] via P2P/CUMEM
[rank0]:costa-671b-5-worker-1:382:704 [0] NCCL INFO Channel 08/0 : 0[0] -> 1[1] via P2P/CUMEM
[rank0]:costa-671b-5-worker-1:382:704 [0] NCCL INFO Channel 09/0 : 0[0] -> 1[1] via P2P/CUMEM
[rank0]:costa-671b-5-worker-1:382:704 [0] NCCL INFO Channel 10/0 : 0[0] -> 1[1] via P2P/CUMEM
[rank0]:costa-671b-5-worker-1:382:704 [0] NCCL INFO Channel 11/0 : 0[0] -> 1[1] via P2P/CUMEM
[rank0]:costa-671b-5-worker-1:382:704 [0] NCCL INFO Channel 12/0 : 0[0] -> 1[1] via P2P/CUMEM
[rank0]:costa-671b-5-worker-1:382:704 [0] NCCL INFO Channel 13/0 : 0[0] -> 1[1] via P2P/CUMEM
[rank0]:costa-671b-5-worker-1:382:704 [0] NCCL INFO Channel 14/0 : 0[0] -> 1[1] via P2P/CUMEM
[rank0]:costa-671b-5-worker-1:382:704 [0] NCCL INFO Channel 15/0 : 0[0] -> 1[1] via P2P/CUMEM
[rank0]:costa-671b-5-worker-1:382:704 [0] NCCL INFO Channel 16/0 : 0[0] -> 1[1] via P2P/CUMEM
[rank0]:costa-671b-5-worker-1:382:704 [0] NCCL INFO Channel 17/0 : 0[0] -> 1[1] via P2P/CUMEM
[rank0]:costa-671b-5-worker-1:382:704 [0] NCCL INFO Channel 18/0 : 0[0] -> 1[1] via P2P/CUMEM
[rank0]:costa-671b-5-worker-1:382:704 [0] NCCL INFO Channel 19/0 : 0[0] -> 1[1] via P2P/CUMEM
[rank0]:costa-671b-5-worker-1:382:704 [0] NCCL INFO Channel 20/0 : 0[0] -> 1[1] via P2P/CUMEM
[rank0]:costa-671b-5-worker-1:382:704 [0] NCCL INFO Channel 21/0 : 0[0] -> 1[1] via P2P/CUMEM
[rank0]:costa-671b-5-worker-1:382:704 [0] NCCL INFO Channel 22/0 : 0[0] -> 1[1] via P2P/CUMEM
[rank0]:costa-671b-5-worker-1:382:704 [0] NCCL INFO Channel 23/0 : 0[0] -> 1[1] via P2P/CUMEM
[rank0]:costa-671b-5-worker-1:382:704 [0] NCCL INFO Connected all rings, use ring PXN 0 GDR 1
[rank0]:costa-671b-5-worker-1:382:382 [0] NCCL INFO Comm config Blocking set to 1
[rank0]:costa-671b-5-worker-1:382:718 [0] NCCL INFO Assigned NET plugin Libfabric to comm
[rank0]:costa-671b-5-worker-1:382:718 [0] NCCL INFO Using network Libfabric
[rank0]:costa-671b-5-worker-1:382:718 [0] NCCL INFO DMA-BUF is available on GPU device 0
[rank0]:costa-671b-5-worker-1:382:718 [0] NCCL INFO ncclCommInitRankConfig comm 0x5bc85dc0 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 53000 commId 0xbb118fbd103cc510 - Init START
[rank0]:costa-671b-5-worker-1:382:718 [0] NCCL INFO Bootstrap timings total 0.002449 (create 0.000028, send 0.000084, recv 0.001388, ring 0.000559, delay 0.000000)
W0814 06:34:52.848000 316 torch/distributed/elastic/multiprocessing/api.py:908] Sending process 382 closing signal SIGTERM
W0814 06:34:52.849000 316 torch/distributed/elastic/multiprocessing/api.py:908] Sending process 383 closing signal SIGTERM
W0814 06:34:52.850000 316 torch/distributed/elastic/multiprocessing/api.py:908] Sending process 384 closing signal SIGTERM
W0814 06:34:52.851000 316 torch/distributed/elastic/multiprocessing/api.py:908] Sending process 385 closing signal SIGTERM
W0814 06:34:52.852000 316 torch/distributed/elastic/multiprocessing/api.py:908] Sending process 386 closing signal SIGTERM
W0814 06:34:52.852000 316 torch/distributed/elastic/multiprocessing/api.py:908] Sending process 387 closing signal SIGTERM
W0814 06:34:52.853000 316 torch/distributed/elastic/multiprocessing/api.py:908] Sending process 388 closing signal SIGTERM
E0814 06:34:56.669000 316 torch/distributed/elastic/multiprocessing/api.py:882] failed (exitcode: 1) local_rank: 7 (pid: 389) of binary: /home/ubuntu/code/thirdparty/torchtitan-dev/.venv/bin/python
E0814 06:34:56.673000 316 torch/distributed/elastic/multiprocessing/errors/error_handler.py:141] no error file defined for parent, to copy child error file (/tmp/torchelastic_90ov82__/1_gpm7ozb5/attempt_0/7/error.json)
[rank0]:costa-671b-5-worker-1:382:718 [0] NCCL INFO MNNVL busId 0x53000 fabric UUID 0.0 cliqueId 0x0 state 3 healthMask 0x0
[rank0]:costa-671b-5-worker-1:382:718 [0] NCCL INFO NCCL_TOPO_FILE set by environment to /proc/self/fd/156
[rank0]:costa-671b-5-worker-1:382:718 [0] NCCL INFO Setting affinity for GPU 0 to 0-47,96-143
[rank0]:costa-671b-5-worker-1:382:718 [0] NCCL INFO comm 0x5bc85dc0 rank 0 nRanks 4 nNodes 4 localRanks 1 localRank 0 MNNVL 0
[rank0]:costa-671b-5-worker-1:382:718 [0] NCCL INFO Channel 00/04 : 0 1 2 3
[rank0]:costa-671b-5-worker-1:382:718 [0] NCCL INFO Channel 01/04 : 0 1 2 3
[rank0]:costa-671b-5-worker-1:382:718 [0] NCCL INFO Channel 02/04 : 0 1 2 3
[rank0]:costa-671b-5-worker-1:382:718 [0] NCCL INFO Channel 03/04 : 0 1 2 3
[rank0]:costa-671b-5-worker-1:382:718 [0] NCCL INFO Trees [0] 2/-1/-1->0->-1 [1] 2/-1/-1->0->-1 [2] -1/-1/-1->0->1 [3] -1/-1/-1->0->1
[rank0]:costa-671b-5-worker-1:382:718 [0] NCCL INFO P2P Chunksize set to 524288
[rank0]:costa-671b-5-worker-1:382:718 [0] NCCL INFO Check P2P Type isAllDirectP2p 0 directMode 0
[rank0]:costa-671b-5-worker-1:382:734 [0] NCCL INFO [Proxy Service] Device 0 CPU core 114
[rank0]:costa-671b-5-worker-1:382:735 [0] NCCL INFO [Proxy Service UDS] Device 0 CPU core 100
[rank0]:costa-671b-5-worker-1:382:718 [0] NCCL INFO threadThresholds 8/8/64 | 32/8/64 | 512 | 512
[rank0]:costa-671b-5-worker-1:382:718 [0] NCCL INFO 4 coll channels, 4 collnet channels, 0 nvls channels, 4 p2p channels, 2 p2p channels per peer
Traceback (most recent call last):
File "/home/ubuntu/code/thirdparty/torchtitan-dev/.venv/bin/torchrun", line 10, in <module>
[rank0]:costa-671b-5-worker-1:382:718 [0] NCCL INFO CC Off, workFifoBytes 1048576
[rank0]:costa-671b-5-worker-1:382:718 [0] NCCL INFO ncclCommInitRankConfig comm 0x5bc85dc0 rank 0 nranks 4 cudaDev 0 nvmlDev 0 busId 53000 commId 0xbb118fbd103cc510 - Init COMPLETE
sys.exit(main())
[rank0]:costa-671b-5-worker-1:382:718 [0] NCCL INFO Init timings - ncclCommInitRankConfig: rank 0 nranks 4 total 79.79 (kernels 0.00, alloc 0.00, bootstrap 0.00, allgathers 31.52, topo 48.26, graphs 0.00, connections 0.01, rest 0.00)
[rank0]:costa-671b-5-worker-1:382:741 [0] NCCL INFO [Proxy Progress] Device 0 CPU core 33
[rank0]:costa-671b-5-worker-1:382:738 [0] NCCL INFO Channel 00/0 : 3[0] -> 0[0] [receive] via NET/Libfabric/0/GDRDMA
[rank0]:costa-671b-5-worker-1:382:738 [0] NCCL INFO Channel 01/0 : 3[0] -> 0[0] [receive] via NET/Libfabric/0/GDRDMA
[rank0]:costa-671b-5-worker-1:382:738 [0] NCCL INFO Channel 02/0 : 3[0] -> 0[0] [receive] via NET/Libfabric/0/GDRDMA
[rank0]:costa-671b-5-worker-1:382:738 [0] NCCL INFO Channel 03/0 : 3[0] -> 0[0] [receive] via NET/Libfabric/0/GDRDMA
[rank0]:costa-671b-5-worker-1:382:738 [0] NCCL INFO Channel 00/0 : 0[0] -> 1[0] [send] via NET/Libfabric/0/GDRDMA
[rank0]:costa-671b-5-worker-1:382:738 [0] NCCL INFO Channel 01/0 : 0[0] -> 1[0] [send] via NET/Libfabric/0/GDRDMA
[rank0]:costa-671b-5-worker-1:382:738 [0] NCCL INFO Channel 02/0 : 0[0] -> 1[0] [send] via NET/Libfabric/0/GDRDMA
[rank0]:costa-671b-5-worker-1:382:738 [0] NCCL INFO Channel 03/0 : 0[0] -> 1[0] [send] via NET/Libfabric/0/GDRDMA
^^^^^^
File "/home/ubuntu/code/thirdparty/torchtitan-dev/.venv/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 357, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/code/thirdparty/torchtitan-dev/.venv/lib/python3.12/site-packages/torch/distributed/run.py", line 936, in main
run(args)
File "/home/ubuntu/code/thirdparty/torchtitan-dev/.venv/lib/python3.12/site-packages/torch/distributed/run.py", line 927, in run
elastic_launch(
File "/home/ubuntu/code/thirdparty/torchtitan-dev/.venv/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 157, in __call__
return launch_agent(self._config, self._entrypoint, list(args))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/code/thirdparty/torchtitan-dev/.venv/lib/python3.12/site-packages/torch/distributed/launcher/api.py", line 294, in launch_agent
raise ChildFailedError(
torch.distributed.elastic.multiprocessing.errors.ChildFailedError:
============================================================
torchtitan.train FAILED
------------------------------------------------------------
Failures:
<NO_OTHER_FAILURES>
------------------------------------------------------------
Root Cause (first observed failure):
[0]:
time : 2025-08-14_06:34:50
host : costa-671b-5-worker-1.costa-671b-5.default.svc.cluster.local
rank : 15 (local_rank: 7)
exitcode : 1 (pid: 389)
error_file: /tmp/torchelastic_90ov82__/1_gpm7ozb5/attempt_0/7/error.json
traceback : Traceback (most recent call last):
File "/home/ubuntu/code/thirdparty/torchtitan-dev/.venv/lib/python3.12/site-packages/torch/distributed/elastic/multiprocessing/errors/__init__.py", line 357, in wrapper
return f(*args, **kwargs)
^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/code/thirdparty/torchtitan-dev/torchtitan/train.py", line 576, in train
self.train_step(data_iterator)
File "/home/ubuntu/code/thirdparty/torchtitan-dev/torchtitan/train.py", line 476, in train_step
loss = self.forward_backward_step(input_dict, labels)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/code/thirdparty/torchtitan-dev/torchtitan/train.py", line 452, in forward_backward_step
pred = model_parts[0](inputs, eos_id=self.tokenizer.eos_id)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/code/thirdparty/torchtitan-dev/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/code/thirdparty/torchtitan-dev/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1881, in _call_impl
return inner()
^^^^^^^
File "/home/ubuntu/code/thirdparty/torchtitan-dev/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1829, in inner
result = forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/code/thirdparty/torchtitan-dev/torchtitan/models/deepseek_v3/model/model.py", line 431, in forward
h = layer(h, self.freqs_cis)
^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/code/thirdparty/torchtitan-dev/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/code/thirdparty/torchtitan-dev/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1881, in _call_impl
return inner()
^^^^^^^
File "/home/ubuntu/code/thirdparty/torchtitan-dev/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1829, in inner
result = forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/code/thirdparty/torchtitan-dev/.venv/lib/python3.12/site-packages/torch/distributed/algorithms/_checkpoint/checkpoint_wrapper.py", line 171, in forward
return self.checkpoint_fn( # type: ignore[misc]
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/code/thirdparty/torchtitan-dev/.venv/lib/python3.12/site-packages/torch/_compile.py", line 53, in inner
return disable_fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/code/thirdparty/torchtitan-dev/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1005, in _fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/code/thirdparty/torchtitan-dev/.venv/lib/python3.12/site-packages/torch/utils/checkpoint.py", line 495, in checkpoint
ret = function(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/code/thirdparty/torchtitan-dev/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/code/thirdparty/torchtitan-dev/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1786, in _call_impl
return forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/code/thirdparty/torchtitan-dev/torchtitan/models/deepseek_v3/model/model.py", line 337, in forward
x = x + self.moe(self.ffn_norm(x))
^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/code/thirdparty/torchtitan-dev/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/code/thirdparty/torchtitan-dev/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1881, in _call_impl
return inner()
^^^^^^^
File "/home/ubuntu/code/thirdparty/torchtitan-dev/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1829, in inner
result = forward_call(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/code/thirdparty/torchtitan-dev/torchtitan/models/moe.py", line 407, in forward
routed_output = self.experts(routed_input, num_tokens_per_expert)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/code/thirdparty/torchtitan-dev/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1775, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/code/thirdparty/torchtitan-dev/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1881, in _call_impl
return inner()
^^^^^^^
File "/home/ubuntu/code/thirdparty/torchtitan-dev/.venv/lib/python3.12/site-packages/torch/nn/modules/module.py", line 1808, in inner
args_kwargs_result = hook(self, args, kwargs) # type: ignore[misc]
^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/code/thirdparty/torchtitan-dev/.venv/lib/python3.12/site-packages/torch/distributed/fsdp/_fully_shard/_fsdp_state.py", line 62, in fsdp_hook_wrapper
return torch._dynamo.disable(
^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/code/thirdparty/torchtitan-dev/.venv/lib/python3.12/site-packages/torch/_dynamo/eval_frame.py", line 1005, in _fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/code/thirdparty/torchtitan-dev/.venv/lib/python3.12/site-packages/torch/distributed/fsdp/_fully_shard/_fsdp_state.py", line 248, in _pre_forward
args, kwargs = self._fsdp_param_group.pre_forward(module, args, kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/code/thirdparty/torchtitan-dev/.venv/lib/python3.12/site-packages/torch/distributed/fsdp/_fully_shard/_fsdp_param_group.py", line 386, in pre_forward
self.unshard(self.unshard_async_op)
File "/home/ubuntu/code/thirdparty/torchtitan-dev/.venv/lib/python3.12/site-packages/torch/distributed/fsdp/_fully_shard/_fsdp_param_group.py", line 314, in unshard
self._all_gather_result = foreach_all_gather(
^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/code/thirdparty/torchtitan-dev/.venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 120, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/code/thirdparty/torchtitan-dev/.venv/lib/python3.12/site-packages/torch/distributed/fsdp/_fully_shard/_fsdp_collectives.py", line 275, in foreach_all_gather
all_gather_work = all_gather_comm(
^^^^^^^^^^^^^^^^
File "/home/ubuntu/code/thirdparty/torchtitan-dev/.venv/lib/python3.12/site-packages/torch/distributed/fsdp/_fully_shard/_fsdp_collectives.py", line 89, in __call__
return dist.all_gather_into_tensor(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/code/thirdparty/torchtitan-dev/.venv/lib/python3.12/site-packages/torch/distributed/c10d_logger.py", line 81, in wrapper
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/ubuntu/code/thirdparty/torchtitan-dev/.venv/lib/python3.12/site-packages/torch/distributed/distributed_c10d.py", line 3988, in all_gather_into_tensor
work = group._allgather_base(output_tensor, input_tensor, opts)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
torch.distributed.DistBackendError: NCCL error in: /pytorch/torch/csrc/distributed/c10d/NCCLUtils.cpp:94, unhandled cuda error (run with NCCL_DEBUG=INFO for details), NCCL version 2.27.5
ncclUnhandledCudaError: Call to CUDA function failed.
Last error:
Cuda failure 'device-side assert triggered'
Metadata
Metadata
Assignees
Labels
No labels