NCCL kernels take longer when composing CUDAGraph with SimpleFSDP

Reported by @BoyuanFeng and @galv for the PR: https://github.com/pytorch/torchtitan/pull/2050.

Repro instructions:
```
# WITHOUT cudagraph
USE_EXPANDABLE_SEGMENTS=False NGPU=8 CONFIG_FILE=./torchtitan/models/llama3/train_configs/debug_model.toml ./run_train.sh --model.name compiler_toolkit.llama3 --parallelism.data_parallel_shard_degree=2 --parallelism.tensor_parallel_degree=4

# WITH cudagraph
USE_EXPANDABLE_SEGMENTS=False NGPU=8 CONFIG_FILE=./torchtitan/models/llama3/train_configs/debug_model.toml ./run_train.sh --model.name compiler_toolkit.llama3 --parallelism.data_parallel_shard_degree=2 --parallelism.tensor_parallel_degree=4  --job.custom_config_module=torchtitan.experiments.compiler_toolkit.job_config --compile.passes cudagraph

# the trace would be stored in torchtitan/outputs/profile_trace
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

NCCL kernels take longer when composing CUDAGraph with SimpleFSDP #2071

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

NCCL kernels take longer when composing CUDAGraph with SimpleFSDP #2071

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions