CUDAGraph support for SimpleFSDP and TP #2050

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

BoyuanFeng merged 27 commits into main from bf/cg-experiment

Nov 20, 2025

Contributor

BoyuanFeng commented Nov 17, 2025 •

edited

Loading

Features

Support SimpleFSDP and TP
Support static input indices to reduce copy
Support memory reuse to reduce memory consumption
Cleanup cudagraph when training finishes to avoid nccl hang from destroy_process_group

Command:

NCCL_GRAPH_REGISTER=0 NGPU=8 TRAIN_FILE=torchtitan.experiments.compiler_toolkit.train CONFIG_FILE=./torchtitan/models/llama3/train_configs/debug_model.toml ./run_train.sh --model.name compiler_toolkit.llama3 --parallelism.data_parallel_shard_degree=2 --parallelism.tensor_parallel_degree=4  --job.custom_config_module=torchtitan.experiments.compiler_toolkit.job_config --compile.passes cudagraph

Note: we use NCCL_GRAPH_REGISTER=0 due to a known issue that nccl + cudagraphs + expandable segments result in IMA. pytorch/pytorch#158029

Result

Numerics:
Achieved bitwise equivalence w/ and w/o cudagraph pass on llama3.1-8B AND llama3.1-70B.

Performance:

Raw log: llama3-8b, llama3-70b

Memory:
On llama3.1-70b, cudagraph takes 6% more memory consumption (143 GiB vs 153 GiB).

A few tricks to reduce memory consumption (use llama3.1-70b w/ cudagraph as an example):

Start: 161 GiB
+ use the same stream for warmup and graph capture of both fwd and bwd: 160 GiB
+ warmup in cudagraph memory pool instead of eager memory pool: 153 GiB

static input copy:
On llama3.1-70B, for forward, we copy 1 tensor of 128 bytes; for backward, we copy 1 tensor of 0.98 GB. This shows static input indices is handled correctly.

Followup PR

In the followup PR, I will enable fx graph partition for deepseek v3 pytorch/pytorch#165945.

BoyuanFeng added 12 commits

November 14, 2025 10:07

nit

658e73d


          Merge branch 'main' into bf/cg-experiment

dd218f8


          this can run

8ee5fce


          remove expandable_segments to fix IMA issue

40823fa

nit

6a297db


          cleanup

73812f9


          cleanup

5c3da3f


          cleanup

b2b2b4f


          explicit del cudagraph before destroy process group

b433910


          add USE_EXPANDABLE_SEGMENTS config

5559ae4


          add static input indices

a3ed72c


          lint

2163f31

BoyuanFeng requested review from fegin, tianyu-l, wconstab and wwwjn as code owners

November 17, 2025 23:40

meta-cla bot added the CLA Signed label

BoyuanFeng marked this pull request as draft

November 17, 2025 23:41

BoyuanFeng added 6 commits

November 17, 2025 15:47


          cleanup

3ff3ada


          warmup in cudagraph memory pool

e49a2f2


          Merge branch 'main' into bf/cg-experiment

4845b96


          lint

fac30da


          refactor cudagraph to separate file

2b5cfbc


          make sure cudagraph is always the last pass to apply

BoyuanFeng commented

View reviewed changes

torchtitan/experiments/compiler_toolkit/cudagraph.py

    
                  def copy_static_inputs(self, *args):

                      for i in self.input_indices_to_copy:

                          self.args[i].copy_(args[i])

Contributor Author

BoyuanFeng Nov 18, 2025

we could replace this for loop with foreach copy. However, I empirically observed there is only 1 tensor to copy for fwd and 1 tensor to copy for bwd. So no need to add code complexity here.

BoyuanFeng commented

View reviewed changes

torchtitan/experiments/compiler_toolkit/cudagraph.py

    
                              self.cudagraph, pool=self.graph_pool, stream=self.stream

                          ):

                              # `output` is managed by pytorch's cudagraph pool

                              self.output = self.runnable(*args)

Contributor Author

BoyuanFeng Nov 18, 2025

we could potentially use weakref for output tensor to reduce memory. Will do in a followup pr.

BoyuanFeng requested review from eellison and ngimel

November 18, 2025 21:40

BoyuanFeng marked this pull request as ready for review

November 18, 2025 21:40


          add test

b0feed3

BoyuanFeng added 2 commits

November 18, 2025 16:45


          Merge branch 'main' into bf/cg-experiment

5bc3c2a


          more docs and tests

3835a14

eellison approved these changes

View reviewed changes

eellison left a comment

Looks good! should handle not persisting input and output for serious use. I would also add assertions for assumptions that will manifest as silent incorrectness, at least behind a config. Also, we probably shouldn't globally turn off expandable segments when cudagraphs is not enabled.

torchtitan/experiments/compiler_toolkit/cudagraph.py Outdated

Comment on lines 100 to 102

    
                          input_addresses = [

                              x.data_ptr() if isinstance(x, torch.Tensor) else None for x in args

                          ]

eellison Nov 19, 2025

I guess we're assuming that the non tensor inputs are the same every time ? Should we just assert they're all tensors if we're not handling the other cases ?

Contributor Author

BoyuanFeng Nov 19, 2025

IIUC, there would only be tensor and symint (for moe layer). let me add assertion

Contributor Author

BoyuanFeng Nov 19, 2025

there are also rng_state: torch._C.Generator, used by

graphsafe_run_with_rng_state_2 = torch.ops.higher_order.graphsafe_run_with_rng_state(torch.ops.aten._scaled_dot_product_flash_attention.default, transpose_20, transpose_21, transpose_22, 0.0, True, scale = 0.25, rng_state = fwd_rng_state_2);

See the last 3 args in P2047035404

torchtitan/experiments/compiler_toolkit/cudagraph.py

    
                      self.copy_static_inputs(*args)

                      self.cudagraph.replay()

                      return self.output

eellison Nov 19, 2025

The persistent input and output is not good for memory, as you've commented.

Contributor Author

BoyuanFeng Nov 19, 2025

yes will add in the next pr.

torchtitan/experiments/compiler_toolkit/cudagraph.py Outdated

    
                      self.args = None

                      self.output = None

                  def copy_static_inputs(self, *args):

eellison Nov 19, 2025

if any of the static inputs changes, you'll get silent incorrectness. you might consider at least a config to check this.

Contributor Author

BoyuanFeng Nov 19, 2025

yes added

torchtitan/experiments/compiler_toolkit/cudagraph.py

Comment on lines +52 to +56

    
              (

                  _global_dummy_graph,

                  _global_graph_pool,

                  _global_graph_capture_stream,

              ) = init_global_graph_pool()

eellison Nov 19, 2025

does this work when backward is on a separate stream ? or not an issue?

Contributor Author

BoyuanFeng Nov 19, 2025 •

edited

Loading

IIUC, this is not an issue currently. since fwd and bwd are on the same cuda stream by default.

cudagraph trees has used the same graph capture stream for both fwd and bwd.
https://github.com/pytorch/pytorch/blob/7a928397cda89b71c24b0efe9db6df7fb04a46cb/torch/_inductor/cudagraph_trees.py#L1945

run_train.sh Outdated

    
              # need to turn off expandable segments when using cudagraph, since

              # it does not work with cg and nccl yet.

              # https://github.com/pytorch/pytorch/issues/158029

eellison Nov 19, 2025

Can we turn this off only when using cudagraph ?

Contributor Author

BoyuanFeng Nov 19, 2025

currently it's on by default. when using cudagraph, we need to explicitly turn it off with USE_EXPANDABLE_SEGMENTS=False [other commands].

ngimel Nov 19, 2025

Instead of turning off expandable segments you can turn off nccl memory registration, as the issue suggests


          add runtime checks

26414c0

yiming0416 requested changes

View reviewed changes

torchtitan/experiments/compiler_toolkit/graph_utils.py Outdated Show resolved Hide resolved

torchtitan/train.py Outdated

    
                      # in joint_graph_module. An explicit gc.collect() is necessary

                      # to clean up reference cycles.

                      for part in self.model_parts:

                          part.joint_graph_module = None

Contributor

yiming0416 Nov 19, 2025

I think joint_graph_module only exists for compiler toolkit experiments. So for other experiments or training runs part won't have joint_graph_module

Contributor Author

BoyuanFeng Nov 19, 2025

we can only cleanup if it has the joint_graph_module


          cleanup

2d037e4

yiming0416 approved these changes

View reviewed changes

Contributor

yiming0416 left a comment

experiments part LGTM


          Merge branch 'main' into bf/cg-experiment

267d0ae

tianyu-l reviewed

View reviewed changes

Contributor

tianyu-l left a comment

Sounds great! Had a suggestion.

torchtitan/train.py Outdated

    
                      # in joint_graph_module. An explicit gc.collect() is necessary

                      # to clean up reference cycles.

                      for part in self.model_parts:

                          if hasattr(part, "joint_graph_module"):

Contributor

tianyu-l Nov 19, 2025

joint_graph_module is exclusively used for compiler_toolkit right? If it can't be made general, let's create a Trainer subclass to overwrite this method. E.g. https://github.com/pytorch/torchtitan/blob/main/torchtitan/experiments/torchcomms/train.py

Contributor

yiming0416 Nov 19, 2025

@tianyu-l Trainer subclass added in #2064

yiming0416 mentioned this pull request

[compiler toolkit] Add Trainer subclass for compiler toolkit #2064

Merged

yiming0416 added a commit that referenced this pull request


          [compiler toolkit] Add Trainer subclass for compiler toolkit (#2064)

f541d91

Adding CudaGraph pass (#2050)
would require some custom logic in Trainer's close() method.

So we create a Trainer subclass in compiler toolkit

BoyuanFeng added 3 commits

November 19, 2025 16:06


          Merge branch 'main' into bf/cg-experiment

752f307


          cleanup

c8e7384

nit

0516fa7

BoyuanFeng merged commit f5e3a84 into main

5 checks passed

syed-ahmed mentioned this pull request

NCCL kernels take longer when composing CUDAGraph with SimpleFSDP #2071

Open

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

yiming0416 yiming0416 approved these changes

tianyu-l tianyu-l left review comments

fegin Awaiting requested review from fegin

wwwjn Awaiting requested review from wwwjn

wconstab Awaiting requested review from wconstab

ngimel Awaiting requested review from ngimel

+1 more reviewer

eellison eellison approved these changes

Labels