Different numerical results during training with CUDA graphs #119873

tsengalb99 · 2024-02-14T06:02:33Z

🐛 Describe the bug

I am trying to wrap a sequence of Llama decoder layers (as implemented by huggingface) with a cuda graph to speed up training. Without cuda graphs, the loss decreases normally. With cuda graphs, the loss decreases at a slower rate and does not converge to the same solution. I have some custom kernels in my linear layer that decompress quantized weights but I verified these are not the source of the issue by pre-manifesting the decompressed weights and just calling x@W.T in the linear layer.

Is there a specific way that cuda graphs need to be set up to support training?

Details:

I am working on the end to end fine tuning part of quip#. The latest code is in the quip# repo https://github.com/Cornell-RelaxML/quip-sharp. The fine tuning script is here https://github.com/Cornell-RelaxML/quip-sharp/blob/main/quantize_llama/finetune_e2e_llama.py.
This script takes a llama model and shards it across multiple gpus with a simple shard script https://github.com/Cornell-RelaxML/quip-sharp/blob/main/lib/utils/shard_model.py.
Each shard can be wrapped with a cuda graph wrapper (https://github.com/Cornell-RelaxML/quip-sharp/blob/6e666a034b54369e2b1e0a9a7c3c5c850998a4cd/lib/utils/shard_model.py#L68, graph wrapper defined in https://github.com/Cornell-RelaxML/quip-sharp/blob/main/lib/utils/graph_wrapper.py).
If I use L68 instead of L69, I get different loss values during training. The initial loss value before training is the same in both cases, so the forward pass is correct and there is something going on with the backward pass.
To reproduce, run python -m quantize_llama.finetune_e2e_llama --base_model meta-llama/Llama-2-7b-hf --hf_path relaxml/Llama-2-7b-E8P-2Bit --devset_size 8 --ft_valid_size 4 --ft_epochs 2 --ft_bs 1 --ctx_size 4096 --ft_update_freq 2 --ckpt_path /tmp/test --batch_size 4
- This should give the following output

I0214 05:23:27.217544 4143374 finetune.py:164] initial loss 2.3495054244995117
I0214 05:24:07.684923 4143374 finetune.py:190] epoch 0 new loss 2.3075146675109863 old loss 2.3495054244995117 BETTER
I0214 05:24:47.764547 4143374 finetune.py:190] epoch 1 new loss 2.2501564025878906 old loss 2.3075146675109863 BETTER

whereas with cuda graphs (comment out L69 and uncomment L68) you should get

I0214 04:24:51.857225 4141303 finetune.py:164] initial loss 2.349527359008789
I0214 04:25:05.204223 4141303 finetune.py:204] epoch 0 new loss 2.3479909896850586 old loss 2.349527359008789 BETTER
I0214 04:25:18.243131 4141303 finetune.py:204] epoch 1 new loss 2.347287654876709 old loss 2.3479909896850586 BETTER

You can test manifesting the whole weight matrix with --ft_train_mode which should give the same results. This happens when using only one gpu as well so it should not be due to using more than one gpu.

Versions

Collecting environment information...
PyTorch version: 2.1.1+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 20.04.5 LTS (x86_64)
GCC version: (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.31

Python version: 3.11.5 (main, Sep 11 2023, 13:54:46) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.4.0-139-generic-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration:
GPU 0: NVIDIA A100 80GB PCIe
GPU 1: NVIDIA A100 80GB PCIe
GPU 2: NVIDIA A100 80GB PCIe
GPU 3: NVIDIA A100 80GB PCIe
GPU 4: NVIDIA A100 80GB PCIe
GPU 5: NVIDIA A100 80GB PCIe
GPU 6: NVIDIA A100 80GB PCIe
GPU 7: NVIDIA A100 80GB PCIe

Nvidia driver version: 525.105.17
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
Address sizes: 46 bits physical, 57 bits virtual
CPU(s): 96
On-line CPU(s) list: 0-95
Thread(s) per core: 2
Core(s) per socket: 6
Socket(s): 8
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 106
Model name: Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz
Stepping: 6
CPU MHz: 1999.982
BogoMIPS: 3999.96
Virtualization: VT-x
Hypervisor vendor: KVM
Virtualization type: full
L1d cache: 1.5 MiB
L1i cache: 1.5 MiB
L2 cache: 192 MiB
L3 cache: 128 MiB
NUMA node0 CPU(s): 0-47
NUMA node1 CPU(s): 48-95
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Vulnerable: Clear CPU buffers attempted, no microcode; SMT Host state unknown
Vulnerability Retbleed: Not affected
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Mitigation; TSX disabled
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology cpuid tsc_known_freq pni pclmulqdq vmx ssse3 fma cx16 pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves wbnoinvd arat avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid md_clear arch_capabilities

Versions of relevant libraries:
[pip3] numpy==1.26.3
[pip3] torch==2.1.1
[pip3] torchaudio==2.1.1
[pip3] torchvision==0.16.1
[pip3] triton==2.1.0
[conda] numpy 1.26.3 pypi_0 pypi
[conda] torch 2.1.1 pypi_0 pypi
[conda] torchaudio 2.1.1 pypi_0 pypi
[conda] torchvision 0.16.1 pypi_0 pypi
[conda] triton 2.1.0 pypi_0 pypi

cc @mcarilli @ezyang

The text was updated successfully, but these errors were encountered:

ezyang · 2024-02-15T04:11:04Z

There are some safety conditions which must be fulfilled for CUDA graphs to give valid results. Hypothetically, torch.compile on PT2 with cudagraphs should automatically test these safety conditions. If you're willing to do a detour in making your code torch.compile'able, it might tell you about what might be the problem.

tsengalb99 · 2024-02-15T16:55:10Z

I did try torch.compile in both regular and 'reduce-overhead' (cuda graphs) mode on the Shard class. In both cases, I got a rather lengthy error. I can post those later today. Is there anything obviously wrong with what I'm doing? Essentially the structure is this:

A shard wrapper model that has n shards, each of which is on a specific gpu
Each shard has its own cuda graph
Only capture the forward pass in the graph
During training, do the usual training loop but call graph.replay() on the shards sequentially to do a forward pass and do the backward pass normally.

ezyang · 2024-02-19T00:22:56Z

Well, it's not clear to me how you can "do the backward pass normally", because ordinarily when you run a forwards pass, on CPU we setup an autograd graph that says how to do backwards. If you cudagraph replay, though, this CPU compute is skipped (the point of cudagraphs) and now we can't backward. To cudagraph, you need to cudagraph both forward and backward (which is what PT2 would do for you.) Not so sure about the multi-gpu interaction though.

tsengalb99 · 2024-02-19T16:27:47Z

So if I’m understanding you correctly, I need to capture both the forward and backward pass in the cuda graph for cuda graph training to work? How would this work if I need to shard my model over multiple gpus? Cuda graphs don’t support capturing over multiple gpus so I’ll need to split the model over multiple graphs, but that prevents capturing the backward pass in the same graph (there’s not even one graph anymore)? From: Edward Z. Yang ***@***.***> Sent: Sunday, February 18, 2024 7:23 PM To: pytorch/pytorch ***@***.***> Cc: Albert Tseng ***@***.***>; Author ***@***.***> Subject: Re: [pytorch/pytorch] Different numerical results during training with CUDA graphs (Issue #119873) Well, it's not clear to me how you can "do the backward pass normally", because ordinarily when you run a forwards pass, on CPU we setup an autograd graph that says how to do backwards. If you cudagraph replay, though, this CPU compute is skipped (the point of cudagraphs) and now we can't backward. To cudagraph, you need to cudagraph both forward and backward (which is what PT2 would do for you.) Not so sure about the multi-gpu interaction though. — Reply to this email directly, view it on GitHub <#119873 (comment)> , or unsubscribe <https://github.com/notifications/unsubscribe-auth/AH6WZSA4MEVOUG6MCF5PJ53YUKLOZAVCNFSM6AAAAABDHWLNY2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMYTSNJRGUYDENJZGE> . You are receiving this because you authored the thread. <https://github.com/notifications/beacon/AH6WZSAPMCGKOJYLQRBRQCDYUKLOZA5CNFSM6AAAAABDHWLNY2WGG33NNVSW45C7OR4XAZNMJFZXG5LFINXW23LFNZ2KUY3PNVWWK3TUL5UWJTTUKGIP6.gif> Message ID: ***@***.*** ***@***.***> >

ezyang · 2024-02-20T04:57:17Z

I know you're not going to like this answer... but if you use DDP instead of DP you will not have this problem :P

tsengalb99 · 2024-02-20T05:00:35Z

I'm not using DataParallel though? The shard wrapper thing is just a quick wrapper class I wrote to manually do sharding. I did consider FSDP - do you know if that works with cuda graphs correctly?

ezyang · 2024-02-21T13:45:40Z

cc @awgu, it feels like it could in principle but I don't know if we've done it in practice

Abhishekghosh1998 · 2024-03-07T16:55:22Z

Hypothetically, torch.compile on PT2 with cudagraphs should automatically test these safety conditions. If you're willing to do a detour in making your code torch.compile'able, it might tell you about what might be the problem.

@ezyang Could you please point to some documentation/blog that discusses what are the capabilities of torch.compile in the context of CUDA Graph? Like, what are the "safety conditions" which get automatically tested?

colesbury added triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module module: numerical-reproducibility module: cuda graphs Ability to capture and then replay streams of CUDA kernels labels Feb 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Different numerical results during training with CUDA graphs #119873

Different numerical results during training with CUDA graphs #119873

tsengalb99 commented Feb 14, 2024 •

edited by pytorch-bot bot

ezyang commented Feb 15, 2024

tsengalb99 commented Feb 15, 2024

ezyang commented Feb 19, 2024

tsengalb99 commented Feb 19, 2024 via email

ezyang commented Feb 20, 2024

tsengalb99 commented Feb 20, 2024

ezyang commented Feb 21, 2024

Abhishekghosh1998 commented Mar 7, 2024

Different numerical results during training with CUDA graphs #119873

Different numerical results during training with CUDA graphs #119873

Comments

tsengalb99 commented Feb 14, 2024 • edited by pytorch-bot bot

🐛 Describe the bug

Versions

ezyang commented Feb 15, 2024

tsengalb99 commented Feb 15, 2024

ezyang commented Feb 19, 2024

tsengalb99 commented Feb 19, 2024 via email

ezyang commented Feb 20, 2024

tsengalb99 commented Feb 20, 2024

ezyang commented Feb 21, 2024

Abhishekghosh1998 commented Mar 7, 2024

tsengalb99 commented Feb 14, 2024 •

edited by pytorch-bot bot