Backpropagation to flex_attention `score_mod` biases fails based on presence of graph breaks

### 🐛 Describe the bug

The following minimal code snippet:
```python
import torch
from torch.nn.attention.flex_attention import flex_attention

@torch.compile()
def test(x, y):
    # Materialize a bias matrix
    B, L, device = x.shape[0], x.shape[1], x.device
    b = torch.arange(B, device=device, dtype=torch.long).view(B, 1, 1)
    q_idx = torch.arange(L, device=device, dtype=torch.long).view(1, L, 1)
    kv_idx = torch.arange(L, device=device, dtype=torch.long).view(1, 1, L)
    bias_mat = y[b, q_idx] + y[b, kv_idx] # (B, L, L)

    # Dummy score_mod retrieving bias values
    def score_mod(score, b, h, q_idx, kv_idx):
        return score + bias_mat[b, q_idx, kv_idx]

    x_ = x[:, :, None].repeat(1, 1, 16, 1)
    # torch._dynamo.graph_break()
    return flex_attention(x_, x_, x_, score_mod=score_mod)


DEVICE = "cuda"
B, L, D = 2, 16, 64

x = torch.randn(B, L, D, device=DEVICE, requires_grad=True)
y = torch.randn(B, L, device=DEVICE, requires_grad=True)

out = test(x, y).mean().backward()

print(torch.__version__)
print(f"x: {(x.grad is not None) and (x.grad.norm() > 0)}, y: {(y.grad is not None) and (y.grad.norm() > 0)}")
assert x.grad.norm() > 0
assert y.grad.norm() > 0
```
fails to properly backpropagate gradients into `y` for me (on the current stable PyTorch 2.8.0-cu12.8 installed via pip, the current nightly, and `2.8.0a0+34c6371d24.nv25.08` via the `nvidia/pytorch:25.08-py3` container, across two differently configured systems with H100s & H200s).

Curiously, un-commenting it enables the code to work as expected.

The code above is a minified version of a longer snippet I started out with that had the same behavior, and various other variations of this code snippet also (seemingly randomly to me) affect whether gradients are propagated to `y`. Gradients are always successfully propagated to `x`.

Some further things I've tried (in this case in the aforementioned NGC container):
- `backend="eager"`: works
- `backend="aot_eager"`: fails
- `backend="aot_eager_decomp_partition"`: fails
- `backend="inductor"`: fails

### Error logs

Running the script above produces:
```
2.8.0a0+34c6371d24.nv25.08
x: True, y: False
Traceback (most recent call last):
  File "/home/hpc/v104dd/v104dd11/dev/diffusion-video-pixel-diffusion-fsdp2/weird_shit.py", line 33, in <module>
    assert y.grad.norm() > 0
           ^^^^^^^^^^^
AttributeError: 'NoneType' object has no attribute 'norm'
```
indicating that gradients are (wrongfully) not backpropagated to the bias term. Removing the `# torch._dynamo.graph_break()` _comment_ fixes this error.

### Versions

Version with stable PyTorch 2.8:
```
PyTorch version: 2.8.0+cu128
Is debug build: False
CUDA used to build PyTorch: 12.8
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.5 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04.2) 11.4.0
Clang version: Could not collect
CMake version: version 3.22.1
Libc version: glibc-2.35

Python version: 3.11.11 (main, Dec 11 2024, 16:28:39) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.15.0-130-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.8.93
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA H200
Nvidia driver version: 550.144.03
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_adv_infer.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_adv_train.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_infer.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_cnn_train.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_ops_infer.so.8.9.7
/usr/lib/x86_64-linux-gnu/libcudnn_ops_train.so.8.9.7
Is XPU available: False
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                         x86_64
CPU op-mode(s):                       32-bit, 64-bit
Address sizes:                        52 bits physical, 57 bits virtual
Byte Order:                           Little Endian
CPU(s):                               384
On-line CPU(s) list:                  0-383
Vendor ID:                            AuthenticAMD
Model name:                           AMD EPYC 9654 96-Core Processor
CPU family:                           25
Model:                                17
Thread(s) per core:                   2
Core(s) per socket:                   96
Socket(s):                            2
Stepping:                             1
Frequency boost:                      enabled
CPU max MHz:                          3707.8120
CPU min MHz:                          1500.0000
BogoMIPS:                             4800.14
Flags:                                fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 invpcid_single hw_pstate ssbd mba ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid overflow_recov succor smca fsrm flush_l1d
Virtualization:                       AMD-V
L1d cache:                            6 MiB (192 instances)
L1i cache:                            6 MiB (192 instances)
L2 cache:                             192 MiB (192 instances)
L3 cache:                             768 MiB (24 instances)
NUMA node(s):                         2
NUMA node0 CPU(s):                    0-95,192-287
NUMA node1 CPU(s):                    96-191,288-383
Vulnerability Gather data sampling:   Not affected
Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Not affected
Vulnerability Mds:                    Not affected
Vulnerability Meltdown:               Not affected
Vulnerability Mmio stale data:        Not affected
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Not affected
Vulnerability Spec rstack overflow:   Mitigation; safe RET
Vulnerability Spec store bypass:      Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:             Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:             Mitigation; Enhanced / Automatic IBRS; IBPB conditional; STIBP always-on; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Not affected

Versions of relevant libraries:
[pip3] ema-pytorch==0.7.7
[pip3] msgpack-numpy==0.4.8
[pip3] mypy-extensions==1.0.0
[pip3] numpy==2.3.2
[pip3] nvidia-cublas-cu12==12.8.4.1
[pip3] nvidia-cuda-cupti-cu12==12.8.90
[pip3] nvidia-cuda-nvrtc-cu12==12.8.93
[pip3] nvidia-cuda-runtime-cu12==12.8.90
[pip3] nvidia-cudnn-cu12==9.10.2.21
[pip3] nvidia-cufft-cu12==11.3.3.83
[pip3] nvidia-curand-cu12==10.3.9.90
[pip3] nvidia-cusolver-cu12==11.7.3.90
[pip3] nvidia-cusparse-cu12==12.5.8.93
[pip3] nvidia-cusparselt-cu12==0.7.1
[pip3] nvidia-nccl-cu12==2.27.3
[pip3] nvidia-nvjitlink-cu12==12.8.93
[pip3] nvidia-nvtx-cu12==12.8.90
[pip3] nvtx==0.2.11
[pip3] open_clip_torch==3.1.0
[pip3] pynvjitlink-cu12==0.5.2
[pip3] pytorch-lightning==2.5.3
[pip3] pytorch-triton==3.4.0+gitf7888497
[pip3] torch==2.8.0
[pip3] torchcodec==0.6.0
[pip3] torchdata==0.11.0
[pip3] torchmetrics==1.8.1
[pip3] torchtitan==0.1.0
[pip3] torchvision==0.23.0
[pip3] triton==3.4.0
[conda] cuda-cudart               12.8.90                       0    nvidia/label/cuda-12.8.1
[conda] cuda-cudart-dev           12.8.90                       0    nvidia/label/cuda-12.8.1
[conda] cuda-cudart-dev_linux-64  12.8.90                       0    nvidia/label/cuda-12.8.1
[conda] cuda-cudart-static        12.8.90                       0    nvidia/label/cuda-12.8.1
[conda] cuda-cudart-static_linux-64 12.8.90                       0    nvidia/label/cuda-12.8.1
[conda] cuda-cudart_linux-64      12.8.90                       0    nvidia/label/cuda-12.8.1
[conda] cuda-cupti                12.8.90                       0    nvidia/label/cuda-12.8.1
[conda] cuda-cupti-dev            12.8.90                       0    nvidia/label/cuda-12.8.1
[conda] cuda-libraries            12.8.1                        0    nvidia/label/cuda-12.8.1
[conda] cuda-libraries-dev        12.8.1                        0    nvidia/label/cuda-12.8.1
[conda] cuda-nvrtc                12.8.93                       0    nvidia/label/cuda-12.8.1
[conda] cuda-nvrtc-dev            12.8.93                       0    nvidia/label/cuda-12.8.1
[conda] cuda-nvtx                 12.8.90                       0    nvidia/label/cuda-12.8.1
[conda] cuda-opencl               12.8.90                       0    nvidia/label/cuda-12.8.1
[conda] cuda-opencl-dev           12.8.90                       0    nvidia/label/cuda-12.8.1
[conda] ema-pytorch               0.7.7                    pypi_0    pypi
[conda] libcublas                 12.8.4.1                      0    nvidia/label/cuda-12.8.1
[conda] libcublas-dev             12.8.4.1                      0    nvidia/label/cuda-12.8.1
[conda] libcufft                  11.3.3.83                     0    nvidia/label/cuda-12.8.1
[conda] libcufft-dev              11.3.3.83                     0    nvidia/label/cuda-12.8.1
[conda] libcurand                 10.3.9.90                     0    nvidia/label/cuda-12.8.1
[conda] libcurand-dev             10.3.9.90                     0    nvidia/label/cuda-12.8.1
[conda] libcusolver               11.7.3.90                     0    nvidia/label/cuda-12.8.1
[conda] libcusolver-dev           11.7.3.90                     0    nvidia/label/cuda-12.8.1
[conda] libcusparse               12.5.8.93                     0    nvidia/label/cuda-12.8.1
[conda] libcusparse-dev           12.5.8.93                     0    nvidia/label/cuda-12.8.1
[conda] libnvjitlink              12.8.93                       1    nvidia/label/cuda-12.8.1
[conda] libnvjitlink-dev          12.8.93                       1    nvidia/label/cuda-12.8.1
[conda] msgpack-numpy             0.4.8                    pypi_0    pypi
[conda] numpy                     2.3.2                    pypi_0    pypi
[conda] nvidia-cublas-cu12        12.8.4.1                 pypi_0    pypi
[conda] nvidia-cuda-cupti-cu12    12.8.90                  pypi_0    pypi
[conda] nvidia-cuda-nvrtc-cu12    12.8.93                  pypi_0    pypi
[conda] nvidia-cuda-runtime-cu12  12.8.90                  pypi_0    pypi
[conda] nvidia-cudnn-cu12         9.10.2.21                pypi_0    pypi
[conda] nvidia-cufft-cu12         11.3.3.83                pypi_0    pypi
[conda] nvidia-curand-cu12        10.3.9.90                pypi_0    pypi
[conda] nvidia-cusolver-cu12      11.7.3.90                pypi_0    pypi
[conda] nvidia-cusparse-cu12      12.5.8.93                pypi_0    pypi
[conda] nvidia-cusparselt-cu12    0.7.1                    pypi_0    pypi
[conda] nvidia-nccl-cu12          2.27.3                   pypi_0    pypi
[conda] nvidia-nvjitlink-cu12     12.8.93                  pypi_0    pypi
[conda] nvidia-nvtx-cu12          12.8.90                  pypi_0    pypi
[conda] nvtx                      0.2.11                   pypi_0    pypi
[conda] open-clip-torch           3.1.0                    pypi_0    pypi
[conda] pynvjitlink-cu12          0.5.2                    pypi_0    pypi
[conda] pytorch-lightning         2.5.3                    pypi_0    pypi
[conda] pytorch-triton            3.4.0+gitf7888497          pypi_0    pypi
[conda] torch                     2.8.0                    pypi_0    pypi
[conda] torchcodec                0.6.0                    pypi_0    pypi
[conda] torchdata                 0.11.0                   pypi_0    pypi
[conda] torchmetrics              1.8.1                    pypi_0    pypi
[conda] torchtitan                0.1.0                    pypi_0    pypi
[conda] torchvision               0.23.0                   pypi_0    pypi
[conda] triton                    3.4.0                    pypi_0    pypi
```

Version with NGC Container:
```
PyTorch version: 2.8.0a0+34c6371d24.nv25.08
Is debug build: False
CUDA used to build PyTorch: 13.0
ROCM used to build PyTorch: N/A

OS: Ubuntu 24.04.2 LTS (x86_64)
GCC version: (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0
Clang version: Could not collect
CMake version: version 3.31.6
Libc version: glibc-2.39

Python version: 3.12.3 (main, Jun 18 2025, 17:59:45) [GCC 13.3.0] (64-bit runtime)
Python platform: Linux-5.14.0-570.35.1.el9_6.x86_64-x86_64-with-glibc2.39
Is CUDA available: True
CUDA runtime version: 13.0.48
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: NVIDIA H100
Nvidia driver version: 580.65.06
cuDNN version: Probably one of the following:
/usr/lib/x86_64-linux-gnu/libcudnn.so.9.12.0
/usr/lib/x86_64-linux-gnu/libcudnn_adv.so.9.12.0
/usr/lib/x86_64-linux-gnu/libcudnn_cnn.so.9.12.0
/usr/lib/x86_64-linux-gnu/libcudnn_engines_precompiled.so.9.12.0
/usr/lib/x86_64-linux-gnu/libcudnn_engines_runtime_compiled.so.9.12.0
/usr/lib/x86_64-linux-gnu/libcudnn_graph.so.9.12.0
/usr/lib/x86_64-linux-gnu/libcudnn_heuristic.so.9.12.0
/usr/lib/x86_64-linux-gnu/libcudnn_ops.so.9.12.0
Is XPU available: False
HIP runtime version: N/A
MIOpen runtime version: N/A
Is XNNPACK available: True

CPU:
Architecture:                            x86_64
CPU op-mode(s):                          32-bit, 64-bit
Address sizes:                           52 bits physical, 57 bits virtual
Byte Order:                              Little Endian
CPU(s):                                  128
On-line CPU(s) list:                     0-127
Vendor ID:                               AuthenticAMD
Model name:                              AMD EPYC 9554 64-Core Processor
CPU family:                              25
Model:                                   17
Thread(s) per core:                      1
Core(s) per socket:                      64
Socket(s):                               2
Stepping:                                1
Frequency boost:                         enabled
CPU(s) scaling MHz:                      93%
CPU max MHz:                             3762.9880
CPU min MHz:                             1500.0000
BogoMIPS:                                6200.54
Flags:                                   fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good amd_lbr_v2 nopl xtopology nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 pcid sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba perfmon_v2 ibrs ibpb stibp ibrs_enhanced vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd amd_ppin cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif x2avic v_spec_ctrl vnmi avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq la57 rdpid overflow_recov succor smca fsrm flush_l1d debug_swap
L1d cache:                               4 MiB (128 instances)
L1i cache:                               4 MiB (128 instances)
L2 cache:                                128 MiB (128 instances)
L3 cache:                                512 MiB (16 instances)
NUMA node(s):                            8
NUMA node0 CPU(s):                       0-15
NUMA node1 CPU(s):                       16-31
NUMA node2 CPU(s):                       32-47
NUMA node3 CPU(s):                       48-63
NUMA node4 CPU(s):                       64-79
NUMA node5 CPU(s):                       80-95
NUMA node6 CPU(s):                       96-111
NUMA node7 CPU(s):                       112-127
Vulnerability Gather data sampling:      Not affected
Vulnerability Indirect target selection: Not affected
Vulnerability Itlb multihit:             Not affected
Vulnerability L1tf:                      Not affected
Vulnerability Mds:                       Not affected
Vulnerability Meltdown:                  Not affected
Vulnerability Mmio stale data:           Not affected
Vulnerability Reg file data sampling:    Not affected
Vulnerability Retbleed:                  Not affected
Vulnerability Spec rstack overflow:      Mitigation; Safe RET
Vulnerability Spec store bypass:         Mitigation; Speculative Store Bypass disabled via prctl
Vulnerability Spectre v1:                Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:                Mitigation; Enhanced / Automatic IBRS; IBPB conditional; STIBP disabled; RSB filling; PBRSB-eIBRS Not affected; BHI Not affected
Vulnerability Srbds:                     Not affected
Vulnerability Tsx async abort:           Not affected

Versions of relevant libraries:
[pip3] ema-pytorch==0.7.7
[pip3] intel-openmp==2021.4.0
[pip3] mkl==2021.1.1
[pip3] mkl-devel==2021.1.1
[pip3] mkl-include==2021.1.1
[pip3] mypy_extensions==1.1.0
[pip3] numpy==1.26.4
[pip3] nvidia-cudnn-frontend==1.13.0
[pip3] nvtx==0.2.11
[pip3] onnx==1.18.0
[pip3] optree==0.17.0
[pip3] pynvjitlink==0.7.0
[pip3] pytorch-triton==3.3.1+gitc8757738
[pip3] tbb==2021.13.1
[pip3] torch==2.8.0a0+34c6371d24.nv25.8
[pip3] torch_tensorrt==2.8.0a0
[pip3] torchao==0.12.0+git
[pip3] torchdata==0.11.0
[pip3] torchdiffeq==0.2.5
[pip3] torchprofile==0.0.4
[pip3] torchtitan==0.1.0
[pip3] torchvision==0.23.0a0+428a54c9
[conda] Could not collect
```

cc @ezyang @gchanan @zou3519 @kadeng @msaroufim @chauhang @penguinwu @bdhirsh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Backpropagation to flex_attention `score_mod` biases fails based on presence of graph breaks #162228

🐛 Describe the bug

Error logs

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Backpropagation to flex_attention score_mod biases fails based on presence of graph breaks #162228

Description

🐛 Describe the bug

Error logs

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Backpropagation to flex_attention `score_mod` biases fails based on presence of graph breaks #162228