Poor bf16 precision in backward torch.asin

### 🐛 Describe the bug

The torch.asin backward function in bfloat16 of cpu will show up to 2 ulp error when compared with gpu. The identical compution in float32 shows that the result of gpu is more accurate.

```python
import torch

print(torch.__version__)

def test_asin_backward_bf16(device):
    print(f"{device}:")
    tensor_input = torch.tensor([0x3e84], dtype=torch.uint16).view(torch.bfloat16).to(device).requires_grad_(True)
    asin_output = torch.asin(tensor_input)
    external_grad = torch.tensor([0x3f0e], dtype=torch.uint16).view(torch.bfloat16).to(device).requires_grad_(True)
    asin_output.backward(external_grad)
    print(tensor_input.view(torch.uint16))
    print(asin_output.view(torch.uint16))
    print(external_grad.view(torch.uint16))
    print(tensor_input.grad.view(torch.uint16))

test_asin_backward_bf16('cpu')
test_asin_backward_bf16('cuda')

def test_asin_backward_fp32(device):
    print(f"{device}:")
    tensor_input = torch.tensor([0x3e840000], dtype=torch.uint32).view(torch.float32).to(device).requires_grad_(True)
    asin_output = torch.asin(tensor_input)
    external_grad = torch.tensor([0x3f0e0000], dtype=torch.uint32).view(torch.float32).to(device).requires_grad_(True)
    asin_output.backward(external_grad)
    print(tensor_input.view(torch.uint32))
    print(asin_output.view(torch.uint32))
    print(external_grad.view(torch.uint32))
    print(tensor_input.grad.view(torch.uint32))

test_asin_backward_fp32('cpu')
test_asin_backward_fp32('cuda')
```

output
```
2.8.0+cu126
cpu:
tensor([16004], dtype=torch.uint16)
tensor([16006], dtype=torch.uint16)
tensor([16142], dtype=torch.uint16)
tensor([16148], dtype=torch.uint16) # this is 0x3f14
cuda:
tensor([16004], device='cuda:0', dtype=torch.uint16)
tensor([16006], device='cuda:0', dtype=torch.uint16)
tensor([16142], device='cuda:0', dtype=torch.uint16)
tensor([16146], device='cuda:0', dtype=torch.uint16) # this is 0x3f12
cpu:
tensor([1048838144], dtype=torch.uint32)
tensor([1048936961], dtype=torch.uint32)
tensor([1057882112], dtype=torch.uint32)
tensor([1058207712], dtype=torch.uint32)
cuda:
tensor([1048838144], device='cuda:0', dtype=torch.uint32)
tensor([1048936961], device='cuda:0', dtype=torch.uint32)
tensor([1057882112], device='cuda:0', dtype=torch.uint32)
tensor([1058207712], device='cuda:0', dtype=torch.uint32) # this is 0x3f12f7e0
```



### Versions

Collecting environment information...

Vulnerability Itlb multihit:          Not affected
Vulnerability L1tf:                   Mitigation; PTE Inversion
Vulnerability Mds:                    Vulnerable; SMT Host state unknown
Vulnerability Meltdown:               Vulnerable
Vulnerability Mmio stale data:        Vulnerable
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed:               Vulnerable
Vulnerability Spec rstack overflow:   Not affected
Vulnerability Spec store bypass:      Vulnerable
Vulnerability Spectre v1:             Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers
Vulnerability Spectre v2:             Vulnerable; IBPB: disabled; STIBP: disabled; PBRSB-eIBRS: Not affected; BHI: Vulnerable
Vulnerability Srbds:                  Not affected
Vulnerability Tsx async abort:        Vulnerable

Versions of relevant libraries:
[pip3] intel-cmplr-lib-ur==2025.2.1
[pip3] intel-openmp==2025.2.1
[pip3] mkl==2025.2.0
[pip3] numpy==2.0.2
[pip3] nvidia-cublas-cu12==12.6.4.1
[pip3] nvidia-cuda-cupti-cu12==12.6.80
[pip3] nvidia-
[pip3] nvidia-cusolver-cu12==11.7.1.2
[pip3] nvidia-cusparse-cu12==12.5.4.2
[pip3] nvidia-cusparselt-cu12==0.7.1
[pip3] nvidia-nccl-cu12==2.27.3
[pip3] nvidia-nvjitlink-cu12==12.6.85
[pip3] nvidia-nvtx-cu12==12.6.77
[pip3] nvtx==0.2.13
[pip3] optree==0.17.0
[pip3] pynvjitlink-cu12==0.7.0
[pip3] tbb==2022.2.0
[pip3] tcmlib==1.4.0
[pip3] torch==2.8.0+cu126
[pip3] torchao==0.10.0
[pip3] torchaudio==2.8.0+cu126
[pip3] torchdata==0.11.0
[pip3] torchsummary==1.5.1
[pip3] torchtune==0.6.1
[pip3] torchvision==0.23.0+cu126
[pip3] triton==3.4.0
[pip3] umf==0.11.0
[conda] Could not collect

cc @ezyang @albanD @gqchen @nikitaved @soulitzer @Varal7 @xmfan @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Poor bf16 precision in backward torch.asin #162907

🐛 Describe the bug

Versions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Poor bf16 precision in backward torch.asin #162907

Description

🐛 Describe the bug

Versions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions