Skip to content

Poor bf16 precision in backward torch.asin #162907

@wangtianxia-sjtu

Description

@wangtianxia-sjtu

🐛 Describe the bug

The torch.asin backward function in bfloat16 of cpu will show up to 2 ulp error when compared with gpu. The identical compution in float32 shows that the result of gpu is more accurate.

import torch

print(torch.__version__)

def test_asin_backward_bf16(device):
    print(f"{device}:")
    tensor_input = torch.tensor([0x3e84], dtype=torch.uint16).view(torch.bfloat16).to(device).requires_grad_(True)
    asin_output = torch.asin(tensor_input)
    external_grad = torch.tensor([0x3f0e], dtype=torch.uint16).view(torch.bfloat16).to(device).requires_grad_(True)
    asin_output.backward(external_grad)
    print(tensor_input.view(torch.uint16))
    print(asin_output.view(torch.uint16))
    print(external_grad.view(torch.uint16))
    print(tensor_input.grad.view(torch.uint16))

test_asin_backward_bf16('cpu')
test_asin_backward_bf16('cuda')

def test_asin_backward_fp32(device):
    print(f"{device}:")
    tensor_input = torch.tensor([0x3e840000], dtype=torch.uint32).view(torch.float32).to(device).requires_grad_(True)
    asin_output = torch.asin(tensor_input)
    external_grad = torch.tensor([0x3f0e0000], dtype=torch.uint32).view(torch.float32).to(device).requires_grad_(True)
    asin_output.backward(external_grad)
    print(tensor_input.view(torch.uint32))
    print(asin_output.view(torch.uint32))
    print(external_grad.view(torch.uint32))
    print(tensor_input.grad.view(torch.uint32))

test_asin_backward_fp32('cpu')
test_asin_backward_fp32('cuda')

output

2.8.0+cu126
cpu:
tensor([16004], dtype=torch.uint16)
tensor([16006], dtype=torch.uint16)
tensor([16142], dtype=torch.uint16)
tensor([16148], dtype=torch.uint16) # this is 0x3f14
cuda:
tensor([16004], device='cuda:0', dtype=torch.uint16)
tensor([16006], device='cuda:0', dtype=torch.uint16)
tensor([16142], device='cuda:0', dtype=torch.uint16)
tensor([16146], device='cuda:0', dtype=torch.uint16) # this is 0x3f12
cpu:
tensor([1048838144], dtype=torch.uint32)
tensor([1048936961], dtype=torch.uint32)
tensor([1057882112], dtype=torch.uint32)
tensor([1058207712], dtype=torch.uint32)
cuda:
tensor([1048838144], device='cuda:0', dtype=torch.uint32)
tensor([1048936961], device='cuda:0', dtype=torch.uint32)
tensor([1057882112], device='cuda:0', dtype=torch.uint32)
tensor([1058207712], device='cuda:0', dtype=torch.uint32) # this is 0x3f12f7e0

Versions

Collecting environment information...

Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Mitigation; PTE Inversion
Vulnerability Mds: Vulnerable; SMT Host state unknown
Vulnerability Meltdown: Vulnerable
Vulnerability Mmio stale data: Vulnerable
Vulnerability Reg file data sampling: Not affected
Vulnerability Retbleed: Vulnerable
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass: Vulnerable
Vulnerability Spectre v1: Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers
Vulnerability Spectre v2: Vulnerable; IBPB: disabled; STIBP: disabled; PBRSB-eIBRS: Not affected; BHI: Vulnerable
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Vulnerable

Versions of relevant libraries:
[pip3] intel-cmplr-lib-ur==2025.2.1
[pip3] intel-openmp==2025.2.1
[pip3] mkl==2025.2.0
[pip3] numpy==2.0.2
[pip3] nvidia-cublas-cu12==12.6.4.1
[pip3] nvidia-cuda-cupti-cu12==12.6.80
[pip3] nvidia-
[pip3] nvidia-cusolver-cu12==11.7.1.2
[pip3] nvidia-cusparse-cu12==12.5.4.2
[pip3] nvidia-cusparselt-cu12==0.7.1
[pip3] nvidia-nccl-cu12==2.27.3
[pip3] nvidia-nvjitlink-cu12==12.6.85
[pip3] nvidia-nvtx-cu12==12.6.77
[pip3] nvtx==0.2.13
[pip3] optree==0.17.0
[pip3] pynvjitlink-cu12==0.7.0
[pip3] tbb==2022.2.0
[pip3] tcmlib==1.4.0
[pip3] torch==2.8.0+cu126
[pip3] torchao==0.10.0
[pip3] torchaudio==2.8.0+cu126
[pip3] torchdata==0.11.0
[pip3] torchsummary==1.5.1
[pip3] torchtune==0.6.1
[pip3] torchvision==0.23.0+cu126
[pip3] triton==3.4.0
[pip3] umf==0.11.0
[conda] Could not collect

cc @ezyang @albanD @gqchen @nikitaved @soulitzer @Varal7 @xmfan @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168

Metadata

Metadata

Assignees

No one assigned

    Labels

    module: autogradRelated to torch.autograd, and the autograd engine in generalmodule: cpuCPU specific problem (e.g., perf, algorithm)triagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate module

    Type

    No type

    Projects

    Status

    No status

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions