use bfloat16 on nvidia V100 GPU #124996

bugm · 2024-04-26T02:40:28Z

🐛 Describe the bug

Hello!
It is said that bfloat16 is only supported on GPUs with compute capability of at least 8.0, which means nvidia V100 should not support bfloat16.

But I have test the code below on a V100 machine and run successfully.

import torch
a = torch.randn(3,3,dtype=torch.bfloat16,device="cuda")
b = torch.randn(3,3,dtype=torch.bfloat16,device="cuda")
c = torch.matmul(a,b)
print(c.dtype)
print(c.device)
print(torch.cuda.is_bf16_supported())

While the initialization and operation success, but the torch.cuda.is_bf16_supported() here return False

torch.bfloat16
cuda:0
False

So I what to know what is the situation her? Thanks!

Versions

PyTorch version: 2.2.2+cu121
Is debug build: False
CUDA used to build PyTorch: 12.1
ROCM used to build PyTorch: N/A

OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: Could not collect
Libc version: glibc-2.35

Python version: 3.12.2 | packaged by Anaconda, Inc. | (main, Feb 27 2024, 17:35:02) [GCC 11.2.0] (64-bit runtime)
Python platform: Linux-5.15.0-94-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: 12.1.105
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: GPU 0: Tesla V100-SXM2-32GB
Nvidia driver version: 535.154.05

cc @ptrblck

The text was updated successfully, but these errors were encountered:

malfet · 2024-04-26T13:27:15Z

I think distinction here is "supported by software"(i.e. emulation) vs "supported by hardware". torch.cuda.is_bf16_supported() tells that your GPU hardware does not have native bf16 instructions, but software can easily emulate some bf16 operations by shifting input values to the left and then running computation in float32, but it will be slower

bugm · 2024-04-29T03:44:45Z

I think distinction here is "supported by software"(i.e. emulation) vs "supported by hardware". torch.cuda.is_bf16_supported() tells that your GPU hardware does not have native bf16 instructions, but software can easily emulate some bf16 operations by shifting input values to the left and then running computation in float32, but it will be slower

Thanks for you answer! I have tried to use a bfloat16 mixed precision training on a V100 GPU, which shows the time cost is almost the same as full fp32 training(even a little slower).

malfet added module: bfloat16 module: cuda Related to torch.cuda, and CUDA support in general labels Apr 26, 2024

mikaylagawarecki added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Apr 26, 2024

bugm closed this as completed Apr 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

use bfloat16 on nvidia V100 GPU #124996

use bfloat16 on nvidia V100 GPU #124996

bugm commented Apr 26, 2024 •

edited by pytorch-bot bot

malfet commented Apr 26, 2024

bugm commented Apr 29, 2024

use bfloat16 on nvidia V100 GPU #124996

use bfloat16 on nvidia V100 GPU #124996

Comments

bugm commented Apr 26, 2024 • edited by pytorch-bot bot

🐛 Describe the bug

Versions

malfet commented Apr 26, 2024

bugm commented Apr 29, 2024

bugm commented Apr 26, 2024 •

edited by pytorch-bot bot