-
Notifications
You must be signed in to change notification settings - Fork 491
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RuntimeError: cutlassF: no kernel found to launch! #49
Comments
I have the same error |
same error |
My conda environment is as below: GPU: RTX 5000 Name Version Build Channel |
cc: @drisspg |
Can you try using a the patch release, or nightly? |
@drisspg #46 (comment) @merveermann says they're using the nightly I believe. |
So this error is being thrown on Nightly for devices: V100, RTX5000 Also it is possible to give example inputs of to SDPA that are causing this error to be thrown? My hunch is that compile is doing some memory planning optimizations that cause the alignment check here: https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/transformers/cuda/attention.cu#L1023-L1027 |
It seems your GPU not support bf16, change all |
@drisspg I tested on a V100. Both eager and compiled runs into the same error. I think the issue is that mem_eff_attention doesn't support bf16 on sm < 80: https://github.com/pytorch/pytorch/blob/main/aten/src/ATen/native/transformers/cuda/mem_eff_attention/kernels/cutlassF.h#L286 I tested with float16 and it works. Shall we default gpt-fast to float16 for V100 and under? |
Ohh @yifuwang thank you, that is a great catch I will put up a PR right now to fix this in PyTorch |
# Summary Fix issue bug in detecting mem eff capability for cuda devices less than sm80: pytorch-labs/gpt-fast#49 Pull Request resolved: #116026 Approved by: https://github.com/janeyx99
thank you all, after change all torch.bfloat16 to torch.float32, run with unquantized model works well root@md:/home/projects/gpt-fast# CUDA_VISIBLE_DEVICES=0 python3 generate.py --compile --checkpoint_path /models/huggingface_models/meta-Llama-2-7b-hf/model_int8.pth --max_new_tokens 100
Loading model ...
Using int8 weight-only quantization!
/opt/conda/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
return self.fget.__get__(instance, owner)()
Time to load model: 2.52 seconds
[2023-12-19 00:54:26,247] [0/0] torch._dynamo.output_graph: [WARNING] nn.Module state_dict and backward hooks are not yet supported by torch.compile, but were detected in your model and will be silently ignored. See https://pytorch.org/docs/master/compile/nn-module.html for more information and limitations.
Compilation time: 101.21 seconds
Hello, my name is ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇
Time for inference 1: 4.87 sec total, 20.53 tokens/sec
Bandwidth achieved: 141.08 GB/s
Hello, my name is ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇
Time for inference 2: 4.87 sec total, 20.55 tokens/sec
Bandwidth achieved: 141.25 GB/s
Hello, my name is ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇
Time for inference 3: 4.87 sec total, 20.55 tokens/sec
Bandwidth achieved: 141.24 GB/s
Hello, my name is ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇
Time for inference 4: 4.87 sec total, 20.55 tokens/sec
Bandwidth achieved: 141.22 GB/s
Hello, my name is ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇
Time for inference 5: 4.87 sec total, 20.55 tokens/sec
Bandwidth achieved: 141.24 GB/s
==========
Average tokens/sec: 20.55
Memory used: 8.01 GB |
@goodboyyes2009 Did you re-run quatilized.py after torch.bfloat16 to torch.float32? |
@VendaCino oh, sorry, I do re-run quatilized.py, but I change all torch.bfloat16 to torch.float16 |
'⁇ ⁇ ⁇' is because tensor value hope this information can help to trace the problem. update:
it depends on the weight of model, so when I test in tinyllama it works well. when I use model.pth
so i guess something wrong in WeightOnlyInt8Linear class WeightOnlyInt8Linear(torch.nn.Module):
...
def forward(self, input: torch.Tensor) -> torch.Tensor:
return F.linear(input, self.weight.to(dtype=input.dtype)) * self.scales -> here loss the precision change it to def forward(self, input: torch.Tensor) -> torch.Tensor:
return (F.linear(input.to(dtype=torch.float32), self.weight.to(dtype=torch.float32)) * self.scales).to(dtype=input.dtype) everything looks good
|
OK. Thank you very much! @VendaCino |
# Summary Fix issue bug in detecting mem eff capability for cuda devices less than sm80: pytorch-labs/gpt-fast#49 Pull Request resolved: #116026 Approved by: https://github.com/janeyx99
# Summary Fix issue bug in detecting mem eff capability for cuda devices less than sm80: pytorch-labs/gpt-fast#49 Pull Request resolved: pytorch#116026 Approved by: https://github.com/janeyx99
# Summary Fix issue bug in detecting mem eff capability for cuda devices less than sm80: pytorch-labs/gpt-fast#49 Pull Request resolved: pytorch#116026 Approved by: https://github.com/janeyx99
# Summary Fix issue bug in detecting mem eff capability for cuda devices less than sm80: pytorch-labs/gpt-fast#49 Pull Request resolved: #116026 Approved by: https://github.com/janeyx99
root@md:/home/projects/gpt-fast# CUDA_VISIBLE_DEVICES=0 python3 generate.py --compile --checkpoint_path /models/huggingface_models/meta-Llama-2-7b-hf/model_int8.pth --max_new_tokens 100
Loading model ...
Using int8 weight-only quantization!
/opt/conda/lib/python3.10/site-packages/torch/_utils.py:831: UserWarning: TypedStorage is deprecated. It will be removed in the future and UntypedStorage will be the only storage class. This should only matter to you if you are using storages directly. To access UntypedStorage directly, use tensor.untyped_storage() instead of tensor.storage()
return self.fget.get(instance, owner)()
Time to load model: 2.33 seconds
Traceback (most recent call last):
File "/home/projects/gpt-fast/generate.py", line 407, in
main(
File "/home/projects/gpt-fast/generate.py", line 346, in main
y, metrics = generate(
File "/opt/conda/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "/home/projects/gpt-fast/generate.py", line 167, in generate
next_token = prefill(model, prompt.view(1, -1), input_pos, **sampling_kwargs)
File "/home/projects/gpt-fast/generate.py", line 52, in prefill
logits = model(x, input_pos)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/projects/gpt-fast/model.py", line 118, in forward
x = layer(x, input_pos, freqs_cis, mask)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/projects/gpt-fast/model.py", line 137, in forward
h = x + self.attention(self.attention_norm(x), freqs_cis, mask, input_pos)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/opt/conda/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/projects/gpt-fast/model.py", line 186, in forward
y = F.scaled_dot_product_attention(q, k, v, attn_mask=mask, dropout_p=0.0)
RuntimeError: cutlassF: no kernel found to launch!
GPU: NVIDIA V100
conda list:
_libgcc_mutex 0.1 main
_openmp_mutex 5.1 1_gnu
asttokens 2.0.5 pyhd3eb1b0_0
astunparse 1.6.3 pypi_0 pypi
attrs 23.1.0 pypi_0 pypi
backcall 0.2.0 pyhd3eb1b0_0
beautifulsoup4 4.12.2 py310h06a4308_0
blas 1.0 mkl
boltons 23.0.0 py310h06a4308_0
brotlipy 0.7.0 py310h7f8727e_1002
bzip2 1.0.8 h7b6447c_0
c-ares 1.19.0 h5eee18b_0
ca-certificates 2023.08.22 h06a4308_0
certifi 2023.7.22 py310h06a4308_0
cffi 1.15.1 py310h5eee18b_3
chardet 4.0.0 py310h06a4308_1003
charset-normalizer 2.0.4 pyhd3eb1b0_0
click 8.0.4 py310h06a4308_0
cmake 3.26.4 h96355d8_0
conda 23.9.0 py310h06a4308_0
conda-build 3.27.0 py310h06a4308_0
conda-content-trust 0.2.0 py310h06a4308_0
conda-index 0.3.0 py310h06a4308_0
conda-libmamba-solver 23.7.0 py310h06a4308_0
conda-package-handling 2.2.0 py310h06a4308_0
conda-package-streaming 0.9.0 py310h06a4308_0
cryptography 41.0.3 py310hdda0065_0
cuda-cudart 11.8.89 0 nvidia
cuda-cupti 11.8.87 0 nvidia
cuda-libraries 11.8.0 0 nvidia
cuda-nvrtc 11.8.89 0 nvidia
cuda-nvtx 11.8.86 0 nvidia
cuda-runtime 11.8.0 0 nvidia
decorator 5.1.1 pyhd3eb1b0_0
dnspython 2.4.2 pypi_0 pypi
exceptiongroup 1.0.4 py310h06a4308_0
executing 0.8.3 pyhd3eb1b0_0
expat 2.5.0 h6a678d5_0
expecttest 0.1.6 pypi_0 pypi
ffmpeg 4.3 hf484d3e_0 pytorch
filelock 3.9.0 py310h06a4308_0
fmt 9.1.0 hdb19cb5_0
freetype 2.12.1 h4a9f257_0
fsspec 2023.9.2 pypi_0 pypi
giflib 5.2.1 h5eee18b_3
gmp 6.2.1 h295c915_3
gmpy2 2.1.2 py310heeb90bb_0
gnutls 3.6.15 he1e5248_0
hypothesis 6.87.1 pypi_0 pypi
icu 58.2 he6710b0_3
idna 3.4 py310h06a4308_0
intel-openmp 2023.1.0 hdb19cb5_46305
ipython 8.15.0 py310h06a4308_0
jedi 0.18.1 py310h06a4308_1
jinja2 3.1.2 py310h06a4308_0
jpeg 9e h5eee18b_1
jsonpatch 1.32 pyhd3eb1b0_0
jsonpointer 2.1 pyhd3eb1b0_0
krb5 1.20.1 h143b758_1
lame 3.100 h7b6447c_0
lcms2 2.12 h3be6417_0
ld_impl_linux-64 2.38 h1181459_1
lerc 3.0 h295c915_0
libarchive 3.6.2 h6ac8c49_2
libcublas 11.11.3.6 0 nvidia
libcufft 10.9.0.58 0 nvidia
libcufile 1.7.2.10 0 nvidia
libcurand 10.3.3.141 0 nvidia
libcurl 8.1.1 h251f7ec_1
libcusolver 11.4.1.48 0 nvidia
libcusparse 11.7.5.86 0 nvidia
libdeflate 1.17 h5eee18b_1
libedit 3.1.20221030 h5eee18b_0
libev 4.33 h7f8727e_1
libffi 3.4.4 h6a678d5_0
libgcc-ng 11.2.0 h1234567_1
libgomp 11.2.0 h1234567_1
libiconv 1.16 h7f8727e_2
libidn2 2.3.4 h5eee18b_0
libjpeg-turbo 2.0.0 h9bf148f_0 pytorch
liblief 0.12.3 h6a678d5_0
libmamba 1.4.1 h2dafd23_1
libmambapy 1.4.1 py310h2dafd23_1
libnghttp2 1.52.0 h2d74bed_1
libnpp 11.8.0.86 0 nvidia
libnvjpeg 11.9.0.86 0 nvidia
libpng 1.6.39 h5eee18b_0
libsolv 0.7.22 he621ea3_0
libssh2 1.10.0 hdbd6064_2
libstdcxx-ng 11.2.0 h1234567_1
libtasn1 4.19.0 h5eee18b_0
libtiff 4.5.1 h6a678d5_0
libunistring 0.9.10 h27cfd23_0
libuuid 1.41.5 h5eee18b_0
libuv 1.44.2 h5eee18b_0
libwebp 1.3.2 h11a3e52_0
libwebp-base 1.3.2 h5eee18b_0
libxml2 2.10.3 hcbfbd50_0
llvm-openmp 14.0.6 h9e868ea_0
lz4-c 1.9.4 h6a678d5_0
markupsafe 2.1.1 py310h7f8727e_0
matplotlib-inline 0.1.6 py310h06a4308_0
mkl 2023.1.0 h213fc3f_46343
mkl-service 2.4.0 py310h5eee18b_1
mkl_fft 1.3.8 py310h5eee18b_0
mkl_random 1.2.4 py310hdb19cb5_0
more-itertools 8.12.0 pyhd3eb1b0_0
mpc 1.1.0 h10f8cd9_1
mpfr 4.0.2 hb69a4c5_1
mpmath 1.3.0 py310h06a4308_0
ncurses 6.4 h6a678d5_0
nettle 3.7.3 hbbd107a_1
networkx 3.1 py310h06a4308_0
numpy 1.26.0 py310h5f9d8c6_0
numpy-base 1.26.0 py310hb5e798b_0
openh264 2.1.1 h4ff587b_0
openssl 3.0.11 h7f8727e_2
packaging 23.1 py310h06a4308_0
parso 0.8.3 pyhd3eb1b0_0
patch 2.7.6 h7b6447c_1001
patchelf 0.17.2 h6a678d5_0
pcre2 10.37 he7ceb23_1
pexpect 4.8.0 pyhd3eb1b0_3
pickleshare 0.7.5 pyhd3eb1b0_1003
pillow 9.4.0 py310h6a678d5_1
pip 23.2.1 py310h06a4308_0
pkginfo 1.9.6 py310h06a4308_0
pluggy 1.0.0 py310h06a4308_1
prompt-toolkit 3.0.36 py310h06a4308_0
psutil 5.9.0 py310h5eee18b_0
ptyprocess 0.7.0 pyhd3eb1b0_2
pure_eval 0.2.2 pyhd3eb1b0_0
py-lief 0.12.3 py310h6a678d5_0
pybind11-abi 4 hd3eb1b0_1
pycosat 0.6.4 py310h5eee18b_0
pycparser 2.21 pyhd3eb1b0_0
pygments 2.15.1 py310h06a4308_1
pyopenssl 23.2.0 py310h06a4308_0
pysocks 1.7.1 py310h06a4308_0
python 3.10.13 h955ad1f_0
python-etcd 0.4.5 pypi_0 pypi
python-libarchive-c 2.9 pyhd3eb1b0_1
pytorch 2.1.0 py3.10_cuda11.8_cudnn8.7.0_0 pytorch
pytorch-cuda 11.8 h7e8668a_5 pytorch
pytorch-mutex 1.0 cuda pytorch
pytz 2023.3.post1 py310h06a4308_0
pyyaml 6.0 py310h5eee18b_1
readline 8.2 h5eee18b_0
reproc 14.2.4 h295c915_1
reproc-cpp 14.2.4 h295c915_1
requests 2.31.0 py310h06a4308_0
rhash 1.4.3 hdbd6064_0
ruamel.yaml 0.17.21 py310h5eee18b_0
ruamel.yaml.clib 0.2.6 py310h5eee18b_1
sentencepiece 0.1.99 pypi_0 pypi
setuptools 68.0.0 py310h06a4308_0
six 1.16.0 pyhd3eb1b0_1
sortedcontainers 2.4.0 pypi_0 pypi
soupsieve 2.5 py310h06a4308_0
sqlite 3.41.2 h5eee18b_0
stack_data 0.2.0 pyhd3eb1b0_0
sympy 1.12 pypi_0 pypi
tbb 2021.8.0 hdb19cb5_0
tk 8.6.12 h1ccaba5_0
tomli 2.0.1 py310h06a4308_0
toolz 0.12.0 py310h06a4308_0
torchaudio 2.1.0 py310_cu118 pytorch
torchelastic 0.2.2 pypi_0 pypi
torchtriton 2.1.0 py310 pytorch
torchvision 0.16.0 py310_cu118 pytorch
tqdm 4.65.0 py310h2f386ee_0
traitlets 5.7.1 py310h06a4308_0
truststore 0.8.0 py310h06a4308_0
types-dataclasses 0.6.6 pypi_0 pypi
typing-extensions 4.8.0 pypi_0 pypi
typing_extensions 4.7.1 py310h06a4308_0
tzdata 2023c h04d1e81_0
urllib3 1.26.16 py310h06a4308_0
wcwidth 0.2.5 pyhd3eb1b0_0
wheel 0.41.2 py310h06a4308_0
xz 5.4.2 h5eee18b_0
yaml 0.2.5 h7b6447c_0
yaml-cpp 0.7.0 h295c915_1
zlib 1.2.13 h5eee18b_0
zstandard 0.19.0 py310h5eee18b_0
zstd 1.5.5 hc292b87_0
The text was updated successfully, but these errors were encountered: