-
-
Notifications
You must be signed in to change notification settings - Fork 220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
will it work with Nvidia P40 24GB on Linux? #27
Comments
That would turn the q4 weights into half types without converting them first. So that definitely wouldn't work. I'm planning to do a lot more work on support for the P40 specifically. It's a very attractive card for the obvious reasons if it can be made to perform well. I don't have one to develop on, but the error message you're getting suggests that CUDA is providing its own atomicAdd operation for half2 types, while the extension also tries to provide its own for compatibility. Try commenting out the following line in
You may have to also delete the directory |
Your fix resolved the reported compile issue, thank you for quick and easy fix. python test_benchmark_inference.py -d /home/igorm/ai-assistant/agent-city/llm/models/Wizard-Vicuna-13B-Uncensored-GPTQ -p -ppl |
That's a new one. An internal error in SentencePiece would suggest either you've got a corrupted tokenizer.model or the wrong version of SentencePiece installed perhaps? I'm using 0.1.97, if that helps. |
I installed your sources on fresh virtualenv using your scripts. |
Vicuna was loaded from HF |
I can't think of anything else at the moment, really. That, or try a different model, or try downloading the tokenizer.model file again. |
You were right again :) |
tested chatbot |
PyTorch waits in a busy loop whenever it synchronizes a CUDA stream, as far as I can tell. With a 13900K the CPU should be easily able to keep up with the P40, since my 12900K can keep up with a 4090. So while there are clearly still CPU bottlenecks that people with slower CPUs are running into (working on that), the CPU usage you're seeing is "normal." As for the speed on the P40, that obviously needs some work. But all the CUDA stuff is being rewritten as we speak, and there will be a bunch more tuning options soon, with probably some alternative code paths that should be better suited for older GPUs. |
I see. Will be watching your git for updates and will try again when it has something for older PGUs. |
Having read up on it a bit, good performance on P40 might be a ways off, unfortunately. Apparently its FP16 performance is 1/64 of its FP32 performance. I guess it's emulated in the driver or something. So I don't know how much I can do to address that, other than either provide an alternative FP32 path which would limit the context length somewhat. And it's a big rewrite. |
Just tested and perf isn't good. 1.x It/s with no context. Maybe there is a way to just stop doing matmul in FP16? For reference this is how autoGPTQ does it in float https://github.com/PanQiWei/AutoGPTQ/blob/main/auto_gptq/nn_modules/qlinear_old.py I run it with use_cuda_fp16=false. |
Yep, it converts everything to FP32 on the fly. It's hard to get to 160 tokens/second that way, and hard to run a 30B model at full context length when the state takes up twice as much space. But I have some ideas to try out, once I find a convenient way to test on a Pascal card. |
I've got 2 of them so if you need anything tested I can run it. I wish I was getting 160 t/s, but for some reason I'm not on the 3090s. I think that's a whole separate issue. Not sure what's doing it, if its my xeon v4, PCIE3 system or something up with my environment. I'm only getting 27 it/s on the 7b so it has to be something. |
|
I've got an extra P40 I can send you if you think it'll help you crank up the performance on these things |
I could do the same with an MI25 or MI60, if that was something you wanted. |
Bits and bytes perf for P40 is not good. About 1/2 speed as well. |
Just tested with nohalf2, if I did it right, it definitely went up on P6000. This is the 7b though.
any reason to not
in cuda_compat within the repo? Does it do something required for volta? |
I think I'd need to know for sure exactly when half2 support is provided by CUDA and when it isn't. Cause there's still a half2 path that needs to compile, even if it's never called. Unless it's |
In that case it's because you check for From https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#atomicadd:
|
Pascal is compute 6.1. Not sure how maxwell fares on this repo, I don't think anyone tried it yet. Pascal doesn't have an atomicadd half tho, unless you make the function for it.
|
I dug out an old 1080 Ti (Pascal) the other day to do some comparisons vs a Radeon VII (GCN 5.1 gfx906) and can confirm that @ardfork's #ifdef change worked for the 1080 Ti. (I saw there's some ongoing work hipifying for rocm, happy to run a test, the old Radeon card runs llama.cpp both w/ clblast and hipblas so will be curious to see how exllama compares.) |
Supposedly there is a way to pack 2 half2 ops into a single FP32 operation and gain a speedup but I'm not sure if that is accomplished only for P100 or if it also works for P40. Who knows where this people's repo went by now since it's so old or if their numbers hold up. |
half precision would still require 2 bytes per weight, limiting model selection to 3B or 7B. 13B would not fit into 24GB, right? |
I thought about int8 as well but int8 is missing hardware matrix matmul. |
@Ph0rk0z That doesn't seem accurate as far as I can tell, though maybe I'm misunderstanding. If so feel free to brush me off, I'm struggling to trace this to experience I last used years ago. Below is one of the references I came across when digging, when I had some free time a few weeks ago. I started to implement something to show how we could use it, but frankly got lost about 1/4 of the way in. It's been over a decade since I wrote C code in earnest, and it wasn't heavily math based. I did get far enough to compare speed of the matmul operation when using it and it's definitely faster than fp16 or fp32, roughly 4 times faster than equivalent fp32 operations if my timings bore out correctly. Basically, in CUDA land at least, it seems to boil down to leveraging __dp4a (or __dp2a, if we have any int16 as input) calls. These appear to be very fast on the p40 (I have 2 of the cards in my current test bed). I'm presuming this might be some standard function call, as I didn't see it in the CUDA specific documentation, the __dp4a function is also referenced in Intel and AMD documentation, though I didn't dig far enough into that side to say how exactly. This helps quite a bit regarding use of the dp4a and dp2a functions: https://developer.nvidia.com/blog/mixed-precision-programming-cuda-8/ |
Sounds like you got further than me. I am pretty rusty on the math here. This computes the dot product but what about the matrix product. I thought they were different and also used in stuff like bits and bytes for some of the operations. Extrapolating from this guy's post: https://forums.developer.nvidia.com/t/dp4a-instruction-usage-in-pascal-architecture-gpus/53309/6 We would need to write a matmul function using DP4a and do like was done with atomicadd and then see if it's faster or not. So the original problem of that remains. |
The FP16 problem remains, but INT8 would present problems of its own. It's an integer type, after all, not a drop-in replacement for floats. |
I'm developing AI assistant for fiction writer. As openai API gets pretty expensive with all the inference tricks needed, I'm looking for a good local alternative for most of inference, saving gpt4 just for polishing final results.
exllama looks pretty interesting, but I'm getting compilation error.
Even though in addition to fiction writer I'm a software developer, I'm far from being an AI expert.
Would it be correct to assume from the lines below that P40 is not supported currently?
-D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__
Maybe it was a silly try, but self.weight = tensors[key].half() did not work.
If P40 will not work with exllama, could somebody advise if oobabooga/GPTQ-for-LLaMa would work?
If not CUDA, maybe there are good options for i9-13900K with 128G DDR5?
The full Traceback:
python test_benchmark_inference.py -d /home/igorm/ai-assistant/agent-city/llm/models/Wizard-Vicuna-13B-Uncensored-GPTQ -p -ppl
Traceback (most recent call last):
File "/home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1900, in _run_ninja_build
subprocess.run(
File "/usr/lib/python3.10/subprocess.py", line 526, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "/home/igorm/ai-assistant/agent-city/llm/exllama/exllama/test_benchmark_inference.py", line 1, in
from model import ExLlama, ExLlamaCache, ExLlamaConfig
File "/home/igorm/ai-assistant/agent-city/llm/exllama/exllama/model.py", line 5, in
import cuda_ext
File "/home/igorm/ai-assistant/agent-city/llm/exllama/exllama/cuda_ext.py", line 14, in
exllama_ext = load(
File "/home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1283, in load
return jit_compile(
File "/home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1508, in jit_compile
write_ninja_file_and_build_library(
File "/home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1623, in write_ninja_file_and_build_library
run_ninja_build(
File "/home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1916, in run_ninja_build
raise RuntimeError(message) from e
RuntimeError: Error building extension 'exllama_ext': [1/3] /opt/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=exllama_ext -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include/TH -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include/THC -isystem /opt/cuda/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS -D__CUDA_NO_BFLOAT16_CONVERSIONS -D__CUDA_NO_HALF2_OPERATORS --expt-relaxed-constexpr -gencode=arch=compute_61,code=compute_61 -gencode=arch=compute_61,code=sm_61 --compiler-options '-fPIC' -std=c++17 -c /home/igorm/ai-assistant/agent-city/llm/exllama/exllama/exllama_ext/cuda_func/q4v2_matmul.cu -o q4v2_matmul.cuda.o
FAILED: q4v2_matmul.cuda.o
/opt/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=exllama_ext -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include/TH -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include/THC -isystem /opt/cuda/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_61,code=compute_61 -gencode=arch=compute_61,code=sm_61 --compiler-options '-fPIC' -std=c++17 -c /home/igorm/ai-assistant/agent-city/llm/exllama/exllama/exllama_ext/cuda_func/q4v2_matmul.cu -o q4v2_matmul.cuda.o
/home/igorm/ai-assistant/agent-city/llm/exllama/exllama/exllama_ext/cuda_func/../cuda_compat.cuh(48): error: cannot overload functions distinguished by return type alone
void atomicAdd(half2* address, half2 val) { atomicAdd_half2(address, val); }
^
1 error detected in the compilation of "/home/igorm/ai-assistant/agent-city/llm/exllama/exllama/exllama_ext/cuda_func/q4v2_matmul.cu".
[2/3] /opt/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=exllama_ext -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include/TH -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include/THC -isystem /opt/cuda/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_61,code=compute_61 -gencode=arch=compute_61,code=sm_61 --compiler-options '-fPIC' -std=c++17 -c /home/igorm/ai-assistant/agent-city/llm/exllama/exllama/exllama_ext/cuda_func/half_matmul.cu -o half_matmul.cuda.o
FAILED: half_matmul.cuda.o
/opt/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=exllama_ext -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include/TH -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include/THC -isystem /opt/cuda/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_61,code=compute_61 -gencode=arch=compute_61,code=sm_61 --compiler-options '-fPIC' -std=c++17 -c /home/igorm/ai-assistant/agent-city/llm/exllama/exllama/exllama_ext/cuda_func/half_matmul.cu -o half_matmul.cuda.o
/home/igorm/ai-assistant/agent-city/llm/exllama/exllama/exllama_ext/cuda_func/../cuda_compat.cuh(48): error: cannot overload functions distinguished by return type alone
void atomicAdd(half2* address, half2 val) { atomicAdd_half2(address, val); }
^
1 error detected in the compilation of "/home/igorm/ai-assistant/agent-city/llm/exllama/exllama/exllama_ext/cuda_func/half_matmul.cu".
ninja: build stopped: subcommand failed.
The text was updated successfully, but these errors were encountered: