Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

will it work with Nvidia P40 24GB on Linux? #27

Open
waan1 opened this issue Jun 3, 2023 · 29 comments
Open

will it work with Nvidia P40 24GB on Linux? #27

waan1 opened this issue Jun 3, 2023 · 29 comments

Comments

@waan1
Copy link

waan1 commented Jun 3, 2023

I'm developing AI assistant for fiction writer. As openai API gets pretty expensive with all the inference tricks needed, I'm looking for a good local alternative for most of inference, saving gpt4 just for polishing final results.
exllama looks pretty interesting, but I'm getting compilation error.
Even though in addition to fiction writer I'm a software developer, I'm far from being an AI expert.
Would it be correct to assume from the lines below that P40 is not supported currently?
-D__CUDA_NO_HALF_OPERATORS__ -D__CUDA_NO_HALF_CONVERSIONS__ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__

Maybe it was a silly try, but self.weight = tensors[key].half() did not work.

If P40 will not work with exllama, could somebody advise if oobabooga/GPTQ-for-LLaMa would work?
If not CUDA, maybe there are good options for i9-13900K with 128G DDR5?

The full Traceback:
python test_benchmark_inference.py -d /home/igorm/ai-assistant/agent-city/llm/models/Wizard-Vicuna-13B-Uncensored-GPTQ -p -ppl
Traceback (most recent call last):
File "/home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1900, in _run_ninja_build
subprocess.run(
File "/usr/lib/python3.10/subprocess.py", line 526, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['ninja', '-v']' returned non-zero exit status 1.

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/home/igorm/ai-assistant/agent-city/llm/exllama/exllama/test_benchmark_inference.py", line 1, in
from model import ExLlama, ExLlamaCache, ExLlamaConfig
File "/home/igorm/ai-assistant/agent-city/llm/exllama/exllama/model.py", line 5, in
import cuda_ext
File "/home/igorm/ai-assistant/agent-city/llm/exllama/exllama/cuda_ext.py", line 14, in
exllama_ext = load(
File "/home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1283, in load
return jit_compile(
File "/home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1508, in jit_compile
write_ninja_file_and_build_library(
File "/home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1623, in write_ninja_file_and_build_library
run_ninja_build(
File "/home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/utils/cpp_extension.py", line 1916, in run_ninja_build
raise RuntimeError(message) from e
RuntimeError: Error building extension 'exllama_ext': [1/3] /opt/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=exllama_ext -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include/TH -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include/THC -isystem /opt/cuda/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS
-D__CUDA_NO_BFLOAT16_CONVERSIONS
-D__CUDA_NO_HALF2_OPERATORS
--expt-relaxed-constexpr -gencode=arch=compute_61,code=compute_61 -gencode=arch=compute_61,code=sm_61 --compiler-options '-fPIC' -std=c++17 -c /home/igorm/ai-assistant/agent-city/llm/exllama/exllama/exllama_ext/cuda_func/q4v2_matmul.cu -o q4v2_matmul.cuda.o
FAILED: q4v2_matmul.cuda.o
/opt/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=exllama_ext -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include/TH -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include/THC -isystem /opt/cuda/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS
-D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_61,code=compute_61 -gencode=arch=compute_61,code=sm_61 --compiler-options '-fPIC' -std=c++17 -c /home/igorm/ai-assistant/agent-city/llm/exllama/exllama/exllama_ext/cuda_func/q4v2_matmul.cu -o q4v2_matmul.cuda.o
/home/igorm/ai-assistant/agent-city/llm/exllama/exllama/exllama_ext/cuda_func/../cuda_compat.cuh(48): error: cannot overload functions distinguished by return type alone
void atomicAdd(half2* address, half2 val) { atomicAdd_half2(address, val); }
^

1 error detected in the compilation of "/home/igorm/ai-assistant/agent-city/llm/exllama/exllama/exllama_ext/cuda_func/q4v2_matmul.cu".
[2/3] /opt/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=exllama_ext -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include/TH -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include/THC -isystem /opt/cuda/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_61,code=compute_61 -gencode=arch=compute_61,code=sm_61 --compiler-options '-fPIC' -std=c++17 -c /home/igorm/ai-assistant/agent-city/llm/exllama/exllama/exllama_ext/cuda_func/half_matmul.cu -o half_matmul.cuda.o
FAILED: half_matmul.cuda.o
/opt/cuda/bin/nvcc -DTORCH_EXTENSION_NAME=exllama_ext -DTORCH_API_INCLUDE_EXTENSION_H -DPYBIND11_COMPILER_TYPE="gcc" -DPYBIND11_STDLIB="libstdcpp" -DPYBIND11_BUILD_ABI="cxxabi1011" -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include/torch/csrc/api/include -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include/TH -isystem /home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/torch/include/THC -isystem /opt/cuda/include -isystem /usr/include/python3.10 -D_GLIBCXX_USE_CXX11_ABI=0 -D__CUDA_NO_HALF_OPERATORS -D__CUDA_NO_HALF_CONVERSIONS_ -D__CUDA_NO_BFLOAT16_CONVERSIONS__ -D__CUDA_NO_HALF2_OPERATORS__ --expt-relaxed-constexpr -gencode=arch=compute_61,code=compute_61 -gencode=arch=compute_61,code=sm_61 --compiler-options '-fPIC' -std=c++17 -c /home/igorm/ai-assistant/agent-city/llm/exllama/exllama/exllama_ext/cuda_func/half_matmul.cu -o half_matmul.cuda.o
/home/igorm/ai-assistant/agent-city/llm/exllama/exllama/exllama_ext/cuda_func/../cuda_compat.cuh(48): error: cannot overload functions distinguished by return type alone
void atomicAdd(half2* address, half2 val) { atomicAdd_half2(address, val); }
^

1 error detected in the compilation of "/home/igorm/ai-assistant/agent-city/llm/exllama/exllama/exllama_ext/cuda_func/half_matmul.cu".
ninja: build stopped: subcommand failed.

@turboderp
Copy link
Owner

Maybe it was a silly try, but self.weight = tensors[key].half() did not work.

That would turn the q4 weights into half types without converting them first. So that definitely wouldn't work.

I'm planning to do a lot more work on support for the P40 specifically. It's a very attractive card for the obvious reasons if it can be made to perform well. I don't have one to develop on, but the error message you're getting suggests that CUDA is providing its own atomicAdd operation for half2 types, while the extension also tries to provide its own for compatibility.

Try commenting out the following line in exllama_ext/cuda_func/cuda_compat.cuh:

__device__ __forceinline__ void atomicAdd(half2* address, half2 val) { atomicAdd_half2(address, val); }

You may have to also delete the directory ~/.cache/torch_extensions/pyxxx_cuxxx/exllama_ext, since editing .h and .cuh files won't always trigger a rebuild.

@waan1
Copy link
Author

waan1 commented Jun 3, 2023

Your fix resolved the reported compile issue, thank you for quick and easy fix.
But now it is a runtime issue. I'm not sure if it is a show stopper for now. Would it be of any value if I create a new issue?

python test_benchmark_inference.py -d /home/igorm/ai-assistant/agent-city/llm/models/Wizard-Vicuna-13B-Uncensored-GPTQ -p -ppl
-- Loading model
-- Tokenizer: /home/igorm/ai-assistant/agent-city/llm/models/Wizard-Vicuna-13B-Uncensored-GPTQ/tokenizer.model
-- Model config: /home/igorm/ai-assistant/agent-city/llm/models/Wizard-Vicuna-13B-Uncensored-GPTQ/config.json
-- Model: /home/igorm/ai-assistant/agent-city/llm/models/Wizard-Vicuna-13B-Uncensored-GPTQ/Wizard-Vicuna-13B-Uncensored-GPTQ-4bit-128g.compat.no-act-order.safetensors
-- Sequence length: 2048
-- Options: ['attention: switched', 'matmul: switched', 'mlp: switched', 'perf', 'perplexity']
Traceback (most recent call last):
File "/home/igorm/ai-assistant/agent-city/llm/exllama/exllama/test_benchmark_inference.py", line 168, in
wrapper = timer("Load model", lambda: ModelWrapper(args))
File "/home/igorm/ai-assistant/agent-city/llm/exllama/exllama/test_benchmark_inference.py", line 72, in timer
ret = func()
File "/home/igorm/ai-assistant/agent-city/llm/exllama/exllama/test_benchmark_inference.py", line 168, in
wrapper = timer("Load model", lambda: ModelWrapper(args))
File "/home/igorm/ai-assistant/agent-city/llm/exllama/exllama/test_benchmark_inference.py", line 51, in init
self.tokenizer = ExLlamaTokenizer(self.tokenizer_model_path)
File "/home/igorm/ai-assistant/agent-city/llm/exllama/exllama/tokenizer.py", line 10, in init
self.tokenizer = SentencePieceProcessor(model_file = self.path)
File "/home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/sentencepiece/init.py", line 447, in Init
self.Load(model_file=model_file, model_proto=model_proto)
File "/home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/sentencepiece/init.py", line 905, in Load
return self.LoadFromFile(model_file)
File "/home/igorm/ai-assistant/agent-city/llm/exllama/myenv/lib/python3.10/site-packages/sentencepiece/init.py", line 310, in LoadFromFile
return _sentencepiece.SentencePieceProcessor_LoadFromFile(self, arg)
RuntimeError: Internal: src/sentencepiece_processor.cc(1101) [model_proto->ParseFromArray(serialized.data(), serialized.size())]

@turboderp
Copy link
Owner

That's a new one. An internal error in SentencePiece would suggest either you've got a corrupted tokenizer.model or the wrong version of SentencePiece installed perhaps? I'm using 0.1.97, if that helps.

@waan1
Copy link
Author

waan1 commented Jun 3, 2023

I installed your sources on fresh virtualenv using your scripts.
And I've got version = '0.1.99'
Do you suggest to uninstall it and install 0.1.97?

@waan1
Copy link
Author

waan1 commented Jun 3, 2023

Vicuna was loaded from HF
I'll try to load a smaller model just in case

@turboderp
Copy link
Owner

I can't think of anything else at the moment, really. That, or try a different model, or try downloading the tokenizer.model file again.

@waan1
Copy link
Author

waan1 commented Jun 3, 2023

You were right again :)
tokenizer.model was loaded before I installed lfs.
downloaded it again and it works now.
Thank you very much for help!
And now we know that exllama is compatible with P40.
Looks like pretty slow.
Will play with it.

@waan1
Copy link
Author

waan1 commented Jun 3, 2023

tested chatbot
one performance core of CPU (CPU3) is 100% (i9-13900K)
other 23 cores are idle
P40 is 100%. it took only 9.2GB VRAM out of 24GB.
I'm unclear of how both CPU and GPU could be saturated at the same time.
Prompt: what is the capital of (country) - tried 10 times with different countries.
Response took from 16 to 25 sec. Sometimes it was just one word, sometimes it was "capital of (country) is (city).

@turboderp
Copy link
Owner

I'm unclear of how both CPU and GPU could be saturated at the same time.

PyTorch waits in a busy loop whenever it synchronizes a CUDA stream, as far as I can tell. With a 13900K the CPU should be easily able to keep up with the P40, since my 12900K can keep up with a 4090. So while there are clearly still CPU bottlenecks that people with slower CPUs are running into (working on that), the CPU usage you're seeing is "normal."

As for the speed on the P40, that obviously needs some work. But all the CUDA stuff is being rewritten as we speak, and there will be a bunch more tuning options soon, with probably some alternative code paths that should be better suited for older GPUs.

@waan1
Copy link
Author

waan1 commented Jun 4, 2023

I see. Will be watching your git for updates and will try again when it has something for older PGUs.

@turboderp
Copy link
Owner

Having read up on it a bit, good performance on P40 might be a ways off, unfortunately. Apparently its FP16 performance is 1/64 of its FP32 performance. I guess it's emulated in the driver or something. So I don't know how much I can do to address that, other than either provide an alternative FP32 path which would limit the context length somewhat. And it's a big rewrite.

@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented Jun 4, 2023

Just tested and perf isn't good. 1.x It/s with no context. Maybe there is a way to just stop doing matmul in FP16?

For reference this is how autoGPTQ does it in float https://github.com/PanQiWei/AutoGPTQ/blob/main/auto_gptq/nn_modules/qlinear_old.py

I run it with use_cuda_fp16=false.

@turboderp
Copy link
Owner

Yep, it converts everything to FP32 on the fly. It's hard to get to 160 tokens/second that way, and hard to run a 30B model at full context length when the state takes up twice as much space. But I have some ideas to try out, once I find a convenient way to test on a Pascal card.

@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented Jun 4, 2023

I've got 2 of them so if you need anything tested I can run it.

I wish I was getting 160 t/s, but for some reason I'm not on the 3090s. I think that's a whole separate issue. Not sure what's doing it, if its my xeon v4, PCIE3 system or something up with my environment. I'm only getting 27 it/s on the 7b so it has to be something.

@waan1
Copy link
Author

waan1 commented Jun 5, 2023

Having read up on it a bit, good performance on P40 might be a ways off, unfortunately. Apparently its FP16 performance is 1/64 of its FP32 performance.
Tesla P40
INT8 (TIOP/s): 47.0
FP32 (TFLOP/s): 11.8
Is it possible to use INT8 instead?
https://downloads.dell.com/manuals/all-products/esuprt_solutions_int/esuprt_solutions_int_solutions_resources/high-computing-solution-resources_white-papers13_en-us.pdf
Some people somehow achieve about 4080 performance on P40?
https://www.reddit.com/r/LocalLLaMA/comments/13n8bqh/my_results_using_a_tesla_p40/

@jterry333
Copy link

I've got an extra P40 I can send you if you think it'll help you crank up the performance on these things

@jmoney7823956789378
Copy link

I could do the same with an MI25 or MI60, if that was something you wanted.

@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented Jun 5, 2023

int8

Bits and bytes perf for P40 is not good. About 1/2 speed as well.

@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented Jun 6, 2023

Just tested with nohalf2, if I did it right, it definitely went up on P6000. This is the 7b though.

Output generated in 30.07 seconds (3.33 tokens/s, 100 tokens, context 18, seed 1652524018)

any reason to not

//__device__ __forceinline__ void atomicAdd(half2* address, half2 val) { atomicAdd_half2(address, val); }

in cuda_compat within the repo? Does it do something required for volta?

@turboderp
Copy link
Owner

I think I'd need to know for sure exactly when half2 support is provided by CUDA and when it isn't. Cause there's still a half2 path that needs to compile, even if it's never called. Unless it's #ifdefd out, but then what are the conditions for that exactly?

@ardfork
Copy link
Contributor

ardfork commented Jun 6, 2023

In that case it's because you check for __CUDA_ARCH__ < 700 for both atomicAdd half and half2 when half2 should be __CUDA_ARCH__ < 600.

From https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#atomicadd:

The 32-bit __half2 floating-point version of atomicAdd() is only supported by devices of compute capability 6.x and higher. The atomicity of the __half2 or __nv_bfloat162 add operation is guaranteed separately for each of the two __half or __nv_bfloat16 elements; the entire __half2 or __nv_bfloat162 is not guaranteed to be atomic as a single 32-bit access.
The 16-bit __half floating-point version of atomicAdd() is only supported by devices of compute capability 7.x and higher.

@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented Jun 7, 2023

Pascal is compute 6.1. Not sure how maxwell fares on this repo, I don't think anyone tried it yet. Pascal doesn't have an atomicadd half tho, unless you make the function for it.



#ifdef __CUDA_ARCH__
#if __CUDA_ARCH__ < 700 && __CUDA_ARCH__ > 600
// adapted from https://github.com/torch/cutorch/blob/master/lib/THC/THCAtomics.cuh
__device__ __forceinline__ void atomicAddHalf(__half* address, c10::Half val) {
    unsigned int *address_as_ui = reinterpret_cast<unsigned int *>(reinterpret_cast<char *>(address) - (reinterpret_cast<size_t>(address) & 2));
    unsigned int old = *address_as_ui;
    unsigned int assumed;

    do {
        assumed = old;
        unsigned short hsum = reinterpret_cast<size_t>(address) & 2 ? (old >> 16) : (old & 0xffff);
        hsum += val;
        old = reinterpret_cast<size_t>(address) & 2
                 ? (old & 0xffff) | (hsum << 16)
                 : (old & 0xffff0000) | hsum;
        old = atomicCAS(address_as_ui, assumed, old);

    // Note: uses integer comparison to avoid hang in case of NaN (since NaN != NaN)
    } while (assumed != old);
}
#endif
#endif

@lhl
Copy link
Contributor

lhl commented Jun 8, 2023

I dug out an old 1080 Ti (Pascal) the other day to do some comparisons vs a Radeon VII (GCN 5.1 gfx906) and can confirm that @ardfork's #ifdef change worked for the 1080 Ti.

(I saw there's some ongoing work hipifying for rocm, happy to run a test, the old Radeon card runs llama.cpp both w/ clblast and hipblas so will be curious to see how exllama compares.)

@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented Jun 29, 2023

half2-HPEC2017.pdf

Supposedly there is a way to pack 2 half2 ops into a single FP32 operation and gain a speedup but I'm not sure if that is accomplished only for P100 or if it also works for P40. Who knows where this people's repo went by now since it's so old or if their numbers hold up.

@waan1
Copy link
Author

waan1 commented Jun 30, 2023

half precision would still require 2 bytes per weight, limiting model selection to 3B or 7B. 13B would not fit into 24GB, right?
ideally packing 8 weights x 4 bit into one fp32 would increase performance by 8 time minus overhead and allow up to 40B GPTQ models.
but I guess it is a lot of work.
P40 makes 4 times more int8 operations than fp32 (47 TOPS vs 12 FLOPS). Would it be easier to pack two weights into one int8? Still a lot of work though.

@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented Jun 30, 2023

I thought about int8 as well but int8 is missing hardware matrix matmul.

@drgnfr6
Copy link

drgnfr6 commented Jul 8, 2023

@Ph0rk0z That doesn't seem accurate as far as I can tell, though maybe I'm misunderstanding. If so feel free to brush me off, I'm struggling to trace this to experience I last used years ago. Below is one of the references I came across when digging, when I had some free time a few weeks ago. I started to implement something to show how we could use it, but frankly got lost about 1/4 of the way in. It's been over a decade since I wrote C code in earnest, and it wasn't heavily math based. I did get far enough to compare speed of the matmul operation when using it and it's definitely faster than fp16 or fp32, roughly 4 times faster than equivalent fp32 operations if my timings bore out correctly.

Basically, in CUDA land at least, it seems to boil down to leveraging __dp4a (or __dp2a, if we have any int16 as input) calls. These appear to be very fast on the p40 (I have 2 of the cards in my current test bed). I'm presuming this might be some standard function call, as I didn't see it in the CUDA specific documentation, the __dp4a function is also referenced in Intel and AMD documentation, though I didn't dig far enough into that side to say how exactly.

This helps quite a bit regarding use of the dp4a and dp2a functions: https://developer.nvidia.com/blog/mixed-precision-programming-cuda-8/
I'm not sure why no-one uses the call in llama.cpp or exllama or similar, it seems to be perfectly functional, compiles under cuda toolkit 12.2 and is quite fast on p40s (I'd guess others as well, given specs from nvidia on int based ops), but I also couldn't find it in the official docs for the cuda math API here either: https://docs.nvidia.com/cuda/cuda-math-api/modules.html

@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented Jul 8, 2023

Sounds like you got further than me. I am pretty rusty on the math here. This computes the dot product but what about the matrix product. I thought they were different and also used in stuff like bits and bytes for some of the operations.

Extrapolating from this guy's post: https://forums.developer.nvidia.com/t/dp4a-instruction-usage-in-pascal-architecture-gpus/53309/6

We would need to write a matmul function using DP4a and do like was done with atomicadd and then see if it's faster or not.
But then also everything else has to be changed to FP32 from the FP16 it currently is in exllama because all FP16 ops are slow.

So the original problem of that remains.

@turboderp
Copy link
Owner

The FP16 problem remains, but INT8 would present problems of its own. It's an integer type, after all, not a drop-in replacement for floats.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants