-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Generic packing algorithms from size N to M #284
Comments
This is quite cool and I've been thinking along similar lines I think what we could to do to ship this is in And this can be a baseline for smaller dtypes. I'd be specific somewhere in the function names or docs that this is padding-based? Cause conceptually I can imagine another alternative where instead of wasting space you could pack 8 uint3 into 3 unint8 as a more general algorithm but that's finicky enough that we don't have to worry about it right now |
Also @mobicham had been asking us for standardizing bitpacking logic so curious on his thoughts too |
Thanks @vayuda , very interesting, thanks of sharing! Normally, bit-unpacking is almost never used in isolation, it's either fused in a dequant kernel or a low-bit matmul kernel. There are two main things to consider while designing a bitpacking logic:
@msaroufim do you know by any chance what kind of bitpacking logic is used in tiny_gemm? |
@mobicham Thanks for the input. The interleaved accessing is interesting though I'm not really sure what it means to fully take advantage of tensor cores. I think this is something we can iterate on. For now I can create a version that can do row-wise pack/unpack. As per @msaroufim suggestions, I will place these functions in the api file and write appropriate tests. |
Even in relative isolation (without op support) bit packing/unpacking, is still useful for saving memory footprint when storing bool tensors / masks / bitsets: But of course, more op support is needed for compressed bool tensors / bittensors / bitsets as well... (Similarly, for some other usecases, it is still useful even when packing/unpacking is not fused into ops where the bottleneck is actually memory efficiency and speed overhead can be tolerated) |
In order to support sub-byte dtypes for quantization, I (and many others) believe that it is better to pack these smaller dtypes into existing pytorch dtypes in order to reduce memory bandwidth contention for a bit of increased computation. Here is a preliminary algorithm in pytorch for doing this. It supports many types of conversions as seen in the tests.
Inspecting the compiled Triton code seems promising because it only launches one kernel and one buffer. Here is a snippit
The text was updated successfully, but these errors were encountered: