-
Notifications
You must be signed in to change notification settings - Fork 454
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(quant): add observer #189
feat(quant): add observer #189
Conversation
feat(project): add quant
I'm currently traveling so I can't check in detail, but this looks very interesting. I think this could be more effective than 3 bits. When the code is complete, please let me know with the result. I will merge this code. |
Please run llama benchmarks. This is because the current inference code is created assuming that all bits and groupsize are the same. |
@qwopqwop200 please review. |
Thanks for the great work. First of all,maybe, i can implement each layer useing different wbit and groupsize. However, I currently don't have the environment to test this out. I think we can make this possible in maybe 3 days. And I'm not familiar with cpp so I can't combine them. |
No, to be precise, I'm on a trip right now, so I can't experiment. The trip is over soon, so I'll experiment from then on. |
@qwopqwop200 I think this might have the potential to help on the t5 model quant as well. (would be good to be able to disable quant on select layers too) |
Life is short, work is endless, so enjoy the trip ! |
@tpoisonooo |
Features
1. add
--observer
to re-quantize the layer which as big errorfor example
<groupsize=128, wbits=4>
, it would tryuntil the error drop 50% .
I have spent 2-days, try to use different
<groupsize, wbits>
for each layer but in vain, it is really hard to refactor an unfamiliar repo.If
--observe
enabled, notensor
orsafe_tensor
would be saved.2. add
--quant-directory
to export quant table with toml+numpy formatIn matmul_kernel,
a_ptr
is fp16 andb_ptr
is integer, which is bad for SIMD instruction (such as ARM/x86).While llama.py support multiple quantization methods, it is hard to implemented by cpp code (llama.cpp/ncnn as example).
Why not combine them.
Typo
import *
utils
moduleOthers
texttable
for better display, like this:Tests
These commands tested