Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(quant): add observer #189

Merged
merged 12 commits into from
Apr 18, 2023

Conversation

tpoisonooo
Copy link
Contributor

@tpoisonooo tpoisonooo commented Apr 17, 2023

Features

1. add --observer to re-quantize the layer which as big error

for example <groupsize=128, wbits=4>, it would try

groupsize=128, wbits=4
groupsize=64, wbits=4
groupsize=32, wbits=4
groupsize=128, wbits=8
groupsize=64, wbits=8
groupsize=32, wbits=8

until the error drop 50% .

I have spent 2-days, try to use different <groupsize, wbits> for each layer but in vain, it is really hard to refactor an unfamiliar repo.
If --observe enabled, no tensor or safe_tensor would be saved.

2. add --quant-directory to export quant table with toml+numpy format

In matmul_kernel, a_ptr is fp16 and b_ptr is integer, which is bad for SIMD instruction (such as ARM/x86).

While llama.py support multiple quantization methods, it is hard to implemented by cpp code (llama.cpp/ncnn as example).

Why not combine them.

Typo

  1. remove import *
  2. move some files to utils module
  3. remove useless imports
  4. update README, remove hard code
  5. add .gitignore

Others

  1. add texttable for better display, like this:
+---------------------+------------+
|        name         |   error    |
+=====================+============+
| mlp.down_proj.31    | 259870.141 |
+---------------------+------------+
| self_attn.q_proj.27 | 132039.609 |
+---------------------+------------+
| self_attn.q_proj.30 | 130265.906 |
+---------------------+------------+
| mlp.gate_proj.30    | 128808.148 |
+---------------------+------------+
| mlp.gate_proj.29    | 128561.297 |
+---------------------+------------+

Tests

These commands tested

# observe and not save
python llama.py /llama/7B-huggingface-v4.28.0.dev/ c4 --wbits 4 --true-sequential --act-order --groupsize -1 --sym --observe --save llama7b-4bit-128g.pt

# save quant table
python llama.py /llama/7B-huggingface-v4.28.0.dev/ c4 --wbits 4 --true-sequential --act-order --groupsize 128 --sym --observe --quant-directory /workspace/llama-int4-table/

# disable observe, no side effect to main branch
$ cat quant-fp16-baseline.sh 
CUDA_VISIBLE_DEVICES=7 python llama.py /llama/7B-huggingface-v4.28.0.dev/ c4 --wbits 4 --true-sequential --act-order --groupsize 128 --save llama7b-4bit-128g.pt
CUDA_VISIBLE_DEVICES=7 python llama_inference.py ${MODEL_DIR} --wbits 4 --groupsize 128 --load llama7b-4bit-128g.pt --text "this is llama"
..
Done.
 ⁇  this is llama. The word is llama.
"What does it mean?" she said.
"The llama's a mammal," I said. "It's a type of camel, but it does

@tpoisonooo tpoisonooo closed this Apr 17, 2023
@tpoisonooo tpoisonooo reopened this Apr 17, 2023
@qwopqwop200
Copy link
Owner

qwopqwop200 commented Apr 17, 2023

I'm currently traveling so I can't check in detail, but this looks very interesting. I think this could be more effective than 3 bits. When the code is complete, please let me know with the result. I will merge this code.

@qwopqwop200
Copy link
Owner

qwopqwop200 commented Apr 17, 2023

Please run llama benchmarks. This is because the current inference code is created assuming that all bits and groupsize are the same.

@tpoisonooo tpoisonooo changed the title WIP: feat(quant): add observer feat(quant): add observer Apr 18, 2023
@tpoisonooo
Copy link
Contributor Author

@qwopqwop200 please review.

@qwopqwop200
Copy link
Owner

qwopqwop200 commented Apr 18, 2023

Thanks for the great work. First of all,maybe, i can implement each layer useing different wbit and groupsize. However, I currently don't have the environment to test this out. I think we can make this possible in maybe 3 days. And I'm not familiar with cpp so I can't combine them.

@qwopqwop200 qwopqwop200 merged commit fcf403f into qwopqwop200:triton Apr 18, 2023
@tpoisonooo tpoisonooo deleted the add-observer-triton branch April 18, 2023 11:44
@qwopqwop200
Copy link
Owner

No, to be precise, I'm on a trip right now, so I can't experiment. The trip is over soon, so I'll experiment from then on.

@johnrobinsn
Copy link
Contributor

johnrobinsn commented Apr 18, 2023

@qwopqwop200 I think this might have the potential to help on the t5 model quant as well. (would be good to be able to disable quant on select layers too)

@tpoisonooo
Copy link
Contributor Author

No, to be precise, I'm on a trip right now, so I can't experiment. The trip is over soon, so I'll experiment from then on.

Life is short, work is endless, so enjoy the trip !

@johnrobinsn johnrobinsn mentioned this pull request Apr 18, 2023
@qwopqwop200
Copy link
Owner

@tpoisonooo
As a result of the current experiment, observe does not work properly. These results seem to rather increase ppl.
https://github.com/qwopqwop200/GPTQ-for-LLaMa/tree/observe

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants