feat(quant): add observer #189

tpoisonooo · 2023-04-17T06:57:14Z

Features

1. add `--observer` to re-quantize the layer which as big error

for example <groupsize=128, wbits=4>, it would try

groupsize=128, wbits=4
groupsize=64, wbits=4
groupsize=32, wbits=4
groupsize=128, wbits=8
groupsize=64, wbits=8
groupsize=32, wbits=8

until the error drop 50% .

I have spent 2-days, try to use different <groupsize, wbits> for each layer but in vain, it is really hard to refactor an unfamiliar repo.
If --observe enabled, no tensor or safe_tensor would be saved.

2. add `--quant-directory` to export quant table with toml+numpy format

In matmul_kernel, a_ptr is fp16 and b_ptr is integer, which is bad for SIMD instruction (such as ARM/x86).

While llama.py support multiple quantization methods, it is hard to implemented by cpp code (llama.cpp/ncnn as example).

Why not combine them.

Typo

remove import *
move some files to utils module
remove useless imports
update README, remove hard code
add .gitignore

Others

add texttable for better display, like this:

+---------------------+------------+
|        name         |   error    |
+=====================+============+
| mlp.down_proj.31    | 259870.141 |
+---------------------+------------+
| self_attn.q_proj.27 | 132039.609 |
+---------------------+------------+
| self_attn.q_proj.30 | 130265.906 |
+---------------------+------------+
| mlp.gate_proj.30    | 128808.148 |
+---------------------+------------+
| mlp.gate_proj.29    | 128561.297 |
+---------------------+------------+

Tests

These commands tested

# observe and not save
python llama.py /llama/7B-huggingface-v4.28.0.dev/ c4 --wbits 4 --true-sequential --act-order --groupsize -1 --sym --observe --save llama7b-4bit-128g.pt

# save quant table
python llama.py /llama/7B-huggingface-v4.28.0.dev/ c4 --wbits 4 --true-sequential --act-order --groupsize 128 --sym --observe --quant-directory /workspace/llama-int4-table/

# disable observe, no side effect to main branch
$ cat quant-fp16-baseline.sh 
CUDA_VISIBLE_DEVICES=7 python llama.py /llama/7B-huggingface-v4.28.0.dev/ c4 --wbits 4 --true-sequential --act-order --groupsize 128 --save llama7b-4bit-128g.pt
CUDA_VISIBLE_DEVICES=7 python llama_inference.py ${MODEL_DIR} --wbits 4 --groupsize 128 --load llama7b-4bit-128g.pt --text "this is llama"
..
Done.
 ⁇  this is llama. The word is llama.
"What does it mean?" she said.
"The llama's a mammal," I said. "It's a type of camel, but it does

feat(project): add quant

qwopqwop200 · 2023-04-17T08:49:10Z

I'm currently traveling so I can't check in detail, but this looks very interesting. I think this could be more effective than 3 bits. When the code is complete, please let me know with the result. I will merge this code.

qwopqwop200 · 2023-04-17T08:56:05Z

Please run llama benchmarks. This is because the current inference code is created assuming that all bits and groupsize are the same.

tpoisonooo · 2023-04-18T08:44:08Z

@qwopqwop200 please review.

qwopqwop200 · 2023-04-18T09:21:40Z

Thanks for the great work. First of all,maybe, i can implement each layer useing different wbit and groupsize. However, I currently don't have the environment to test this out. I think we can make this possible in maybe 3 days. And I'm not familiar with cpp so I can't combine them.

qwopqwop200 · 2023-04-18T11:53:26Z

No, to be precise, I'm on a trip right now, so I can't experiment. The trip is over soon, so I'll experiment from then on.

johnrobinsn · 2023-04-18T12:00:51Z

@qwopqwop200 I think this might have the potential to help on the t5 model quant as well. (would be good to be able to disable quant on select layers too)

tpoisonooo · 2023-04-18T12:05:46Z

No, to be precise, I'm on a trip right now, so I can't experiment. The trip is over soon, so I'll experiment from then on.

Life is short, work is endless, so enjoy the trip !

qwopqwop200 · 2023-04-21T06:40:05Z

@tpoisonooo
As a result of the current experiment, observe does not work properly. These results seem to rather increase ppl.
https://github.com/qwopqwop200/GPTQ-for-LLaMa/tree/observe

tpoisonooo added 2 commits April 17, 2023 03:49

feat(project): add observer

14800c5

feat(project): add quant

fix(project): rebase

62b4375

tpoisonooo closed this Apr 17, 2023

style(project): clean

b703031

tpoisonooo reopened this Apr 17, 2023

style(project): remove useless

3e93637

tpoisonooo added 8 commits April 17, 2023 12:20

feat(project): add inference

7c75500

feat(project): revert file option

27f8b1a

fix(project): revert

a1045c1

style(project): nest utils

8f6e42d

style(project): remove useless imports

2e673b8

style(project): remove import all

3569be6

style(project): format

1accdb6

tests(llama.py): fix

90d7788

tpoisonooo changed the title ~~WIP: feat(quant): add observer~~ feat(quant): add observer Apr 18, 2023

qwopqwop200 merged commit fcf403f into qwopqwop200:triton Apr 18, 2023

tpoisonooo deleted the add-observer-triton branch April 18, 2023 11:44

johnrobinsn mentioned this pull request Apr 18, 2023

T5 Benchmark #157

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(quant): add observer #189

feat(quant): add observer #189

tpoisonooo commented Apr 17, 2023 •

edited

Loading

qwopqwop200 commented Apr 17, 2023 •

edited

Loading

qwopqwop200 commented Apr 17, 2023 •

edited

Loading

tpoisonooo commented Apr 18, 2023

qwopqwop200 commented Apr 18, 2023 •

edited

Loading

qwopqwop200 commented Apr 18, 2023

johnrobinsn commented Apr 18, 2023 •

edited

Loading

tpoisonooo commented Apr 18, 2023

qwopqwop200 commented Apr 21, 2023

feat(quant): add observer #189

feat(quant): add observer #189

Conversation

tpoisonooo commented Apr 17, 2023 • edited Loading

Features

1. add --observer to re-quantize the layer which as big error

2. add --quant-directory to export quant table with toml+numpy format

Typo

Others

Tests

qwopqwop200 commented Apr 17, 2023 • edited Loading

qwopqwop200 commented Apr 17, 2023 • edited Loading

tpoisonooo commented Apr 18, 2023

qwopqwop200 commented Apr 18, 2023 • edited Loading

qwopqwop200 commented Apr 18, 2023

johnrobinsn commented Apr 18, 2023 • edited Loading

tpoisonooo commented Apr 18, 2023

qwopqwop200 commented Apr 21, 2023

tpoisonooo commented Apr 17, 2023 •

edited

Loading

1. add `--observer` to re-quantize the layer which as big error

2. add `--quant-directory` to export quant table with toml+numpy format

qwopqwop200 commented Apr 17, 2023 •

edited

Loading

qwopqwop200 commented Apr 17, 2023 •

edited

Loading

qwopqwop200 commented Apr 18, 2023 •

edited

Loading

johnrobinsn commented Apr 18, 2023 •

edited

Loading