-
Notifications
You must be signed in to change notification settings - Fork 459
Closed
Description
Amazing 4bit inference speed!
I happened to see a new branch fastest-inference-4bit
, so I did a test:
LLaMA-13B | branch | Bits | group-size | memory(MiB) | PPL(c4) | Median(s/token) |
---|---|---|---|---|---|---|
FP16 | fastest-inference-4bit | 16 | - | 26634 | 6.96 | 0.0383 |
GPTQ | triton | 4 | 128 | 8590 | 6.97 | 0.0551 |
GPTQ | fastest-inference-4bit | 4 | 128 | 8699 | 19069 | 0.0344 |
But why the value of PPL is so large?
update result:
- The
act-order
parameter is consistent in build compressed model and the benchmark - with
groupsize = -1
andact-order = false
:
LLaMA-13B | branch | Bits | group-size | memory(MiB) | PPL(c4) | Median(s/token) | act-order | speed up |
---|---|---|---|---|---|---|---|---|
FP16 | fastest-inference-4bit | 16 | - | 26634 | 6.96 | 0.0383 | - | 1x |
GPTQ | triton | 4 | 128 | 8590 | 6.97 | 0.0551 | - | 0.69x |
GPTQ | fastest-inference-4bit | 4 | 128 | 8699 | 6.97 | 0.0429 | true | 0.89x |
GPTQ | fastest-inference-4bit | 4 | 128 | 8699 | 7.03 | 0.0287 | false | 1.33x |
GPTQ | fastest-inference-4bit | 4 | -1 | 8448 | 7.12 | 0.0284 | false | 1.44x |
Metadata
Metadata
Assignees
Labels
No labels