Skip to content

Result with the branch fastest-inference-4bit #227

@alanxmay

Description

@alanxmay

Amazing 4bit inference speed!

I happened to see a new branch fastest-inference-4bit, so I did a test:

LLaMA-13B branch Bits group-size memory(MiB) PPL(c4) Median(s/token)
FP16 fastest-inference-4bit 16 - 26634 6.96 0.0383
GPTQ triton 4 128 8590 6.97 0.0551
GPTQ fastest-inference-4bit 4 128 8699 19069 0.0344

But why the value of PPL is so large?


update result:

  1. The act-order parameter is consistent in build compressed model and the benchmark
  2. with groupsize = -1 and act-order = false:
LLaMA-13B branch Bits group-size memory(MiB) PPL(c4) Median(s/token) act-order speed up
FP16 fastest-inference-4bit 16 - 26634 6.96 0.0383 - 1x
GPTQ triton 4 128 8590 6.97 0.0551 - 0.69x
GPTQ fastest-inference-4bit 4 128 8699 6.97 0.0429 true 0.89x
GPTQ fastest-inference-4bit 4 128 8699 7.03 0.0287 false 1.33x
GPTQ fastest-inference-4bit 4 -1 8448 7.12 0.0284 false 1.44x

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions