Will these optimization integrate into hf's code? #9

lucasjinreal · 2023-12-01T07:54:05Z

so that every one can use it out-of-box?

aniketmaurya · 2023-12-01T11:05:33Z

Most of these features are already supported in Lit-GPT (if you're looking for finetuning LLMs) and more of this will be supported soon. You can use LLMs from HF model hub.

SunMarc · 2023-12-01T17:14:40Z

Thanks for the interest ! We already support most of the optimization described here:

Torch.compile with pytorch blog here
4-bit quant with GPTQ and recently AWQ which is faster
Speculative Decoding.

Chillee · 2023-12-01T18:56:21Z

@SunMarc I think there might still be some gaps in how the kv-cache is handled during inference. Specifically, the link you sent is about vision models, not text generation.

We should chat more about this - i'd love to see the techniques here integrated.

SunMarc · 2023-12-01T19:05:10Z

Yes, absolutely! cc @younesbelkada for visibility

yhyu13 · 2023-12-03T08:46:28Z

These opt should already in hf. Moreover, some specific opt made for hardware like writing your cuda knerl for GPTQ and paged attention (e.g. flash_attn2) would make inference even faster.

https://github.com/turboderp/exllamav2 has bench marked llama-7b with 190+ t/s on single 3090Ti which matches this repo on 8xA100, but 3090Ti is only about 1/3 flops of a single A100. So hardware opt also plays as another drive.

lucasjinreal · 2023-12-04T02:35:42Z

Hi， does torch.complie works with AWQ?

(seems hf already supports AWQ, but quantization way might not same as this repo)

How to enable speculative decoding in hf?

Chillee · 2023-12-04T19:28:35Z

@yhyu13

https://github.com/turboderp/exllamav2 has bench marked llama-7b with 190+ t/s on single 3090Ti which matches this repo on 8xA100, but 3090Ti is only about 1/3 flops of a single A100.

To be clear, the benchmark on this repo is at 197 t/s on a single A100 with a groupsize of 32, while exllamav2 is running a single 4090 with a groupsize of 128.

Still certainly very good results from exllamav2 :)

learning-chip mentioned this issue Dec 17, 2023

Understanding why TorchInductor cannot speed-up huggingface transformer inference #59

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Will these optimization integrate into hf's code? #9

Will these optimization integrate into hf's code? #9

lucasjinreal commented Dec 1, 2023

aniketmaurya commented Dec 1, 2023 •

edited

Loading

SunMarc commented Dec 1, 2023

Chillee commented Dec 1, 2023

SunMarc commented Dec 1, 2023

yhyu13 commented Dec 3, 2023 •

edited

Loading

lucasjinreal commented Dec 4, 2023

Chillee commented Dec 4, 2023

Will these optimization integrate into hf's code? #9

Will these optimization integrate into hf's code? #9

Comments

lucasjinreal commented Dec 1, 2023

aniketmaurya commented Dec 1, 2023 • edited Loading

SunMarc commented Dec 1, 2023

Chillee commented Dec 1, 2023

SunMarc commented Dec 1, 2023

yhyu13 commented Dec 3, 2023 • edited Loading

lucasjinreal commented Dec 4, 2023

Chillee commented Dec 4, 2023

aniketmaurya commented Dec 1, 2023 •

edited

Loading

yhyu13 commented Dec 3, 2023 •

edited

Loading