Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Will these optimization integrate into hf's code? #9

Open
lucasjinreal opened this issue Dec 1, 2023 · 7 comments
Open

Will these optimization integrate into hf's code? #9

lucasjinreal opened this issue Dec 1, 2023 · 7 comments

Comments

@lucasjinreal
Copy link

so that every one can use it out-of-box?

@aniketmaurya
Copy link

aniketmaurya commented Dec 1, 2023

Most of these features are already supported in Lit-GPT (if you're looking for finetuning LLMs) and more of this will be supported soon. You can use LLMs from HF model hub.

@SunMarc
Copy link

SunMarc commented Dec 1, 2023

Thanks for the interest ! We already support most of the optimization described here:

@Chillee
Copy link
Contributor

Chillee commented Dec 1, 2023

@SunMarc I think there might still be some gaps in how the kv-cache is handled during inference. Specifically, the link you sent is about vision models, not text generation.

We should chat more about this - i'd love to see the techniques here integrated.

@SunMarc
Copy link

SunMarc commented Dec 1, 2023

Yes, absolutely! cc @younesbelkada for visibility

@yhyu13
Copy link

yhyu13 commented Dec 3, 2023

These opt should already in hf. Moreover, some specific opt made for hardware like writing your cuda knerl for GPTQ and paged attention (e.g. flash_attn2) would make inference even faster.

https://github.com/turboderp/exllamav2 has bench marked llama-7b with 190+ t/s on single 3090Ti which matches this repo on 8xA100, but 3090Ti is only about 1/3 flops of a single A100. So hardware opt also plays as another drive.

@lucasjinreal
Copy link
Author

Hi, does torch.complie works with AWQ?

(seems hf already supports AWQ, but quantization way might not same as this repo)

How to enable speculative decoding in hf?

@Chillee
Copy link
Contributor

Chillee commented Dec 4, 2023

@yhyu13

https://github.com/turboderp/exllamav2 has bench marked llama-7b with 190+ t/s on single 3090Ti which matches this repo on 8xA100, but 3090Ti is only about 1/3 flops of a single A100.

To be clear, the benchmark on this repo is at 197 t/s on a single A100 with a groupsize of 32, while exllamav2 is running a single 4090 with a groupsize of 128.

Still certainly very good results from exllamav2 :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants