Question - possible to run starcoder with exllama? #14

tpfwrz · 2023-05-29T22:12:42Z

Recently there was a 15.5b param called starcoder released https://huggingface.co/bigcode/starcoder

You should be able to run it with text-generation-webui using a fork of GPTQ-for-llama called
GPTQ-for-SantaCoder https://github.com/mayank31398/GPTQ-for-SantaCoder

Since as far as I can tell these two libraries are using the same library - transformers
as well as the same quantization method (GPTQ) shouldn't this be possible to run with exllama?

any ideas on how I would go about doing this? @turboderp @disarmyouwitha

turboderp · 2023-05-29T22:59:45Z

I want to look at other architectures at some point, but it's not at the top of my list right now. SantaCoder uses a different pipeline and would require a different CUDA backend, and I don't want to get too ambitious before the Llama stuff is somewhat stable, at least.

tpfwrz · 2023-05-29T23:49:53Z

understood, thank you for your contributions this library is amazing. I will do some playing with it myself at some point to try and get starcoder working with exllama because this is the absolute fastest inference there is and it's not even close.

as for doing it myself, would both the pipeline and cuda backend be inside GPTQ for santacoder? do you have any tips if i were to try and make santacoder architecture work with exllama?

nikshepsvn · 2023-05-30T02:07:24Z

piggybacking off this, do you have a rough idea on what overhead would be to support models like MPT and Falcon (commercial-use models). What about OpenLLama?

turboderp · 2023-05-30T11:10:03Z

It would take some work. ExLlama doesn't use Transformers, which is much of the reason why it's fast, but it also means you can't plug in new models like MPT or Falcon as easily. That's the tradeoff with the Transfomers library. It tries to fit every model into the same framework and provide a bunch of easy interfaces (like pipelines etc.) that end up being model-agnostic, but you're not always getting as much out of each model as you could be. If you're doing research, that's perfectly fine, of course. Just throw another H100 at it if it's too slow, or let it finish processing overnight or something. But it's not a good foundation for targeting consumer hardware.

Now, with a GPTQ conversion of MPT, Falcon or OpenLlama I probably wouldn't need too much extra code to support those models. They're still all transformers after all. But I'm worried about multiplying the number of cases to consider by four, just to support some unproven new models that might be forgotten by the end of next week. Always keep in mind that benchmarks don't really mean anything at the end of the day.

nikshepsvn · 2023-05-31T18:24:10Z

@turboderp lets say I wanted to extend exllama to support Falcon (which is now fully opensource btw, prob going to drive adoption like insane given that the 40B is currently SOTA, source: https://twitter.com/TIIuae/status/1661728921044553730)

Reason I'm proposing Falcon here is I feel hesitant building so much top of Llama given its restrictive license. I do understand your point of not scaling to support all models tho (as new models come out every week at this rate) but I think Falcon is probably worth the time If you don't think you could take this on I would be happy to help -- this could be a great experience for me to learn, happy for fork into exFalcon if you prefer this repo staying true to LLama. Would also appreciate some info on how I can get started on understanding this codebase. I'm a generalist SWE by trade with mostly data/backend exp so this low level code is a little new to me.

turboderp · 2023-06-01T10:05:47Z

@nikshepsvn The next step after I get the web UI to a usable state is probably going to be some more optimizations. I need to strip out a bit of unused code and build out the C++ extension a bunch and experiment with splitting matmuls acrooss multiple GPUs. I think with some tighter synchronization multi-GPU setups should be able to get a significant speed boost on individual tokens, and I hope with the extra 3090-Ti I'm getting today I'll eventually be able to double the performance on 65B. After that I need to get back to memory optimization, since I'm still not happy with Torch's memory overhead. Then a little down the line I expect to refactor a bunch, mostly to move all of the tuning parameters into a separate config and add a better test harness and an auto-tuning feature, since the number of parameters is going to become unmanageable otherwise.

All in all it's a bad place to start getting distracted. Absolutely Falcon looks interesting, especially after they changed their minds about charging for it, but it's still an unproven and censored model. I have no problem with people forking the code and building on it, but the disclaimer at the top of the readme is there for a reason. The project is still going through puberty, and anything you'd add to make it run Falcon would have to be reworked over and over again until the underlying engine stops changing.

DumoeDss · 2023-06-16T21:58:19Z

Hi, sorry to bother you, could you please take a moment to look at this starcoder model for gptq? I saw the introduction that the model uses AutoGPTQ for quantization, supports cuda and triton, does not support gptq_for_llama, but I will report an error when loading self.pad_token_id = read_config["pad_token_id"]
KeyError: 'pad_token_id', don't know if it would be relatively easy to support this model.
https://huggingface.co/TheBloke/starchat-beta-GPTQ
https://huggingface.co/TheBloke/starcoderplus-GPTQ

turboderp · 2023-06-16T22:35:59Z

StarCoder and StarChat are a different model architecture than Llama, so it wouldn't be easy to add support for them, no. I may get to it eventually, but it's not very high on my list right now.

turboderp closed this as not planned Won't fix, can't repro, duplicate, stale Jun 17, 2023

This was referenced Jun 17, 2023

KeyError when loading GPTQ Model #62

Closed

Support for StarCoder #64

Closed

Support for llama models with >2048 context? #74

Closed

ZanMax mentioned this issue Apr 18, 2024

Run on CPU without AVX2 #315

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question - possible to run starcoder with exllama? #14

Question - possible to run starcoder with exllama? #14

tpfwrz commented May 29, 2023

turboderp commented May 29, 2023

tpfwrz commented May 29, 2023

nikshepsvn commented May 30, 2023

turboderp commented May 30, 2023

nikshepsvn commented May 31, 2023 •

edited

Loading

turboderp commented Jun 1, 2023

DumoeDss commented Jun 16, 2023

turboderp commented Jun 16, 2023

Question - possible to run starcoder with exllama? #14

Question - possible to run starcoder with exllama? #14

Comments

tpfwrz commented May 29, 2023

turboderp commented May 29, 2023

tpfwrz commented May 29, 2023

nikshepsvn commented May 30, 2023

turboderp commented May 30, 2023

nikshepsvn commented May 31, 2023 • edited Loading

turboderp commented Jun 1, 2023

DumoeDss commented Jun 16, 2023

turboderp commented Jun 16, 2023

nikshepsvn commented May 31, 2023 •

edited

Loading