-
-
Notifications
You must be signed in to change notification settings - Fork 220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question - possible to run starcoder with exllama? #14
Comments
I want to look at other architectures at some point, but it's not at the top of my list right now. SantaCoder uses a different pipeline and would require a different CUDA backend, and I don't want to get too ambitious before the Llama stuff is somewhat stable, at least. |
understood, thank you for your contributions this library is amazing. I will do some playing with it myself at some point to try and get starcoder working with exllama because this is the absolute fastest inference there is and it's not even close. as for doing it myself, would both the pipeline and cuda backend be inside GPTQ for santacoder? do you have any tips if i were to try and make santacoder architecture work with exllama? |
piggybacking off this, do you have a rough idea on what overhead would be to support models like MPT and Falcon (commercial-use models). What about OpenLLama? |
It would take some work. ExLlama doesn't use Transformers, which is much of the reason why it's fast, but it also means you can't plug in new models like MPT or Falcon as easily. That's the tradeoff with the Transfomers library. It tries to fit every model into the same framework and provide a bunch of easy interfaces (like pipelines etc.) that end up being model-agnostic, but you're not always getting as much out of each model as you could be. If you're doing research, that's perfectly fine, of course. Just throw another H100 at it if it's too slow, or let it finish processing overnight or something. But it's not a good foundation for targeting consumer hardware. Now, with a GPTQ conversion of MPT, Falcon or OpenLlama I probably wouldn't need too much extra code to support those models. They're still all transformers after all. But I'm worried about multiplying the number of cases to consider by four, just to support some unproven new models that might be forgotten by the end of next week. Always keep in mind that benchmarks don't really mean anything at the end of the day. |
@turboderp lets say I wanted to extend exllama to support Falcon (which is now fully opensource btw, prob going to drive adoption like insane given that the 40B is currently SOTA, source: https://twitter.com/TIIuae/status/1661728921044553730) Reason I'm proposing Falcon here is I feel hesitant building so much top of Llama given its restrictive license. I do understand your point of not scaling to support all models tho (as new models come out every week at this rate) but I think Falcon is probably worth the time If you don't think you could take this on I would be happy to help -- this could be a great experience for me to learn, happy for fork into exFalcon if you prefer this repo staying true to LLama. Would also appreciate some info on how I can get started on understanding this codebase. I'm a generalist SWE by trade with mostly data/backend exp so this low level code is a little new to me. |
@nikshepsvn The next step after I get the web UI to a usable state is probably going to be some more optimizations. I need to strip out a bit of unused code and build out the C++ extension a bunch and experiment with splitting matmuls acrooss multiple GPUs. I think with some tighter synchronization multi-GPU setups should be able to get a significant speed boost on individual tokens, and I hope with the extra 3090-Ti I'm getting today I'll eventually be able to double the performance on 65B. After that I need to get back to memory optimization, since I'm still not happy with Torch's memory overhead. Then a little down the line I expect to refactor a bunch, mostly to move all of the tuning parameters into a separate config and add a better test harness and an auto-tuning feature, since the number of parameters is going to become unmanageable otherwise. All in all it's a bad place to start getting distracted. Absolutely Falcon looks interesting, especially after they changed their minds about charging for it, but it's still an unproven and censored model. I have no problem with people forking the code and building on it, but the disclaimer at the top of the readme is there for a reason. The project is still going through puberty, and anything you'd add to make it run Falcon would have to be reworked over and over again until the underlying engine stops changing. |
Hi, sorry to bother you, could you please take a moment to look at this starcoder model for gptq? I saw the introduction that the model uses AutoGPTQ for quantization, supports cuda and triton, does not support gptq_for_llama, but I will report an error when loading self.pad_token_id = read_config["pad_token_id"] |
StarCoder and StarChat are a different model architecture than Llama, so it wouldn't be easy to add support for them, no. I may get to it eventually, but it's not very high on my list right now. |
Recently there was a 15.5b param called starcoder released https://huggingface.co/bigcode/starcoder
You should be able to run it with text-generation-webui using a fork of GPTQ-for-llama called
GPTQ-for-SantaCoder https://github.com/mayank31398/GPTQ-for-SantaCoder
Since as far as I can tell these two libraries are using the same library - transformers
as well as the same quantization method (GPTQ) shouldn't this be possible to run with exllama?
any ideas on how I would go about doing this? @turboderp @disarmyouwitha
The text was updated successfully, but these errors were encountered: