-
-
Notifications
You must be signed in to change notification settings - Fork 220
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error when trying to run Wizard-Vicuna-13B-Uncensored-GPTQ #12
Comments
The last part is due to an incorrect function reference in the C++ extension. I've pushed a fix, but it's an inconsequential bug that prevents it from freeing the CUDA buffers when the application is exiting and they would be freed automatically anyway. The actual error that you're getting is from the safetensors library. From the looks of it the file is just corrupt. But you never know. Do you have a download link for that specific model? |
Figured it out, I tried fetching the model doing qq: there is no UI/API for this yet right? I was considering using this in a project but I understand its extremely early stage. To scale adoption, I think at bare minimum there should be some API endpoints to interact with exllama for inference, any idea how you're prioritizing this? I would love to hop on and help if this is a lil far out |
I'm working on a little UI that's mostly just going to serve as a demo (and so ChatGPT can teach me Javascript). It's coming along, and maybe I'll push something tomorrow. As for an API it's still a little early, I think, maybe? I want to package it as a library at some point, but I'm hesitant right now, and it feels like there are still a thousand little things I need to look at before I can give it a version number. I think once I've got a little UI up and and running I'll also have a better sense of what the API should look like. In the meantime @disarmyouwitha is working on a FastAPI interface so maybe that's worth checking out. |
Ah thanks for the update. That's fair, I can see this being a useful project to serve GPTQ models in production via an API once we have commercially licensable models (like OpenLLama) but for now I think building for local makes sense. Would love to contribute back with a basic API that emulates OpenAI since I will prob build it out for what I'm working on. Thanks for the dope work here btw, inference speed is blazing fast running 2x exllama on a 3090 using a 13B model Heads up that running the benchmark fails but chat works fine -- this isn't blocking my prototyping now so I haven't looked deeper but raising here
|
@turboderp Speaking about APIs, is there a way to keep state entirely in the prompt string and disable internal cache? As far as I understand, right now due to internal cache and internal state, you can only have one conversation with a loaded model. It would be nice to have a way to have different chats with different initial prompts using one instance of the model loaded into memory. For example, openAI API chat endpoint assumes that the client itself keeps the conversation history and provides the whole chat history with every request. Of course adding support for keeping multiple internal states probably would be even better, but I guess it will take a lot more work than just providing an option to disable internal state... btw, thanks for your work, inference results with 33B models on my 4090 are very impressive! |
The cache is kind of essential for the speed. I'm sure OpenAPI does caching transparently on their end somehow. I could see ways to go about that, managing a dictionary of caches for a certain number of (most recent) prompt texts, but it would be complex and you definitely don't want to end up swapping the cache tensors in and out of VRAM all the time. I haven't tested the implementation on multiple threads at once, but at least alternating between inferences shouldn't be a problem. There might be an oversight somewhere, but the cache is supposed to be passed along with the hidden states throughout the forward pass, meaning the ExLlama class should be stateless. That's the idea, anyway. It builds from a config object, which gets embedded and kept, but the running state is kept in the cache, which is kept by the generator, not the model. So if you create multiple instances of ExLlamaGenerator, each referencing their own ExLlamaCache, they should all be able to use the same model, even taking turns on individual tokens. Of course you need to allocate the memory up front for each cache, but I personally prefer that approach over the risk of running out of memory down the line. And it's a lot faster to update the cache in-place instead of concatenating the way HF does, plus it saves gigabytes of VRAM. You could alternatively just create a new generator every time you want to switch to a new context, or (probably better) you could keep a single cache and just reset it (set |
@epicfilemcnulty check out my fork for some reference code for getting an API set up~ What I have is very barebones but should be pretty easy to understand/build on top of! I am just using the generate_simple function and passing in the entire conversation as the prompt each inference. This allows for you to have different conversations with different users at the same time. |
Closing this issue since its resolved, feel free to continue API related discussion in other issue |
--------- Co-authored-by: jllllll <3887729+jllllll@users.noreply.github.com>
Has anyone seen an error like this before?
The text was updated successfully, but these errors were encountered: