Error when trying to run Wizard-Vicuna-13B-Uncensored-GPTQ #12

nikshepsvn · 2023-05-29T01:08:13Z

root@C.6325117:/exllama$ python test_benchmark_inference.py -d Wizard-Vicuna-13B-Uncensored-GPTQ -p -ppl
 -- Loading model
 -- Tokenizer: Wizard-Vicuna-13B-Uncensored-GPTQ/tokenizer.model
 -- Model config: Wizard-Vicuna-13B-Uncensored-GPTQ/config.json
 -- Model: Wizard-Vicuna-13B-Uncensored-GPTQ/Wizard-Vicuna-13B-Uncensored-GPTQ-4bit-128g.compat.no-act-order.safetensors
 -- Sequence length: 2048
 -- Options: ['attention: switched', 'matmul: switched', 'mlp: switched', 'perf', 'perplexity']
Traceback (most recent call last):
  File "/exllama/test_benchmark_inference.py", line 171, in <module>
    wrapper = timer("Load model", lambda: ModelWrapper(args))
  File "/exllama/test_benchmark_inference.py", line 73, in timer
    ret = func()
  File "/exllama/test_benchmark_inference.py", line 171, in <lambda>
    wrapper = timer("Load model", lambda: ModelWrapper(args))
  File "/exllama/test_benchmark_inference.py", line 51, in __init__
    self.model = ExLlama(config)
  File "/exllama/model.py", line 883, in __init__
    with safe_open(self.config.model_path, framework="pt", device="cpu") as f:
safetensors_rust.SafetensorError: Error while deserializing header: HeaderTooLarge
Exception ignored in: <function ExLlama.__del__ at 0x7fd43bfe1fc0>
Traceback (most recent call last):
  File "/exllama/model.py", line 1066, in __del__
    if torch_device is not None: cuda_ext.free_cuda_buffers(torch_device)
  File "/exllama/cuda_ext.py", line 57, in free_cuda_buffers
    free_buffers(device)
TypeError: free_buffers(): incompatible function arguments. The following argument types are supported:
    1. (arg0: torch.device, arg1: int, arg2: int, arg3: int, arg4: int) -> None

Invoked with: device(type='cuda', index=0)
root@C.6325117:/exllama$ ^C

Has anyone seen an error like this before?

The text was updated successfully, but these errors were encountered:

turboderp · 2023-05-29T01:17:47Z

The last part is due to an incorrect function reference in the C++ extension. I've pushed a fix, but it's an inconsequential bug that prevents it from freeing the CUDA buffers when the application is exiting and they would be freed automatically anyway.

The actual error that you're getting is from the safetensors library. From the looks of it the file is just corrupt. But you never know. Do you have a download link for that specific model?

nikshepsvn · 2023-05-29T01:31:09Z

Figured it out, I tried fetching the model doing git clone https://huggingface.co/TheBloke/Wizard-Vicuna-13B-Uncensored-GPTQ which just put an empty git object for the actual model. I was able to add my SSH key into HF and then clone using git clone git@hf.co:TheBloke/Wizard-Vicuna-13B-Uncensored-GPTQ.

qq: there is no UI/API for this yet right? I was considering using this in a project but I understand its extremely early stage. To scale adoption, I think at bare minimum there should be some API endpoints to interact with exllama for inference, any idea how you're prioritizing this? I would love to hop on and help if this is a lil far out

turboderp · 2023-05-29T01:49:12Z

I'm working on a little UI that's mostly just going to serve as a demo (and so ChatGPT can teach me Javascript). It's coming along, and maybe I'll push something tomorrow.

As for an API it's still a little early, I think, maybe? I want to package it as a library at some point, but I'm hesitant right now, and it feels like there are still a thousand little things I need to look at before I can give it a version number. I think once I've got a little UI up and and running I'll also have a better sense of what the API should look like.

In the meantime @disarmyouwitha is working on a FastAPI interface so maybe that's worth checking out.

nikshepsvn · 2023-05-29T02:01:09Z

Ah thanks for the update. That's fair, I can see this being a useful project to serve GPTQ models in production via an API once we have commercially licensable models (like OpenLLama) but for now I think building for local makes sense. Would love to contribute back with a basic API that emulates OpenAI since I will prob build it out for what I'm working on. Thanks for the dope work here btw, inference speed is blazing fast running 2x exllama on a 3090 using a 13B model

Heads up that running the benchmark fails but chat works fine -- this isn't blocking my prototyping now so I haven't looked deeper but raising here


root@C.6325117:/exllama$ python3 test_benchmark_inference.py -d Wizard-Vicuna-13                                                                                                                        B-Uncensored-GPTQ -p -ppl

 -- Loading model
 -- Tokenizer: Wizard-Vicuna-13B-Uncensored-GPTQ/tokenizer.model
 -- Model config: Wizard-Vicuna-13B-Uncensored-GPTQ/config.json
 -- Model: Wizard-Vicuna-13B-Uncensored-GPTQ/Wizard-Vicuna-13B-Uncensored-GPTQ-4                                                                                                                        bit-128g.compat.no-act-order.safetensors
 -- Sequence length: 2048
 -- Options: ['attention: switched', 'matmul: switched', 'mlp: switched', 'perf'                                                                                                                        , 'perplexity']
 ** Time, Load model: 5.82 seconds
 -- Groupsize (inferred): 128
 -- Act-order (inferred): no
 ** VRAM, Model: [cuda:0] 6,683.17 MB
 -- Inference, first pass.
 ** Time, Inference: 1.46 seconds
 ** Speed: 1313.56 tokens/second
 -- Generating 128 tokens, 1920 token prompt...
 ** Speed: 56.02 tokens/second
 -- Generating 128 tokens, 4 token prompt...
 ** Speed: 59.19 tokens/second
 ** VRAM, Inference: [cuda:0] 2,254.17 MB
 ** VRAM, Total: [cuda:0] 8,937.34 MB
 -- Loading dataset...
 -- Testing..........
 ** Perplexity: 6.2006
Exception ignored in: <function ExLlama.__del__ at 0x7fb3302a9fc0>
Traceback (most recent call last):
  File "/exllama/model.py", line 1066, in __del__
  File "/exllama/cuda_ext.py", line 57, in free_cuda_buffers
TypeError: free_buffers(): incompatible function arguments. The following argument types are supported:
    1. (arg0: torch.device, arg1: int, arg2: int, arg3: int, arg4: int) -> None

epicfilemcnulty · 2023-05-29T12:52:05Z

@turboderp Speaking about APIs, is there a way to keep state entirely in the prompt string and disable internal cache?

As far as I understand, right now due to internal cache and internal state, you can only have one conversation with a loaded model. It would be nice to have a way to have different chats with different initial prompts using one instance of the model loaded into memory. For example, openAI API chat endpoint assumes that the client itself keeps the conversation history and provides the whole chat history with every request.

Of course adding support for keeping multiple internal states probably would be even better, but I guess it will take a lot more work than just providing an option to disable internal state...

btw, thanks for your work, inference results with 33B models on my 4090 are very impressive!

turboderp · 2023-05-29T13:18:43Z

The cache is kind of essential for the speed. I'm sure OpenAPI does caching transparently on their end somehow. I could see ways to go about that, managing a dictionary of caches for a certain number of (most recent) prompt texts, but it would be complex and you definitely don't want to end up swapping the cache tensors in and out of VRAM all the time.

I haven't tested the implementation on multiple threads at once, but at least alternating between inferences shouldn't be a problem. There might be an oversight somewhere, but the cache is supposed to be passed along with the hidden states throughout the forward pass, meaning the ExLlama class should be stateless. That's the idea, anyway.

It builds from a config object, which gets embedded and kept, but the running state is kept in the cache, which is kept by the generator, not the model. So if you create multiple instances of ExLlamaGenerator, each referencing their own ExLlamaCache, they should all be able to use the same model, even taking turns on individual tokens.

Of course you need to allocate the memory up front for each cache, but I personally prefer that approach over the risk of running out of memory down the line. And it's a lot faster to update the cache in-place instead of concatenating the way HF does, plus it saves gigabytes of VRAM. You could alternatively just create a new generator every time you want to switch to a new context, or (probably better) you could keep a single cache and just reset it (set current_seq_len = 0).

disarmyouwitha · 2023-05-29T15:33:29Z

@epicfilemcnulty check out my fork for some reference code for getting an API set up~

What I have is very barebones but should be pretty easy to understand/build on top of!
https://github.com/disarmyouwitha/exllama/blob/master/fast_api.py
https://github.com/disarmyouwitha/exllama/blob/master/fastapi_chat.html

I am just using the generate_simple function and passing in the entire conversation as the prompt each inference.
generate_simple calls gen_begin which I think clears the cache

This allows for you to have different conversations with different users at the same time.
(Though it looks like there may be a more clever way of handling this)

nikshepsvn · 2023-05-29T16:03:17Z

Closing this issue since its resolved, feel free to continue API related discussion in other issue

--------- Co-authored-by: jllllll <3887729+jllllll@users.noreply.github.com>

nikshepsvn mentioned this issue May 29, 2023

ExLlama API spec / discussion #13

Closed

nikshepsvn closed this as completed May 29, 2023

moinnadeem pushed a commit to moinnadeem/exllama that referenced this issue Oct 14, 2023

Build wheel for python 3.11 (turboderp#12)

3ddf3bd

--------- Co-authored-by: jllllll <3887729+jllllll@users.noreply.github.com>

ZanMax mentioned this issue Apr 18, 2024

Run on CPU without AVX2 #315

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error when trying to run Wizard-Vicuna-13B-Uncensored-GPTQ #12

Error when trying to run Wizard-Vicuna-13B-Uncensored-GPTQ #12

nikshepsvn commented May 29, 2023

turboderp commented May 29, 2023

nikshepsvn commented May 29, 2023

turboderp commented May 29, 2023

nikshepsvn commented May 29, 2023 •

edited

Loading

epicfilemcnulty commented May 29, 2023

turboderp commented May 29, 2023

disarmyouwitha commented May 29, 2023

nikshepsvn commented May 29, 2023 •

edited

Loading

Error when trying to run Wizard-Vicuna-13B-Uncensored-GPTQ #12

Error when trying to run Wizard-Vicuna-13B-Uncensored-GPTQ #12

Comments

nikshepsvn commented May 29, 2023

turboderp commented May 29, 2023

nikshepsvn commented May 29, 2023

turboderp commented May 29, 2023

nikshepsvn commented May 29, 2023 • edited Loading

epicfilemcnulty commented May 29, 2023

turboderp commented May 29, 2023

disarmyouwitha commented May 29, 2023

nikshepsvn commented May 29, 2023 • edited Loading

nikshepsvn commented May 29, 2023 •

edited

Loading

nikshepsvn commented May 29, 2023 •

edited

Loading