We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test_benchmark_inference.py: I tried to change ids = torch.randint(0, 31999, (1, max_seq_len - gen_tokens)).cuda() to ids = torch.randint(0, 31999, (2, max_seq_len - gen_tokens)).cuda() An error was reported:
ids = torch.randint(0, 31999, (1, max_seq_len - gen_tokens)).cuda()
ids = torch.randint(0, 31999, (2, max_seq_len - gen_tokens)).cuda()
Traceback (most recent call last): File "/root/exllama/exllama_231009/test_benchmark_inference.py", line 168, in <module> logits = timer("Warmup", lambda: next_logits(ids, lora)) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/exllama/exllama_231009/test_benchmark_inference.py", line 56, in timer ret = func() ^^^^^^ File "/root/exllama/exllama_231009/test_benchmark_inference.py", line 168, in <lambda> logits = timer("Warmup", lambda: next_logits(ids, lora)) ^^^^^^^^^^^^^^^^^^^^^^ File "/root/exllama/exllama_231009/test_benchmark_inference.py", line 44, in next_logits n_logits = model.forward(input_ids, cache, last_id_only, lora=apply_lora, input_mask=input_mask) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/exllama/exllama_231009/model.py", line 972, in forward r = self._forward(input_ids[:, chunk_begin : chunk_end], ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/exllama/exllama_231009/model.py", line 1058, in _forward hidden_states = decoder_layer.forward(hidden_states, cache, buffers[device], lora) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/exllama/exllama_231009/model.py", line 536, in forward hidden_states = self.self_attn.forward(hidden_states, cache, buffer, lora) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/root/exllama/exllama_231009/model.py", line 440, in forward new_keys = cache.key_states[self.index].narrow(2, past_len, q_len).narrow(0, 0, bsz) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ RuntimeError: start (0) + length (2) exceeds dimension size (1).
I want to test the effect of GPTQ when batch size>1. Is there any way?
The text was updated successfully, but these errors were encountered:
Yes, you'd want to specify the batch size when creating the cache. Change line 137 like so:
cache = ExLlamaCache(model, batch_size = 2)
Note that depending on the model this may use a lot more VRAM, so you might need to reduce the sequence length accordingly.
Sorry, something went wrong.
No branches or pull requests
test_benchmark_inference.py:
I tried to change
ids = torch.randint(0, 31999, (1, max_seq_len - gen_tokens)).cuda()
to
ids = torch.randint(0, 31999, (2, max_seq_len - gen_tokens)).cuda()
An error was reported:
I want to test the effect of GPTQ when batch size>1. Is there any way?
The text was updated successfully, but these errors were encountered: