Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

does the benchmark support batch size>1? #304

Closed
deltaguo opened this issue Oct 9, 2023 · 1 comment
Closed

does the benchmark support batch size>1? #304

deltaguo opened this issue Oct 9, 2023 · 1 comment

Comments

@deltaguo
Copy link

deltaguo commented Oct 9, 2023

test_benchmark_inference.py:
I tried to change
ids = torch.randint(0, 31999, (1, max_seq_len - gen_tokens)).cuda()
to
ids = torch.randint(0, 31999, (2, max_seq_len - gen_tokens)).cuda()
An error was reported:

Traceback (most recent call last):
  File "/root/exllama/exllama_231009/test_benchmark_inference.py", line 168, in <module>
    logits = timer("Warmup", lambda: next_logits(ids, lora))
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/exllama/exllama_231009/test_benchmark_inference.py", line 56, in timer
    ret = func()
          ^^^^^^
  File "/root/exllama/exllama_231009/test_benchmark_inference.py", line 168, in <lambda>
    logits = timer("Warmup", lambda: next_logits(ids, lora))
                                     ^^^^^^^^^^^^^^^^^^^^^^
  File "/root/exllama/exllama_231009/test_benchmark_inference.py", line 44, in next_logits
    n_logits = model.forward(input_ids, cache, last_id_only, lora=apply_lora, input_mask=input_mask)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/exllama/exllama_231009/model.py", line 972, in forward
    r = self._forward(input_ids[:, chunk_begin : chunk_end],
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/exllama/exllama_231009/model.py", line 1058, in _forward
    hidden_states = decoder_layer.forward(hidden_states, cache, buffers[device], lora)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/exllama/exllama_231009/model.py", line 536, in forward
    hidden_states = self.self_attn.forward(hidden_states, cache, buffer, lora)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/exllama/exllama_231009/model.py", line 440, in forward
    new_keys = cache.key_states[self.index].narrow(2, past_len, q_len).narrow(0, 0, bsz)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: start (0) + length (2) exceeds dimension size (1).

I want to test the effect of GPTQ when batch size>1. Is there any way?

@turboderp
Copy link
Owner

Yes, you'd want to specify the batch size when creating the cache. Change line 137 like so:

cache = ExLlamaCache(model, batch_size = 2)

Note that depending on the model this may use a lot more VRAM, so you might need to reduce the sequence length accordingly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants