does the benchmark support batch size>1? #304

deltaguo · 2023-10-09T08:47:39Z

test_benchmark_inference.py:
I tried to change
ids = torch.randint(0, 31999, (1, max_seq_len - gen_tokens)).cuda()
to
ids = torch.randint(0, 31999, (2, max_seq_len - gen_tokens)).cuda()
An error was reported:

Traceback (most recent call last):
  File "/root/exllama/exllama_231009/test_benchmark_inference.py", line 168, in <module>
    logits = timer("Warmup", lambda: next_logits(ids, lora))
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/exllama/exllama_231009/test_benchmark_inference.py", line 56, in timer
    ret = func()
          ^^^^^^
  File "/root/exllama/exllama_231009/test_benchmark_inference.py", line 168, in <lambda>
    logits = timer("Warmup", lambda: next_logits(ids, lora))
                                     ^^^^^^^^^^^^^^^^^^^^^^
  File "/root/exllama/exllama_231009/test_benchmark_inference.py", line 44, in next_logits
    n_logits = model.forward(input_ids, cache, last_id_only, lora=apply_lora, input_mask=input_mask)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/exllama/exllama_231009/model.py", line 972, in forward
    r = self._forward(input_ids[:, chunk_begin : chunk_end],
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/exllama/exllama_231009/model.py", line 1058, in _forward
    hidden_states = decoder_layer.forward(hidden_states, cache, buffers[device], lora)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/exllama/exllama_231009/model.py", line 536, in forward
    hidden_states = self.self_attn.forward(hidden_states, cache, buffer, lora)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/exllama/exllama_231009/model.py", line 440, in forward
    new_keys = cache.key_states[self.index].narrow(2, past_len, q_len).narrow(0, 0, bsz)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: start (0) + length (2) exceeds dimension size (1).

I want to test the effect of GPTQ when batch size>1. Is there any way?

The text was updated successfully, but these errors were encountered:

turboderp · 2023-10-09T13:57:49Z

Yes, you'd want to specify the batch size when creating the cache. Change line 137 like so:

cache = ExLlamaCache(model, batch_size = 2)

Note that depending on the model this may use a lot more VRAM, so you might need to reduce the sequence length accordingly.

deltaguo closed this as completed Oct 12, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

does the benchmark support batch size>1? #304

does the benchmark support batch size>1? #304

deltaguo commented Oct 9, 2023

turboderp commented Oct 9, 2023

does the benchmark support batch size>1? #304

does the benchmark support batch size>1? #304

Comments

deltaguo commented Oct 9, 2023

turboderp commented Oct 9, 2023