Inference with 2 GPUs #69

UranusSeven · 2023-03-19T12:24:23Z

Hi everyone,

I got the following exception when running generate.py with 2 GPUs:

Traceback (most recent call last):
  File "/home/wangzhong/miniconda3/envs/alpaca/lib/python3.9/site-packages/gradio/routes.py", line 393, in run_predict
    output = await app.get_blocks().process_api(
  File "/home/wangzhong/miniconda3/envs/alpaca/lib/python3.9/site-packages/gradio/blocks.py", line 1069, in process_api
    result = await self.call_function(
  File "/home/wangzhong/miniconda3/envs/alpaca/lib/python3.9/site-packages/gradio/blocks.py", line 878, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "/home/wangzhong/miniconda3/envs/alpaca/lib/python3.9/site-packages/anyio/to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/home/wangzhong/miniconda3/envs/alpaca/lib/python3.9/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "/home/wangzhong/miniconda3/envs/alpaca/lib/python3.9/site-packages/anyio/_backends/_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "/home/wangzhong/repos/alpaca-lora/generate.py", line 103, in evaluate
    generation_output = model.generate(
  File "/home/wangzhong/miniconda3/envs/alpaca/lib/python3.9/site-packages/peft/peft_model.py", line 581, in generate
    outputs = self.base_model.generate(**kwargs)
  File "/home/wangzhong/miniconda3/envs/alpaca/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/wangzhong/miniconda3/envs/alpaca/lib/python3.9/site-packages/transformers/generation/utils.py", line 1490, in generate
    return self.beam_search(
  File "/home/wangzhong/miniconda3/envs/alpaca/lib/python3.9/site-packages/transformers/generation/utils.py", line 2749, in beam_search
    outputs = self(
  File "/home/wangzhong/miniconda3/envs/alpaca/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/wangzhong/miniconda3/envs/alpaca/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/wangzhong/miniconda3/envs/alpaca/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 765, in forward
    outputs = self.model(
  File "/home/wangzhong/miniconda3/envs/alpaca/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/wangzhong/miniconda3/envs/alpaca/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 614, in forward
    layer_outputs = decoder_layer(
  File "/home/wangzhong/miniconda3/envs/alpaca/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/wangzhong/miniconda3/envs/alpaca/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/wangzhong/miniconda3/envs/alpaca/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 309, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/wangzhong/miniconda3/envs/alpaca/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/wangzhong/miniconda3/envs/alpaca/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/wangzhong/miniconda3/envs/alpaca/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 209, in forward
    query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
  File "/home/wangzhong/miniconda3/envs/alpaca/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/wangzhong/miniconda3/envs/alpaca/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/wangzhong/miniconda3/envs/alpaca/lib/python3.9/site-packages/peft/tuners/lora.py", line 522, in forward
    result = super().forward(x)
  File "/home/wangzhong/miniconda3/envs/alpaca/lib/python3.9/site-packages/bitsandbytes/nn/modules.py", line 242, in forward
    out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)
  File "/home/wangzhong/miniconda3/envs/alpaca/lib/python3.9/site-packages/bitsandbytes/autograd/_functions.py", line 488, in matmul
    return MatMul8bitLt.apply(A, B, out, bias, state)
  File "/home/wangzhong/miniconda3/envs/alpaca/lib/python3.9/site-packages/torch/autograd/function.py", line 506, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/home/wangzhong/miniconda3/envs/alpaca/lib/python3.9/site-packages/bitsandbytes/autograd/_functions.py", line 317, in forward
    state.CxB, state.SB = F.transform(state.CB, to_order=formatB)
  File "/home/wangzhong/miniconda3/envs/alpaca/lib/python3.9/site-packages/bitsandbytes/functional.py", line 1698, in transform
    prev_device = pre_call(A.device)
AttributeError: 'NoneType' object has no attribute 'device'

I have tried the solution in #21, but it makes all the workload running on a single GPU.

I changed the code a little bit and here's the diff:

The text was updated successfully, but these errors were encountered:

T-Atlas · 2023-03-25T16:41:41Z

I would like to ask, I found that the speed of inference using the model is much slower than training the model. Is this normal? Are there any tricks to speed up this process?

lurenlym · 2023-03-26T15:24:31Z

I would like to ask, I found that the speed of inference using the model is much slower than training the model. Is this normal? Are there any tricks to speed up this process?

same question

Sowhat007 · 2023-03-29T11:24:53Z

Maybe you are using cpu to do the inference job. Switch to the right environment and try this in python:
import torch
torch.cuda.is_available()
If it's false, then you are using cpu for inferencing. This is caused by unpairing pytorch and cuda versions. I solved the problem by reinstalling pytorch following this guide (since I'm using cuda 11.7):
https://pytorch.org/get-started/pytorch-2.0/
Hope it helps

ManuelFay · 2023-04-28T13:24:18Z

During inference, the model runs N times sequentially to generate N tokens (it waits to get the output of the previous inference to aggreagate the token to the next inference's input).
During training, you use the CausalLM attention mask to guess all next tokens at the same time so you process a sequence all at once. You thus have N times less forward passes (but you do have to run a backward pass during training).

arnocandel linked a pull request Mar 19, 2023 that will close this issue

Fix multi-GPU generation. #74

Open

Qubitium mentioned this issue Mar 21, 2023

Releasing Alpaca 30B adapters #77

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inference with 2 GPUs #69

Inference with 2 GPUs #69

UranusSeven commented Mar 19, 2023

T-Atlas commented Mar 25, 2023

lurenlym commented Mar 26, 2023 •

edited

Loading

Sowhat007 commented Mar 29, 2023 •

edited

Loading

ManuelFay commented Apr 28, 2023

Inference with 2 GPUs #69

Inference with 2 GPUs #69

Comments

UranusSeven commented Mar 19, 2023

T-Atlas commented Mar 25, 2023

lurenlym commented Mar 26, 2023 • edited Loading

Sowhat007 commented Mar 29, 2023 • edited Loading

ManuelFay commented Apr 28, 2023

lurenlym commented Mar 26, 2023 •

edited

Loading

Sowhat007 commented Mar 29, 2023 •

edited

Loading