Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inference with 2 GPUs #69

Open
UranusSeven opened this issue Mar 19, 2023 · 4 comments · May be fixed by #74
Open

Inference with 2 GPUs #69

UranusSeven opened this issue Mar 19, 2023 · 4 comments · May be fixed by #74

Comments

@UranusSeven
Copy link

Hi everyone,

I got the following exception when running generate.py with 2 GPUs:

Traceback (most recent call last):
  File "/home/wangzhong/miniconda3/envs/alpaca/lib/python3.9/site-packages/gradio/routes.py", line 393, in run_predict
    output = await app.get_blocks().process_api(
  File "/home/wangzhong/miniconda3/envs/alpaca/lib/python3.9/site-packages/gradio/blocks.py", line 1069, in process_api
    result = await self.call_function(
  File "/home/wangzhong/miniconda3/envs/alpaca/lib/python3.9/site-packages/gradio/blocks.py", line 878, in call_function
    prediction = await anyio.to_thread.run_sync(
  File "/home/wangzhong/miniconda3/envs/alpaca/lib/python3.9/site-packages/anyio/to_thread.py", line 31, in run_sync
    return await get_asynclib().run_sync_in_worker_thread(
  File "/home/wangzhong/miniconda3/envs/alpaca/lib/python3.9/site-packages/anyio/_backends/_asyncio.py", line 937, in run_sync_in_worker_thread
    return await future
  File "/home/wangzhong/miniconda3/envs/alpaca/lib/python3.9/site-packages/anyio/_backends/_asyncio.py", line 867, in run
    result = context.run(func, *args)
  File "/home/wangzhong/repos/alpaca-lora/generate.py", line 103, in evaluate
    generation_output = model.generate(
  File "/home/wangzhong/miniconda3/envs/alpaca/lib/python3.9/site-packages/peft/peft_model.py", line 581, in generate
    outputs = self.base_model.generate(**kwargs)
  File "/home/wangzhong/miniconda3/envs/alpaca/lib/python3.9/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "/home/wangzhong/miniconda3/envs/alpaca/lib/python3.9/site-packages/transformers/generation/utils.py", line 1490, in generate
    return self.beam_search(
  File "/home/wangzhong/miniconda3/envs/alpaca/lib/python3.9/site-packages/transformers/generation/utils.py", line 2749, in beam_search
    outputs = self(
  File "/home/wangzhong/miniconda3/envs/alpaca/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/wangzhong/miniconda3/envs/alpaca/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/wangzhong/miniconda3/envs/alpaca/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 765, in forward
    outputs = self.model(
  File "/home/wangzhong/miniconda3/envs/alpaca/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/wangzhong/miniconda3/envs/alpaca/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 614, in forward
    layer_outputs = decoder_layer(
  File "/home/wangzhong/miniconda3/envs/alpaca/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/wangzhong/miniconda3/envs/alpaca/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/wangzhong/miniconda3/envs/alpaca/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 309, in forward
    hidden_states, self_attn_weights, present_key_value = self.self_attn(
  File "/home/wangzhong/miniconda3/envs/alpaca/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/wangzhong/miniconda3/envs/alpaca/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/wangzhong/miniconda3/envs/alpaca/lib/python3.9/site-packages/transformers/models/llama/modeling_llama.py", line 209, in forward
    query_states = self.q_proj(hidden_states).view(bsz, q_len, self.num_heads, self.head_dim).transpose(1, 2)
  File "/home/wangzhong/miniconda3/envs/alpaca/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/wangzhong/miniconda3/envs/alpaca/lib/python3.9/site-packages/accelerate/hooks.py", line 165, in new_forward
    output = old_forward(*args, **kwargs)
  File "/home/wangzhong/miniconda3/envs/alpaca/lib/python3.9/site-packages/peft/tuners/lora.py", line 522, in forward
    result = super().forward(x)
  File "/home/wangzhong/miniconda3/envs/alpaca/lib/python3.9/site-packages/bitsandbytes/nn/modules.py", line 242, in forward
    out = bnb.matmul(x, self.weight, bias=self.bias, state=self.state)
  File "/home/wangzhong/miniconda3/envs/alpaca/lib/python3.9/site-packages/bitsandbytes/autograd/_functions.py", line 488, in matmul
    return MatMul8bitLt.apply(A, B, out, bias, state)
  File "/home/wangzhong/miniconda3/envs/alpaca/lib/python3.9/site-packages/torch/autograd/function.py", line 506, in apply
    return super().apply(*args, **kwargs)  # type: ignore[misc]
  File "/home/wangzhong/miniconda3/envs/alpaca/lib/python3.9/site-packages/bitsandbytes/autograd/_functions.py", line 317, in forward
    state.CxB, state.SB = F.transform(state.CB, to_order=formatB)
  File "/home/wangzhong/miniconda3/envs/alpaca/lib/python3.9/site-packages/bitsandbytes/functional.py", line 1698, in transform
    prev_device = pre_call(A.device)
AttributeError: 'NoneType' object has no attribute 'device'

I have tried the solution in #21, but it makes all the workload running on a single GPU.

I changed the code a little bit and here's the diff:

image

@arnocandel arnocandel linked a pull request Mar 19, 2023 that will close this issue
@T-Atlas
Copy link
Contributor

T-Atlas commented Mar 25, 2023

I would like to ask, I found that the speed of inference using the model is much slower than training the model. Is this normal? Are there any tricks to speed up this process?

@lurenlym
Copy link

lurenlym commented Mar 26, 2023

I would like to ask, I found that the speed of inference using the model is much slower than training the model. Is this normal? Are there any tricks to speed up this process?

same question

@Sowhat007
Copy link

Sowhat007 commented Mar 29, 2023

Maybe you are using cpu to do the inference job. Switch to the right environment and try this in python:
import torch
torch.cuda.is_available()
If it's false, then you are using cpu for inferencing. This is caused by unpairing pytorch and cuda versions. I solved the problem by reinstalling pytorch following this guide (since I'm using cuda 11.7):
https://pytorch.org/get-started/pytorch-2.0/
Hope it helps

@ManuelFay
Copy link

During inference, the model runs N times sequentially to generate N tokens (it waits to get the output of the previous inference to aggreagate the token to the next inference's input).
During training, you use the CausalLM attention mask to guess all next tokens at the same time so you process a sequence all at once. You thus have N times less forward passes (but you do have to run a backward pass during training).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants