Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loading the model on multiple GPUs #46

Open
aamir-gmail opened this issue Apr 19, 2023 · 17 comments
Open

Loading the model on multiple GPUs #46

aamir-gmail opened this issue Apr 19, 2023 · 17 comments

Comments

@aamir-gmail
Copy link

I have two 4090 24GB, if possible please provide an extra argument to demo.py to either load the model on CPU or 2 or more GPU and another argument to run on 16-bit and take advantage of extra GPU RAM, instead of editing config files.

@CyberTimon
Copy link

I also would like to know how to do this?
I have 2x3060 12gb so I could load the 13b model but it doesn't seem to be implemented

@taomanwai
Copy link

I have same request.

@wJc-cn
Copy link

wJc-cn commented May 6, 2023

I have same request too.

@thcheung
Copy link

thcheung commented Jun 6, 2023

  1. Set the parameter device_map='auto' when load the LlamaForCausalLM.from_pretrained()

  2. Replace the line in demo.py as: chat = Chat(model, vis_processor, device='cuda')

It can run on two RTX 2080Ti in my computer.

@sinsauzero
Copy link

sinsauzero commented Jun 7, 2023

Set the parameter device_map='auto' when load the LlamaForCausalLM.from_pretrained()
Replace the line in demo.py as: chat = Chat(model, vis_processor, device='cuda')
It can run on two RTX 2080Ti in my computer.

It seems the model is implemented in two devices. But when doing the inference, the tensor flowed in two deivces and it will throw the two devices error.
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!

@thcheung
Copy link

thcheung commented Jun 7, 2023

Set the parameter device_map='auto' when load the LlamaForCausalLM.from_pretrained()
Replace the line in demo.py as: chat = Chat(model, vis_processor, device='cuda')
It can run on two RTX 2080Ti in my computer.

It seems the model is implemented in two devices. But when doing the inference, the tensor flowed in two deivces and it will throw the two devices error. RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!

(1) Load the LLaMA with device map to 'auto':

device_map={'': device_8bit}

device_map = 'auto'

(2) Modify the line below from 'cuda:{}'.format(args.gpu_id)' to 'cuda', It will automatically assign to device0 or device1 if you have two devices:

chat = Chat(model, vis_processor, device='cuda:{}'.format(args.gpu_id))

chat = Chat(model, vis_processor, device='cuda' )

(3) The "to device" can be removed from the line below because llama has been loaded to GPUs automatically:

model = model_cls.from_config(model_config).to('cuda:{}'.format(args.gpu_id))

model = model_cls.from_config(model_config)

(4) When encode the image, we may encode the image with CPU and assign the image embedding to GPU

image_emb, _ = self.model.encode_img(image)

img_list.append(image_emb)

image_emb, _ = self.model.encode_img(image.to('cpu'))
img_list.append(image_emb.to('cuda'))

The model should now work if you have multiple GPUs with low memory space.

image

@JainitBITW
Copy link

Traceback (most recent call last):
File "/home2/jainit/MiniGPT-4/demo.py", line 61, in
model = model_cls.from_config(model_config)
File "/home2/jainit/MiniGPT-4/minigpt4/models/mini_gpt4.py", line 243, in from_config
model = cls(
File "/home2/jainit/MiniGPT-4/minigpt4/models/mini_gpt4.py", line 90, in init
self.llama_model = LlamaForCausalLM.from_pretrained(
File "/home2/jainit/torchy/lib/python3.9/site-packages/transformers/modeling_utils.py", line 2722, in from_pretrained
max_memory = get_balanced_memory(
File "/home2/jainit/torchy/lib/python3.9/site-packages/accelerate/utils/modeling.py", line 731, in get_balanced_memory
max_memory = get_max_memory(max_memory)
File "/home2/jainit/torchy/lib/python3.9/site-packages/accelerate/utils/modeling.py", line 624, in get_max_memory
_ = torch.tensor([0], device=i)
RuntimeError: CUDA error: out of memory
CUDA kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1.
Compile with TORCH_USE_CUDA_DSA to enable device-side assertions.

I did all of these steps but i still get

@sushilkhadkaanon
Copy link

@JainitBITW Is it working now for you?

@JainitBITW
Copy link

Yes i just restarted my cuda.

@sushilkhadkaanon
Copy link

@JainitBITW Did you do anything apart from @thcheung 's instruction?
Thanks anyway!

@JainitBITW
Copy link

Nope exactly same

@JainitBITW
Copy link

What error you are getting

@sushilkhadkaanon
Copy link

I'm trying to run the 13 B model on multiple GPUs. The author has written they currently don't support multi-GPU inference. So , I want to be sure that it's possible to do inference on multiple GPUs before provisioning the ec2 instance.

@JainitBITW
Copy link

I think you van go ahead

@sushilkhadkaanon
Copy link

@JainitBITW @thcheung thanks it worked for me (8 bit). Have any idea how to do it for 16 bit (low resource = False) ?
It is throwing this error:
RuntimeError: Input type (float) and bias type (c10::Half) should be the same

@daniellandau
Copy link

RuntimeError: Input type (float) and bias type (c10::Half) should be the same

I got through this error by setting vit_precision: "fp32" in minigpt_v2.yaml, but I didn't figure out what would need to be done to get the new input to also be fp16 (half precision) instead of making everything fp32.

@uiyo
Copy link

uiyo commented Nov 13, 2023

My solution is:
CUDA_VISIBLE_DEVICES=1 python demo_v2.py --cfg-path eval_configs/minigptv2_eval.yaml --gpu-id 0

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants