Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inference with GPU took too much GPU RAM #16

Open
DungMinhDao opened this issue Jun 1, 2023 · 4 comments
Open

Inference with GPU took too much GPU RAM #16

DungMinhDao opened this issue Jun 1, 2023 · 4 comments

Comments

@DungMinhDao
Copy link

I tried inferencing with GPU, after making some modifications to the code:

llama/memory_pool.py:        self.sess = ort.InferenceSession(onnxfile, providers=['CUDAExecutionProvider'])

I found all the files with import onnxruntime and add import torch before it (to make sure it loaded all the necessary CUDA related libs). Also I uninstalled onnxruntime and install onnxruntime-gpu instead.
It ran fast but 34GB GPU memory for me to load the model. I tried changing the --poolsize to lower but the situation didn't change (and with --poolsize less than 10 some parts of the model can't be loaded into either GPU or CPU)

@tpoisonooo
Copy link
Owner

tpoisonooo commented Jun 5, 2023

1B param needs ~4GB memory with fp32 format. So 34GB = 26GB llama weight + 6GB extra.

I guess that using fp16 mode to minimize llama weight to 13 GB, it should be 13GB + 6GB = 19GB.

Let me test poolsize<10 on A100 GPU later.

@tpoisonooo
Copy link
Owner

@DungMinhDao

@DungMinhDao
Copy link
Author

Thanks for replying. I don't know but somehow it still uses 34GB even when I switched to the fp16 branch of the HuggingFace model weights you linked to, and I specified the ${FP16_ONNX_DIR}. Can you check if the memory_pool is implemented for FP16 usage on GPU, or what command should I run for using the FP16 model on GPU? Many thanks.

@iamhere1
Copy link

iamhere1 commented Jun 13, 2023

@tpoisonooo Thank you for your great work! However, I have the same problem as @DungMinhDao . I convert the model(7B ) to fp16 using the tool script, https://github.com/tpoisonooo/llama.onnx/blob/main/tools/convert-fp32-to-fp16.py, the model size is half as orinal fp32 model, but the 32G memory is still not enough to load the fp16 model. Is there anything wrong?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants