New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

7B onnx模型(float16) 占用显存超过32G #19

Open

iamhere1 opened this issue Jun 13, 2023 · 0 comments

iamhere1 commented Jun 13, 2023 •

edited

机器配置：32G GPU

实验步骤：

将这里 (https://github.com/ymcui/Chinese-LLaMA-Alpaca#%E6%A8%A1%E5%9E%8B%E4%B8%8B%E8%BD%BD) Chinese-Alpaca-Plus-7B 模型和原版LLaMA进行合并，得到一个新的模型new_chinese；
使用本项目说明文档提到的方法（https://github.com/tpoisonooo/llama.onnx），将new_chinese模型转为onnx模型 new_chinese_onnx；
使用本项目说明文档提到的方法（https://github.com/tpoisonooo/llama.onnx），将new_chinese_onnx模型转化为精度是fp16的模型new_chinese_onnx_fp16；
运行python3 demo_llama.py，
使用GPU：模型加载完毕之前，程序会因显存不足报错；
使用CPU，能加载成功

问题：
在转onnx之前，模型可以被加载到显存并成功执行inference;
转onnx之后，无论是否继续转为精度fp16的模型，都因显存不足而报错。
请问，对于当前的32G显存gpu, 有什么方法，可以使用转化后的onnx模型来做预测吗？

tpoisonooo mentioned this issue

GPU Inference #25

Open

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment