You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
For the latest PHI-3 demo Chrome browser uses:
5.31 Gb for Renderer process and 4.16 Gb for GPU process, totaling almost 10 Gb while running ~2Gb model.
After first inference memory consumption jumps above 12Gb. That can't be normal.
Can you confirm you do not have any other tabs open? I can't seem to understand how this can be related to the application (not much is being rendered).
4.16 Gb for GPU process
This makes sense since it's (most likely) running in fp16 mode.
This makes sense since it's (most likely) running in fp16 mode.
Can we make it run in lower precision if we run q_4 quantization? (We can't, ONNX doesn't support 4q yet)
To avoid any confusion this is the downloaded model:
/Xenova/Phi-3-mini-4k-instruct_fp16/resolve/main/onnx/model_q4.onnx 838 mb
/Xenova/Phi-3-mini-4k-instruct_fp16/resolve/main/onnx/model_q4.onnx_data 1454mb
System Info
Environment/Platform
Description
For the latest PHI-3 demo Chrome browser uses:
5.31 Gb for Renderer process and 4.16 Gb for GPU process, totaling almost 10 Gb while running ~2Gb model.
After first inference memory consumption jumps above 12Gb. That can't be normal.
Reproduction
load model
The text was updated successfully, but these errors were encountered: