Replies: 3 comments
-
mixtral is 50G worth of model weights, it's 8x7B, the quantized version of mistral 7b is 3x5G files and all that is fine to load into the 24G of an RTX 4090. The GGUF are quantized models to use less memory. If you search for "quantized" in the discussions it seems like lots of requests for this feature but no acknowledgement. |
Beta Was this translation helpful? Give feedback.
-
did you manage to run it? i'm trying to run quantized models (e.g. TheBloke) on my 2xRTX4090 machine and I get OOM errors all the time |
Beta Was this translation helpful? Give feedback.
-
Yes, I switched from vllm to tabbyAPI. It runs models with exl2 quantization. With this I was able to run multiple high end models like Mixtral with one RTX4090. It even runs on windows. |
Beta Was this translation helpful? Give feedback.
-
I'm able to run a Mixtral Model with my RTX 4090 in oobabooga text-ui. I'm running a GGUF Mixtral model with branch 4_K_M. With vllm I'm trying to run different Mixtral models in vanilla and AWQ Mode but always get different types of out of memory exceptions. Has been anybody able to run a mixtral model in vllm with a RTX 4090?
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions