You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
this use all my VRAM on 4090 24212MiB / 24564MiB
I notice in log, the actual model itself only take 14.07 GB. It seemed like the kv-cache takes a lot of VRAM so I cannot use max-model-len more than 8k. However when I use TabbyAPI and using the same exl2 model. I can comfortably use up to 20k context without issue. Is this by design that batching takes more VRAM so less context can be use?
In my understanding, setting the temperature to 0 should result the same or at least, very similar response. However I am getting very different response from model. Is there other setting I should set if I want the response to be very much the same every time I request the same input?
The text was updated successfully, but these errors were encountered:
Your current environment
How would you like to use Aphrodite?
I want to run this (https://huggingface.co/Qwen/Qwen1.5-14B-Chat).
I used following cmd in exllamaV2 to convert the model to exl2 format in 8.0 bit.
then I serve the api using
this use all my VRAM on 4090 24212MiB / 24564MiB
I notice in log, the actual model itself only take 14.07 GB. It seemed like the kv-cache takes a lot of VRAM so I cannot use max-model-len more than 8k. However when I use TabbyAPI and using the same exl2 model. I can comfortably use up to 20k context without issue. Is this by design that batching takes more VRAM so less context can be use?
Another question is about temperature when requesting. Here is my request json to http://localhost:2242/v1/chat/completions
In my understanding, setting the temperature to 0 should result the same or at least, very similar response. However I am getting very different response from model. Is there other setting I should set if I want the response to be very much the same every time I request the same input?
The text was updated successfully, but these errors were encountered: