-
-
Notifications
You must be signed in to change notification settings - Fork 325
Description
OS
Linux
GPU Library
CUDA 12.x
Python version
3.10
Pytorch version
2.8.0+cu128
Model
gemm3 27b exl2
Describe the bug
Hello,
I'm a developer in South Korea using your framework, and I want to start by saying thank you for building such an excellent library like ExLlamaV2.
I am currently using a combination of ExLlamaV2 (ExLlamaV2DynamicJob) with FastAPI and Redis to handle multiple concurrent user requests.
I've observed an issue where, even when the model's response to a user query is shorter than max_new_tokens, the generation process seems to continue internally until it reaches max_new_tokens before finally reporting the End-of-Stream (EOS) status.
This persistence occurs even though I have explicitly set stop_conditions. (Currently, I'm mitigating this by forcibly closing the web connection when no data is received for a set period, to allow the next conversation to begin.)
I'm wondering if there's a recommended way to force the LLM to return the EOS status immediately after the output is logically complete, without having to wait until max_new_tokens has been fully generated.
Here is the code I use to create the job:
job = ExLlamaV2DynamicJob(
input_ids=input_ids, max_new_tokens=max_new_tokens,
stop_conditions=get_stop_conditions(PROMPT_FORMAT, tokenizer),
gen_settings=ExLlamaV2Sampler.Settings(),
filter_prefer_eos=True, identifier=job_id
)
The model I am using is Gemma 3.
Thank you for your time and assistance!
Best regards.
Reproduction steps
.
Expected behavior
.
Logs
No response
Additional context
No response
Acknowledgements
- I have looked for similar issues before submitting this one.
- I understand that the developers have lives and my issue will be answered when possible.
- I understand the developers of this program are human, and I will ask my questions politely.