[QUESTION] Inquiry about DynamicGenerator: EOS not returned until max_new_tokens reached, despite stop_conditions

### OS

Linux

### GPU Library

CUDA 12.x

### Python version

3.10

### Pytorch version

2.8.0+cu128

### Model

gemm3 27b exl2

### Describe the bug

Hello,

I'm a developer in South Korea using your framework, and I want to start by saying thank you for building such an excellent library like ExLlamaV2.

I am currently using a combination of ExLlamaV2 (ExLlamaV2DynamicJob) with FastAPI and Redis to handle multiple concurrent user requests.

I've observed an issue where, even when the model's response to a user query is shorter than max_new_tokens, the generation process seems to continue internally until it reaches max_new_tokens before finally reporting the End-of-Stream (EOS) status.

This persistence occurs even though I have explicitly set stop_conditions. (Currently, I'm mitigating this by forcibly closing the web connection when no data is received for a set period, to allow the next conversation to begin.)

I'm wondering if there's a recommended way to force the LLM to return the EOS status immediately after the output is logically complete, without having to wait until max_new_tokens has been fully generated.

Here is the code I use to create the job:

```
job = ExLlamaV2DynamicJob(
    input_ids=input_ids, max_new_tokens=max_new_tokens,
    stop_conditions=get_stop_conditions(PROMPT_FORMAT, tokenizer),
    gen_settings=ExLlamaV2Sampler.Settings(),
    filter_prefer_eos=True, identifier=job_id
)
```

The model I am using is Gemma 3.

Thank you for your time and assistance!

Best regards.

### Reproduction steps

.

### Expected behavior

.

### Logs

_No response_

### Additional context

_No response_

### Acknowledgements

- [x] I have looked for similar issues before submitting this one.
- [x] I understand that the developers have lives and my issue will be answered when possible.
- [x] I understand the developers of this program are human, and I will ask my questions politely.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[QUESTION] Inquiry about DynamicGenerator: EOS not returned until max_new_tokens reached, despite stop_conditions #809

OS

GPU Library

Python version

Pytorch version

Model

Describe the bug

Reproduction steps

Expected behavior

Logs

Additional context

Acknowledgements

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

[QUESTION] Inquiry about DynamicGenerator: EOS not returned until max_new_tokens reached, despite stop_conditions #809

Description

OS

GPU Library

Python version

Pytorch version

Model

Describe the bug

Reproduction steps

Expected behavior

Logs

Additional context

Acknowledgements

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions