Skip to content

Newest Unsloth version silently FORCES Qwen2VL tokenizer padding side to right in inference, while training is left #2138

@Nazzaroth2

Description

@Nazzaroth2

I just migrated my code to my new dev server and noticed a very degraded result for OCR inference for Qwen2VL.
I first suspected a mismatch between my older adapter with the newly uploaded qwen2 4bit quant (which silently got replaced 11 days ago. I adress this in a different issue. This ATLEAST needs warnings in the future!).
But the degraded output stayed even after retraining the model on the same dataset with the new version of unsloth.

After further digging I now know the issue is the tokenizer padding side.

For very weird reasons when training the model the tokenizer uses the left side, BUT forces the right side when doing inference.

Here is the re-decoded input_ids that I get from unsloth/zoo 2025.2.15/2025.2.7:

'<|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|vision_end|>japanese OCR:\n<|im_end|>\n<|im_start|>assistant\n

And here the same output for the newest version:
<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\n<|vision_start|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|image_pad|><|vision_end|>japanese OCR:\n<|im_end|>\n<|im_start|>assistant\n<|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|><|vision_pad|>

The most annoying part is that apperently unsloth decides to force the padding side internally now? I've set the padding side multiple times in my inference code with tokenizer.padding_side = "left" and right up to before the model generates outputs the python debugger is reporting a padding side of "left". But after the model.generate call, the tokenizer side is back to right?

So yeah. We need 1) a consistent tokenizer side and 2) not overwriting user specified values.

I advocate for consistent tokenizer side "left" as that ensures the token-distance to the user input stays always the same, while tokenizer "right" creates variable spacing between input and output.

Sorry that I am not going further and creating a PR. My git isn't yet quite set up for that.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugqwen-vlIssues related to qwen-vl

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions