-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Batch inference produces nonsense results for unsloth/mistral-7b-instruct-v0.2-bnb-4bit #267
Comments
@ziemowit-s I'll check this out! Sorry on the issue! |
Don't worry, it's a relatively new library so bugs are expected :) |
Hey, just want to confirm, I have an exact same issue with my Llama model. Inference on single samples works fine, but produces garbage on batches of multiple samples. I'm loading my model in bfloat16 without quantization. |
@ziemowit-s @its5Q Apologies on the issues again :( Still debugging stuff so sorry on that! |
Actually can confirm - batched inference in fact is breaking - I'm working on a fix asap - sorry for the wait guys! |
@ziemowit-s @its5Q Much apologies on the delay - I temporarily fixed it by disabling Unsloth's fast inference paths - it seems like I need to dig deeper on why this is happening :( Using Again sorry for the inconvience! |
@ziemowit-s @its5Q I think I finally fixed it!!
Single inference again is faster - batched similar speed for now. |
Awesome, I'll test it as soon as I get to it |
Tried it myself and I'm getting the same weird output as before. One thing that I've noticed is that the weird output only comes from the samples that are padded, and the longest prompt in the batch produces normal output. If all the samples in the batch are the same length in tokens, thus no padding is required, the model output for all samples is as to be expected. |
@its5Q That's very weird :( For me it seems to work perfectly. I have an example if you can run this: from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
model_name = "unsloth/mistral-7b-instruct-v0.2-bnb-4bit",
max_seq_length = 2048,
dtype = None,
load_in_4bit = True,
)
inputs = [
"Create a Python program using Pytorch to create a simple neural network for image classification.\n"\
"You need to do the data preparation step, the training step, and the inference step as well.",
"Create a Python program to compute all the primes.",
"Write a long essay about happiness, and how to attain it. Provide clear markdown sections.",
"20*20=?",
]
tokenizer.pad_token = tokenizer.unk_token
tokenizer.padding_side = "left"
inputs = tokenizer(inputs, return_tensors = "pt", padding = True).to("cuda")
outputs = model.generate(**inputs, max_new_tokens = 512, do_sample = False, use_cache = True)
decoded = tokenizer.batch_decode(outputs)
for text in decoded:
print(text.replace(tokenizer.pad_token, ""))
print("_" * 70) You will get:
If you do them individually, I get:
Some differences via https://www.diffchecker.com/text-compare/: The 2nd one:
No difference on 2nd.
Very different to single decoding, but both are still coherent: And finally:
0 differences as well - the reasoning though is dumb lol |
Also @its5Q you need to use padding_side = "left" or else the results will be wrong |
Oh yeah, that the problem, thanks. Now batched inference works as expected for me. |
@its5Q im thinking if somehow I can default it to left, since people have said this was an ongoing issue! |
wouldn't the difference be due to calculating a random seed each generation? Therefore generations would be different even when comparing non-batched with non-batched |
I'm not an expert in the transformers/unsloth code, but couldn't you just add a line of code before |
@JIBSIL Oh if you select |
Ah, thanks for the clarification. However, in the newest release, I am encountering a different error:
Specifically using Gemma-7b. But as usual, mistral works fine 🤣 |
@its5Q Whoops you're correct! I decided to just run the notebook - I 100% finally fixed it now oh lord so sorry!!! The issue of multiple model supports :( |
Hi there,
after loading the model with:
I performed a batch inference:
The received answer is nonsensical, but since it consists of 3 elements, and the second is the longest - this one is the only correct one, the other two are nonsensical. When I reduce all the texts (to a maximum of 3000 characters) - all the answers return to normal. It also works well when I infer each one in turn.
texts.txt
nonsense_texts.txt
The texts to generate the summary are attached as texts.txt, and the nonsense answers are in the file nonsense_texts.txt (3 entries are separated by the
<END>
tag) to be reproduced, below is an example of a nonsense answer:The text was updated successfully, but these errors were encountered: