Batch inference produces nonsense results for unsloth/mistral-7b-instruct-v0.2-bnb-4bit #267

ziemowit-s · 2024-03-19T23:16:05Z

Hi there,

after loading the model with:

from unsloth import FastLanguageModel
import torch

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/mistral-7b-instruct-v0.2-bnb-4bit",
    max_seq_length = 2048,
    dtype = None,
    load_in_4bit = True,
)

tokenizer.padding_side='left'
tokenizer.pad_token = tokenizer.eos_token

I performed a batch inference:

instructs = []

for r in texts:
    summary_inst = f"""Provide a very short summary of the text: {r}"""

    chat = [
        {"role": "user", "content": summary_inst},
    ]

    txt = tokenizer.apply_chat_template(chat, tokenize=False)
    instructs.append(txt)

inputs = tokenizer(instructs, return_tensors = "pt", padding=True).to("cuda")
response = model.generate(**inputs, max_new_tokens = 512, do_sample=False).cpu().numpy()

raw_txts = tokenizer.batch_decode(response, skip_special_characters=True)
response = [rr.split("[/INST]")[-1].replace("</s>", "") for rr in raw_txts]

The received answer is nonsensical, but since it consists of 3 elements, and the second is the longest - this one is the only correct one, the other two are nonsensical. When I reduce all the texts (to a maximum of 3000 characters) - all the answers return to normal. It also works well when I infer each one in turn.

texts.txt
nonsense_texts.txt

The texts to generate the summary are attached as texts.txt, and the nonsense answers are in the file nonsense_texts.txt (3 entries are separated by the <END> tag) to be reproduced, below is an example of a nonsense answer:

' The , 1200000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000',

The text was updated successfully, but these errors were encountered:

danielhanchen · 2024-03-20T04:25:52Z

@ziemowit-s I'll check this out! Sorry on the issue!

ziemowit-s · 2024-03-20T12:42:21Z

Don't worry, it's a relatively new library so bugs are expected :)

its5Q · 2024-03-22T06:59:50Z

Hey, just want to confirm, I have an exact same issue with my Llama model. Inference on single samples works fine, but produces garbage on batches of multiple samples. I'm loading my model in bfloat16 without quantization.

danielhanchen · 2024-03-22T17:28:40Z

@ziemowit-s @its5Q Apologies on the issues again :( Still debugging stuff so sorry on that!

danielhanchen · 2024-03-28T17:26:51Z

Actually can confirm - batched inference in fact is breaking - I'm working on a fix asap - sorry for the wait guys!

danielhanchen · 2024-04-01T17:37:34Z

@ziemowit-s @its5Q Much apologies on the delay - I temporarily fixed it by disabling Unsloth's fast inference paths - it seems like I need to dig deeper on why this is happening :( Using pip install --upgrade --force-reinstall --no-cache-dir git+https://github.com/unslothai/unsloth.git will get the temporary fix.

Again sorry for the inconvience!

danielhanchen · 2024-04-02T17:57:35Z

@ziemowit-s @its5Q I think I finally fixed it!!
On the example @ziemowit-s provided me:

[' The text emphasizes the benefits of humor in the healing process, including reducing stress, improving mood, and boosting the immune system. It suggests strategies such as seeking out humor that resonates, finding humor in everyday situations, sharing a laugh with others, using humor as a coping mechanism, and being gentle with oneself. The text also encourages taking things one step at a time and seeking support when needed.',
 ' The text discusses various causes for memory and concentration issues beyond anxiety, including nutritional deficiencies, sleep deprivation, chronic stress, medications, medical conditions, substance abuse, brain injuries, and aging. Daniella shares her experiences of anxiety, stress, skipping meals, and lack of sleep. Irvin suggests prioritizing self-care, relaxation techniques, and speaking with a therapist to manage stress and memory issues. Daniella expresses concerns about the cost and time commitment of adding new treatments to her therapy sessions. Irvin emphasizes the importance of investing in mental health and encourages Daniella to consider speaking with her therapist about her concerns.',
 ' Dissociative disorders involve alterations in consciousness, memory, identity, or perception, and can include feelings of worthlessness and isolation due to detachment from self and others. These symptoms should be discussed with a therapist for proper diagnosis and treatment. While feelings of worthlessness and isolation are common, they may indicate an underlying mental health condition. Reach out for help and support if these feelings persist and interfere with daily life.',
 ' Board games can help manage anxiety by providing distraction, social interaction, problem-solving, relaxation, and fun. The benefits may vary for individuals, so experimenting with different types of games is recommended.']

Single inference again is faster - batched similar speed for now.
Use install --upgrade --force-reinstall --no-cache-dir git+https://github.com/unslothai/unsloth.git to update on local machines (Colab / Kaggle no need)

its5Q · 2024-04-02T19:01:37Z

Awesome, I'll test it as soon as I get to it

its5Q · 2024-04-02T19:29:08Z

Tried it myself and I'm getting the same weird output as before. One thing that I've noticed is that the weird output only comes from the samples that are padded, and the longest prompt in the batch produces normal output. If all the samples in the batch are the same length in tokens, thus no padding is required, the model output for all samples is as to be expected.
Using unsloth from commit d3a33a0

danielhanchen · 2024-04-03T09:34:07Z

@its5Q That's very weird :( For me it seems to work perfectly. I have an example if you can run this:

from unsloth import FastLanguageModel
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "unsloth/mistral-7b-instruct-v0.2-bnb-4bit",
    max_seq_length = 2048,
    dtype = None,
    load_in_4bit = True,
)

inputs = [
    "Create a Python program using Pytorch to create a simple neural network for image classification.\n"\
    "You need to do the data preparation step, the training step, and the inference step as well.",

    "Create a Python program to compute all the primes.",

    "Write a long essay about happiness, and how to attain it. Provide clear markdown sections.",

    "20*20=?",
]

tokenizer.pad_token = tokenizer.unk_token
tokenizer.padding_side = "left"
inputs = tokenizer(inputs, return_tensors = "pt", padding = True).to("cuda")
outputs = model.generate(**inputs, max_new_tokens = 512, do_sample = False, use_cache = True)

decoded = tokenizer.batch_decode(outputs)
for text in decoded:
    print(text.replace(tokenizer.pad_token, ""))
    print("_" * 70)

You will get:

<s> Create a Python program using Pytorch to create a simple neural network for image classification.
You need to do the data preparation step, the training step, and the inference step as well.

Here's a simple example of a neural network for image classification using PyTorch. This example uses the MNIST dataset, which consists of 60,000 28x28 grayscale images of digits 0-9.

First, let's install the required packages:

```bash
pip install torch torchvision
```

Now, let's write the code:

```python
import torch
import torchvision
import torchvision.transforms as transforms
import torch.nn as nn
import torch.optim as optim

# Load and normalize the MNIST dataset
transform = transforms.Compose([transforms.ToTensor(), transforms.Normalize((0.5,), (0.5,))])

trainset = torchvision.datasets.MNIST(root='./data', train=True, download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=100, shuffle=True)

testset = torchvision.datasets.MNIST(root='./data', train=False, download=True, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=100, shuffle=False)

# Define the neural network architecture
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.fc1 = nn.Linear(28 * 28, 128)
        self.fc2 = nn.Linear(128, 64)
        self.fc3 = nn.Linear(64, 10)

    def forward(self, x):
        x = x.view(-1, 28 * 28)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

net = Net()

# Define the loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(),
______________________________________________________________________
<s> Create a Python program to compute all the primes.

Here's a simple Python program to find all prime numbers up to a given limit:

```python
def is_prime(n):
    """
    Check if a number is prime.
    """
    if n <= 1:
        return False
    for i in range(2, int(n ** 0.5) + 1):
        if n % i == 0:
            return False
    return True

def find_primes(limit):
    """
    Find all prime numbers up to a given limit.
    """
    primes = []
    for n in range(2, limit + 1):
        if is_prime(n):
            primes.append(n)
    return primes

if __name__ == "__main__":
    limit = int(input("Enter the limit: "))
    primes = find_primes(limit)
    print(f"Prime numbers up to {limit}:")
    print(primes)
```

This program uses two functions: `is_prime()` to check if a number is prime, and `find_primes()` to find all prime numbers up to a given limit. The main part of the code is in the `if __name__ == "__main__":` block, where it takes user input for the limit and then prints out the prime numbers found.</s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s></s>
______________________________________________________________________
<s> Write a long essay about happiness, and how to attain it. Provide clear markdown sections.

# Happiness: The Elusive Pursuit

Happiness is a concept that has puzzled philosophers, theologians, and ordinary people for centuries. It is a state of well-being and contentment, a feeling of joy and satisfaction with life. Yet, despite its importance, happiness remains an elusive and subjective experience. In this essay, we will explore the nature of happiness, its sources, and the ways to attain it.

## The Nature of Happiness

Happiness is a complex and multifaceted experience. It is not a static state, but rather a dynamic process that ebbs and flows throughout our lives. Happiness is not the absence of suffering or hardship, but rather the ability to find meaning and joy in the midst of challenges. It is a state of mind that is shaped by our thoughts, emotions, and actions.

### The Role of Thoughts

Our thoughts play a significant role in shaping our experience of happiness. The way we think about ourselves, our circumstances, and the world around us can either enhance or diminish our sense of well-being. For example, focusing on the negative aspects of a situation can lead to feelings of sadness and frustration, while focusing on the positive can lead to feelings of gratitude and joy.

### The Role of Emotions

Emotions are another important factor in our experience of happiness. Positive emotions such as joy, love, and gratitude can enhance our sense of well-being, while negative emotions such as anger, sadness, and fear can detract from it. However, it is important to note that emotions are not static states, but rather transient experiences that come and go.

### The Role of Actions

Our actions also play a role in our experience of happiness. Engaging in activities that bring us joy and fulfillment, such as pursuing a hobby or spending time with loved ones, can enhance our sense of well-being. Conversely, engaging in activities that are harmful or detrimental to our health and happiness, such as substance abuse or excessive work, can detract from it.

## The Sources of Happiness

Despite the complexity of happiness, there are certain sources that have been identified as contributing to our sense of well-being.

### Relationships

Relationships with others are a fundamental source of happiness. Human
______________________________________________________________________
<s> 20*20=?

The answer to this question is 400. The multiplication of 20 by itself results in 400. The number 20 is multiplied by itself 20 times, resulting in a total of 40,000. However, since the question asks for the result of 20 multiplied by itself 20 times, we need to find the result of multiplying 20 by itself 20 times and then take the square root of that number to get the final answer of 400.

Here's the step-by-step calculation:

1. Multiply 20 by itself 20 times: 20 × 20 × 20 × ... × 20 (20 multiplications)
2. Calculate the product: 20^20 = 40,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000
______________________________________________________________________

If you do them individually, I get:

<s> Create a Python program using Pytorch to create a simple neural network for image classification.
You need to do the data preparation step, the training step, and the inference step as well.

Here's a simple example of a neural network for image classification using PyTorch. This example uses the MNIST dataset, which consists of 60,000 28x28 grayscale images of digits 0-9.

First, let's install the required packages:

```bash
pip install torch torchvision
```

Now, let's write the code:

```python
import torch
import torchvision
import torchvision.transforms as transforms
import torch.nn as nn
import torch.optim as optim

# Load the MNIST dataset
transform = transforms.ToTensor()

trainset = torchvision.datasets.MNIST(root='./data', train=True, download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=100, shuffle=True)

testset = torchvision.datasets.MNIST(root='./data', train=False, download=True, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=100, shuffle=False)

# Define the neural network architecture
class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.fc1 = nn.Linear(28 * 28, 128)
        self.fc2 = nn.Linear(128, 64)
        self.fc3 = nn.Linear(64, 10)

    def forward(self, x):
        x = x.view(-1, 28 * 28)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

net = Net()

# Define the loss function and optimizer
criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)

# Training the neural network
for epoch in range
______________________________________________________________________

Some differences via https://www.diffchecker.com/text-compare/:

The 2nd one:

<s> Create a Python program to compute all the primes.

Here's a simple Python program to find all prime numbers up to a given limit:

```python
def is_prime(n):
    """
    Check if a number is prime.
    """
    if n <= 1:
        return False
    for i in range(2, int(n ** 0.5) + 1):
        if n % i == 0:
            return False
    return True

def find_primes(limit):
    """
    Find all prime numbers up to a given limit.
    """
    primes = []
    for n in range(2, limit + 1):
        if is_prime(n):
            primes.append(n)
    return primes

if __name__ == "__main__":
    limit = int(input("Enter the limit: "))
    primes = find_primes(limit)
    print(f"Prime numbers up to {limit}:")
    print(primes)
```

This program uses two functions: `is_prime()` to check if a number is prime, and `find_primes()` to find all prime numbers up to a given limit. The main part of the code is in the `if __name__ == "__main__":` block, where it takes user input for the limit and then prints out the prime numbers found.</s>
______________________________________________________________________

No difference on 2nd.
On 3rd:

<s> Write a long essay about happiness, and how to attain it. Provide clear markdown sections.

# Happiness: The Ultimate Goal

Happiness is a state of well-being and contentment. It is the feeling of joy, satisfaction, and fulfillment. Happiness is the ultimate goal of every human being. It is what we all strive for, whether consciously or unconsciously.

## The Importance of Happiness

Happiness is important for several reasons. First and foremost, it makes life worth living. Without happiness, life can seem meaningless and unfulfilling. Happiness gives us a sense of purpose and direction. It helps us to appreciate the good things in life and to cope with the challenges and hardships that come our way.

Second, happiness is good for our health. Research has shown that happy people are healthier and more resilient than unhappy people. They have stronger immune systems, they recover from illness faster, and they live longer.

Third, happiness is good for our relationships. Happy people are more likely to have strong, healthy relationships with others. They are better able to communicate effectively, to forgive and to be forgiven, and to show love and compassion.

Fourth, happiness is good for our productivity and creativity. Happy people are more productive and creative than unhappy people. They are more focused, more motivated, and more innovative.

## The Pursuit of Happiness

Despite the many benefits of happiness, it can be elusive. Many people spend their entire lives searching for happiness, only to find that it always seems just out of reach. So how can we attain happiness?

### 1. Cultivate a Positive Attitude

One of the most effective ways to cultivate happiness is to cultivate a positive attitude. This means focusing on the good things in life, rather than the bad. It means looking for the silver lining in every situation, and finding ways to turn negatives into positives.

### 2. Practice Gratitude

Another effective way to cultivate happiness is to practice gratitude. This means being thankful for what we have, rather than focusing on what we don't have. It means appreciating the small things in life, and being grateful for the people and things that make our lives richer and more meaningful.

### 3. Build Strong Relationships

Strong relationships with others are essential for happiness. This
______________________________________________________________________

Very different to single decoding, but both are still coherent:

This is because I use torch.nn.functional.softmax for single decoding and torch.nn.functional.scaled_dot_product_attention for multi decoding

And finally:

<s> 20*20=?

The answer to this question is 400. The multiplication of 20 by itself results in 400. The number 20 is multiplied by itself 20 times, resulting in a total of 40,000. However, since the question asks for the result of 20 multiplied by itself 20 times, we need to find the result of multiplying 20 by itself 20 times and then take the square root of that number to get the final answer of 400.

Here's the step-by-step calculation:

1. Multiply 20 by itself 20 times: 20 × 20 × 20 × ... × 20 (20 multiplications)
2. Calculate the product: 20^20 = 40,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000,000
______________________________________________________________________

0 differences as well - the reasoning though is dumb lol

danielhanchen · 2024-04-03T09:37:08Z

Also @its5Q you need to use padding_side = "left" or else the results will be wrong

its5Q · 2024-04-03T11:45:52Z

Also @its5Q you need to use padding_side = "left" or else the results will be wrong

Oh yeah, that the problem, thanks. Now batched inference works as expected for me.

danielhanchen · 2024-04-03T12:18:30Z

@its5Q im thinking if somehow I can default it to left, since people have said this was an ongoing issue!

JIBSIL · 2024-04-04T14:16:44Z

0 differences as well - the reasoning though is dumb lol

wouldn't the difference be due to calculating a random seed each generation? Therefore generations would be different even when comparing non-batched with non-batched

JIBSIL · 2024-04-04T14:22:32Z

@its5Q im thinking if somehow I can default it to left, since people have said this was an ongoing issue!

I'm not an expert in the transformers/unsloth code, but couldn't you just add a line of code before return model, tokenizer with tokenizer.padding_side = "left"?

danielhanchen · 2024-04-04T15:10:26Z

@JIBSIL Oh if you select do_sample = False there is no randomness involved. On the left issue - the issue is for training, this makes training more complex, and Unsloth was primarily a training library, hence the reason why the padding is right.

JIBSIL · 2024-04-04T16:28:19Z

@JIBSIL Oh if you select do_sample = False there is no randomness involved. On the left issue - the issue is for training, this makes training more complex, and Unsloth was primarily a training library, hence the reason why the padding is right.

Ah, thanks for the clarification. However, in the newest release, I am encountering a different error:

File /opt/conda/lib/python3.10/site-packages/unsloth/models/gemma.py:148, in GemmaModel_fast_forward_inference(self, input_ids, past_key_values, position_ids, attention_mask)
    146 seq_len = past_key_values[0][0].shape[-2]
    147 if bsz != 1:
--> 148     attention_mask = _prepare_4d_causal_attention_mask(attention_mask, (bsz, q_len), hidden_states, seq_len,)
    149 pass
    151 next_decoder_cache = []

NameError: name '_prepare_4d_causal_attention_mask' is not defined

Specifically using Gemma-7b. But as usual, mistral works fine 🤣

danielhanchen · 2024-04-04T17:20:23Z

@its5Q Whoops you're correct! I decided to just run the notebook - I 100% finally fixed it now oh lord so sorry!!! The issue of multiple model supports :(

danielhanchen added the currently fixing Am fixing now! label Mar 20, 2024

danielhanchen added the URGENT BUG Urgent bug label Mar 28, 2024

danielhanchen removed the URGENT BUG Urgent bug label Apr 1, 2024

danielhanchen added fixed - pending confirmation Fixed, waiting for confirmation from poster and removed currently fixing Am fixing now! labels Apr 2, 2024

danielhanchen mentioned this issue Apr 3, 2024

Slow Kaggle Performance (2x T4) #260

Closed

danielhanchen closed this as completed Oct 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Batch inference produces nonsense results for unsloth/mistral-7b-instruct-v0.2-bnb-4bit #267

Batch inference produces nonsense results for unsloth/mistral-7b-instruct-v0.2-bnb-4bit #267

ziemowit-s commented Mar 19, 2024 •

edited

Loading

danielhanchen commented Mar 20, 2024

ziemowit-s commented Mar 20, 2024

its5Q commented Mar 22, 2024

danielhanchen commented Mar 22, 2024

danielhanchen commented Mar 28, 2024

danielhanchen commented Apr 1, 2024

danielhanchen commented Apr 2, 2024 •

edited

Loading

its5Q commented Apr 2, 2024

its5Q commented Apr 2, 2024 •

edited

Loading

danielhanchen commented Apr 3, 2024

danielhanchen commented Apr 3, 2024

its5Q commented Apr 3, 2024

danielhanchen commented Apr 3, 2024

JIBSIL commented Apr 4, 2024

JIBSIL commented Apr 4, 2024

danielhanchen commented Apr 4, 2024

JIBSIL commented Apr 4, 2024 •

edited

Loading

danielhanchen commented Apr 4, 2024 •

edited

Loading

Batch inference produces nonsense results for unsloth/mistral-7b-instruct-v0.2-bnb-4bit #267

Batch inference produces nonsense results for unsloth/mistral-7b-instruct-v0.2-bnb-4bit #267

Comments

ziemowit-s commented Mar 19, 2024 • edited Loading

danielhanchen commented Mar 20, 2024

ziemowit-s commented Mar 20, 2024

its5Q commented Mar 22, 2024

danielhanchen commented Mar 22, 2024

danielhanchen commented Mar 28, 2024

danielhanchen commented Apr 1, 2024

danielhanchen commented Apr 2, 2024 • edited Loading

its5Q commented Apr 2, 2024

its5Q commented Apr 2, 2024 • edited Loading

danielhanchen commented Apr 3, 2024

danielhanchen commented Apr 3, 2024

its5Q commented Apr 3, 2024

danielhanchen commented Apr 3, 2024

JIBSIL commented Apr 4, 2024

JIBSIL commented Apr 4, 2024

danielhanchen commented Apr 4, 2024

JIBSIL commented Apr 4, 2024 • edited Loading

danielhanchen commented Apr 4, 2024 • edited Loading

ziemowit-s commented Mar 19, 2024 •

edited

Loading

danielhanchen commented Apr 2, 2024 •

edited

Loading

its5Q commented Apr 2, 2024 •

edited

Loading

JIBSIL commented Apr 4, 2024 •

edited

Loading

danielhanchen commented Apr 4, 2024 •

edited

Loading