Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very poor output quality #47

Open
calebmor460 opened this issue Jun 12, 2023 · 55 comments
Open

Very poor output quality #47

calebmor460 opened this issue Jun 12, 2023 · 55 comments

Comments

@calebmor460
Copy link

I have noticed that while it massively increases the inference speed, it massively decreases the quality of the outputs, instruct models become very obstinate and give completely irrelevant responses, words become misspelled, it repeats lines over and over, and also sometimes spams Chinese letters

@turboderp
Copy link
Owner

I haven't seen this at all. What model are you using? And what settings?

@calebmor460
Copy link
Author

calebmor460 commented Jun 12, 2023

Tried on Chronos 13b, wizard-lm 13b, and Pygmalion 7b, I used temperatures in between 0.5 and 1, and a context length of 2048, lower temperatures do seem to wrangle it into behaving a little more, but I have to lower the temperature so much the output is too “dry” to be useful . However using the same settings and models on normal GPTQ yields satisfactory results(Albeit at unsatisfactory speed)

@turboderp
Copy link
Owner

And just to be clear, is this in ExLlama's web UI or in Ooba?

@calebmor460
Copy link
Author

calebmor460 commented Jun 12, 2023

occam's fork of koboldAI that allows using exllama

using gptQ, said fork behaves normally

@Panchovix
Copy link
Contributor

Panchovix commented Jun 12, 2023

Not OP, but for context, the Kobold fork is here if you want to check it, turbo.

https://github.com/0cc4m/KoboldAI/tree/4bit-plugin (KoboldAI implementation to support GTPQ and exllama)

https://github.com/0cc4m/exllama (exllama fork on transformers branch, which builds exllama to work on Kobold)

They added Kobold samplers to that exllama fork.

So it seems these samplers are added

image

(I'm not sure about rep pen slope though)

@turboderp
Copy link
Owner

Okay. I really have enough work cut out for me with this, but I guess I should try installing Kobold at some point to see how they're using it. I would assume they're just taking the logits and passing them to the same samplers they use for other models, and that should just work. But there are some peculiarities to keep in mind, specifically regarding the cache, and that "context tokens" slider looks a little suspect. But idk.

@calebmor460
Copy link
Author

you think maybe the code wasn't hooked up to the context correctly and it's actually running on incredibly low context size?

@turboderp
Copy link
Owner

I'm not sure what that slider does, but if it truncates the cache that would definitely lead to degenerate output since the position embeddings for cached entries would be wrong. But, looking at the Transformers wrapper they added I think it's just an issue with how the cache is being passed around. It has to stay in sync with the sequence for every forward pass.

E.g. if the model generates an EOS token, and their generator doesn't add that to the running sequence, it has to be removed from the cache. Or something similar along those lines. The cache being out of sync is the kind of thing which might leave it working poorly without crashing. But I'd have to install it and run it in a debugger to make sure. Which I will. After doing some other stuff first.

@calebmor460
Copy link
Author

Alright then, thank you for taking a look at it

@0cc4m
Copy link

0cc4m commented Jun 13, 2023

@turboderp Apologies for this, this should have gone to me directly.

I do use the KoboldAI samplers, here's the code if you're interested. It seems to work the first time or times you generate, but breaks afterwards. I'm not yet sure why. I do call generator.gen_begin(gen_in), which resets the cache as far as I know.

@turboderp
Copy link
Owner

Yes, using the KoboldAI samplers is the obvious choice for integrating into Kobold, so that's great. There's nothing special about the logits, after all. In fact you should just be able to bypass ExLlamaGenerator altogether and call the forward pass directly.

I'm going to install the 4-bit branch and have a play with it later today. But I don't see anything immediately wrong with how you're using it. gen_begin() should indeed reset the cache (gen_begin_reuse() should work as well, but much faster in some cases), and you're appending every token produced by the forward pass so the cache should stay in sync with the sequence.

I'll have a look though. It shouldn't be too hard to spot if the cache and the sequence go out of sync somehow.

@0cc4m
Copy link

0cc4m commented Jun 13, 2023

It's not yet that user-friendly to install, you need to clone the branch, run install_requirements.sh and then install the exllama package into the conda env with ./commandline.sh, pip install git+https://github.com/0cc4m/exllama. Then you can run it with ./play.sh

@calebmor460
Copy link
Author

I can confiirm my issue is no longer present after Occam's latest commit to his koboldAI fork, thank you very much for your help.

@0cc4m
Copy link

0cc4m commented Jun 13, 2023

But... I didn't fix anything yet.

@turboderp
Copy link
Owner

Me neither. I'm still struggling to get it to load a model. :)

@0cc4m
Copy link

0cc4m commented Jun 13, 2023

@turboderp Let me know if you need help.

@blauzim
Copy link

blauzim commented Jun 13, 2023

I'm seeing a similar degradation in output quality. It used to match autogptq output quite closely but latest releases seem to be producing different results. I can get back previous quality results by setting

ExLlamaConfig.fused_attn = False

hope this can help chase things down.

@turboderp
Copy link
Owner

Well, it's up and running. I was just using a model that didn't have any gptq_bits key in its config and I got stuck on why it wasn't being recognized. Kind of a lot going on in aiserver.py. Maybe you should refactor to less than 10k lines? ;) But it's fine now.

I had to skip the call to tpool.execute() in generate(), just calling model.core_generate() directly in order to debug in PyCharm, but I don't see that having any side effects in this case.

I'm just not seeing anything amiss. It's correctly resetting the cache on each pass, then generating one token at a time and the cache grows as it should, staying exactly one token behind the sequence, and there really isn't much else happening.

The output also looks reasonable. Just trying with 7B Llama, but with the storywriter preset it is telling me a very cute little story that doesn't seem to be degenerating with multiple passes. It does the thing that small models like to do where it starts repeating itself, but you can throw it off by adding in "Until suddenly..." or some such, and all that behaves as I'd expect.

If I swap out gen_begin with gen_begin_reuse, it even seems to be correctly reusing the cache and only re-evaluating the prompt from the first changed token, to further show that it's working. I'm not sure how useful that feature is in Kobold since you're not truncating the sequence in larger steps, so it would only accelerate things until the context is filled up. And prompt eval is really fast already, so idk.

But all in all... I can't find anything wrong at the moment.

@turboderp
Copy link
Owner

turboderp commented Jun 13, 2023

The fused attention step is mathematically equivalent to the regular attention, but there might be slight differences related to numerical precision. Maybe if some of the sampling methods are extremely sensitive?

It would help if I could reproduce it. Exactly what model and settings are you using to make this happen?

@blauzim
Copy link

blauzim commented Jun 13, 2023

Here's a adjusted snippet of the code - nothing too complicated. llama is a python class which executes a prompt. I've had the same issue / tried it with multiple different models from thebloke. It might just be a user issue with how I'm using the exllama code. I've setup my code to run with either exllama, autogptq, gptq-for-llama, and llama.cpp so have been comparing them and noticed this difference / issue.

llama.model_path = "models/Nous-Hermes-13B-GPTQ"
llama.tokenizer_model_path = llama.model_path + "/tokenizer.model"
llama.model_config_path = llama.model_path + "/config.json"
llama.model_safetensors_path = llama.model_path + "/" + [x for x in os.listdir(llama.model_path) if x.endswith('.safetensors')][0]
llama.config = ExLlamaConfig(llama.model_config_path)
llama.config.model_path = llama.model_safetensors_path
# llama.config.fused_attn = False
llama.config.max_seq_len = 2048
llama.model = ExLlama(llama.config)
llama.cache = ExLlamaCache(llama.model)
llama.tokenizer = ExLlamaTokenizer(llama.tokenizer_model_path)
llama.generator = ExLlamaGenerator(llama.model, llama.tokenizer, llama.cache)
llama.generator.settings.token_repetition_penalty_max = 1.2


with torch.no_grad():
    # torch.manual_seed(42)
    llama.generator.end_beam_search()

    ids = llama.generator.tokenizer.encode(prompt)
    #llama.generator.gen_begin(ids)
    llama.generator.gen_begin_reuse(ids)

    for i in range(request.max_tokens):
        token = llama.generator.gen_single_token()
        llama.generator.gen_prune_left
        if token.item() == llama.generator.tokenizer.eos_token_id: break
        for eos_token in stopping_criteria_list :
            if llama.generator.sequence_ends_with(eos_token) :
                break
    generated_ids = llama.generator.sequence[0][len(ids[0]):]
    generated_text = llama.generator.tokenizer.decode(generated_ids)

@turboderp
Copy link
Owner

I'll have to try and see if I can reproduce it. One thing that stands out is the call to gen_prune_left() which I haven't looked at in ages. I think it's buggy when called during a beam search. Otherwise, calling it in the generation loop would continually reset the cache so performance would suffer a lot. Maybe it's just a copy/paste error?

Hermes is a model I haven't tested, though. I have found some finetunes to be strangely sensitive to rounding errors. I'll have to check that one out I guess.

@turboderp
Copy link
Owner

I wrote a quick little script to try and spot any difference in the output between fused and regular attention:

from model import ExLlama, ExLlamaCache, ExLlamaConfig
from tokenizer import ExLlamaTokenizer
from generator import ExLlamaGenerator
import torch
import os, glob

torch.set_grad_enabled(False)
torch.cuda._lazy_init()

model_directory =  "/mnt/str/models/_test_models/TheBloke_GPT4All-13B-snoozy-GPTQ/"
# model_directory =  "/mnt/str/models/llama-13b-4bit-128g/"

tokenizer_path = os.path.join(model_directory, "tokenizer.model")
model_config_path = os.path.join(model_directory, "config.json")
st_pattern = os.path.join(model_directory, "*.safetensors")
model_path = glob.glob(st_pattern)[0]

# Create config, model, tokenizer, generator

config = ExLlamaConfig(model_config_path)
config.model_path = model_path
model = ExLlama(config)
cache = ExLlamaCache(model)
tokenizer = ExLlamaTokenizer(tokenizer_path)
generator = ExLlamaGenerator(model, tokenizer, cache)
generator.disallow_tokens([tokenizer.eos_token_id])
generator.settings.token_repetition_penalty_max = 1.2
generator.settings.temperature = 0.6
generator.settings.top_p = 0.5

# Build a growing prompt

print ("")
print ("------------------- Regular attention --------------------")
print ("")

config.fused_attn = False
prompt = "Once upon a time,"
gen_tokens = 128
torch.manual_seed(69420)
for i in range(5): prompt = generator.generate_simple(prompt, max_new_tokens = gen_tokens)
print(prompt)

print ("")
print ("------------------- Fused attention --------------------")
print ("")

config.fused_attn = True
prompt = "Once upon a time,"
gen_tokens = 128
torch.manual_seed(69420)
for i in range(5): prompt = generator.generate_simple(prompt, max_new_tokens = gen_tokens)
print(prompt)

This seems to consistently produce roughly the same output.

Now, I say roughly, but it's important to note that even with a fixed seed the implementation is always ever so slightly non-deterministic, which comes down to floating-point addition being non-associative and CUDA providing no guarantees about the order in which threads are launched. The difference is always small, but it's made a little larger by the use of FP16, where some other implementations use FP32, at least for intermediate results.

It's larger still in the fused attention because at the very end I've optimized away the addition of the residual connection by just doing the last matmul straight on top of the residual state. Mathematically that's the same thing, but it does change the order of additions quite a bit for potentially a more different rounding behavior in the end.

Still, the differences are small in any case, and even though the generation happens in multiple steps, I'm just not seeing much divergence. And both are staying coherent, although that Hermes model really likes to write song lyrics for some reason. But it seems equally likely to do that with or without fused attention.

@blauzim
Copy link

blauzim commented Jun 13, 2023

Thanks, I can run the sample code you provided have and it works cleanly. So must be an issue in the code I'm using / how exllama is being called. The code is trying to be general between all the various GPTQ implementations so might have some cruft causing issues. Will do more testing and see if i can find out why.

@blauzim
Copy link

blauzim commented Jun 14, 2023

Did some further digging. Seems to be related to creating the generate and tokenizer objects inside the "llama" class. When created at the top level it works, but when the exllama objects are created in a class it has the fused attention difference. Could it be some scoping issues? For the sample below it generates different creative texts. But when used for instruction following it produces very bad results when fused attention is on.

output :

Injected compiler path: C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.36.32532\bin\Hostx64\x64

------------------- With Fused Attention == True version --------------------

Once upon a time, the only way to get your hands on new music was by waiting for it to come out or finding an underground tape trading scene. Nowadays you can stream and download songs instantly from anywhere in world with just few clicks of mouse button!
The internet has also made sharing information about bands much easier than before – through social media sites like Facebook & Twitter as well blogs that cater specifically towards independent musicians (like this one). This makes discoverability so important because now anyone who wants access to their favorite band’s latest single without having any connection within industry gatekeepers such us record labels A&R people

Injected compiler path: C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.36.32532\bin\Hostx64\x64

------------------- With Fused Attention == False version --------------------

Once upon a time, the only way to get your hands on new music was by waiting for it to be released and then going out to buy […]
Filed Under: Entertainment Tagged With: Apple Music, Beats 1 Radio Station

code :

from model import exllama, exllamacache, exllamaconfig
from tokenizer import exllamatokenizer
from generator import exllamagenerator
import torch
import os, glob

torch.set_grad_enabled(false)
torch.cuda._lazy_init()


######  set this to change from / to fused attention
_use_fused_attention = false

print ("")
print ("------------------- with fused attention == " + str(_use_fused_attention) + " version --------------------")
print ("")
model_directory =  "../../llama/nous-hermes-13b-gptq"
class llama_ex :
    model_directory :str | none = none
    tokenizer_model_path :str | none = none
    model_config_path :str | none = none
    model_safetensor_path :str | none = none
    n_ctx : int = 2048
    config : exllamaconfig  | none = none
    model : exllama | none = none
    cache : exllamacache | none = none
    tokenizer : exllamatokenizer | none = none
    generator : exllamagenerator  | none = none

    def __init__(self, *args, **kwargs):
        for this_param in list(set(dir(self)) & set(kwargs.keys())) :
            setattr(self, this_param, kwargs[this_param])
        self.model_tokenizer_path = os.path.join(self.model_directory, "tokenizer.model")
        self.model_config_path = os.path.join(self.model_directory, "config.json")
        self.model_safetensors_path = os.path.join(self.model_directory, [x for x in os.listdir(self.model_directory) if x.endswith('.safetensors')][0])
        self.config = exllamaconfig(self.model_config_path)
        self.config.model_path = self.model_safetensors_path
        self.config.fused_attn = _use_fused_attention
        self.model = exllama(self.config)
        self.cache = exllamacache(self.model)
        self.tokenizer = exllamatokenizer(self.model_tokenizer_path)
        self.generator = exllamagenerator(self.model, self.tokenizer, self.cache)

llama = llama_ex(model_directory = model_directory)

prompt = "once upon a time,"
llama.generator.settings.token_repetition_penalty_max = 1.5
llama.generator.settings.temperature = 0.5
llama.generator.settings.top_p = 0.1
llama.generator.settings.top_k = 40
gen_tokens = 128
torch.manual_seed(69420)

generated_text = llama.generator.generate_simple(prompt, max_new_tokens = gen_tokens)
print(generated_text)

@calebmor460
Copy link
Author

so does this mean the fix has been rolled into the code, and if so, what files do I replace?

@turboderp
Copy link
Owner

There isn't a fix, no, because I haven't been able to reproduce the problem yet. I'm working on a thorough perplexity test to run with all the different possible code paths, which should highlight if there are any significant differences in how the model evaluates depending on tuning parameters.

I know there are some numerical differences, at least, and it's possible that this divergence is just the result of the model ending up at a "tipping point" and then going down one path or another based on some small shift in the probabilities. But that's not the same as poor output quality, though. There isn't a "correct" choice for any one token. So unless something is actually breaking and resulting in a broken probability distribution, what you really want is to avoid those tipping points in the first place.

I'll know more once these tests are set up. In the meantime you could try the new typical sampling feature, which does seems to produce more consistent results overall.

@calebmor460
Copy link
Author

I will try that when I get a chance to, thank you

@QM60
Copy link

QM60 commented Jun 17, 2023

For what it's worth, I've noticed output quality issues as well in Kobold, which I assumed was related to the sampling swap. However, I noticed similar issues with ooba's very recent exllama support, which doesn't touch exllama's native sampling.

One revealing thing. I was using Wizard-Vicuna-30b, which uses </s> as part of its prompt format. I noticed that I got "</s>" (as in the literal string, not the EOS token) creeping into the output, which never happened with normal transformers. This suggests that exllama is not interpreting </s> as a special token. If it doesn't check special_tokens_map.json, that would explain some things.
In addition, I had issues with very early/jarring EOS, and contraction fumbling (emitting words like can'm, don've, etc.) which is normally only an issue with GPTQ models that don't use desc_act. Neither happened with regular GPTQ. The early stopping may be a symptom of incorrect interpretation of </s> in the prompt, but I'm not sure if that's plausible for contraction fumbling.

@turboderp
Copy link
Owner

turboderp commented Jun 17, 2023

Kobold doesn't use ExLlama's sampling, only logits from the model. Ooba does use the native sampling, though, as well as ExLlama's tokenizer which is just a straight SentencePiece instance reading the model file directly. I'll have to dig into the Transformers tokenizer to see if it does something special.

The special tokens map shouldn't lead to you seeing "</s>" in the output, especially when the file that defines that string isn't being read. I can look into some ways to take special_tokens_map.json into account, but it's going to be a little tricky when you have models on HF where that file looks like this:

{
  "bos_token": "</s>",
  "eos_token": "</s>",
  "pad_token": "[PAD]",
  "unk_token": "</s>"
}

The contractions are interesting, at least. Seems too oddly specific to not be a tokenizer issue, but I'm not sure what to make of it. I'll try to see if I can reproduce it. Have you seen it in Kobold too or just in Ooba?

@QM60
Copy link

QM60 commented Jun 17, 2023

Seen it in both, but it's happening constantly in ooba, every other reply. It's very weird. It manifests in a few ways: just forgetting to finish (doesn') finishing with a weird token (can'the) or cutting off the whole generation at a contraction (doesn'<EOS>). Again, the only other time I saw this was with 128g cuda models without act-order, but it was rarer, and I assume it was just quantization error. Fascinating that issues can manifest like this.

For the emitting </s> issue, I suspect this is happening because it's interpreted as a normal string (in the prompt from the chat history), causing the model to assume it should end generations with it. GPTQ parses it as EOS. This might be causing some issues, but I tried removing the EOS tokens from the prompt entirely, and the contraction glitches are still there. Weird.
I hear you on the weird model configs, although for models that expect EOS in the prompt (like those trained on vicuna 1.1 formats) I should hope they didn't do that.

For what it's worth, the initial exllama branch in Kobold (which was very early, before most of your optimizations, or even support for non-groupsize models) didn't have any generation bugs at all that I could detect.

@Panchovix
Copy link
Contributor

Sorry for the question here, but the only samplers missing now are tfs (tail free sampling) and top_a, right?

@turboderp
Copy link
Owner

@QM60 : Are you seeing it with different models or just a particular one? Is there one that does it more than others?

@turboderp
Copy link
Owner

Okay, so I did a lot of in-depth testing and I did discover a bug in the handling of the cache during fused attention. It was a little extra sneaky because it doesn't manifest on 7B, and I've been testing a bunch on 7B because it's faster, and just validating on 13B and 33B, but not thoroughly enough to notice the difference.

Anyway, I don't know if this fixes all the issues, but it definitely improves the output on paper.

@EyeDeck
Copy link
Contributor

EyeDeck commented Jun 17, 2023

Interesting, I was just investigating the "contraction disease" issue last night, because I could swear it only started showing up after I pulled a few days ago, so I was trying different commits to narrow down exactly when it started. I was going to spend more time on it today, but it looks like 1ef63db did indeed fix it.

For whatever little it's worth anymore, the narrowest I'd gotten it (before realizing I was dysfunctionally tired, heh), was

3c86994 (older)

image

896da5d

image

image

image

dd63e07 (latest for comparison)

image

image

image

So it's very plausible that b65d774 introduced it.

@QM60
Copy link

QM60 commented Jun 17, 2023

Anyway, I don't know if this fixes all the issues, but it definitely improves the output on paper.

Massive improvement for me, the contraction issues are gone. The only remaining issue is the </s> one.
Which is interesting, because sentencepiece is deliberately designed NOT to read any control tokens (including EOS) from a normal text stream - which makes complete sense! But the normal llama tokenizer does parse </s> as one token, and vicuna-1.1 and derived models rely on seeing EOS in the prompt. Empirically, for Wizard-Vicuna-30b, omitting it from the prompt is not too bad, but makes it prone to emit very short or even empty replies fairly frequently.

Source on vicuna 1.1 prompt format, where they explicitly suggest using </s>:
https://github.com/lm-sys/FastChat/blob/7ae721fa3c881e1e24cf181305d127a316acd463/docs/vicuna_weights_version.md#example-prompt-weight-v11

Honestly, I could see the argument that llama's behavior is a bug, but that's what it is. No API lets you inject token IDs into a prompt, so people are used to embedding control tokens into text. I'm not sure if using plain sentencepiece instead of llamatokenizer will cause other issues though. Transformers seemed to have multiple tokenization bugs for llama which suggests something might be tricky about it.

@QM60
Copy link

QM60 commented Jun 18, 2023

Alright, I celebrated too fast. Contractions are okay but there is still something off about the output from exllama. I suspect it's a sampling issue; perhaps related to sampling order? To identify the differences, I recommend using the sphinx-moth preset from ooba, which has very high temperature mediated by strict top-p and top-k. I think this is more susceptible to whatever the difference is.

Here is a simple comparison using ooba with a fixed seed, llama 30B (plain) quantized to 4 bit with act-order and no groupsize. Settings included. The difference in sanity speaks for itself although I'm kinda fond of the exllama version's energy

exllama comparison.txt

@turboderp
Copy link
Owner

Well, the model is still FP16, regardless, whereas GPTQ-for-LLaMa can also be FP32 depending on how you use it. So it is more susceptible to numerical instability, even if it has no meaningful impact on perplexity.

I would question if sampling with a high temperature is really a good way to get more "varied" output anyway, compared to typical or tail-free sampling, or some such. It makes a lot of assumptions about how the model is supposed to act way outside of the conditions it was trained in, with some inherent assumptions about numerical accuracy. So getting it to work just right by tweaking magic numbers is a balancing act. You could try running without fused attention and MLP, or maybe try lowering the top-p threshold to see if that produces more equivalent output by cutting out some more of the noise on the tail-end of the probability distribution.

I'll have a look anyway a little later. There might be issues with the order of operations. We'll see. But generally speaking, to produce exactly the same output as other implementations in the extremes, I'd have to emulate those other implementations much more carefully, maybe even down to timings in the kernels to get CUDA threads to launch in the same order.

And of course FP32 would eat up a lot more VRAM. I have other ideas for improving accuracy, though, building on the LoRA support I'm currently working on.

@QM60
Copy link

QM60 commented Jun 18, 2023

I was skeptical that FP32 would explain that kind of a difference in output, so I looked into it, and it's just a sampling difference. For exllama, top_p sampling inherits the base probabilities, from before top_k was applied. That means the same value of top_p is potentially a lot more strict in transformers, where probabilities are normalized at each step.

To confirm, I was able to get the same subjective level of output quality simply by adding one line after if top_p > 0.0: in generator.py to rescale probabilities.

top_probs = top_probs / torch.sum(top_probs, dim = -1)

This could explain some output differences reported here.

@Panchovix
Copy link
Contributor

I was skeptical that FP32 would explain that kind of a difference in output, so I looked into it, and it's just a sampling difference. For exllama, top_p sampling inherits the base probabilities, from before top_k was applied. That means the same value of top_p is potentially a lot more strict in transformers, where probabilities are normalized at each step.

To confirm, I was able to get the same subjective level of output quality simply by adding one line after if top_p > 0.0: in generator.py to rescale probabilities.

top_probs = top_probs / torch.sum(top_probs, dim = -1)

This could explain some output differences reported here.

Just tried this and indeed gets the same level of output quality. Nice catch!

@turboderp
Copy link
Owner

Well... I'm not fond of making the sampling parameters interdependent in this way. But I guess it doesn't matter all that much since it's kind of all trial-and-error anyway, getting those parameters just right for subjectively satisfying output. So I've pushed an update to normalize the distribution after each sampler.

@Larryvrh
Copy link

Larryvrh commented Jun 19, 2023

I wonder whether it is possible to utilize the original decoding pipeline from huggingface transformers to get a more "aligned" result text given that logits are most likely be similar?

@turboderp
Copy link
Owner

turboderp commented Jun 19, 2023

It would be possible, I assume. KoboldAI does a similar thing to plug ExLlama's logits into the same samplers used for other implementations. I think it's much too limiting, though.

@QM60
Copy link

QM60 commented Jun 21, 2023

Man... it's better, but I swear something's still off. I switched to Kobold just to rule out the tokenizer and samplers (which are the same there) but exllama still has seizures ever so often. A few highlights:

  • She grins mischievouslyifizziq
  • every word and action soDD far
  • Wouldn't that be fun?"ований текст надо убрать
  • isn' readonly bad after all…

This doesn't seem likely to be a sampler issue, and I never saw things like that from GPTQ with the same model/settings. It's much better than before, but I'd still view the current forward pass with a bit of suspicion atm...

@turboderp
Copy link
Owner

Are these still with extremely high temperature?

@turboderp
Copy link
Owner

Suppressing the EOS token might be an instance where the model is extra sensitive to noise, or rounding errors from the FP16 math. If the model is 99.9% sure it wants to emit an EOS token, picking from the remaining 0.1% instead might be asking for trouble.

I'm thinking maybe it's more correct to lower the temperature by some amount proportional to the likelihood of the EOS token, or any other token that gets masked out. Maybe the distribution just ends up being really flat otherwise.

It's interesting that I've never really seen this myself, though.

@QM60
Copy link

QM60 commented Jun 21, 2023

For what it's worth, I've never seen Kobold suppress EOS when called via its api. It doesn't do that for me; I have generations stop before max tokens have been reached quite often. And my results didn't come from "super high temperature", either, I believe it was about 0.7 with tfs 0.9. But still, sphinx-moth is a normal, popular preset, despite high temp, and as strict as it is with top-k and top-p, tiny FP16 rounding errors aren't likely to cause these kinds of issues.

Banning EOS can cause gibberish, although usually it's a different kind. But the thing is, 90% of gens feel completely normal, and the seizures are sudden and occur at random places in the middle of the text, not at natural stopping points. (The contraction is a clear example of that)

I have no idea how to debug this, though, if the results are nondeterministic due to concurrency. Except maybe fuzzing and comparing logits with transformers output within a tolerance. But that would only detect the bugs, not find them.

@turboderp
Copy link
Owner

Fuzzing is probably the way to go. Perplexity tests aren't revealing anything out of the ordinary, but then they wouldn't if it's an intermittent thing that just corrupts a few numbers under some special conditions. That wouldn't affect the average over thousands of tokens.

I think I'll set up a script to run the same inference in ExLlama and GPTQ-for-LLaMa and look for anywhere the logits deviate considerably between the two. That should at least let me reproduce the error.

@QM60
Copy link

QM60 commented Jun 23, 2023

For what it's worth, so far, ooba's exllama_hf adapter is giving me flawless output, or at least, indistinguishable from transformers. Not terribly surprising since I think it should have identical tokenizer and sampler behavior.

In summary, what I noticed:

  • Kobold's exllama = random seizures/outbursts, as mentioned
  • native exllama samplers = weird repetitiveness (even with sustain == -1), issues parsing special tokens in prompt
  • ooba's exllama HF adapter = perfect

The forward pass might be perfectly fine after all.

@turboderp
Copy link
Owner

Well, the forward pass is mutating all the time, and there definitely were some issues with it that have been resolved.

The comparison between the HF samplers and the ones in ExLlama's generator is kind of apples-to-oranges. Without some concrete examples of generations that go "wrong" in ExLlama when they shouldn't, it's really hard to do anything about it. All I get out of that is basically "I tried strumming my bass the same way I strum my guitar, and it doesn't sound the same, so I guess my bass is broken." There could absolutely be bugs in the implementation, but it's impossible to find them based on a subjective sense of something being off.

I could just keep making it more and more identical to Transformers, but the whole point of the project wasn't to create a Transformers-compatible plugin for Ooba and Kobold, it was to build an alternative platform for experimenting with techniques that don't fit well in Transformers.

The tokenizer still shouldn't behave differently, though. With regards to BOS tokens, yes, but that's a separate issue of figuring out what to do about all the incorrect tokenizer configs floating around on HF.

@QM60
Copy link

QM60 commented Jun 23, 2023

Was not complaining about samplers, just noting my experiences. I don't expect exllama's built-in samplers to be identical at all! If it needs new presets, that's fine. Swapping in known samplers is a very useful way to control for differences, though. And that's important, since it suggests that any complaints you're getting probably aren't due to bugs in the forward pass; at least, not anymore. That's good!

The tokenizer definitely does behave differently though, because it's just wrapping sentencepiece, which explicitly says it doesn't parse special tokens like </s> (by design). HF's tokenizer does parse them. That's not wrong per se. It might even be desirable for someone writing their own backend, since they could insert separator tokens manually, and know that users can't inject them. But it's different!

@turboderp
Copy link
Owner

I'm not taking it as complaints, don't get me wrong. But I also don't want to doubt people when they say the output is bad, like in this case there was a bug causing the forward pass to run slightly incorrectly, and it's good that I found that. I also don't blame people for not having any other way to determine if the output is off than by comparing it to other implementations, because I don't either. It's a black box after all. Just sometimes I do wish it came more in the form of: "here's the output I got, here's (exactly) how I got it, and here's why I think it's wrong."

But yeah, it's not to be confrontational or anything, I just wish there was a better way of communicating that I don't view Transformers as the standard. Maybe I just need an FAQ. :)

As for the tokenizer, I do plan to add some support special tokens. For models that rely on them it gets messy otherwise, if they have to be inserted after encoding and disappear when decoding.

@turboderp
Copy link
Owner

So... how do things stand with the special tokens?

Having studied it a bit, it seems the authors of SentencePiece go out of their way to explain that control symbols are categorically invalid inputs to the encoder. Meanwhile, Transformers implements a very elaborate workaround so they can be encoded anyway. I imagine those two teams don't like each other very much.

Anyway, it wouldn't be too difficult to emulate what Transformers does, but it would be kind of messy so I'm wondering how many models actually include control symbols in their prompt format. Is it unique to Wizard-Vicuna, or is it more common than that?

@oobabooga
Copy link

OpenAssistant also has special tokens like <|endoftext|> that were manually added to the tokenizer. See here: https://huggingface.co/OpenAssistant/oasst-rlhf-2-llama-30b-7k-steps-xor/blob/main/oasst-rlhf-2-llama-30b-7k-steps-xor/added_tokens.json

@turboderp
Copy link
Owner

turboderp commented Jun 28, 2023

I see. That's an XOR release, but I assume some the full releases I'm looking at are true enough to the original. It doesn't look like the tokenizer model is changed at all, so it's really all just spread across four different config files, with contradictions and all. Transformers is quite the framework..

Anyway, this makes me wonder if text-generation-webui makes any attempt at sanitizing user input, or if that's maybe just me overthinking things.

@pi6am
Copy link

pi6am commented Aug 27, 2023

I believe I have a lead on this part of the issue:

Kobold's exllama = random seizures/outbursts, as mentioned

I managed to reproduce the problem with logging enabled and observed the following generation:

   id       gen     RepetitionP     TopP      softmax
 29892    24.8125     22.5628     22.5628        1       [,]
  1183    18.3281     16.7737       -inf         0       [ she]
  310     15.3672     14.1284       -inf         0       [ of]
  322     15.0312     13.7184       -inf         0       [ and]
 20265    1.68555     1.68555       -inf         0       [ Bened]
Selected: 20265 [ Bened]

Despite the comma being the only token with non-zero probability, KAI selected the Bened token instead. The cause is a bug in torch.multinomial that has been claim fixed but not yet released (i.e. it's still active in PyTorch 2.0.1). This bug sometimes causes the multinomial function to select items with zero weight.

I worked around this bug in the KoboldAI exllama backend by checking the selected tokens and resampling when any zero probability tokens are chosen. I've verified with logging that this avoids the issue and in testing it seems to have solved all the problems with poor output quality.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

10 participants