-
-
Notifications
You must be signed in to change notification settings - Fork 213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Very poor output quality #47
Comments
I haven't seen this at all. What model are you using? And what settings? |
Tried on Chronos 13b, wizard-lm 13b, and Pygmalion 7b, I used temperatures in between 0.5 and 1, and a context length of 2048, lower temperatures do seem to wrangle it into behaving a little more, but I have to lower the temperature so much the output is too “dry” to be useful . However using the same settings and models on normal GPTQ yields satisfactory results(Albeit at unsatisfactory speed) |
And just to be clear, is this in ExLlama's web UI or in Ooba? |
occam's fork of koboldAI that allows using exllama using gptQ, said fork behaves normally |
Not OP, but for context, the Kobold fork is here if you want to check it, turbo. https://github.com/0cc4m/KoboldAI/tree/4bit-plugin (KoboldAI implementation to support GTPQ and exllama) https://github.com/0cc4m/exllama (exllama fork on transformers branch, which builds exllama to work on Kobold) They added Kobold samplers to that exllama fork. So it seems these samplers are added (I'm not sure about rep pen slope though) |
Okay. I really have enough work cut out for me with this, but I guess I should try installing Kobold at some point to see how they're using it. I would assume they're just taking the logits and passing them to the same samplers they use for other models, and that should just work. But there are some peculiarities to keep in mind, specifically regarding the cache, and that "context tokens" slider looks a little suspect. But idk. |
you think maybe the code wasn't hooked up to the context correctly and it's actually running on incredibly low context size? |
I'm not sure what that slider does, but if it truncates the cache that would definitely lead to degenerate output since the position embeddings for cached entries would be wrong. But, looking at the Transformers wrapper they added I think it's just an issue with how the cache is being passed around. It has to stay in sync with the sequence for every forward pass. E.g. if the model generates an EOS token, and their generator doesn't add that to the running sequence, it has to be removed from the cache. Or something similar along those lines. The cache being out of sync is the kind of thing which might leave it working poorly without crashing. But I'd have to install it and run it in a debugger to make sure. Which I will. After doing some other stuff first. |
Alright then, thank you for taking a look at it |
@turboderp Apologies for this, this should have gone to me directly. I do use the KoboldAI samplers, here's the code if you're interested. It seems to work the first time or times you generate, but breaks afterwards. I'm not yet sure why. I do call |
Yes, using the KoboldAI samplers is the obvious choice for integrating into Kobold, so that's great. There's nothing special about the logits, after all. In fact you should just be able to bypass I'm going to install the 4-bit branch and have a play with it later today. But I don't see anything immediately wrong with how you're using it. I'll have a look though. It shouldn't be too hard to spot if the cache and the sequence go out of sync somehow. |
It's not yet that user-friendly to install, you need to clone the branch, run |
I can confiirm my issue is no longer present after Occam's latest commit to his koboldAI fork, thank you very much for your help. |
But... I didn't fix anything yet. |
Me neither. I'm still struggling to get it to load a model. :) |
@turboderp Let me know if you need help. |
I'm seeing a similar degradation in output quality. It used to match autogptq output quite closely but latest releases seem to be producing different results. I can get back previous quality results by setting ExLlamaConfig.fused_attn = False hope this can help chase things down. |
Well, it's up and running. I was just using a model that didn't have any I had to skip the call to I'm just not seeing anything amiss. It's correctly resetting the cache on each pass, then generating one token at a time and the cache grows as it should, staying exactly one token behind the sequence, and there really isn't much else happening. The output also looks reasonable. Just trying with 7B Llama, but with the storywriter preset it is telling me a very cute little story that doesn't seem to be degenerating with multiple passes. It does the thing that small models like to do where it starts repeating itself, but you can throw it off by adding in "Until suddenly..." or some such, and all that behaves as I'd expect. If I swap out But all in all... I can't find anything wrong at the moment. |
The fused attention step is mathematically equivalent to the regular attention, but there might be slight differences related to numerical precision. Maybe if some of the sampling methods are extremely sensitive? It would help if I could reproduce it. Exactly what model and settings are you using to make this happen? |
Here's a adjusted snippet of the code - nothing too complicated. llama is a python class which executes a prompt. I've had the same issue / tried it with multiple different models from thebloke. It might just be a user issue with how I'm using the exllama code. I've setup my code to run with either exllama, autogptq, gptq-for-llama, and llama.cpp so have been comparing them and noticed this difference / issue.
|
I'll have to try and see if I can reproduce it. One thing that stands out is the call to Hermes is a model I haven't tested, though. I have found some finetunes to be strangely sensitive to rounding errors. I'll have to check that one out I guess. |
I wrote a quick little script to try and spot any difference in the output between fused and regular attention:
This seems to consistently produce roughly the same output. Now, I say roughly, but it's important to note that even with a fixed seed the implementation is always ever so slightly non-deterministic, which comes down to floating-point addition being non-associative and CUDA providing no guarantees about the order in which threads are launched. The difference is always small, but it's made a little larger by the use of FP16, where some other implementations use FP32, at least for intermediate results. It's larger still in the fused attention because at the very end I've optimized away the addition of the residual connection by just doing the last matmul straight on top of the residual state. Mathematically that's the same thing, but it does change the order of additions quite a bit for potentially a more different rounding behavior in the end. Still, the differences are small in any case, and even though the generation happens in multiple steps, I'm just not seeing much divergence. And both are staying coherent, although that Hermes model really likes to write song lyrics for some reason. But it seems equally likely to do that with or without fused attention. |
Thanks, I can run the sample code you provided have and it works cleanly. So must be an issue in the code I'm using / how exllama is being called. The code is trying to be general between all the various GPTQ implementations so might have some cruft causing issues. Will do more testing and see if i can find out why. |
Did some further digging. Seems to be related to creating the generate and tokenizer objects inside the "llama" class. When created at the top level it works, but when the exllama objects are created in a class it has the fused attention difference. Could it be some scoping issues? For the sample below it generates different creative texts. But when used for instruction following it produces very bad results when fused attention is on. output :Injected compiler path: C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.36.32532\bin\Hostx64\x64 ------------------- With Fused Attention == True version -------------------- Once upon a time, the only way to get your hands on new music was by waiting for it to come out or finding an underground tape trading scene. Nowadays you can stream and download songs instantly from anywhere in world with just few clicks of mouse button! Injected compiler path: C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.36.32532\bin\Hostx64\x64 ------------------- With Fused Attention == False version -------------------- Once upon a time, the only way to get your hands on new music was by waiting for it to be released and then going out to buy […] code :
|
so does this mean the fix has been rolled into the code, and if so, what files do I replace? |
There isn't a fix, no, because I haven't been able to reproduce the problem yet. I'm working on a thorough perplexity test to run with all the different possible code paths, which should highlight if there are any significant differences in how the model evaluates depending on tuning parameters. I know there are some numerical differences, at least, and it's possible that this divergence is just the result of the model ending up at a "tipping point" and then going down one path or another based on some small shift in the probabilities. But that's not the same as poor output quality, though. There isn't a "correct" choice for any one token. So unless something is actually breaking and resulting in a broken probability distribution, what you really want is to avoid those tipping points in the first place. I'll know more once these tests are set up. In the meantime you could try the new typical sampling feature, which does seems to produce more consistent results overall. |
I will try that when I get a chance to, thank you |
For what it's worth, I've noticed output quality issues as well in Kobold, which I assumed was related to the sampling swap. However, I noticed similar issues with ooba's very recent exllama support, which doesn't touch exllama's native sampling. One revealing thing. I was using Wizard-Vicuna-30b, which uses |
Kobold doesn't use ExLlama's sampling, only logits from the model. Ooba does use the native sampling, though, as well as ExLlama's tokenizer which is just a straight SentencePiece instance reading the model file directly. I'll have to dig into the Transformers tokenizer to see if it does something special. The special tokens map shouldn't lead to you seeing "</s>" in the output, especially when the file that defines that string isn't being read. I can look into some ways to take special_tokens_map.json into account, but it's going to be a little tricky when you have models on HF where that file looks like this:
The contractions are interesting, at least. Seems too oddly specific to not be a tokenizer issue, but I'm not sure what to make of it. I'll try to see if I can reproduce it. Have you seen it in Kobold too or just in Ooba? |
Seen it in both, but it's happening constantly in ooba, every other reply. It's very weird. It manifests in a few ways: just forgetting to finish ( For the emitting For what it's worth, the initial exllama branch in Kobold (which was very early, before most of your optimizations, or even support for non-groupsize models) didn't have any generation bugs at all that I could detect. |
Sorry for the question here, but the only samplers missing now are tfs (tail free sampling) and top_a, right? |
@QM60 : Are you seeing it with different models or just a particular one? Is there one that does it more than others? |
Okay, so I did a lot of in-depth testing and I did discover a bug in the handling of the cache during fused attention. It was a little extra sneaky because it doesn't manifest on 7B, and I've been testing a bunch on 7B because it's faster, and just validating on 13B and 33B, but not thoroughly enough to notice the difference. Anyway, I don't know if this fixes all the issues, but it definitely improves the output on paper. |
Interesting, I was just investigating the "contraction disease" issue last night, because I could swear it only started showing up after I pulled a few days ago, so I was trying different commits to narrow down exactly when it started. I was going to spend more time on it today, but it looks like 1ef63db did indeed fix it. For whatever little it's worth anymore, the narrowest I'd gotten it (before realizing I was dysfunctionally tired, heh), was 3c86994 (older)So it's very plausible that b65d774 introduced it. |
Massive improvement for me, the contraction issues are gone. The only remaining issue is the Source on vicuna 1.1 prompt format, where they explicitly suggest using Honestly, I could see the argument that llama's behavior is a bug, but that's what it is. No API lets you inject token IDs into a prompt, so people are used to embedding control tokens into text. I'm not sure if using plain sentencepiece instead of llamatokenizer will cause other issues though. Transformers seemed to have multiple tokenization bugs for llama which suggests something might be tricky about it. |
Alright, I celebrated too fast. Contractions are okay but there is still something off about the output from exllama. I suspect it's a sampling issue; perhaps related to sampling order? To identify the differences, I recommend using the sphinx-moth preset from ooba, which has very high temperature mediated by strict top-p and top-k. I think this is more susceptible to whatever the difference is. Here is a simple comparison using ooba with a fixed seed, llama 30B (plain) quantized to 4 bit with act-order and no groupsize. Settings included. The difference in sanity speaks for itself although I'm kinda fond of the exllama version's energy |
Well, the model is still FP16, regardless, whereas GPTQ-for-LLaMa can also be FP32 depending on how you use it. So it is more susceptible to numerical instability, even if it has no meaningful impact on perplexity. I would question if sampling with a high temperature is really a good way to get more "varied" output anyway, compared to typical or tail-free sampling, or some such. It makes a lot of assumptions about how the model is supposed to act way outside of the conditions it was trained in, with some inherent assumptions about numerical accuracy. So getting it to work just right by tweaking magic numbers is a balancing act. You could try running without fused attention and MLP, or maybe try lowering the top-p threshold to see if that produces more equivalent output by cutting out some more of the noise on the tail-end of the probability distribution. I'll have a look anyway a little later. There might be issues with the order of operations. We'll see. But generally speaking, to produce exactly the same output as other implementations in the extremes, I'd have to emulate those other implementations much more carefully, maybe even down to timings in the kernels to get CUDA threads to launch in the same order. And of course FP32 would eat up a lot more VRAM. I have other ideas for improving accuracy, though, building on the LoRA support I'm currently working on. |
I was skeptical that FP32 would explain that kind of a difference in output, so I looked into it, and it's just a sampling difference. For exllama, top_p sampling inherits the base probabilities, from before top_k was applied. That means the same value of top_p is potentially a lot more strict in transformers, where probabilities are normalized at each step. To confirm, I was able to get the same subjective level of output quality simply by adding one line after
This could explain some output differences reported here. |
Just tried this and indeed gets the same level of output quality. Nice catch! |
Well... I'm not fond of making the sampling parameters interdependent in this way. But I guess it doesn't matter all that much since it's kind of all trial-and-error anyway, getting those parameters just right for subjectively satisfying output. So I've pushed an update to normalize the distribution after each sampler. |
I wonder whether it is possible to utilize the original decoding pipeline from huggingface transformers to get a more "aligned" result text given that logits are most likely be similar? |
It would be possible, I assume. KoboldAI does a similar thing to plug ExLlama's logits into the same samplers used for other implementations. I think it's much too limiting, though. |
Man... it's better, but I swear something's still off. I switched to Kobold just to rule out the tokenizer and samplers (which are the same there) but exllama still has seizures ever so often. A few highlights:
This doesn't seem likely to be a sampler issue, and I never saw things like that from GPTQ with the same model/settings. It's much better than before, but I'd still view the current forward pass with a bit of suspicion atm... |
Are these still with extremely high temperature? |
Suppressing the EOS token might be an instance where the model is extra sensitive to noise, or rounding errors from the FP16 math. If the model is 99.9% sure it wants to emit an EOS token, picking from the remaining 0.1% instead might be asking for trouble. I'm thinking maybe it's more correct to lower the temperature by some amount proportional to the likelihood of the EOS token, or any other token that gets masked out. Maybe the distribution just ends up being really flat otherwise. It's interesting that I've never really seen this myself, though. |
For what it's worth, I've never seen Kobold suppress EOS when called via its api. It doesn't do that for me; I have generations stop before max tokens have been reached quite often. And my results didn't come from "super high temperature", either, I believe it was about 0.7 with tfs 0.9. But still, sphinx-moth is a normal, popular preset, despite high temp, and as strict as it is with top-k and top-p, tiny FP16 rounding errors aren't likely to cause these kinds of issues. Banning EOS can cause gibberish, although usually it's a different kind. But the thing is, 90% of gens feel completely normal, and the seizures are sudden and occur at random places in the middle of the text, not at natural stopping points. (The contraction is a clear example of that) I have no idea how to debug this, though, if the results are nondeterministic due to concurrency. Except maybe fuzzing and comparing logits with transformers output within a tolerance. But that would only detect the bugs, not find them. |
Fuzzing is probably the way to go. Perplexity tests aren't revealing anything out of the ordinary, but then they wouldn't if it's an intermittent thing that just corrupts a few numbers under some special conditions. That wouldn't affect the average over thousands of tokens. I think I'll set up a script to run the same inference in ExLlama and GPTQ-for-LLaMa and look for anywhere the logits deviate considerably between the two. That should at least let me reproduce the error. |
For what it's worth, so far, ooba's exllama_hf adapter is giving me flawless output, or at least, indistinguishable from transformers. Not terribly surprising since I think it should have identical tokenizer and sampler behavior. In summary, what I noticed:
The forward pass might be perfectly fine after all. |
Well, the forward pass is mutating all the time, and there definitely were some issues with it that have been resolved. The comparison between the HF samplers and the ones in ExLlama's generator is kind of apples-to-oranges. Without some concrete examples of generations that go "wrong" in ExLlama when they shouldn't, it's really hard to do anything about it. All I get out of that is basically "I tried strumming my bass the same way I strum my guitar, and it doesn't sound the same, so I guess my bass is broken." There could absolutely be bugs in the implementation, but it's impossible to find them based on a subjective sense of something being off. I could just keep making it more and more identical to Transformers, but the whole point of the project wasn't to create a Transformers-compatible plugin for Ooba and Kobold, it was to build an alternative platform for experimenting with techniques that don't fit well in Transformers. The tokenizer still shouldn't behave differently, though. With regards to BOS tokens, yes, but that's a separate issue of figuring out what to do about all the incorrect tokenizer configs floating around on HF. |
Was not complaining about samplers, just noting my experiences. I don't expect exllama's built-in samplers to be identical at all! If it needs new presets, that's fine. Swapping in known samplers is a very useful way to control for differences, though. And that's important, since it suggests that any complaints you're getting probably aren't due to bugs in the forward pass; at least, not anymore. That's good! The tokenizer definitely does behave differently though, because it's just wrapping sentencepiece, which explicitly says it doesn't parse special tokens like |
I'm not taking it as complaints, don't get me wrong. But I also don't want to doubt people when they say the output is bad, like in this case there was a bug causing the forward pass to run slightly incorrectly, and it's good that I found that. I also don't blame people for not having any other way to determine if the output is off than by comparing it to other implementations, because I don't either. It's a black box after all. Just sometimes I do wish it came more in the form of: "here's the output I got, here's (exactly) how I got it, and here's why I think it's wrong." But yeah, it's not to be confrontational or anything, I just wish there was a better way of communicating that I don't view Transformers as the standard. Maybe I just need an FAQ. :) As for the tokenizer, I do plan to add some support special tokens. For models that rely on them it gets messy otherwise, if they have to be inserted after encoding and disappear when decoding. |
So... how do things stand with the special tokens? Having studied it a bit, it seems the authors of SentencePiece go out of their way to explain that control symbols are categorically invalid inputs to the encoder. Meanwhile, Transformers implements a very elaborate workaround so they can be encoded anyway. I imagine those two teams don't like each other very much. Anyway, it wouldn't be too difficult to emulate what Transformers does, but it would be kind of messy so I'm wondering how many models actually include control symbols in their prompt format. Is it unique to Wizard-Vicuna, or is it more common than that? |
OpenAssistant also has special tokens like |
I see. That's an XOR release, but I assume some the full releases I'm looking at are true enough to the original. It doesn't look like the tokenizer model is changed at all, so it's really all just spread across four different config files, with contradictions and all. Transformers is quite the framework.. Anyway, this makes me wonder if text-generation-webui makes any attempt at sanitizing user input, or if that's maybe just me overthinking things. |
I believe I have a lead on this part of the issue:
I managed to reproduce the problem with logging enabled and observed the following generation:
Despite the comma being the only token with non-zero probability, KAI selected the I worked around this bug in the KoboldAI exllama backend by checking the selected tokens and resampling when any zero probability tokens are chosen. I've verified with logging that this avoids the issue and in testing it seems to have solved all the problems with poor output quality. |
I have noticed that while it massively increases the inference speed, it massively decreases the quality of the outputs, instruct models become very obstinate and give completely irrelevant responses, words become misspelled, it repeats lines over and over, and also sometimes spams Chinese letters
The text was updated successfully, but these errors were encountered: