-
Notifications
You must be signed in to change notification settings - Fork 487
Actually apply mask from nsigma #1602
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: concedo_experimental
Are you sure you want to change the base?
Actually apply mask from nsigma #1602
Conversation
Does it actually lead to speedup? I'd imagine simply incrementing matching logits by an amount is very cheap and might even be faster, since it's only ever applied once per token. Also I'm not sure if that's a good idea. Generally you don't want to reorder the logins unnecessarily - top n sigma is supposed to be a warper and resizing the candidates instead may have unforeseen effects. If you want to proceed with it, I suggest you PR this change to https://github.com/ggml-org/llama.cpp/blob/master/src/llama-sampling.cpp#L1786 instead. Their algorithm for top_n_sigma there is the same, and perhaps they might have more insight on modifying this sampler. Also paging @EquinoxPsychosis to take a look. |
I'm not 💯 what you mean by 'warper'. If you mean that it's only supposed to adjust the distribution, that seems incorrect. It sets a cutoff and, even by the comment in the code, 'masks' the tokens. i.e.: It culls out all the ones that fail it's check. Currently it's "soft masking" these by just setting their log probs to a very low number, which 'should' make them non-selectable.
|
In practice I don't think it leads to much speedup at all. Probably because everything is already pre-filtered to the top 3000 candidates at the very first step. Moreover |
That's true in most cases, and as noted, I think in those cases this is a neutral change - but in any workflow where temp isn't last, this is a decent improvement. |
Looking through the llama samplers it looks like they only every reset I don't know what the reasoning is behind using Perhaps @EquinoxPsychosis will know. |
Yup they are the ones who originally PRd it. Lets see if we get some info. |
Hi, sorry for the late response, I've not logged into github for a while due to focusing on a couple of personal projects. Sorry about that. The sampler in question is a near direct port of the one from Llama.cpp with a few changes to make it compatible with Kobold.cpp, IIRC, the reason for The reason I did it that way instead of pruning is simple: I didn't know how to prune tokens, lol. Yup. That's it. |
Nsigma currently modifies logits without actually applying a mask, although in comment it says it does.
Actually applying the intended mask improves generation performance when using nsigma