Skip to content

Actually apply mask from nsigma #1602

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: concedo_experimental
Choose a base branch
from

Conversation

Reithan
Copy link

@Reithan Reithan commented Jun 14, 2025

Nsigma currently modifies logits without actually applying a mask, although in comment it says it does.
Actually applying the intended mask improves generation performance when using nsigma

@Reithan Reithan changed the base branch from concedo to concedo_experimental June 14, 2025 06:05
@LostRuins
Copy link
Owner

Does it actually lead to speedup? I'd imagine simply incrementing matching logits by an amount is very cheap and might even be faster, since it's only ever applied once per token.

Also I'm not sure if that's a good idea. Generally you don't want to reorder the logins unnecessarily - top n sigma is supposed to be a warper and resizing the candidates instead may have unforeseen effects.

If you want to proceed with it, I suggest you PR this change to https://github.com/ggml-org/llama.cpp/blob/master/src/llama-sampling.cpp#L1786 instead. Their algorithm for top_n_sigma there is the same, and perhaps they might have more insight on modifying this sampler.

Also paging @EquinoxPsychosis to take a look.

@Reithan
Copy link
Author

Reithan commented Jun 14, 2025

I'm not 💯 what you mean by 'warper'. If you mean that it's only supposed to adjust the distribution, that seems incorrect. It sets a cutoff and, even by the comment in the code, 'masks' the tokens. i.e.: It culls out all the ones that fail it's check.

Currently it's "soft masking" these by just setting their log probs to a very low number, which 'should' make them non-selectable.
But, if we just cull them, then every operation after nsigma has far less tokens to process, which does lead to a decent speed up.

remove_if is a stable sort, so it won't reorder any of the non-dropped tokens, either. Running remove_if vs the current for (size_t i = 0; i < cur_p->size; ++i) { should be relatively identical in runtime. (Both are O(n))

@LostRuins
Copy link
Owner

In practice I don't think it leads to much speedup at all. Probably because everything is already pre-filtered to the top 3000 candidates at the very first step. Moreover top_n_sigma is usually applied along with temperature which happens right at the end of the sampler chain and so has very little samplers downstream of it.

@Reithan
Copy link
Author

Reithan commented Jun 14, 2025

That's true in most cases, and as noted, I think in those cases this is a neutral change - but in any workflow where temp isn't last, this is a decent improvement.

@Reithan
Copy link
Author

Reithan commented Jun 14, 2025

Looking through the llama samplers it looks like they only every reset cur_p->size in the p and k samplers.
Functionally, setting logits to -INFINITY does the same thing, but you're doing all the work of excluding tokens, without benefitting from the efficiency gain of actually pruning the list.

I don't know what the reasoning is behind using -INFINITY rather than resetting cur_p->size other than possibly differences in authors? It doesn't seem like anyone's left any comments explaining why one was used over the other in each place.

Perhaps @EquinoxPsychosis will know.

@LostRuins
Copy link
Owner

Yup they are the ones who originally PRd it. Lets see if we get some info.

@EquinoxPsychosis
Copy link

EquinoxPsychosis commented Jun 26, 2025

Hi, sorry for the late response, I've not logged into github for a while due to focusing on a couple of personal projects. Sorry about that.

The sampler in question is a near direct port of the one from Llama.cpp with a few changes to make it compatible with Kobold.cpp, IIRC, the reason for -999.0 to the probabilities is cus it's something I've taken directly from XTC, which states that "-INFINITY causes wonkiness down stream."

The reason I did it that way instead of pruning is simple: I didn't know how to prune tokens, lol. Yup. That's it.
If I did, I likely would of just pruned them instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants