Actually apply mask from nsigma #1602

Reithan · 2025-06-14T06:05:34Z

Nsigma currently modifies logits without actually applying a mask, although in comment it says it does.
Actually applying the intended mask improves generation performance when using nsigma

LostRuins · 2025-06-14T07:04:39Z

Does it actually lead to speedup? I'd imagine simply incrementing matching logits by an amount is very cheap and might even be faster, since it's only ever applied once per token.

Also I'm not sure if that's a good idea. Generally you don't want to reorder the logins unnecessarily - top n sigma is supposed to be a warper and resizing the candidates instead may have unforeseen effects.

If you want to proceed with it, I suggest you PR this change to https://github.com/ggml-org/llama.cpp/blob/master/src/llama-sampling.cpp#L1786 instead. Their algorithm for top_n_sigma there is the same, and perhaps they might have more insight on modifying this sampler.

Also paging @EquinoxPsychosis to take a look.

Reithan · 2025-06-14T07:44:27Z

I'm not 💯 what you mean by 'warper'. If you mean that it's only supposed to adjust the distribution, that seems incorrect. It sets a cutoff and, even by the comment in the code, 'masks' the tokens. i.e.: It culls out all the ones that fail it's check.

Currently it's "soft masking" these by just setting their log probs to a very low number, which 'should' make them non-selectable.
But, if we just cull them, then every operation after nsigma has far less tokens to process, which does lead to a decent speed up.

remove_if is a stable sort, so it won't reorder any of the non-dropped tokens, either. Running remove_if vs the current for (size_t i = 0; i < cur_p->size; ++i) { should be relatively identical in runtime. (Both are O(n))

LostRuins · 2025-06-14T08:36:05Z

In practice I don't think it leads to much speedup at all. Probably because everything is already pre-filtered to the top 3000 candidates at the very first step. Moreover top_n_sigma is usually applied along with temperature which happens right at the end of the sampler chain and so has very little samplers downstream of it.

Reithan · 2025-06-14T09:50:52Z

That's true in most cases, and as noted, I think in those cases this is a neutral change - but in any workflow where temp isn't last, this is a decent improvement.

Reithan · 2025-06-14T21:25:08Z

Looking through the llama samplers it looks like they only every reset cur_p->size in the p and k samplers.
Functionally, setting logits to -INFINITY does the same thing, but you're doing all the work of excluding tokens, without benefitting from the efficiency gain of actually pruning the list.

I don't know what the reasoning is behind using -INFINITY rather than resetting cur_p->size other than possibly differences in authors? It doesn't seem like anyone's left any comments explaining why one was used over the other in each place.

Perhaps @EquinoxPsychosis will know.

LostRuins · 2025-06-15T08:20:28Z

Yup they are the ones who originally PRd it. Lets see if we get some info.

EquinoxPsychosis · 2025-06-26T17:58:54Z

Hi, sorry for the late response, I've not logged into github for a while due to focusing on a couple of personal projects. Sorry about that.

The sampler in question is a near direct port of the one from Llama.cpp with a few changes to make it compatible with Kobold.cpp, IIRC, the reason for -999.0 to the probabilities is cus it's something I've taken directly from XTC, which states that "-INFINITY causes wonkiness down stream."

The reason I did it that way instead of pruning is simple: I didn't know how to prune tokens, lol. Yup. That's it.
If I did, I likely would of just pruned them instead.

improve performance by actually applying nsigma's masking

ae15f30

Reithan changed the base branch from concedo to concedo_experimental June 14, 2025 06:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Actually apply mask from nsigma #1602

Actually apply mask from nsigma #1602

Uh oh!

Reithan commented Jun 14, 2025

Uh oh!

LostRuins commented Jun 14, 2025

Uh oh!

Reithan commented Jun 14, 2025

Uh oh!

LostRuins commented Jun 14, 2025

Uh oh!

Reithan commented Jun 14, 2025

Uh oh!

Reithan commented Jun 14, 2025

Uh oh!

LostRuins commented Jun 15, 2025

Uh oh!

EquinoxPsychosis commented Jun 26, 2025 •

edited

Loading

Uh oh!

Uh oh!

Actually apply mask from nsigma #1602

Are you sure you want to change the base?

Actually apply mask from nsigma #1602

Uh oh!

Conversation

Reithan commented Jun 14, 2025

Uh oh!

LostRuins commented Jun 14, 2025

Uh oh!

Reithan commented Jun 14, 2025

Uh oh!

LostRuins commented Jun 14, 2025

Uh oh!

Reithan commented Jun 14, 2025

Uh oh!

Reithan commented Jun 14, 2025

Uh oh!

LostRuins commented Jun 15, 2025

Uh oh!

EquinoxPsychosis commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

EquinoxPsychosis commented Jun 26, 2025 •

edited

Loading