Add memoized cache to `llama_grammar_reject_candidates_for_stack` #1615

Reithan · 2025-06-22T01:46:54Z

This adds a memoized data cache to llama_grammar_reject_candidates_for_stack

The cache first checks the stack as this is reasonable small check since stacks are usually between 20-60 bytes.
If the stack is a hit, it checks the current candidate list size.

Candidate list sizes vary greatly. 50% are ~600b or below, 75% below 1280b, but they max out around 72k
Most 'problematic' recursion happens with lists of ~24b.
a. If the candidate list is <1280b, it's also checked for cache hit to return early
b. If this wasn't a hit - the result of the function is cached

This change completely eliminates 'hangs' from branching explosions in complex grammar in all my tests so far, as well as speeding up 'heavy' grammar processing to run at nearly the same speed as no grammar at all.

Reithan · 2025-06-22T08:41:13Z

Example of grammar on/off with this change. This is with a fairly large and complex grammar block.

On

Off

LostRuins · 2025-06-22T15:05:32Z

I think for a 2% (35 vs 36) t/s change, this is probably not worth it for the additional complexity.

The original speedup in the previous PR was very good, but this one seems marginal at best and probably makes debugging harder.

Reithan · 2025-06-22T20:13:24Z

I think for a 2% (35 vs 36) t/s change, this is probably not worth it for the additional complexity.

The original speedup in the previous PR was very good, but this one seems marginal at best and probably makes debugging harder.

Sorry, you misunderstand, the picture is comparing no grammar, vs grammar - not no change vs change.
Both pictures have the change.

Reithan · 2025-06-23T02:15:19Z

The key problem being solved with this patch is "complex slow grammars that cause combinatorial explosion of stacks"

You can see this reliably with a test grammar like

# ── character classes ─────────────────────────────────────────────
U ::= [A-Z]        # one required upper-case letter
o ::= "" | [a-z]   # 18 independent “present or not” lower-case letters

# ── a single high-fan-out segment ─────────────────────────────────
segment ::= U o o o o o o o o o o o o o o o o o

# ── entry point: unlimited repetition of that segment ─────────────
root ::= segment+

On un-patched branch, this runs at ~5T/s on my setup

With this patch, even setting the batch cutoff much lower, like 680b, we see this:

Almost as fast as no grammar at all (35-37T/s).

While this is more complex than no check at all, I think a simple 'hash and check' memoization scheme is pretty standard dynamic programming. This should be familiar to most engineers, so I don't worry too much about maintenance.

The key tuning bit would be the cutoff value. A high value like the current 1280 catches many many branches, but also runs the has very often, which may not be worth it, since most of those resolve quickly anyway.

A lower values like 640 or even lower will catch many less branches early - but still stop the degenerate 'stuck' cases.

I could also reverse the checks and 'early-out' on candidate size before we even do the stack hash to save even more time.

Reithan · 2025-06-23T02:28:38Z

Tested with doing size check first and lowering byte cutoff, both result is roughly the same speed.

tl;dr this change makes complex grammar that normally gets 'stuck' or slows down considerably run at 'normal' speed.
This does nothing for grammar that already runs smoothly.

So, you get:

~0% speed up/down for normal case
700%+ speed up in 'hard' cases. (or more!)

Reithan · 2025-06-23T02:29:47Z

Actually doing an aggressive size-check first speeds up some 'easy' cases - like the tool call example:

40T/s up from 35T/s

I'll push that change.

LostRuins · 2025-06-23T03:18:59Z

Do we never clear the cache? Will that be a problem? Seems like we might want to clear the cache upon each new gen.

Reithan · 2025-06-23T04:48:20Z

Do we never clear the cache? Will that be a problem? Seems like we might want to clear the cache upon each new gen.

I was debating that. In local use, not clearing that cache lets it use that speedup over every new gen, which helps quite a bit.

But if we're running in horde mode, or changing grammar, or things like that, clearing cache would be useful.

Testing memo cache reset now.

Reithan · 2025-06-23T05:08:17Z

Added cache reset tied to current grammar-reload logic.

LostRuins · 2025-06-23T05:49:07Z

You accidentally committed version.txt

LostRuins · 2025-06-24T15:35:20Z

Btw it's not compiling on my own pc

src/llama-grammar.cpp:915:58: error: cannot bind non-const lvalue reference of type 'std::__detail::_Node_iterator...

had to change line 915
if (auto & cache_hit2 = candidates_memos.find(candidates_hash); cache_hit2 != candidates_memos.end()) {
to
if (auto cache_hit2 = candidates_memos.find(candidates_hash); cache_hit2 != candidates_memos.end()) {

and line 911 as well.

Other than that, seems to be fine.

Reithan · 2025-06-25T03:25:11Z

Interesting. I guess just compiler differences. I've always made a habit of specifying & or * for auto types, JIC. But if your compiles won't do it, that's fine. I'll remove those.

LostRuins · 2025-06-25T11:22:14Z

Merging. Do let me know if any issues arise (to anyone else too)

LostRuins added KIV for now Some issues prevent this from being merged needs review needs review labels Jun 22, 2025

Add memoized cache to llama_grammar_reject_candidates_for_stack

dbff279

make size cutoff more aggressive and move to outer branch

4510328

Reithan force-pushed the add-memoized-cache-to-grammar-rejection branch from daaf02a to 4510328 Compare June 23, 2025 02:31

update comment

241414e

add cache reset whenever grammar is reloaded

0fc4a88

Reithan force-pushed the add-memoized-cache-to-grammar-rejection branch from 497f34c to 0fc4a88 Compare June 23, 2025 08:58

LostRuins removed the KIV for now Some issues prevent this from being merged label Jun 24, 2025

remove explicit reference types for compiler transportability

8ade803

LostRuins approved these changes Jun 25, 2025

View reviewed changes

LostRuins merged commit 54dde5e into LostRuins:concedo_experimental Jun 25, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add memoized cache to `llama_grammar_reject_candidates_for_stack` #1615

Add memoized cache to `llama_grammar_reject_candidates_for_stack` #1615

Uh oh!

Reithan commented Jun 22, 2025 •

edited

Loading

Uh oh!

Reithan commented Jun 22, 2025

Uh oh!

LostRuins commented Jun 22, 2025

Uh oh!

Reithan commented Jun 22, 2025 •

edited

Loading

Uh oh!

Reithan commented Jun 23, 2025 •

edited

Loading

Uh oh!

Reithan commented Jun 23, 2025

Uh oh!

Reithan commented Jun 23, 2025

Uh oh!

LostRuins commented Jun 23, 2025 •

edited

Loading

Uh oh!

Reithan commented Jun 23, 2025 •

edited

Loading

Uh oh!

Reithan commented Jun 23, 2025

Uh oh!

LostRuins commented Jun 23, 2025

Uh oh!

LostRuins commented Jun 24, 2025

Uh oh!

Reithan commented Jun 25, 2025

Uh oh!

LostRuins commented Jun 25, 2025

Uh oh!

Uh oh!

Add memoized cache to llama_grammar_reject_candidates_for_stack #1615

Add memoized cache to llama_grammar_reject_candidates_for_stack #1615

Uh oh!

Conversation

Reithan commented Jun 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reithan commented Jun 22, 2025

On

Off

Uh oh!

LostRuins commented Jun 22, 2025

Uh oh!

Reithan commented Jun 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reithan commented Jun 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reithan commented Jun 23, 2025

Uh oh!

Reithan commented Jun 23, 2025

Uh oh!

LostRuins commented Jun 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reithan commented Jun 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reithan commented Jun 23, 2025

Uh oh!

LostRuins commented Jun 23, 2025

Uh oh!

LostRuins commented Jun 24, 2025

Uh oh!

Reithan commented Jun 25, 2025

Uh oh!

LostRuins commented Jun 25, 2025

Uh oh!

Uh oh!

Add memoized cache to `llama_grammar_reject_candidates_for_stack` #1615

Add memoized cache to `llama_grammar_reject_candidates_for_stack` #1615

Reithan commented Jun 22, 2025 •

edited

Loading

Reithan commented Jun 22, 2025 •

edited

Loading

Reithan commented Jun 23, 2025 •

edited

Loading

LostRuins commented Jun 23, 2025 •

edited

Loading

Reithan commented Jun 23, 2025 •

edited

Loading