Skip to content

Add memoized cache to llama_grammar_reject_candidates_for_stack #1615

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

Reithan
Copy link

@Reithan Reithan commented Jun 22, 2025

This adds a memoized data cache to llama_grammar_reject_candidates_for_stack

  1. The cache first checks the stack as this is reasonable small check since stacks are usually between 20-60 bytes.
  2. If the stack is a hit, it checks the current candidate list size.
  • Candidate list sizes vary greatly. 50% are ~600b or below, 75% below 1280b, but they max out around 72k
  • Most 'problematic' recursion happens with lists of ~24b.
    a. If the candidate list is <1280b, it's also checked for cache hit to return early
    b. If this wasn't a hit - the result of the function is cached

This change completely eliminates 'hangs' from branching explosions in complex grammar in all my tests so far, as well as speeding up 'heavy' grammar processing to run at nearly the same speed as no grammar at all.

@Reithan
Copy link
Author

Reithan commented Jun 22, 2025

Example of grammar on/off with this change. This is with a fairly large and complex grammar block.

On

image

Off

image

@LostRuins
Copy link
Owner

I think for a 2% (35 vs 36) t/s change, this is probably not worth it for the additional complexity.

The original speedup in the previous PR was very good, but this one seems marginal at best and probably makes debugging harder.

@LostRuins LostRuins added KIV for now Some issues prevent this from being merged needs review needs review labels Jun 22, 2025
@Reithan
Copy link
Author

Reithan commented Jun 22, 2025

I think for a 2% (35 vs 36) t/s change, this is probably not worth it for the additional complexity.

The original speedup in the previous PR was very good, but this one seems marginal at best and probably makes debugging harder.

Sorry, you misunderstand, the picture is comparing no grammar, vs grammar - not no change vs change.
Both pictures have the change.

@Reithan
Copy link
Author

Reithan commented Jun 23, 2025

The key problem being solved with this patch is "complex slow grammars that cause combinatorial explosion of stacks"

You can see this reliably with a test grammar like

# ── character classes ─────────────────────────────────────────────
U ::= [A-Z]        # one required upper-case letter
o ::= "" | [a-z]   # 18 independent “present or not” lower-case letters

# ── a single high-fan-out segment ─────────────────────────────────
segment ::= U o o o o o o o o o o o o o o o o o

# ── entry point: unlimited repetition of that segment ─────────────
root ::= segment+

On un-patched branch, this runs at ~5T/s on my setup
image

With this patch, even setting the batch cutoff much lower, like 680b, we see this:
image

Almost as fast as no grammar at all (35-37T/s).

While this is more complex than no check at all, I think a simple 'hash and check' memoization scheme is pretty standard dynamic programming. This should be familiar to most engineers, so I don't worry too much about maintenance.

The key tuning bit would be the cutoff value. A high value like the current 1280 catches many many branches, but also runs the has very often, which may not be worth it, since most of those resolve quickly anyway.

A lower values like 640 or even lower will catch many less branches early - but still stop the degenerate 'stuck' cases.

I could also reverse the checks and 'early-out' on candidate size before we even do the stack hash to save even more time.

@Reithan
Copy link
Author

Reithan commented Jun 23, 2025

Tested with doing size check first and lowering byte cutoff, both result is roughly the same speed.

tl;dr this change makes complex grammar that normally gets 'stuck' or slows down considerably run at 'normal' speed.
This does nothing for grammar that already runs smoothly.

So, you get:

  • ~0% speed up/down for normal case
  • 700%+ speed up in 'hard' cases. (or more!)

@Reithan
Copy link
Author

Reithan commented Jun 23, 2025

Actually doing an aggressive size-check first speeds up some 'easy' cases - like the tool call example:
image

40T/s up from 35T/s

I'll push that change.

@Reithan Reithan force-pushed the add-memoized-cache-to-grammar-rejection branch from daaf02a to 4510328 Compare June 23, 2025 02:31
@LostRuins
Copy link
Owner

LostRuins commented Jun 23, 2025

Do we never clear the cache? Will that be a problem? Seems like we might want to clear the cache upon each new gen.

@Reithan
Copy link
Author

Reithan commented Jun 23, 2025

Do we never clear the cache? Will that be a problem? Seems like we might want to clear the cache upon each new gen.

I was debating that. In local use, not clearing that cache lets it use that speedup over every new gen, which helps quite a bit.

But if we're running in horde mode, or changing grammar, or things like that, clearing cache would be useful.

  • Testing memo cache reset now.

@Reithan
Copy link
Author

Reithan commented Jun 23, 2025

  • Added cache reset tied to current grammar-reload logic.

@LostRuins
Copy link
Owner

You accidentally committed version.txt

@Reithan Reithan force-pushed the add-memoized-cache-to-grammar-rejection branch from 497f34c to 0fc4a88 Compare June 23, 2025 08:58
@LostRuins LostRuins removed the KIV for now Some issues prevent this from being merged label Jun 24, 2025
@LostRuins
Copy link
Owner

Btw it's not compiling on my own pc

src/llama-grammar.cpp:915:58: error: cannot bind non-const lvalue reference of type 'std::__detail::_Node_iterator...

had to change line 915
if (auto & cache_hit2 = candidates_memos.find(candidates_hash); cache_hit2 != candidates_memos.end()) {
to
if (auto cache_hit2 = candidates_memos.find(candidates_hash); cache_hit2 != candidates_memos.end()) {

and line 911 as well.

Other than that, seems to be fine.

@Reithan
Copy link
Author

Reithan commented Jun 25, 2025

Interesting. I guess just compiler differences. I've always made a habit of specifying & or * for auto types, JIC. But if your compiles won't do it, that's fine. I'll remove those.

@LostRuins
Copy link
Owner

Merging. Do let me know if any issues arise (to anyone else too)

@LostRuins LostRuins merged commit 54dde5e into LostRuins:concedo_experimental Jun 25, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs review needs review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants