-
Notifications
You must be signed in to change notification settings - Fork 486
Add memoized cache to llama_grammar_reject_candidates_for_stack
#1615
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add memoized cache to llama_grammar_reject_candidates_for_stack
#1615
Conversation
I think for a 2% (35 vs 36) t/s change, this is probably not worth it for the additional complexity. The original speedup in the previous PR was very good, but this one seems marginal at best and probably makes debugging harder. |
Sorry, you misunderstand, the picture is comparing no grammar, vs grammar - not no change vs change. |
Tested with doing size check first and lowering byte cutoff, both result is roughly the same speed. tl;dr this change makes complex grammar that normally gets 'stuck' or slows down considerably run at 'normal' speed. So, you get:
|
daaf02a
to
4510328
Compare
Do we never clear the cache? Will that be a problem? Seems like we might want to clear the cache upon each new gen. |
I was debating that. In local use, not clearing that cache lets it use that speedup over every new gen, which helps quite a bit. But if we're running in horde mode, or changing grammar, or things like that, clearing cache would be useful.
|
|
You accidentally committed version.txt |
497f34c
to
0fc4a88
Compare
Btw it's not compiling on my own pc
had to change line 915 and line 911 as well. Other than that, seems to be fine. |
Interesting. I guess just compiler differences. I've always made a habit of specifying |
Merging. Do let me know if any issues arise (to anyone else too) |
This adds a memoized data cache to
llama_grammar_reject_candidates_for_stack
a. If the candidate list is <1280b, it's also checked for cache hit to return early
b. If this wasn't a hit - the result of the function is cached
This change completely eliminates 'hangs' from branching explosions in complex grammar in all my tests so far, as well as speeding up 'heavy' grammar processing to run at nearly the same speed as no grammar at all.