Recommended eval approach for measuring Hindsight memory quality vs a prior memory layer? #2022

nishad-kane · 2026-06-05T18:01:19Z

nishad-kane
Jun 5, 2026

Hey Everyone, I'm a software engineer at ReFiBuy.ai (Agentic Commerce) working on LLM evaluation and the agent memory layer. We recently brought Hindsight in to replace our existing memory system, wiring it through the MCP server. I've tested all the core MCP tool calls (retain, recall, reflect, observations) and they're working great so far, really clean integration.

Now I want to move past "it works" and actually quantify the win. Specifically I'm trying to measure three things: output quality (does recall surface the right facts at the right time), token cost reduction (are we injecting less context per turn than the old layer did), and multi-hop retrieval (can it chain facts across threads and sessions to answer something neither fact answers alone). Right now I've stood this up in a Promptfoo harness with test cases using LLM-rubric assertions, and that gets me directional signal, but it feels hand-rolled for a system as specific as Hindsight.

So the question: is there a recommended eval suite or methodology for measuring Hindsight's capabilities, ideally something I could point at an A/B comparison against a legacy memory backend? I'm thinking about things like recall precision/recall on a known fact set, retain-then-recall-across-threads correctness, multi-hop reasoning over retained facts, and token-per-turn deltas.

If there's a canonical benchmark, a fixture dataset, or even just patterns other teams have used to validate their migration, I'd love a pointer. Happy to share back whatever I build if it's useful to others.

Thanks!

cdbartholomew · 2026-06-05T18:33:07Z

cdbartholomew
Jun 5, 2026
Maintainer

@nishad-kane We benchmark Hindsight using various standard benchmarks and document the results here: https://agentmemorybenchmark.ai/

Benchmarks like LongMemEval and BEAM are specifically designed to test things like precision/recall and multi-hop reasoning. Is that what you are looking for?

0 replies

nishad-kane · 2026-06-05T19:20:29Z

nishad-kane
Jun 5, 2026
Author

Thanks @cdbartholomew , It's resourceful! Precision/recall and multi-hop are one of the core (quantified metrics) what I'm after, so LongMemEval and BEAM are spot on.

One thing- jumped out while digging through the LongMemEval split: Hindsight lands 94.6% vs 74% for hybrid-search and halves recall latency (700ms vs 1600ms), but it actually pulls more context tokens (43.6k vs 23.2k). That's a useful reframing for me. I came in assuming the win was token reduction, but the real story looks like accuracy and recall speed at a slightly higher context budget.

Correct me if I'm wrong in framing?
Good to know before I write up the comparison internally so I don't oversell the wrong axis.

2 replies

cdbartholomew Jun 5, 2026
Maintainer

I am pretty sure we run these benchmarks with the maximum token budget on the recall because that's how competitors typically run these benchmarks. You can lower the token budget (from high to low), to change the tradeoff between accuracy and context budget. This will have some impact on the accuracy, but I don't think it drops off a cliff. @nicoloboschi will know better than me if we have run the benchmarks with lower recall budgets and what the effect is.

I think another useful framing is quality of the recall as the dataset grows. That's what the BEAM benchmarks shows. Top-K semantic search (RAG) doesn't perform very well as you get more data. But the BEAM results show that Hindsight outperforms substantially with large datasets.

nishad-kane Jun 8, 2026
Author

Thanks for the detailed breakdown @cdbartholomew, both points land.

The max-token-budget detail is the one I'd have missed. Running recall at the high end makes sense for apples-to-apples competitor comparisons, but for us the budget knob is exactly what I want to sweep. We're token-sensitive per turn, so dialing recall from high to low and watching where accuracy starts to drop is more useful than a single number. I'd rather find the cliff deliberately than ship at max and eat the context cost everywhere.

Really appreciate the time.

jeffreyflynt · 2026-06-13T20:22:22Z

jeffreyflynt
Jun 13, 2026

One additional eval layer that may be useful here is PrecisionMemBench.

I wrote it to measure the memory-retrieval substrate directly, before final-answer quality hides what happened underneath. That makes it complementary to LongMemEval/BEAM/LoCoMo-style evals rather than a replacement for them.

The reason I think it may fit this discussion is that Hindsight already has a clean architectural separation between retain, recall, and reflect. PrecisionMemBench is aimed at the recall side of that split:

did the current fact return?
did stale/superseded facts stay out?
did conflicting memories resolve correctly?
did scope hold across users/threads?
did retrieval return only the relevant memory, or a larger context blob where the right fact is merely present?

I also have a Hindsight wrapper/result in the PrecisionMemBench repo already. In that run, Hindsight showed perfect recall on the single-turn retrieval set, lower precision, which is exactly the distinction this benchmark is meant to surface: the correct belief may be present, but surrounded by unrelated or superseded context.

That seems relevant to the token-budget discussion above. If Hindsight can vary recall budget, PrecisionMemBench gives a way to sweep that budget and score memory precision/over-retrieval directly, independently from final answer scoring.

There are a couple public examples/signals around this direction already:

gbrain has a public memory eval repo: https://github.com/garrytan/gbrain-evals
YourMemory added a direct PrecisionMemBench integration/results commit: sachitrafa/YourMemory@ab0d164

Repo here if useful: https://github.com/tenurehq/precisionMemBench

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Recommended eval approach for measuring Hindsight memory quality vs a prior memory layer? #2022

Uh oh!

{{title}}

Uh oh!

Replies: 3 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Uh oh!

Recommended eval approach for measuring Hindsight memory quality vs a prior memory layer? #2022

Uh oh!

nishad-kane Jun 5, 2026

Replies: 3 comments · 2 replies

Uh oh!

cdbartholomew Jun 5, 2026 Maintainer

Uh oh!

nishad-kane Jun 5, 2026 Author

Uh oh!

cdbartholomew Jun 5, 2026 Maintainer

Uh oh!

nishad-kane Jun 8, 2026 Author

Uh oh!

jeffreyflynt Jun 13, 2026

nishad-kane
Jun 5, 2026

Replies: 3 comments 2 replies

cdbartholomew
Jun 5, 2026
Maintainer

nishad-kane
Jun 5, 2026
Author

cdbartholomew Jun 5, 2026
Maintainer

nishad-kane Jun 8, 2026
Author

jeffreyflynt
Jun 13, 2026