Recommended eval approach for measuring Hindsight memory quality vs a prior memory layer? #2022
Replies: 3 comments 2 replies
-
|
@nishad-kane We benchmark Hindsight using various standard benchmarks and document the results here: https://agentmemorybenchmark.ai/ Benchmarks like LongMemEval and BEAM are specifically designed to test things like precision/recall and multi-hop reasoning. Is that what you are looking for? |
Beta Was this translation helpful? Give feedback.
-
|
Thanks @cdbartholomew , It's resourceful! Precision/recall and multi-hop are one of the core (quantified metrics) what I'm after, so LongMemEval and BEAM are spot on. One thing- jumped out while digging through the LongMemEval split: Hindsight lands 94.6% vs 74% for hybrid-search and halves recall latency (700ms vs 1600ms), but it actually pulls more context tokens (43.6k vs 23.2k). That's a useful reframing for me. I came in assuming the win was token reduction, but the real story looks like accuracy and recall speed at a slightly higher context budget. Correct me if I'm wrong in framing? |
Beta Was this translation helpful? Give feedback.
-
|
One additional eval layer that may be useful here is PrecisionMemBench. I wrote it to measure the memory-retrieval substrate directly, before final-answer quality hides what happened underneath. That makes it complementary to LongMemEval/BEAM/LoCoMo-style evals rather than a replacement for them. The reason I think it may fit this discussion is that Hindsight already has a clean architectural separation between retain, recall, and reflect. PrecisionMemBench is aimed at the recall side of that split:
I also have a Hindsight wrapper/result in the PrecisionMemBench repo already. In that run, Hindsight showed perfect recall on the single-turn retrieval set, lower precision, which is exactly the distinction this benchmark is meant to surface: the correct belief may be present, but surrounded by unrelated or superseded context. That seems relevant to the token-budget discussion above. If Hindsight can vary recall budget, PrecisionMemBench gives a way to sweep that budget and score memory precision/over-retrieval directly, independently from final answer scoring. There are a couple public examples/signals around this direction already:
Repo here if useful: https://github.com/tenurehq/precisionMemBench |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Hey Everyone, I'm a software engineer at ReFiBuy.ai (Agentic Commerce) working on LLM evaluation and the agent memory layer. We recently brought Hindsight in to replace our existing memory system, wiring it through the MCP server. I've tested all the core MCP tool calls (retain, recall, reflect, observations) and they're working great so far, really clean integration.
Now I want to move past "it works" and actually quantify the win. Specifically I'm trying to measure three things: output quality (does recall surface the right facts at the right time), token cost reduction (are we injecting less context per turn than the old layer did), and multi-hop retrieval (can it chain facts across threads and sessions to answer something neither fact answers alone). Right now I've stood this up in a Promptfoo harness with test cases using LLM-rubric assertions, and that gets me directional signal, but it feels hand-rolled for a system as specific as Hindsight.
So the question: is there a recommended eval suite or methodology for measuring Hindsight's capabilities, ideally something I could point at an A/B comparison against a legacy memory backend? I'm thinking about things like recall precision/recall on a known fact set, retain-then-recall-across-threads correctness, multi-hop reasoning over retained facts, and token-per-turn deltas.
If there's a canonical benchmark, a fixture dataset, or even just patterns other teams have used to validate their migration, I'd love a pointer. Happy to share back whatever I build if it's useful to others.
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions