Skip to content

benchmark: single-session-preference category underperforms at 66.7% #48

@salishforge

Description

@salishforge

Observation

LongMemEval single-session-preference questions score 66.7% R@5 — the weakest category by a significant margin (next lowest is single-session-assistant at 85.7%).

Analysis Needed

  1. Examine the 10 failing preference questions — what patterns do they share?
  2. Preference questions likely use implicit/indirect language that FTS struggles with
  3. Semantic search (with embeddings) may significantly improve this category

Action Items

  • Run benchmark with semantic and hybrid modes to compare
  • Analyze failing questions for common patterns
  • Consider adding preference-specific indexing or boosting

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestperformancePerformance improvements

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions