Skip to content

[Feature] Memory Architecture: Phase 2 - Preprocessing, scoring, and entity consolidation #708

@senamakel

Description

@senamakel

Summary

Design and implement the pre-processing and scoring phase for memory ingestion so OpenHuman can filter weak documents, merge duplicate entities, and prioritize the highest-value evidence before indexing.

Problem

Raw memory ingestion will contain a large amount of low-value or redundant material. If everything is indexed equally, retrieval quality will degrade, costs will rise, and higher-level summaries will be polluted by weak evidence.

The system needs a deterministic pre-processing layer that can score and prune data before it reaches the summary/index pipeline.

Constraints:

  • The scoring model should be explainable and debuggable.
  • Signals need to work without requiring a frontier model on every document.
  • Entity identity should survive cross-platform duplication.
  • The design should integrate with the existing OpenHuman memory/graph concepts rather than creating an isolated ranking system.

Solution (optional)

Build a preprocessing pipeline over normalized chunks/documents that evaluates multiple importance signals in parallel and computes a final importance score used for retention or dropping.

Suggested scoring inputs from the proposed architecture:

  • entity extraction and structural NLP signals
  • metadata weighting by interaction type and source context
  • source-type scoring
  • boosts from existing memory/graph relevance and topic momentum
  • direct-interaction weighting
  • duplicate-entity resolution across platforms/accounts

The phase should emit both a keep/drop decision and transparent scoring metadata for downstream debugging.

Acceptance criteria

  • A preprocessing pipeline exists that runs after ingestion and before summary/index creation.
  • The pipeline computes an explainable importance score from multiple signals rather than a single heuristic.
  • Scoring signals include metadata/context weighting, source weighting, existing-graph relevance, and direct-interaction weighting.
  • The system supports entity extraction suitable for later linking and topic assignment.
  • Duplicate entities from different platforms/accounts can be resolved into a shared logical entity.
  • Documents/chunks below a configured threshold can be dropped before indexing.
  • The system stores scoring rationale or component scores for diagnostics and tuning.
  • Debug logging makes it possible to trace why an item was kept, boosted, merged, or dropped.
  • Tests cover threshold behavior, entity merge behavior, and representative high-value vs low-value cases.

Related

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type
No fields configured for issues without a type.

Projects

Status

Done

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions