[Feature] Memory Architecture: Phase 2 - Preprocessing, scoring, and entity consolidation

## Summary

Design and implement the pre-processing and scoring phase for memory ingestion so OpenHuman can filter weak documents, merge duplicate entities, and prioritize the highest-value evidence before indexing.

## Problem

Raw memory ingestion will contain a large amount of low-value or redundant material. If everything is indexed equally, retrieval quality will degrade, costs will rise, and higher-level summaries will be polluted by weak evidence.

The system needs a deterministic pre-processing layer that can score and prune data before it reaches the summary/index pipeline.

Constraints:
- The scoring model should be explainable and debuggable.
- Signals need to work without requiring a frontier model on every document.
- Entity identity should survive cross-platform duplication.
- The design should integrate with the existing OpenHuman memory/graph concepts rather than creating an isolated ranking system.

## Solution (optional)

Build a preprocessing pipeline over normalized chunks/documents that evaluates multiple importance signals in parallel and computes a final importance score used for retention or dropping.

Suggested scoring inputs from the proposed architecture:
- entity extraction and structural NLP signals
- metadata weighting by interaction type and source context
- source-type scoring
- boosts from existing memory/graph relevance and topic momentum
- direct-interaction weighting
- duplicate-entity resolution across platforms/accounts

The phase should emit both a keep/drop decision and transparent scoring metadata for downstream debugging.

## Acceptance criteria

- [ ] A preprocessing pipeline exists that runs after ingestion and before summary/index creation.
- [ ] The pipeline computes an explainable importance score from multiple signals rather than a single heuristic.
- [ ] Scoring signals include metadata/context weighting, source weighting, existing-graph relevance, and direct-interaction weighting.
- [ ] The system supports entity extraction suitable for later linking and topic assignment.
- [ ] Duplicate entities from different platforms/accounts can be resolved into a shared logical entity.
- [ ] Documents/chunks below a configured threshold can be dropped before indexing.
- [ ] The system stores scoring rationale or component scores for diagnostics and tuning.
- [ ] Debug logging makes it possible to trace why an item was kept, boosted, merged, or dropped.
- [ ] Tests cover threshold behavior, entity merge behavior, and representative high-value vs low-value cases.

## Related

- Parent issue: #711
- Depends on the ingestion architecture for normalized chunks and provenance metadata.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] Memory Architecture: Phase 2 - Preprocessing, scoring, and entity consolidation #708

Summary

Problem

Solution (optional)

Acceptance criteria

Related

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

[Feature] Memory Architecture: Phase 2 - Preprocessing, scoring, and entity consolidation #708

Description

Summary

Problem

Solution (optional)

Acceptance criteria

Related

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions