Skip to content

Add entity resolution to deduplicate and merge cross-source observations #250

@ravisuhag

Description

@ravisuhag

Context

Compass stores entities from multiple sources as separate records. A Kafka topic ingested from two different systems creates two unrelated entities with different URNs. The graph is fragmented — context assembly, impact analysis, and search all operate on disconnected duplicates.

Entity resolution is the mechanism that matches incoming observations against existing entities, merges properties, and maintains unified identity. This is the prerequisite for a coherent knowledge graph.

Scope

Tier 1: Exact URN Match

  • When an observation arrives with a URN that already exists, merge properties into the existing entity
  • Track provenance: which source contributed which properties
  • Idempotent — re-sending the same observation must not create duplicates or mutate state unexpectedly
  • This is what Upsert partially does today, but without provenance tracking or merge strategy

Tier 2: Heuristic Matching

  • Match observations where URN differs but type + name + source pattern suggests the same logical entity
  • Configurable matching rules (e.g., "bigquery table names map to dbt model names via this pattern")
  • Candidate scoring with a confidence threshold

Tier 3: Semantic Similarity (follow-up)

  • Use embedding similarity to catch non-obvious matches
  • Only viable after the embedding pipeline has indexed sufficient entities
  • Should be a signal fed into Tier 2 scoring, not a standalone matcher

Merge Strategy

  • When a match is found, merge properties from the new observation into the existing entity
  • Default: last-write-wins per field
  • Track which source contributed which properties (provenance)
  • Resolution audit log: record what was matched, merged, and why

Design Considerations

  • Resolution must be idempotent
  • Meteor sends raw observations, Compass resolves — keep the interface simple
  • Start with Tier 1 (exact URN match with provenance). Ship it. Tier 2 and 3 are follow-ups.
  • Graph-aware ranking (Add graph-aware ranking to search results #237) depends on a coherent, deduplicated graph — this should ship first

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions