Skip to content

Multi-label entities: use max similarity across all names during propagation #27

@monneyboi

Description

@monneyboi

Problem

Entities can have multiple names across sources (e.g. "Meridian Technologies" and "Meridian Tech"), but only one name is stored on Entity.name. Name similarity seeding uses only that single name, under-estimating similarity when the closest name pair isn't the one stored.

This is also a prerequisite for progressive merging (#25): when entities merge during propagation, the merged entity must retain all names from both sides so that subsequent name-similarity seeding remains accurate without ad-hoc canonical remapping.

Correct behaviour

Entities carry a list of names, all equally valid — no canonical/display name. Initial name similarity between two entities should be:

max(soft_tfidf(name_a, name_b) for name_a in entity_a.names for name_b in entity_b.names)

Approach

  1. Replace Entity.name: str with Entity.names: list[str].
  2. Populate from all occurrence names at load time (for single-source graphs, names = [name]).
  3. In propagate_similarity, compute name similarity seed as max over all name pairs for each entity pair.

Context

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions