-
Notifications
You must be signed in to change notification settings - Fork 0
Closed
Description
Problem
Entities can have multiple names across sources (e.g. "Meridian Technologies" and "Meridian Tech"), but only one name is stored on Entity.name. Name similarity seeding uses only that single name, under-estimating similarity when the closest name pair isn't the one stored.
This is also a prerequisite for progressive merging (#25): when entities merge during propagation, the merged entity must retain all names from both sides so that subsequent name-similarity seeding remains accurate without ad-hoc canonical remapping.
Correct behaviour
Entities carry a list of names, all equally valid — no canonical/display name. Initial name similarity between two entities should be:
max(soft_tfidf(name_a, name_b) for name_a in entity_a.names for name_b in entity_b.names)
Approach
- Replace
Entity.name: strwithEntity.names: list[str]. - Populate from all occurrence names at load time (for single-source graphs,
names = [name]). - In
propagate_similarity, compute name similarity seed as max over all name pairs for each entity pair.
Context
- Currently latent on single-pass matching (per-article graphs are single-source, single-name).
- Becomes critical for progressive merging, where merged entities accumulate names across epochs.
- Previously tracked as Multi-label entities: use max similarity across all names during propagation #3, recreated with updated scope.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels