Multi-label entities: use max similarity across all names#28
Conversation
Entities can carry multiple names (e.g. "Meridian Technologies" and "Meridian Tech"). Name similarity seeding now computes max(soft_tfidf) across all name pairs, ensuring the closest match is always used. - Node.name: str → Node.names: list[str] - Graph.add_entity() accepts str or list[str] - IDF built from all names across all entities - Graph I/O serializes names list, loads legacy single-name format - Functionality pooling uses first name as representative Closes #27 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
One issue worth fixing: backward-compatibility shim for legacy single-name JSON format.
load_graph (graph.py:52-56) adds fallback logic to handle the old "name" key:
raw_names = node_data.get("names") or [node_data["name"]]
names=raw_names if isinstance(raw_names, list) else [raw_names],There are no old-format graph files in the repo and no external users. Per CLAUDE.md conventions ("never add backward-compatibility shims... Refactor completely"), this should just read node_data["names"] directly. The test_load_legacy_single_name_format test should be removed as well — it tests the shim.
Everything else looks correct: the max-over-name-pairs seeding, IDF over all names, names[0] for functionality phrase pairs (same semantics as the old .name), and the union-find display. Clean refactor otherwise.
monneyboi
left a comment
There was a problem hiding this comment.
Remove the backward-compatibility shim in load_graph — just read node_data["names"] directly. Delete test_load_legacy_single_name_format as well.
Per review: no old-format files exist and no external users, so the fallback in load_graph and its test are unnecessary shims. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Looks good. All 57 tests pass. The Node.name → Node.names refactor is applied consistently across graph.py, match.py, and all tests. Name similarity seeding correctly takes the max over all name pairs, IDF includes all names, and the backward-compat shim was properly removed per CLAUDE.md conventions. No dead code or half-finished refactors.
Summary
Node.name: strwithNode.names: list[str]so entities can carry multiple names (e.g. "Meridian Technologies" and "Meridian Tech")propagate_similaritynow computesmax(soft_tfidf(a, b))across all name pairs, ensuring the closest match is always usedContext
Prerequisite for progressive merging (#25): when entities merge during propagation, the merged entity retains all names from both sides so subsequent name-similarity seeding remains accurate.
Test plan
Closes #27
🤖 Generated with Claude Code