Skip to content

Multi-label entities: use max similarity across all names#28

Merged
monneyboi merged 2 commits intomainfrom
multi-label-entities
Mar 27, 2026
Merged

Multi-label entities: use max similarity across all names#28
monneyboi merged 2 commits intomainfrom
multi-label-entities

Conversation

@claude
Copy link
Copy Markdown

@claude claude bot commented Mar 27, 2026

Summary

  • Replaces Node.name: str with Node.names: list[str] so entities can carry multiple names (e.g. "Meridian Technologies" and "Meridian Tech")
  • Name similarity seeding in propagate_similarity now computes max(soft_tfidf(a, b)) across all name pairs, ensuring the closest match is always used
  • IDF computation includes all names from all entities
  • Graph I/O serializes the names list and loads legacy single-name format for backward compatibility

Context

Prerequisite for progressive merging (#25): when entities merge during propagation, the merged entity retains all names from both sides so subsequent name-similarity seeding remains accurate.

Test plan

  • New test: multi-label entity seeds similarity from best name pair
  • New test: all names contribute to IDF computation
  • New test: multi-label names survive save/load round-trip
  • New test: legacy single-name JSON format loads correctly
  • All 58 existing tests pass unchanged

Closes #27

🤖 Generated with Claude Code

Entities can carry multiple names (e.g. "Meridian Technologies" and
"Meridian Tech"). Name similarity seeding now computes max(soft_tfidf)
across all name pairs, ensuring the closest match is always used.

- Node.name: str → Node.names: list[str]
- Graph.add_entity() accepts str or list[str]
- IDF built from all names across all entities
- Graph I/O serializes names list, loads legacy single-name format
- Functionality pooling uses first name as representative

Closes #27

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Copy link
Copy Markdown
Author

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One issue worth fixing: backward-compatibility shim for legacy single-name JSON format.

load_graph (graph.py:52-56) adds fallback logic to handle the old "name" key:

raw_names = node_data.get("names") or [node_data["name"]]
names=raw_names if isinstance(raw_names, list) else [raw_names],

There are no old-format graph files in the repo and no external users. Per CLAUDE.md conventions ("never add backward-compatibility shims... Refactor completely"), this should just read node_data["names"] directly. The test_load_legacy_single_name_format test should be removed as well — it tests the shim.

Everything else looks correct: the max-over-name-pairs seeding, IDF over all names, names[0] for functionality phrase pairs (same semantics as the old .name), and the union-find display. Clean refactor otherwise.

Copy link
Copy Markdown
Contributor

@monneyboi monneyboi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the backward-compatibility shim in load_graph — just read node_data["names"] directly. Delete test_load_legacy_single_name_format as well.

Per review: no old-format files exist and no external users, so the
fallback in load_graph and its test are unnecessary shims.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Author

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. All 57 tests pass. The Node.name → Node.names refactor is applied consistently across graph.py, match.py, and all tests. Name similarity seeding correctly takes the max over all name pairs, IDF includes all names, and the backward-compat shim was properly removed per CLAUDE.md conventions. No dead code or half-finished refactors.

@monneyboi monneyboi merged commit 8a635b6 into main Mar 27, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Multi-label entities: use max similarity across all names during propagation

1 participant