Implement progressive merging via epoch-based propagation with union-find#29
Implement progressive merging via epoch-based propagation with union-find#29
Conversation
The negative factor reads from name_seed (fixed) and doesn't depend on anything that changes within the positive fixpoint loop. Moving it to a single post-convergence pass makes the inner loop a clean monotone fixpoint and prepares for epoch-based progressive merging. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Split the positive fixpoint loop and negative dampening pass into standalone functions. propagate_similarity() now calls these in sequence — same API, same behavior. This separation enables the epoch loop in the next step. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Wrap the propagate + dampen cycle in an outer epoch loop. High-confidence merges (>= merge_threshold, default 0.9) are committed between epochs via union-find. Merged entities' neighborhoods are unioned for subsequent epochs, allowing evidence from transitively-matched entities to compound. Key design decisions: - Adjacency lists are deduplicated per epoch to prevent inflated evidence from merged entities having multiple copies of structurally identical edges - Name-only seed (not carried-forward confidence) is used for negative evidence to prevent circular reinforcement across epochs - Best confidence across all epochs is preserved per pair, ensuring progressive merging never worsens scores vs single-epoch behavior - With default parameters, existing behavior is preserved Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Three articles where A+B merge in epoch 1 (identical names + strong structural overlap). Article C's "Meridian Tech Corp" has moderate name similarity and neighbors split between A-unique (Austin) and B-unique (Volta Systems). Without progressive merging, C sees only pairwise evidence. With progressive merging, the enriched A+B neighborhood provides additional structural paths, producing measurably higher confidence. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…rges Two unrelated clusters with isomorphic structure (NovaTech/DataVault vs Quantum Labs/ClearSky) stay separate even with progressive merging enabled. Within-cluster merges work correctly while cross-cluster isolation is maintained. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add --merge-threshold and --max-epochs options to the match CLI command, passing them through run_matching to the propagation pipeline. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Requesting changes (can't use --request-changes on own PR)
Dead code: _build_weighted_adjacency and _build_forward_adjacency (match.py:155-208)
These two functions are defined but never called — they were replaced by _build_epoch_adjacency but left in the file. Per CLAUDE.md: "never add backward-compatibility shims, preserve stale signatures, or keep dead code around 'just in case'. Refactor completely."
Delete both functions. No other issues found — the rest of the implementation is clean and correct.
These functions were replaced by _build_epoch_adjacency but left behind. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Review: No issues found — this is ready to merge.
Clean decomposition of the propagation loop into propagate_positive / apply_negative, and the epoch-based progressive merging is well-structured. Specific things I verified:
- Behavioral equivalence for single-epoch case: Since the negative factor depends only on fixed
name_seedandforward_adj, moving it from per-iteration to post-convergence produces identical final values. The convergence criterion difference (dampened vs undampened delta) is negligible. - Adjacency deduplication in
_build_epoch_adjacencycorrectly prevents inflated evidence from merged entities' redundant edges. - Name-only seed for negative evidence prevents circular reinforcement across epochs — the design from
docs/progressive_merging.mdis faithfully implemented. best_confidencetracking correctly expands canonical-rep scores to original entity pairs and preserves the best score across epochs, so progressive merging never worsens results.- Return type change (
propagate_similaritynow returnstuple[Confidence, UnionFind]) is handled correctly at the only call site inmatch_graphs. - No dead code, no shims, no half-finished refactors — the old
_build_weighted_adjacencyand_build_forward_adjacencyare cleanly replaced by_build_epoch_adjacency.
One minor note (not blocking): _build_epoch_adjacency doesn't skip edges where src == tgt after mapping through union-find (self-loops from merging connected entities). This is unlikely in practice since entities connected by a relation are semantically distinct, but a if src == tgt: continue guard would be a clean defensive addition.
When merged entities were previously connected by a relation, mapping both endpoints through union-find produces src == tgt. These self-loops would pollute adjacency lists with spurious neighbors. Filter them out. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
monneyboi
left a comment
There was a problem hiding this comment.
Replace the epoch-based double loop with a single propagation loop using incremental canonical adjacency.
The current design runs an inner fixpoint loop to convergence, then an outer epoch loop that rebuilds adjacency, re-seeds confidence, and rebuilds pairs between epochs. This is wasteful (most re-convergence work is redundant since only merged entities' neighborhoods changed) and harder to follow than necessary.
The epoch split exists because the docs sketched it that way, but the algorithm doesn't require it. A single loop with inline merging works if we handle adjacency dedup correctly — and we can do that incrementally instead of rebuilding from scratch each epoch.
The key insight: maintain a canonical_adj alongside the UnionFind, updated incrementally on merge.
The adjacency is consumed identically in both positive propagation and negative evidence: adj.get(entity_id, []) → iterate Neighbor(entity_id, relation, func_weight). The consumer doesn't care how the list was built. So on merge of A and B into canonical rep C:
# O(degree_A + degree_B) per merge
combined = canonical_adj[A] + canonical_adj[B]
seen = set()
deduped = []
for nbr in combined:
canon_nbr = uf.find(nbr.entity_id)
key = (canon_nbr, nbr.relation)
if key not in seen:
seen.add(key)
deduped.append(Neighbor(canon_nbr, nbr.relation, nbr.func_weight))
canonical_adj[C] = dedupedThe inner loop stays flat — same structure as current code, no member iteration, no adjacency rebuild:
for neighbor_a in canonical_adj.get(ca, []):
for neighbor_b in canonical_adj.get(cb, []):
nbr_conf = prev_base.get((neighbor_a.entity_id, neighbor_b.entity_id), 0.0)Why dedup matters (can't skip it): without dedup, an entity appearing in 5 articles inflates structural evidence from ~0.59 to ~0.99 on the same underlying fact. The exponential sum saturates but not fast enough — the inflation is severe at exactly the scale we're targeting (major entities appearing across many articles).
What this eliminates:
_build_epoch_adjacency(full O(|edges|) rebuild per epoch)_build_epoch_pairs(full O(n²) rebuild per epoch)_seed_epoch_confidence(full O(n² × names²) re-seeding per epoch)- The
best_confidencetracking and final member-expansion loop - The
propagate_positive/apply_negativesplit into separate functions (they fold back into the single loop)
What this keeps:
- UnionFind for entity identity (already have it)
- Canonical adjacency for neighborhood identity (new, incremental, O(degree) per merge)
- Pairs list and confidence dict keyed by canonical reps, updated on merge
- Deduplication of redundant edges between same canonical entities
The epoch-based design ran an inner fixpoint loop to convergence, then an outer epoch loop that rebuilt adjacency, re-seeded confidence, and rebuilt pairs between epochs. This was wasteful and harder to follow. Replace with a single propagation loop that maintains a canonical_adj alongside the UnionFind, updated incrementally on merge (O(degree) per merge instead of O(|edges|) full rebuild). Positive evidence runs to convergence, then negative dampening is applied once, then merges are committed inline. Enriched neighborhoods compound across merge cycles. Key changes: - Remove _build_epoch_adjacency, _build_epoch_pairs, _seed_epoch_confidence - Remove propagate_positive and apply_negative as separate functions - Remove max_epochs parameter (single loop handles everything) - Add _build_adjacency, _build_pairs, _seed_confidence (initial setup) - Incremental canonical_adj dedup on merge prevents evidence inflation - Re-seed from name similarity after merge to prevent compounding dampening - Negative evidence resolves neighbor IDs through UnionFind and treats same-canonical-entity neighbors as perfect matches Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Bug: negative evidence uses the wrong functionality direction.
_build_adjacency (line 295) assigns functionality weights matching the old positive-propagation adjacency — source gets func.inverse, target gets func.forward:
adjacency[src].append(Neighbor(tgt, edge.relation, func.inverse))
adjacency[tgt].append(Neighbor(src, edge.relation, func.forward))But _one_sided_negative reads neighbor_a.func_weight as the negative-evidence weight. Per the algorithm design and docs/negative_evidence.md line 38 ("Negative evidence uses forward functionality"), it needs the opposite direction: forward for outgoing neighbors, inverse for incoming.
The old code had a separate _build_forward_adjacency with the swapped directions specifically for this. The refactor to a single canonical_adj dropped that distinction — negative evidence now weights mismatches by inverse functionality (outgoing) / forward functionality (incoming) instead of forward / inverse. This is semantically wrong: a functional relation (high forward func, e.g. "acquired") should generate strong negative evidence when its target doesn't match, but now it's weighted by inverse func which measures something different (how unique is the source given the target?).
The fix: since _one_sided_negative iterates over adj.get(id_a, []) and reads neighbor.func_weight, you need the weight to reflect the correct direction for negative evidence. Two options:
- Store both weights in
Neighbor(add a field), use the right one in each context. - Build a parallel
neg_adjwith swapped directions (like the old_build_forward_adjacency), maintained incrementally alongsidecanonical_adj.
Option 1 is cleaner since it avoids maintaining two adjacency structures through merges.
The refactor to a single canonical_adj dropped the distinction between positive and negative functionality directions. Negative evidence needs forward func for outgoing neighbors and inverse for incoming (the opposite of positive propagation). Add neg_func_weight to Neighbor so each context uses the correct weight. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
No issues found — ready to merge.
The negative-evidence functionality fix is correct: both weights stored per Neighbor, func_weight for positive propagation and neg_func_weight for negative evidence, with the right direction assignments in _build_adjacency. Single-loop progressive merging is clean: positive converges before negative dampening, incremental O(degree) adjacency updates on merge, re-seed prevents dampening compounding across cycles. No dead code or shims remain.
- Remove canon_graphs tracking (dead code: all pairs originate cross-graph, so the len==1 filter never triggers; ra==rb already handles self-pairs) - Replace pre_merge_canons/pre_merge_graphs with a simple set comprehension - Extract _remap_confidence helper for the identical confidence/name_seed remapping operations Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
No issues found — this is ready to merge.
Verified: functionality weight directions in _build_adjacency correctly assign func_weight (positive) and neg_func_weight (negative) matching the old separate adjacency builders. Incremental adjacency merge dedup is correct (same-relation entries have identical weights). Single-loop design with positive-converge-then-negative-then-merge is sound — re-seeding prevents dampening compounding across cycles. Final expansion to original entity pairs is correct. No dead code or shims.
Summary
docs/progressive_merging.mdCloses #25
Implementation
Each commit is a standalone step where all existing tests pass:
propagate_positive()andapply_negative()— split into standalone functions, same API and behaviormerge_threshold=0.9,max_epochs=5) reproduce previous behavior--merge-thresholdand--max-epochsoptions added toworldgraph matchTest plan
max_epochs>1vsmax_epochs=1)match_graphs()API backward-compatible (new params have defaults reproducing current behavior)🤖 Generated with Claude Code