Skip to content

Implement progressive merging via epoch-based propagation with union-find #25

@monneyboi

Description

@monneyboi

Summary

Implement epoch-based progressive merging as described in docs/progressive_merging.md. High-confidence entity merges should be committed during propagation (not just in post-processing), enriching neighborhoods for subsequent iterations. This lets evidence from transitively-matched entities compound — the core benefit described in the design doc.

Background

Currently, propagate_similarity() runs to convergence, then build_match_groups() applies union-find as a separate post-processing step. An entity appearing in 5 articles only benefits from pairwise similarity — it never gets the combined neighborhood of its merged selves. Progressive merging fixes this by periodically committing high-confidence merges and continuing propagation with enriched neighborhoods.

Implementation plan

Each step should be a standalone commit where all 54 existing tests pass. Follow the conventions in CLAUDE.md — clean refactors, not patches.

Step 1: Move negative dampening outside the inner propagation loop

This is a pure behavior-preserving refactor. In the current code, the negative factor is already constant per pair within the inner loop (it reads from name_seed, which is fixed, and positive_base, which the negative factor doesn't depend on). Moving it out doesn't change any final values — only makes the inner loop a clean monotone fixpoint.

Concretely in propagate_similarity():

  • The inner for _ in range(max_iter) loop should only compute positive_base (remove the negative factor application from inside the loop)
  • Convergence check should use positive_base changes only
  • After the loop converges, apply negative dampening in a single pass over all pairs above neg_gate

All existing tests must pass unchanged.

Step 2: Extract propagate_positive() and apply_negative() as separate functions

Refactor propagate_similarity() into:

  • propagate_positive(graph, adjacency, pairs, positive_base, ...) -> positive_base — the monotone fixpoint inner loop
  • apply_negative(positive_base, forward_adj, rel_sim, name_seed, ...) -> confidence — one-shot post-convergence dampening
  • propagate_similarity() calls these two in sequence (same API, same behavior)

All existing tests must pass unchanged.

Step 3: Add the epoch loop with union-find merging

Wrap the propagate + dampen cycle in an outer epoch loop. The union-find (already in the codebase) tracks merges between epochs.

New parameters for propagate_similarity() (or a new top-level function if cleaner):

  • merge_threshold (float, default 0.9) — only pairs above this are merged between epochs
  • max_epochs (int, default 5) — maximum number of merge epochs

Epoch loop pseudocode:

uf = UnionFind()
for epoch in range(max_epochs):
    1. Build adjacency using UF canonical reps
       (merged entity's neighborhood = union of all members' adjacencies)
    2. Build cross-graph pairs between canonical reps only
       (map original pairs through uf.find(), deduplicate, skip same-group)
    3. Seed name similarity: max across all member-pair name comparisons
    
    4. propagate_positive() to convergence
    5. apply_negative()
    
    6. Find pairs above merge_threshold where uf.find(a) != uf.find(b)
    7. If none → break (global convergence)
    8. uf.union() the new merges
    9. Carry forward: confidence(canon_a, canon_b) = max over member pairs

Return confidence and incorporate UF into the final grouping.

Important: with default merge_threshold=0.9 and max_epochs=5, existing tests should pass unchanged — pairs that currently score below 0.9 won't trigger any progressive merges, so the epoch loop exits after epoch 1 with no merges.

Between-epoch negative evidence can use converged positive_base (not just name_seed) for neighbor-match checks — there's no circular reinforcement risk since it's between epochs, not within the inner loop. However, if this causes test failures, fall back to using name_seed across epochs too.

Step 4: Write the enriched-neighborhood test

Add a Layer 2 test in test_propagation.py that demonstrates the enriched-neighborhood effect — the core benefit of progressive merging.

The scenario from the design doc:

  • Article A: [Meridian Tech] —acquired→ [DataVault]
  • Article B: [Meridian Technologies] —bought→ [DataVault Inc] and [Meridian Technologies] —headquartered in→ [Austin]
  • Article C: [Meridian] —headquartered in→ [Austin, TX]

After epoch 1, A's and B's Meridian merge (strong name sim + structural match via acquisition). The merged entity now has edges to both DataVault and Austin. In epoch 2, C's "Meridian" benefits from TWO structural paths.

The test should verify that C's "Meridian" matches the A+B merged entity. Without progressive merging (max_epochs=1), C's "Meridian" may not match due to low name similarity ("Meridian" vs "Meridian Technologies") and only one structural path per pairwise comparison.

If the scenario doesn't cleanly demonstrate the effect (e.g., name similarity alone is sufficient), adjust entity names to make name similarity insufficient without the enriched neighborhood. The test should fail with max_epochs=1 (or at least produce lower confidence) and pass with max_epochs>1.

Step 5: Integration test

Add a Layer 3 test in test_integration.py verifying that progressive merging doesn't cause cascading false merges. Reuse the cross-cluster isolation pattern from test_identical_names_different_contexts_no_merge — two unrelated clusters with similar structure should stay separate even with progressive merging enabled.

Acceptance criteria

  • All 54 existing tests pass at every step
  • Negative dampening is cleanly separated from the positive fixpoint loop
  • Progressive merging is implemented via union-find (no graph mutation)
  • At least one test demonstrates the enriched-neighborhood benefit
  • At least one test verifies no cascading false merges
  • match_graphs() API remains backward-compatible (new params have defaults that reproduce current behavior)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions