# Researcher Deduplication

This notebook demonstrates using everyrow's `dedupe()` utility to clean a researcher/candidate database where the same person appears multiple times due to career changes, name variations, and data quality issues.

**Use Case:** Your candidate database has the same researchers listed multiple timesâ€”once from a conference 3 years ago (when they were at Stanford), again from a recent paper (now at OpenAI), and maybe a third time with just initials. You need to deduplicate without losing the career history.

**Why everyrow?** The `dedupe()` function understands that "Dr. A. Butoi" at Stanford in 2021 and "Alexandra Butoi" at Google in 2024 might be the same person who changed jobs. Traditional fuzzy matching can't handle these career transitions.

In [None]:
import asyncio
from dotenv import load_dotenv
load_dotenv()

import pandas as pd
from everyrow import create_session
from everyrow.ops import dedupe

## Load Researcher Data

In [None]:
researchers_df = pd.read_csv("../data/researchers.csv")

# Fill NaN values with empty strings to avoid JSON serialization issues
researchers_df = researchers_df.fillna("")

print(f"Researcher records: {len(researchers_df)}")
print(f"Expected unique people: ~12")
researchers_df

## Define Deduplication Rule

In [None]:
DEDUPE_RULE = """
Two rows represent the SAME PERSON if they are the same researcher/individual.

Consider as the same person:
- Name variations: initials (A. Butoi = Alexandra Butoi), nicknames (Bob = Robert, Mike = Michael)
- Typos in names (Sara = Sarah)
- Same person at different institutions over time (career transitions are common in academia/tech)
- Same GitHub handle is a strong signal of same person
- Same email domain root can indicate same person (even if subdomain differs)

Do NOT consider as same person:
- Same common name but clearly different fields/institutions with no linking signals
- Same first name only with different last names
- Different GitHub handles usually means different people (unless one is missing)

When GitHub handles match, that's strong evidence of same person even with institution changes.
When a common name appears with no linking signals (different github, different field), keep separate.
"""

## Run Deduplication

In [None]:
async def run_dedupe():
    async with create_session(name="Researcher Deduplication") as session:
        print(f"Session URL: {session.get_url()}")
        print("\nDeduplicating researcher records...\n")
        
        result = await dedupe(
            session=session,
            input=researchers_df,
            equivalence_relation=DEDUPE_RULE,
        )
        
        return result.data

results_df = await run_dedupe()

## Analyze Results

In [None]:
print(f"\n{'='*60}")
print(f"DEDUPLICATION RESULTS")
print(f"{'='*60}")
print(f"  Original records:      {len(researchers_df)}")
print(f"  Unique researchers:    {results_df['cluster_id'].nunique()}")
print(f"  Duplicate records:     {len(researchers_df) - results_df['cluster_id'].nunique()}")

In [None]:
# Show duplicate clusters (people with multiple records)
print("\nRESEARCHERS WITH MULTIPLE RECORDS:")
print("-" * 80)

cluster_counts = results_df['cluster_id'].value_counts()
duplicate_clusters = cluster_counts[cluster_counts > 1].index

for cluster_id in duplicate_clusters:
    cluster = results_df[results_df['cluster_id'] == cluster_id].sort_values('year')
    
    print(f"\n--- Person (Cluster {cluster_id}) ---")
    print("Career timeline:")
    for _, row in cluster.iterrows():
        github = f"@{row['github']}" if pd.notna(row['github']) else "(no github)"
        email = row['email'] if pd.notna(row['email']) else "(no email)"
        print(f"  {row['year']}: {row['name']:20} | {row['institution']:25} | {github}")

In [None]:
# Check if the "distractor" Michael Chen was correctly kept separate
michael_chens = results_df[results_df['name'].str.contains('Michael Chen|Mike Chen|M. Chen', case=False, na=False)]

print("\nMICHAEL CHEN DISAMBIGUATION CHECK:")
print("-" * 50)
print(f"Records with 'Michael/Mike Chen': {len(michael_chens)}")
print(f"Unique clusters: {michael_chens['cluster_id'].nunique()}")
print("\nDetails:")
for _, row in michael_chens.iterrows():
    print(f"  Cluster {row['cluster_id']}: {row['name']:15} | {row['institution']:20} | {row.get('github', 'N/A')}")

In [None]:
# Show unique researchers (no duplicates found)
singleton_clusters = cluster_counts[cluster_counts == 1].index
singletons = results_df[results_df['cluster_id'].isin(singleton_clusters)]

print(f"\nUNIQUE RESEARCHERS ({len(singletons)} with single record):")
print("-" * 60)
for _, row in singletons.iterrows():
    print(f"  {row['name']:25} | {row['institution']:25} | {row['year']}")

In [None]:
# Create consolidated view (most recent record per person)
consolidated = results_df.sort_values('year', ascending=False).groupby('cluster_id').first().reset_index()

print(f"\nCONSOLIDATED RESEARCHER LIST ({len(consolidated)} unique people):")
print("-" * 70)
print(f"{'Name':<25} | {'Current Institution':<30} | {'Year'}")
print("-" * 70)
for _, row in consolidated.sort_values('name').iterrows():
    print(f"{row['name']:<25} | {row['institution']:<30} | {row['year']}")

In [None]:
# Export deduplicated list
consolidated.to_csv("researchers_deduplicated.csv", index=False)
print(f"\nExported {len(consolidated)} unique researchers to researchers_deduplicated.csv")

In [None]:
# Full results with cluster assignments
results_df.sort_values(['cluster_id', 'year'])