## Fuzzy Deduplication Workflow

This notebook shows how to:
1. Load config from `config.yml`.
2. Load and unify Dataset1 + Dataset2 (in Dataset1’s format).
3. Block and fuzzy-match records.
4. Merge duplicates.
5. Save final combined data.

In [None]:
import pandas as pd
from src.fuzzy_helpers import (
    load_config,
    load_dataset1,
    load_dataset2_and_map,
    combine_dataframes,
    block_data,
    build_comparison_features,
    classify_duplicates,
    merge_duplicate_pairs,
    keep_dataset1_and_unmatched_dataset2
)

### 1. Load Configuration

Our `config.yml` file contains:
- dataset1_path
- dataset2_path
- output_path
- dataset1_columns
- dataset2_to_dataset1
- blocking_keys
- fuzzy_keys
- similarity_threshold

In [None]:
config_path = "config.yml"  # Adjust if needed
config = load_config(config_path)

dataset1_path = config["dataset1_path"]
dataset2_path = config["dataset2_path"]
output_path = config["output_path"]

dataset1_columns = config["dataset1_columns"]
dataset2_to_dataset1 = config["dataset2_to_dataset1"]
blocking_keys = config["blocking_keys"]
fuzzy_keys = config["fuzzy_keys"]
similarity_threshold = float(config["similarity_threshold"])

### 2. Load & Clean Datasets

Dataset1 is presumably already in the final schema, but we ensure all columns exist.  
Dataset2 is mapped to that same schema using the provided dictionary.


In [None]:
df1 = load_dataset1(dataset1_path, dataset1_columns)
df2_mapped = load_dataset2_and_map(dataset2_path, dataset1_columns, dataset2_to_dataset1)

In [None]:
df1

In [None]:
df2_mapped

In [None]:
# Pre-processing step #1: Convert these date columns to "MM/DD/YYYY"
date_cols = []

for col in date_cols:
    if col in df1.columns:
        # Convert the df1 column to datetime, coerce invalids to NaT, then format as mm/dd/yyyy
        df1[col] = (
            pd.to_datetime(df1[col], errors="coerce")
            .dt.strftime("%m/%d/%Y")
        )

    if col in df2_mapped.columns:
        # Convert the df2 column to datetime, coerce invalids to NaT, then format as mm/dd/yyyy
        df2_mapped[col] = (
            pd.to_datetime(df2_mapped[col], errors="coerce")
            .dt.strftime("%m/%d/%Y")
        )

### 3. Combine

In [None]:
# combined_df = combine_dataframes(df1, df2_mapped)
df1["source"] = 1
df2_mapped["source"] = 2
combined_df = pd.concat([df1, df2_mapped], ignore_index=True)


**Optional**: Some quick cleanup, e.g., filling NaNs with empty strings in text columns.


In [None]:
# for col in combined_df.columns:
#     if combined_df[col].dtype == object:
#         combined_df[col] = combined_df[col].fillna("")

In [None]:
combined_df

### 4. Block Data

We create candidate pairs only for rows that share the same values in `blocking_keys`.

In [None]:
candidate_pairs = block_data(combined_df, blocking_keys)

### 5. Build Fuzzy Comparison Features

Using jaro-winkler on the `fuzzy_keys` columns.

In [None]:
features_df = build_comparison_features(combined_df, candidate_pairs, fuzzy_keys)
features_df

### 6. Classify Duplicates

If the average similarity across these fields is >= `similarity_threshold`, 
they're flagged as duplicates.

In [None]:
duplicates_idx = classify_duplicates(features_df, similarity_threshold)
print(f"Found {len(duplicates_idx)} pairs classified as duplicates.")

### 7. Create De-duped dataset

In [15]:
# We'll keep the earliest row from each connected group of duplicates.
# deduped_df = merge_duplicate_pairs(combined_df, duplicates_idx)

In [None]:
# Keep all of Dataset1, add only unmatched Dataset2
deduped_df = keep_dataset1_and_unmatched_dataset2(combined_df, duplicates_idx)


In [None]:
deduped_df

In [None]:
print(f"Deduplicated dataframe has {len(deduped_df)} rows.")

### 8. Save Final

In [None]:
deduped_df.to_csv(output_path, index=False)
print(f"De-duplicated dataset saved to: {output_path}")