# Data Deduplication using Clustering
**Objective**: Learn and implement data deduplication techniques.

**Task**: Deduplication Using K-means Clustering

**Steps**:
1. Data Set: Download a dataset containing duplicate customer records.
2. Preprocess: Standardize the data to ensure better clustering.
3. Apply K-means: Use K-means clustering to find and group similar customer records.
4. Identify Duplicates: Identify and remove duplicates within clusters.

In [1]:
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import pairwise_distances

# Step 1: Generate a dataset containing duplicate customer records (replace with your actual data)
np.random.seed(42)
num_records = 100
data = {
    'CustomerID': np.arange(1, num_records + 1),
    'Name': [f'Customer {i}' for i in range(1, num_records + 1)],
    'Age': np.random.randint(20, 60, num_records),
    'Income': np.random.randint(30000, 150000, num_records),
    'City': np.random.choice(['Bengaluru', 'Mumbai', 'Delhi', 'Chennai'], num_records)
}
df = pd.DataFrame(data)

# Introduce some duplicates with slight variations
df_duplicates = pd.DataFrame({
    'CustomerID': [101, 102, 103, 104, 105],
    'Name': ['Customer 5', 'Custumer 15', 'Customer 32', 'Customer 67', 'Customer 88'], # Slight typos
    'Age': [25, 38, 45, 52, 31],
    'Income': [62000, 91000, 78000, 125000, 48000],
    'City': ['Bengaluru', 'Mumbai', 'Delhi', 'Chennai', 'Bengaluru']
})

df = pd.concat([df, df_duplicates], ignore_index=True)
np.random.shuffle(df.values) # Shuffle the order

print("Original DataFrame with Potential Duplicates:")
print(df.head())

# Step 2: Preprocess: Standardize the data to ensure better clustering.
# Select numerical features for clustering
numerical_features = ['Age', 'Income']
X = df[numerical_features].copy()

# Standardize the numerical features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_scaled_df = pd.DataFrame(X_scaled, columns=numerical_features)

# Combine scaled numerical features with categorical features (for later identification)
df_processed = pd.concat([df[['CustomerID', 'Name', 'City']].reset_index(drop=True), X_scaled_df], axis=1)
print("\nProcessed DataFrame (Scaled Numerical Features):")
print(df_processed.head())

# Step 3: Apply K-means: Use K-means clustering to find and group similar customer records.
# Determine the optimal number of clusters (you might need to experiment with this)
# For demonstration, let's assume we expect a few duplicate groups, so we'll choose a small number of clusters.
n_clusters = 15
kmeans = KMeans(n_clusters=n_clusters, random_state=42, n_init=10)
df_processed['Cluster'] = kmeans.fit_predict(df_processed[numerical_features])

print("\nDataFrame with Cluster Assignments:")
print(df_processed.head())

# Step 4: Identify Duplicates: Identify and remove duplicates within clusters.
def identify_and_remove_duplicates(cluster_df):
    if len(cluster_df) <= 1:
        return cluster_df
    # Calculate pairwise distances based on the scaled numerical features
    distances = pairwise_distances(cluster_df[numerical_features], metric='euclidean')
    # Set a threshold for considering records as duplicates (you might need to tune this)
    # A smaller threshold means records need to be very similar to be considered duplicates
    similarity_threshold = 1.0
    duplicates_to_drop = set()
    for i in range(len(cluster_df)):
        for j in range(i + 1, len(cluster_df)):
            if distances[i, j] < similarity_threshold:
                # Keep the record with the lower CustomerID as the representative
                index_to_drop = cluster_df.iloc[j].name
                duplicates_to_drop.add(index_to_drop)
    return cluster_df.drop(index=duplicates_to_drop)

# Group by cluster and apply the duplicate removal function
df_deduplicated = df_processed.groupby('Cluster', group_keys=False).apply(identify_and_remove_duplicates).reset_index(drop=True)

# Merge back with the original non-scaled data to show all columns
df_final = pd.merge(df[['CustomerID', 'Name', 'Age', 'Income', 'City']],
                    df_deduplicated[['CustomerID', 'Cluster']],
                    on='CustomerID', how='inner')

print("\nDeduplicated DataFrame:")
print(df_final.head())
print(f"\nNumber of original records: {len(df)}")
print(f"Number of deduplicated records: {len(df_final.drop_duplicates(subset=['Name', 'Age', 'Income', 'City']))}")

Original DataFrame with Potential Duplicates:
   CustomerID        Name  Age  Income       City
0           1  Customer 1   58  119135     Mumbai
1           2  Customer 2   48   65222  Bengaluru
2           3  Customer 3   34  107373     Mumbai
3           4  Customer 4   27  109575    Chennai
4           5  Customer 5   40  126354    Chennai

Processed DataFrame (Scaled Numerical Features):
   CustomerID        Name       City       Age    Income
0           1  Customer 1     Mumbai  1.685131  0.680862
1           2  Customer 2  Bengaluru  0.809196 -0.845697
2           3  Customer 3     Mumbai -0.417112  0.347818
3           4  Customer 4    Chennai -1.030265  0.410168
4           5  Customer 5    Chennai  0.108449  0.885270

DataFrame with Cluster Assignments:
   CustomerID        Name       City       Age    Income  Cluster
0           1  Customer 1     Mumbai  1.685131  0.680862        9
1           2  Customer 2  Bengaluru  0.809196 -0.845697        8
2           3  Customer 3  