<a href="https://colab.research.google.com/github/shinjinisen/data-privacy-pynb/blob/main/sampledatastet.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [2]:
import pandas as pd
!pip install recordlinkage



We create two datasets, dataset_A and dataset_B, each containing information about individuals.

In [151]:
import pandas as pd

# Sample dataset A
data_A = {
    'ID': [1, 2, 3, 4],
    'Name': ['John Smith', 'Alice Johnson', 'Rabert Brown', 'Emma Liu'],
    'Age': [30, 25, 35, 28],
    'Address': ['123 Main St', '456 Elm St', '689 Oak St', '101 Pine St'],
    'Phone': ['111-111-1111', '222-222-2222', '334-333-3333', '944-444-4444']
}

dataset_A = pd.DataFrame(data_A)

# Sample dataset B
data_B = {
    'ID': [5, 6, 7, 8],
    'Name': ['John Smith', 'Alice Johnson', 'Michael Clark', 'Emma Lee'],
    'Age': [30, 26, 35, 27],
    'Address': ['123 Main St', '456 Elm St', '789 Oak St', '101 Pine St'],
    'Phone': ['111-111-1111', '222-222-2222', '333-333-3333', '555-555-5555']
}

dataset_B = pd.DataFrame(data_B)

Now that we have our sample datasets, let's proceed with the record linkage steps:

**1. Data Preparation:**

In this step, we prepare the datasets for record linkage by cleaning and transforming them as needed. This may involve:

- Removing irrelevant columns.

- Standardizing data formats (e.g., converting all text to lowercase).
Handling missing values.

- Normalizing data (e.g., removing accents, standardizing date formats).

In [145]:
# Convert text to lowercase
dataset_A['Name'] = dataset_A['Name'].str.lower()
dataset_A['Address'] = dataset_A['Address'].str.lower()

dataset_B['Name'] = dataset_B['Name'].str.lower()
dataset_B['Address'] = dataset_B['Address'].str.lower()

# Handle missing values if any
dataset_A.dropna(inplace=True)
dataset_B.dropna(inplace=True)

# Normalize data
# For example, remove accents from text data
import unicodedata
dataset_A['Name'] = dataset_A['Name'].apply(lambda x: unicodedata.normalize('NFKD', x).encode('ASCII', 'ignore').decode())
dataset_B['Name'] = dataset_B['Name'].apply(lambda x: unicodedata.normalize('NFKD', x).encode('ASCII', 'ignore').decode())

**2. Blocking:**

Blocking is a technique used to reduce the number of record pairs compared for similarity. It involves grouping records into blocks based on some key attributes

We'll create blocks based on a common attribute- the first letter of the last name.

In [157]:
# Create blocks based on the first letter of the last name
dataset_A['Last_Name_Block'] = dataset_A['Name'].str[0]
dataset_B['Last_Name_Block'] = dataset_B['Name'].str[0]
# Blocking using postal codes
blocked_pairs = []
for address in dataset_A['Address'].unique():
    block_A = dataset_A[dataset_A['Address'] == address]
    block_B = dataset_B[dataset_B['Address'] == address]
    for index_A, record_A in block_A.iterrows():
        for index_B, record_B in block_B.iterrows():
            blocked_pairs.append((index_A, index_B))


**3. Comparison:**

In this step, we compute the similarity between pairs of records based on their attributes. Common similarity metrics include:

Edit distance for textual data.
Jaccard similarity for sets.
Cosine similarity for vectors.

- We'll compare pairs of records within each block using a similarity metric like Jaccard similarity for the Name attribute.
- We will compute similarity using edit distance for names.

In [158]:
!pip install fuzzywuzzy
import fuzzywuzzy




In [159]:
from fuzzywuzzy import fuzz

# Calculate Jaccard similarity for Name attribute
def calculate_similarity(name1, name2):
    return fuzz.token_set_ratio(name1, name2) / 100.0

# Calculate similarity scores for pairs of records
similarity_scores = []
for index_A, record_A in dataset_A.iterrows():
    for index_B, record_B in dataset_B.iterrows():
        if record_A['Last_Name_Block'] == record_B['Last_Name_Block']:
            similarity_score = calculate_similarity(record_A['Name'], record_B['Name'])
            similarity_scores.append({
                'ID_A': record_A['ID'],
                'ID_B': record_B['ID'],
                'Similarity_Score': similarity_score
            })

# Convert the similarity scores to a DataFrame
similarity_df = pd.DataFrame(similarity_scores)


**4. Classification:**

We compare pairs of records based on their computed similarity scores and apply a threshold to determine potential matches.

We'll classify pairs of records as matches or non-matches based on a similarity threshold.

In [160]:
# Set a similarity threshold
similarity_threshold = 0.8

# Classify pairs of records
matches = similarity_df[similarity_df['Similarity_Score'] >= similarity_threshold]
non_matches = similarity_df[similarity_df['Similarity_Score'] < similarity_threshold]
print( matches)
print( non_matches)

   ID_A  ID_B  Similarity_Score
0     1     5               1.0
1     2     6               1.0
   ID_A  ID_B  Similarity_Score
2     4     8              0.75


**5. Evaluation of Record Linkage:**

We'll evaluate the results by comparing them to a reference dataset (ground truth) if available.

In [161]:
# Assuming we have a ground truth dataset
ground_truth = {
    'ID_A': [1, 2, 3],
    'ID_B': [5, 6, 7]
}

ground_truth_df = pd.DataFrame(ground_truth)

# Calculate precision, recall, and F1-score
true_positives = len(matches.merge(ground_truth_df, on=['ID_A', 'ID_B']))
false_positives = len(matches) - true_positives
false_negatives = len(ground_truth_df) - true_positives

precision = true_positives / (true_positives + false_positives)
recall = true_positives / (true_positives + false_negatives)
f1_score = 2 * (precision * recall) / (precision + recall)

print("Precision:", precision)
print("Recall:", recall)
print("F1-score:", f1_score)



Precision: 1.0
Recall: 0.6666666666666666
F1-score: 0.8


**6. Post-processing:**

In this step, we'll resolve duplicates by merging or flagging them for further review. For demonstration purposes, let's print out the matched records and mark them as potential duplicates.
python

In [162]:
# Print out matched records and mark them as potential duplicates
for index, row in matches.iterrows():
    record_A = dataset_A[dataset_A['ID'] == row['ID_A']].iloc[0]
    record_B = dataset_B[dataset_B['ID'] == row['ID_B']].iloc[0]
    print("Potential Duplicate:")
    print("Record A:", record_A)
    print("Record B:", record_B)
    print()


Potential Duplicate:
Record A: ID                            1
Name                 John Smith
Age                          30
Address             123 Main St
Phone              111-111-1111
Last_Name_Block               J
Name: 0, dtype: object
Record B: ID                            5
Name                 John Smith
Age                          30
Address             123 Main St
Phone              111-111-1111
Last_Name_Block               J
Name: 0, dtype: object

Potential Duplicate:
Record A: ID                             2
Name               Alice Johnson
Age                           25
Address               456 Elm St
Phone               222-222-2222
Last_Name_Block                A
Name: 1, dtype: object
Record B: ID                             6
Name               Alice Johnson
Age                           26
Address               456 Elm St
Phone               222-222-2222
Last_Name_Block                A
Name: 1, dtype: object

