### Anonymizing data:
refers to the process of removing or modifying personal identifiers in a dataset to prevent the identification of individuals. This ensures privacy and confidentiality, making it safe to share or analyze the data without compromising individuals' identities.

1. Data Masking

    Definition: Replacing original data with fictitious but realistic-looking data to protect sensitive information.
    Example:
        Replace real credit card numbers (1234-5678-9123-4567) with masked values (XXXX-XXXX-XXXX-4567).
    Use Case: Often used in test environments where the structure of the data needs to be preserved but actual values cannot be disclosed.
    Strength: Preserves data usability without revealing sensitive information.
    Weakness: The structure remains the same, so there’s a slight risk of reverse engineering.


In [1]:
import pandas as pd

# Load the CSV file
data = pd.read_csv('anonymized_data.csv')

# Display the first few rows of the original data
print("Original Data:")
data.head()

Original Data:


Unnamed: 0,Name,Email,Phone,Address,Date_of_Birth,Salary
0,Donna Schneider,goodmancesar@gmail.com,989.248.2739x0758,"6932 Mckenzie Ports Apt. 212\nJamestown, VA 27364",1974-09-06,63834
1,Kyle Brewer,denisemanning@jenkins.com,621-610-7149,"364 Salinas Port Apt. 615\nLake Carlburgh, FL ...",1980-03-23,56634
2,Jacob Young,deniseadams@hotmail.com,847-944-9635,"87898 Ortiz Divide\nWrightport, IN 59110",1940-06-10,132530
3,Jessica Anderson,ericbryant@yahoo.com,242-458-9961,"0527 Pugh Stravenue Apt. 655\nWest Kelly, MN 8...",1977-12-12,92139
4,Jonathan Lopez,sullivanmelissa@hotmail.com,107-180-4156,"208 Julia Junction\nLake Susan, ME 63583",1978-04-21,129014


In [2]:
# Define a masking function
def mask_string(value, mask_char="X", unmasked_length=3):
    """Masks all but the last few characters of a string."""
    masked_part = mask_char * max(len(value) - unmasked_length, 0)
    return masked_part + value[-unmasked_length:]

# Mask sensitive fields
data_masked = data.copy()
data_masked["Email"] = data_masked["Email"].apply(lambda x: mask_string(x.split("@")[0]) + "@" + x.split("@")[1])
data_masked["Phone"] = data_masked["Phone"].apply(lambda x: mask_string(x))
data_masked["Address"] = data_masked["Address"].apply(lambda x: "XXXXX")

# Display the masked data
print("Masked Data:")
data_masked.head()


Masked Data:


Unnamed: 0,Name,Email,Phone,Address,Date_of_Birth,Salary
0,Donna Schneider,XXXXXXXXXsar@gmail.com,XXXXXXXXXXXXXX758,XXXXX,1974-09-06,63834
1,Kyle Brewer,XXXXXXXXXXing@jenkins.com,XXXXXXXXX149,XXXXX,1980-03-23,56634
2,Jacob Young,XXXXXXXXams@hotmail.com,XXXXXXXXX635,XXXXX,1940-06-10,132530
3,Jessica Anderson,XXXXXXXant@yahoo.com,XXXXXXXXX961,XXXXX,1977-12-12,92139
4,Jonathan Lopez,XXXXXXXXXXXXssa@hotmail.com,XXXXXXXXX156,XXXXX,1978-04-21,129014



1. Email Masking: Only the domain (`@domain.com`) remains visible, while the username is masked.
   - Example: `john.doe@example.com` becomes `XXXXXX@example.com`.
2. Phone Masking: All but the last three digits are masked.
   - Example: `123-456-7890` becomes `XXXXXXX-890`.
3. Address Masking: Replaces all addresses with `"XXXXX"`.



2. Pseudonymization

    Definition: Replacing identifying data with artificial identifiers or pseudonyms. The mapping to original data is kept separate and secure.
    Example:
        Replace "John Smith" with "User_12345" or a hash value.
    Use Case: GDPR-compliant applications where identifiers are transformed but can still be linked back to the original data with proper authorization.
    Strength: Allows re-identification if needed (e.g., for legal or medical reasons).
    Weakness: If the mapping file is compromised, data can be re-identified.

In [3]:
import hashlib
# Define a pseudonymization function
def pseudonymize(value):
    """Generates a pseudonym for a given value using a hash function."""
    return hashlib.sha256(value.encode()).hexdigest()

# Pseudonymize sensitive fields
data_pseudonymized = data.copy()
data_pseudonymized["Name"] = data_pseudonymized["Name"].apply(pseudonymize)
data_pseudonymized["Email"] = data_pseudonymized["Email"].apply(pseudonymize)
data_pseudonymized["Phone"] = data_pseudonymized["Phone"].apply(pseudonymize)

# Display the pseudonymized data
print("Pseudonymized Data:")
data_pseudonymized.head()


Pseudonymized Data:


Unnamed: 0,Name,Email,Phone,Address,Date_of_Birth,Salary
0,f1b6b221ccd58badb4ff9bf9c9213cea70d160a46d2497...,3301df7fe54d8365844320e0325c2bdc3147272966050c...,81a6c840d832154c1b0a3afaee4aa23dd4d1cc5689e31e...,"6932 Mckenzie Ports Apt. 212\nJamestown, VA 27364",1974-09-06,63834
1,51dbc377e8cfd06ecc05b6f77ae0736316393db3b74364...,d8403dbebc1cc777afea83c6b7121d7dc7d81520995ad5...,eb4159f6448567b3b662143db2fece5341c65993737262...,"364 Salinas Port Apt. 615\nLake Carlburgh, FL ...",1980-03-23,56634
2,354683464e3fb40633fb5a8163cff6959e7754a65bc692...,23606288543313d27aa49be9df328ed97f28e6f0688fe7...,fe627a0d308dd0da681cccf80dc428665d9d1b4b387141...,"87898 Ortiz Divide\nWrightport, IN 59110",1940-06-10,132530
3,37ce3e402e8ff739aca60d90dc189087dec4415074efb0...,74a7775f6f36753cc975e07290f33e7451afff6d5d3bed...,2ac47dade2ca840e4c39328238fd4bf921b9195bad9ce8...,"0527 Pugh Stravenue Apt. 655\nWest Kelly, MN 8...",1977-12-12,92139
4,4d1d2194198980f71c9e2468b848b1a40c1e6058577f89...,dc4acdd5f7f8a679d9b68346b09cbbd88bec8edda92aab...,f19e6855e336aa92425b692f20474eafacc81e22d97780...,"208 Julia Junction\nLake Susan, ME 63583",1978-04-21,129014


Name Pseudonymization:

    Converts names into unique hash strings.
    Example: "John Smith" becomes "8b5a864ccf3dbb5e3d...".

Email Pseudonymization:

    Masks email addresses with hash strings, preserving uniqueness but not revealing details.
    Example: "john.doe@example.com" becomes "a6c99f1f3bff...".

Phone Pseudonymization:

    Similarly, converts phone numbers into hash strings.



3. Data Aggregation

    Definition: Summarizing or grouping data to prevent individual identification.
    Example:
        Instead of recording individual salaries, report the average salary for a department.
    Use Case: Publishing demographic or statistical data without exposing individual records.
    Strength: Eliminates direct and indirect identifiers.
    Weakness: Reduces granularity, making the data less useful for detailed analysis.

In [4]:
# Define salary ranges for aggregation
salary_bins = [30000, 50000, 70000, 100000, 150000]
salary_labels = ["30k-50k", "50k-70k", "70k-100k", "100k-150k"]

# Add a new column for salary ranges
data["Salary_Range"] = pd.cut(data["Salary"], bins=salary_bins, labels=salary_labels, right=False)

# Aggregate data to show the average salary and count of individuals in each range
aggregated_data = data.groupby("Salary_Range").agg(
    Avg_Salary=("Salary", "mean"),
    Count=("Salary", "size")
).reset_index()

# Display the aggregated data
print("Aggregated Data:")
print(aggregated_data)


Aggregated Data:
  Salary_Range     Avg_Salary  Count
0      30k-50k   39889.416667     12
1      50k-70k   59804.473684     19
2     70k-100k   86605.666667     33
3    100k-150k  130639.750000     36


  aggregated_data = data.groupby("Salary_Range").agg(



4. Random Data Generation

    Definition: Replacing sensitive data with randomly generated values that have no relationship to the original data.
    Example:
        Replace a person's real name with a randomly generated string like "XZ8D6P".
    Use Case: Creating dummy datasets for testing or training machine learning models.
    Strength: Highly secure because the generated data has no link to real-world individuals.
    Weakness: Does not preserve statistical or relational properties of the original data.

In [5]:
import faker

# Initialize Faker for generating random data
fake = faker.Faker()

# Copy the original data
data_randomized = data.copy()

# Generate random values for sensitive fields
data_randomized["Name"] = [fake.name() for _ in range(len(data_randomized))]
data_randomized["Email"] = [fake.email() for _ in range(len(data_randomized))]
data_randomized["Phone"] = [fake.phone_number() for _ in range(len(data_randomized))]
data_randomized["Address"] = [fake.address() for _ in range(len(data_randomized))]

# Display the randomized data
print("Randomized Data:")
data_randomized.head()


Randomized Data:


Unnamed: 0,Name,Email,Phone,Address,Date_of_Birth,Salary,Salary_Range
0,Tristan Rhodes,irivera@example.net,822.281.9810x060,"08108 Miguel Views\nScottfurt, AL 27546",1974-09-06,63834,50k-70k
1,Allen Garcia,zjones@example.net,9113319372,"0351 Williams Center Apt. 326\nJonesfurt, IL 3...",1980-03-23,56634,50k-70k
2,Chloe Santiago DVM,tyates@example.net,496.718.4525,"285 Mullins Mount Suite 729\nHarrisonchester, ...",1940-06-10,132530,100k-150k
3,Jeffrey Brown,frycody@example.com,8997528540,"7911 Thomas Ways\nEast Julianside, MA 28720",1977-12-12,92139,70k-100k
4,Denise Cervantes,stephanie78@example.com,+1-467-598-3095x17852,"922 Noah Stream\nLisaborough, PW 37168",1978-04-21,129014,100k-150k


Name Randomization:

    Replaces each name with a randomly generated name, e.g., "John Smith" might become "Aisha Ali."

Email Randomization:

    Replaces each email with a randomly generated one.

Phone and Address Randomization:

    Each phone number and address is replaced with a randomly generated value in the correct format.