# Anonymization Techniques

In this notebook, we will explore various techniques used to anonymize data, ensuring that personal and sensitive information is protected. Each section will cover a different technique, including definitions, purposes, and examples.


## 1. Data Masking
## 2. Data Pseudonymization
## 3. Generalization
## 4. Suppression
## 5. Noise Addition
## 6. Data Swapping
## 7. Aggregation
## 8. Tokenization
## 9. Synthetic Data Generation
## 10. Differential Privacy

# 1. Data Masking

**Definition**: Data masking involves replacing sensitive information with obscured values to protect privacy.

**Example**:

In [1]:
import pandas as pd

# Sample data
data = {'Patient ID': ['123456789', '987654321', '567890123'], 'Name': ['Alice', 'Bob', 'Charlie']}
df = pd.DataFrame(data)

# Masking patient IDs
df['Masked Patient ID'] = df['Patient ID'].apply(lambda x: x[:3] + 'XXXXX' + x[-2:])
print(df)


  Patient ID     Name Masked Patient ID
0  123456789    Alice        123XXXXX89
1  987654321      Bob        987XXXXX21
2  567890123  Charlie        567XXXXX23


**Advantages:**

- Simple to implement
- Effective at hiding specific pieces of sensitive data

**Disadvantages:**

- Can be reversible if the masking pattern is predictable
- Limited protection of not combined with other techniques

**Level of Security:** `MODERATE`

- Effective for hiding specific data points but can be vulnerable if patterns are predictable

## 2. Data Pseudonymization

**Definition**: Pseudonymization replaces private identifiers with pseudonyms or fake identifiers.

**Example:**

In [2]:
import uuid
import pandas as pd

# Sample data
data = {'Employee ID': ['E12345', 'E98765', 'E56789'], 'Name': ['Alice', 'Bob', 'Charlie']}
df = pd.DataFrame(data)

# Pseudonymizing employee IDs
df['Pseudonymized Employee ID'] = [str(uuid.uuid4()) for _ in range(df.shape[0])]
print(df)


  Employee ID     Name             Pseudonymized Employee ID
0      E12345    Alice  f2afc7ee-c52f-441c-94df-aa18edd77d2b
1      E98765      Bob  39f56b97-3ad6-4197-8b49-d3e515888cf4
2      E56789  Charlie  319a3999-9758-43f7-b49a-de9e54973e22


**Advantages:**

- Reduces the rish of re-identification
- Can be reversed if necessary, maintaining data utility

**Disadvantages:**

- Potential for re-identification if pseudonyms are not managed securely
- Requires careful management of pseudonym mapping

**Level of Security:** `HIGH`

- Provides a balance between data utility and privacy but requires secure management of pseudonym mappings.

## 3. Generalization

**Definition:** Generalization dilutes the granularity of data to make it less specific.

**Example:**

In [3]:
# Sample data
data = {'Income': [15000, 45000, 78000, 120000, 200000]}
df = pd.DataFrame(data)

# Generalizing income into ranges
df['Income Range'] = pd.cut(df['Income'], bins=[0, 30000, 60000, 100000, 200000, float('inf')], labels=['0-30K', '30-60K', '60-100K', '100-200K', '200K+'])
print(df)


   Income Income Range
0   15000        0-30K
1   45000       30-60K
2   78000      60-100K
3  120000     100-200K
4  200000     100-200K


**Advantages:**

- Reduces the risk of re-identification by making data less specific
- Simple to implement and understand

**Disadvantages:**

- Loss of data granularity can reduce data utility
- Not effective if generalized ranges are too broad

**Level of Security:** `MODERATE`

- Effective for making data less specific but may reduce data utility

## 4. Suppression
**Definition:** Suppression involves removing sensitive information from the dataset.

**Example:**

In [4]:
# Sample data
data = {'Name': ['Alice', 'Bob', 'Charlie'], 'Address': ['123 Main St', '456 Elm St', '789 Oak St']}
df = pd.DataFrame(data)

# Suppressing addresses
df.drop(columns=['Address'], inplace=True)
print(df)


      Name
0    Alice
1      Bob
2  Charlie


**Advantages:**

- Completely removes sensitive information
- Simple to implement

**Disadvantages:**

- Loss of data can significantly reduce data utility
- Not applicable if the removed data is essential for analysis

**Level of Security:** `HIGH`

- Provides strong privacy protection but at the cost of data utility

## 5. Noise Addition
**Definition:** Noise addition introduces random noise to the data to mask the original values.

**Example:**

In [5]:
import numpy as np

# Sample data
data = {'Student ID': [1, 2, 3], 'Grade': [85, 90, 95]}
df = pd.DataFrame(data)

# Adding noise to grades
noise = np.random.normal(0, 2, df.shape[0])  # Mean = 0, SD = 2
df['Noisy Grade'] = df['Grade'] + noise
print(df)


   Student ID  Grade  Noisy Grade
0           1     85    86.463928
1           2     90    92.947898
2           3     95    92.185015


**Advantages:**

- Preserves data structure while adding privacy
- Can be adjusted to balance privacy and data utility

**Disadvantages:**

- Too much noise can distort the data and reduce its utility
- May not be suitable for all types of data

**Level of Security:** `MODERATE TO HIGH`

- Effective if noise is appropriately calibrated but can reduce data utility if overused

## 6. Data Swapping
**Definition:** Data swapping involves exchanging values between records to disrupt direct relationships.

**Example:**

In [6]:
# Sample data
data = {'Customer ID': [1, 2, 3], 'Postal Code': ['12345', '67890', '54321']}
df = pd.DataFrame(data)

# Swapping postal codes
df['Swapped Postal Code'] = df['Postal Code'].sample(frac=1).reset_index(drop=True)
print(df)


   Customer ID Postal Code Swapped Postal Code
0            1       12345               67890
1            2       67890               12345
2            3       54321               54321


**Advantages:**

- Maintains overall data distribution
- Disrupts direct linkage between attributes

**Disadvantages:**

- Swapping can sometimes be detected and reversed
- Not suitable for small datasets

**Level of Security:** `MODERATE`

- Effective for disrupting direct linkages but can be vulnerable to reverse engineering

## 7. Aggregation

**Definition:** Aggregation summarizes data to a level where individual data points are no longer distinguishable.

**Example:**

In [7]:
# Sample data
data = {'Customer Type': ['Retail', 'Corporate', 'Retail', 'Corporate'], 'Transaction Amount': [100, 200, 150, 250]}
df = pd.DataFrame(data)

# Aggregating transaction amounts by customer type
df_aggregated = df.groupby('Customer Type', as_index=False)['Transaction Amount'].sum()
print(df_aggregated)


  Customer Type  Transaction Amount
0     Corporate                 450
1        Retail                 250


**Advantages:**

- Protects individual data points while providing useful insights
- Reduces risk of re-identification

**Disadvantages:**

- Loss of individual-level detail
- Not suitable for all types of analysis

**Level of Security:** `HIGH`

- Effective for protecting individual data points but reduces data granularity

## 8. Tokenization
**Definition:** Tokenization replaces sensitive data elements with non-sensitive equivalents (tokens).

**Example:**

In [8]:
import hashlib
import pandas as pd

# Sample data
data = {'User ID': [1, 2, 3], 'Email': ['alice@example.com', 'bob@example.com', 'charlie@example.com']}
df = pd.DataFrame(data)

# Tokenizing email addresses
df['Tokenized Email'] = df['Email'].apply(lambda x: hashlib.sha256(x.encode()).hexdigest())
print(df)


   User ID                Email  \
0        1    alice@example.com   
1        2      bob@example.com   
2        3  charlie@example.com   

                                     Tokenized Email  
0  ff8d9819fc0e12bf0d24892e45987e249a28dce836a85c...  
1  5ff860bf1190596c7188ab851db691f0f3169c453936e9...  
2  add7232b65bb559f896cbcfa9a600170a7ca381a036678...  


**Advantages:**

- Can be used to securely store and manage sensitive data
- Tokens can be mapped back to original values if necessary

**Disadvantages:**

- Requires secure management of token mappings
- Potential for token re-identification if mappings are exposed

**Level of Security:** `HIGH`

- Provides strong security but depends on the secure management of tokens

## 9. Synthetic Data Generation

**Definition:** Synthetic data generation creates artificial data that has similar statistical properties to the original data but does not relate to real individuals.

**Example:**

In [9]:
from sklearn.datasets import make_classification
import pandas as pd

# Generating synthetic health data
X, y = make_classification(n_samples=100, n_features=5, random_state=42)
df_synthetic = pd.DataFrame(X, columns=[f'Health_Metric_{i}' for i in range(1, 6)])
df_synthetic['Condition'] = y
print(df_synthetic.head())


   Health_Metric_1  Health_Metric_2  Health_Metric_3  Health_Metric_4  \
0        -0.430668         0.672873        -0.724280        -0.539630   
1         0.211646        -0.843897         0.534794         0.825848   
2         1.092675         0.409106         1.100096        -0.942751   
3         1.519901        -0.773361         1.998053         0.155132   
4        -0.453901        -2.183473         0.244724         2.591239   

   Health_Metric_5  Condition  
0        -0.651600          0  
1         0.681953          1  
2        -0.981509          0  
3        -0.385314          0  
4        -0.484234          1  


**Advantages:**

- No risk of re-identification since data is not real
- Usefull for testing and training ML models

**Disadvantages:**

- May not perfectly represent the original data
- Can be challenging to generate realistic synthetic data

**Level of Security:** `VERY HIGH`

- Provides the highest level of privacy protection but depends on the quality of synthetic data generation

## 10. Differential Privacy

**Definition:** Differential privacy adds random noise to data or query results to ensure that the output does not reveal much about any single individual's data.

Differential privacy is similar to noise addition in that both techniques involve adding noise to the data to protect privacy. However, the key difference lies in how the noise is added. Differential privacy uses controlled and mathematically calculated noise based on a privacy budget ($\epsilon$), ensuring formal privacy guarantees. This controlled noise addition ensures that the inclusion or exclusion of any single data point does not significantly affect the analysis outcome, making it nearly impossible to infer individual data.


**Example:**

In [10]:
import numpy as np
import pandas as pd

# Sample data
data = {'User ID': [1, 2, 3, 4, 5], 'Location': [5, 15, 25, 35, 45]}
df = pd.DataFrame(data)

# Adding Laplace noise for differential privacy
epsilon = 0.1
sensitivity = 1.0
df['Noisy Location'] = df['Location'] + np.random.laplace(0, sensitivity / epsilon, df.shape[0])
print(df)


   User ID  Location  Noisy Location
0        1         5        3.738993
1        2        15       33.893026
2        3        25      -30.191876
3        4        35       38.416218
4        5        45       25.750632


**Advantages:**

- Provides strong privacy guarantees
- Allows useful data analysis while protecting individual privacy

**Disadvantages:**

- Complexity in implementation
- May reduce data accuracy if not properly calibrated

**Level of Security:** `VERY HIGH`

- Provides strong privacy guarantees but requires careful implementation