# 01. Data Protection Strategies | ÿßÿ≥ÿ™ÿ±ÿßÿ™Ÿäÿ¨Ÿäÿßÿ™ ÿ≠ŸÖÿßŸäÿ© ÿßŸÑÿ®ŸäÿßŸÜÿßÿ™

## üìö Prerequisites (What You Need First) | ÿßŸÑŸÖÿ™ÿ∑ŸÑÿ®ÿßÿ™ ÿßŸÑÿ£ÿ≥ÿßÿ≥Ÿäÿ©

**BEFORE starting this notebook**, you should have completed:
- ‚úÖ **Unit 1: Ethics Foundations** - Understanding ethical principles
- ‚úÖ **Unit 2: Bias and Justice** - Understanding fairness in AI
- ‚úÖ **Basic Python knowledge**: Functions, data manipulation, dictionaries
- ‚úÖ **Understanding of data privacy**: Basic concepts of privacy and security

**If you haven't completed these**, you might struggle with:
- Understanding why data protection matters
- Knowing which protection technique to use
- Understanding encryption and anonymization concepts

---

## üîó Where This Notebook Fits | ŸÖŸÉÿßŸÜ Ÿáÿ∞ÿß ÿßŸÑÿØŸÅÿ™ÿ±

**This is the FIRST example in Unit 3** - it teaches you how to protect data!

**Why this example FIRST?**
- **Before** you can use privacy technologies, you need to understand basic protection
- **Before** you can implement differential privacy, you need data protection basics
- **Before** you can ensure GDPR compliance, you need protection strategies

**Builds on**: 
- üìì Unit 1: Ethics Foundations (ethical principles)
- üìì Unit 2: Bias and Justice (fairness concepts)

**Leads to**: 
- üìì Example 2: Privacy Technologies (advanced privacy techniques)
- üìì Example 3: Differential Privacy (mathematical privacy guarantees)
- üìì Example 4: GDPR Compliance (regulatory compliance)
- üìì Example 5: Secure Development (secure coding practices)

**Why this order?**
1. Data protection provides **foundation** (needed before advanced techniques)
2. Data protection teaches **basic techniques** (encryption, anonymization)
3. Data protection shows **practical approaches** (real-world strategies)

---

## The Story: Protecting What Matters | ÿßŸÑŸÇÿµÿ©: ÿ≠ŸÖÿßŸäÿ© ŸÖÿß ŸäŸáŸÖ

Imagine you're a bank protecting customer money. **Before** you can offer services, you need vaults, locks, security systems. **After** implementing protection, customer assets are secure!

Same with data: **Before** we have sensitive data that needs protection, now we learn techniques - encryption, anonymization, pseudonymization. **After** data protection, we have secure data that can't be misused!

---

## Why Data Protection Matters | ŸÑŸÖÿßÿ∞ÿß ÿ™ŸáŸÖ ÿ≠ŸÖÿßŸäÿ© ÿßŸÑÿ®ŸäÿßŸÜÿßÿ™ÿü

Data protection is essential for ethical AI:
- **Privacy**: Protect individuals' personal information
- **Security**: Prevent unauthorized access to data
- **Compliance**: Meet legal and regulatory requirements
- **Trust**: Build user confidence in your systems
- **Ethics**: Respect individuals' right to privacy

## Learning Objectives | ÿ£ŸáÿØÿßŸÅ ÿßŸÑÿ™ÿπŸÑŸÖ
1. Understand encryption techniques for data protection
2. Learn secure data storage practices
3. Understand anonymization and pseudonymization
4. Compare different data protection strategies
5. Apply protection techniques to real data
6. Understand when to use each technique

In [1]:
# Step 1: Import necessary libraries
# These libraries help us implement data protection strategies

import numpy as np  # For numerical operations: Arrays, calculations, random number generation
import pandas as pd  # For data manipulation: DataFrames, data analysis
import matplotlib.pyplot as plt  # For creating visualizations: Charts, graphs, comparisons
import seaborn as sns  # For statistical visualizations: Heatmaps, advanced plots
from cryptography.fernet import Fernet  # For encryption: Symmetric encryption (Fernet)
import hashlib  # For hashing: Create hash-based pseudonyms
import warnings  # For suppressing warnings: Clean output
import os  # For file operations: Saving images

# Suppress warnings: Clean output
warnings.filterwarnings('ignore')  # Ignore warnings: Suppress non-critical warnings

# Configure plotting: Set default styles for better visualizations
plt.rcParams['font.size'] = 10  # Font size: Make text readable (10pt is good for most displays)
plt.rcParams['figure.figsize'] = (14, 8)  # Figure size: 14 inches wide, 8 inches tall (good for detailed charts)
sns.set_style("whitegrid")  # Style: White background with grid for clean look

print("‚úÖ Libraries imported successfully!")
print("\nüìö What each library does:")
print("   - numpy/pandas: Data manipulation and numerical operations")
print("   - matplotlib/seaborn: Create visualizations (charts, heatmaps)")
print("   - cryptography.fernet: Encryption (symmetric encryption)")
print("   - hashlib: Hashing (create pseudonyms)")
print("   - os: File operations (saving images)")


‚úÖ Libraries imported successfully!

üìö What each library does:
   - numpy/pandas: Data manipulation and numerical operations
   - matplotlib/seaborn: Create visualizations (charts, heatmaps)
   - cryptography.fernet: Encryption (symmetric encryption)
   - hashlib: Hashing (create pseudonyms)
   - os: File operations (saving images)


## Part 2: Encryption Techniques | ÿßŸÑÿ¨ÿ≤ÿ° ÿßŸÑÿ´ÿßŸÜŸä: ÿ™ŸÇŸÜŸäÿßÿ™ ÿßŸÑÿ™ÿ¥ŸÅŸäÿ±

### üìö Prerequisites (What You Need First)
- ‚úÖ **Library imports** (from Part 1) - Understanding encryption tools
- ‚úÖ **Understanding of security** - Basic concepts of data security

### üîó Relationship: What This Builds On
This is the first data protection technique - encryption!
- Builds on: Understanding of data security, cryptography basics
- Shows: How to encrypt data to prevent unauthorized access

### üìñ The Story
**Before encryption**: We have sensitive data that anyone can read.
**After encryption**: We have encrypted data that only authorized parties can decrypt!

---

## Step 2: Encryption Techniques | ÿßŸÑÿÆÿ∑Ÿàÿ© 2: ÿ™ŸÇŸÜŸäÿßÿ™ ÿßŸÑÿ™ÿ¥ŸÅŸäÿ±

**BEFORE**: We have sensitive data in plain text that anyone can read.

**AFTER**: We'll encrypt data so only authorized parties with the key can decrypt it!

**Why encryption?** Protects data at rest and in transit:
- **Confidentiality**: Only authorized parties can read data
- **Integrity**: Detects if data has been tampered with
- **Authentication**: Verifies data source


In [3]:
# Step 2: Implement encryption techniques
# This shows how to encrypt and decrypt data using symmetric encryption

# BEFORE: We have sensitive data in plain text
# AFTER: We'll encrypt data so only authorized parties can read it

print("\n" + "="*80)
print("üîê ENCRYPTION TECHNIQUES")
print("="*80)
print("\nWe'll implement:")
print("  1. Generate encryption key")
print("  2. Encrypt sensitive data")
print("  3. Decrypt data with the key\n")

def generate_encryption_key():
    """
    Generate a symmetric encryption key.
    
    HOW IT WORKS:
    1. Uses Fernet (symmetric encryption) to generate a key
    2. Key is a URL-safe base64-encoded 32-byte key
    3. Same key is used for encryption and decryption
    
    ‚è∞ WHEN to use: Before encrypting data - need a key first
    üí° WHY use: Symmetric encryption is fast and secure for data protection
    """
    return Fernet.generate_key()  # Generate key: Create new encryption key

def encrypt_data(data, key):
    """
    Encrypt data using Fernet symmetric encryption.
    
    HOW IT WORKS:
    1. Creates Fernet cipher with the key
    2. Converts data to bytes (if string)
    3. Encrypts the data using the cipher
    4. Returns encrypted bytes
    
    ‚è∞ WHEN to use: To protect sensitive data from unauthorized access
    üí° WHY use: Encryption ensures only authorized parties can read data
    """
    f = Fernet(key)  # Create cipher: Initialize Fernet with encryption key
    if isinstance(data, str):  # Check type: If data is a string
        encrypted = f.encrypt(data.encode())  # Encrypt: Convert to bytes and encrypt
    else:  # Otherwise: If data is not a string
        encrypted = f.encrypt(str(data).encode())  # Encrypt: Convert to string, then bytes, then encrypt
    return encrypted  # Return: Encrypted data as bytes

def decrypt_data(encrypted_data, key):
    """
    Decrypt data using Fernet symmetric encryption.
    
    HOW IT WORKS:
    1. Creates Fernet cipher with the key
    2. Decrypts the encrypted data
    3. Converts bytes back to string
    4. Returns decrypted string
    
    ‚è∞ WHEN to use: To read encrypted data (need the same key used for encryption)
    üí° WHY use: Allows authorized parties to access protected data
    """
    f = Fernet(key)  # Create cipher: Initialize Fernet with decryption key
    decrypted = f.decrypt(encrypted_data)  # Decrypt: Decrypt the encrypted data
    return decrypted.decode()  # Return: Decrypted data as string

# Demonstrate encryption
print("Demonstrating encryption...")
key = generate_encryption_key()  # Generate key: Create encryption key
original_data = "Sensitive information: SSN 123-45-6789"  # Original: Sensitive data to protect
encrypted = encrypt_data(original_data, key)  # Encrypt: Protect the data
decrypted = decrypt_data(encrypted, key)  # Decrypt: Recover the data

print(f"Original: {original_data}")  # Print: Show original data
print(f"Encrypted: {encrypted}")  # Print: Show encrypted data (bytes)
print(f"Decrypted: {decrypted}")  # Print: Show decrypted data (should match original)
print("‚úÖ Encryption demonstration complete!")



üîê ENCRYPTION TECHNIQUES

We'll implement:
  1. Generate encryption key
  2. Encrypt sensitive data
  3. Decrypt data with the key

Demonstrating encryption...
Original: Sensitive information: SSN 123-45-6789
Encrypted: b'gAAAAABpLJlaZXfKQ-5vG5JhekEsdcbMKlPQtsi1dSYro35fq-s-BEkE9UDWoogEaQVqWCTQTA14QcMARYmepSP_h6WXR1g5nP3LDH_j8wVhOD-4KeYZLpfQX0jJGbsQRMafRNeB0XrF'
Decrypted: Sensitive information: SSN 123-45-6789
‚úÖ Encryption demonstration complete!


In [4]:
# Part 3: Anonymization and Pseudonymization Techniques
# This shows how to anonymize and pseudonymize data

# BEFORE: We have identifying information in our data
# AFTER: We'll remove or mask identifying information

print("\n" + "="*80)
print("üîí ANONYMIZATION AND PSEUDONYMIZATION TECHNIQUES")
print("="*80)
print("\nWe'll implement:")
print("  1. Anonymization: Remove identifying information")
print("  2. Pseudonymization: Replace with hashed values\n")

def anonymize_data(df, columns_to_anonymize):
    """
    Anonymize data by removing or masking identifying information.
    
    HOW IT WORKS:
    1. Creates a copy of the DataFrame
    2. Replaces specified columns with generic identifiers (ID_0, ID_1, etc.)
    3. Returns anonymized DataFrame
    
    ‚è∞ WHEN to use: When you need to remove identifying information permanently
    üí° WHY use: Anonymization makes it impossible to identify individuals
    """
    df_anonymized = df.copy()  # Copy: Create copy to avoid modifying original
    for col in columns_to_anonymize:  # Loop through columns: Process each column to anonymize
        if col in df_anonymized.columns:  # Check: Ensure column exists
            # Replace with generic identifiers: ID_0, ID_1, ID_2, etc.
            df_anonymized[col] = [f'ID_{i}' for i in range(len(df_anonymized))]  # Anonymize: Replace with generic IDs
    return df_anonymized  # Return: Anonymized DataFrame

def pseudonymize_data(df, columns_to_pseudonymize, salt='default_salt'):
    """
    Pseudonymize data by replacing with hashed values.
    
    HOW IT WORKS:
    1. Creates a copy of the DataFrame
    2. Hashes each value in specified columns using SHA-256
    3. Uses salt to prevent rainbow table attacks
    4. Returns pseudonymized DataFrame
    
    ‚è∞ WHEN to use: When you need reversible pseudonyms (can link back with key)
    üí° WHY use: Pseudonymization allows linking while protecting identity
    """
    df_pseudonymized = df.copy()  # Copy: Create copy to avoid modifying original
    for col in columns_to_pseudonymize:  # Loop through columns: Process each column to pseudonymize
        if col in df_pseudonymized.columns:  # Check: Ensure column exists
            # Create hash-based pseudonyms: Hash each value with salt
            df_pseudonymized[col] = df_pseudonymized[col].apply(
                lambda x: hashlib.sha256((str(x) + salt).encode()).hexdigest()[:16]  # Hash: SHA-256 hash, first 16 chars
            )
    return df_pseudonymized  # Return: Pseudonymized DataFrame
# ============================================================================
# DATA PROTECTION COMPARISON
# ============================================================================
def demonstrate_data_protection():
    """
    Demonstrate different data protection techniques
    """
    # Create sample sensitive data
    np.random.seed(42)
    n_samples = 100
    data = {
        'name': [f'Person_{i}' for i in range(n_samples)],
        'email': [f'user{i}@example.com' for i in range(n_samples)],
        'ssn': [f'{np.random.randint(100,999)}-{np.random.randint(10,99)}-{np.random.randint(1000,9999)}' 
                for _ in range(n_samples)],
        'salary': np.random.normal(50000, 15000, n_samples),
        'age': np.random.randint(25, 65, n_samples)
    }
    df = pd.DataFrame(data)
    print("="*80)
    print("ORIGINAL DATA (First 5 rows):")
    print("="*80)
    print(df.head())
    # 1. Anonymization
    print("\n" + "="*80)
    print("1. ANONYMIZATION")
    print("="*80)
    df_anonymized = anonymize_data(df, ['name', 'email', 'ssn'])
    print("\nAnonymized Data (First 5 rows):")
    print(df_anonymized.head())
    # 2. Pseudonymization
    print("\n" + "="*80)
    print("2. PSEUDONYMIZATION")
    print("="*80)
    df_pseudonymized = pseudonymize_data(df, ['name', 'email', 'ssn'])
    print("\nPseudonymized Data (First 5 rows):")
    print(df_pseudonymized.head())
    # 3. Encryption
    print("\n" + "="*80)
    print("3. ENCRYPTION")
    print("="*80)
    key = generate_encryption_key()
    print(f"Generated encryption key: {key[:20]}...")
    # Encrypt sensitive column
    sample_email = df['email'].iloc[0]
    encrypted_email = encrypt_data(sample_email, key)
    decrypted_email = decrypt_data(encrypted_email, key)
    print(f"\nOriginal email: {sample_email}")
    print(f"Encrypted: {encrypted_email[:50]}...")
    print(f"Decrypted: {decrypted_email}")
    return df, df_anonymized, df_pseudonymized
# ============================================================================
# VISUALIZATIONS
# ============================================================================
def plot_data_protection_comparison(df, df_anonymized, df_pseudonymized):
    """
    Visualize data protection techniques comparison
    """
    fig, axes = plt.subplots(2, 2, figsize=(16, 12))
    # Original data distribution
    axes[0, 0].hist(df['salary'], bins=20, color='#e74c3c', alpha=0.7, edgecolor='black')
    axes[0, 0].set_title('Original Data: Salary Distribution', fontsize=12, fontweight='bold')
    axes[0, 0].set_xlabel('Salary')
    axes[0, 0].set_ylabel('Frequency')
    axes[0, 0].grid(alpha=0.3)
    # Anonymized data (same distribution, different identifiers)
    axes[0, 1].hist(df_anonymized['salary'], bins=20, color='#3498db', alpha=0.7, edgecolor='black')
    axes[0, 1].set_title('Anonymized Data: Salary Distribution', fontsize=12, fontweight='bold')
    axes[0, 1].set_xlabel('Salary')
    axes[0, 1].set_ylabel('Frequency')
    axes[0, 1].grid(alpha=0.3)
    # Pseudonymized data
    axes[1, 0].hist(df_pseudonymized['salary'], bins=20, color='#2ecc71', alpha=0.7, edgecolor='black')
    axes[1, 0].set_title('Pseudonymized Data: Salary Distribution', fontsize=12, fontweight='bold')
    axes[1, 0].set_xlabel('Salary')
    axes[1, 0].set_ylabel('Frequency')
    axes[1, 0].grid(alpha=0.3)
    # Protection techniques comparison
    techniques = ['Original', 'Anonymized', 'Pseudonymized', 'Encrypted']
    privacy_level = [1, 7, 8, 10]  # Privacy level (1-10)
    utility_level = [10, 9, 8, 7]  # Data utility (1-10)
    x = np.arange(len(techniques))
    width = 0.35
    axes[1, 1].bar(x - width/2, privacy_level, width, label='Privacy Level', alpha=0.8, color='#9b59b6')
    axes[1, 1].bar(x + width/2, utility_level, width, label='Data Utility', alpha=0.8, color='#f39c12')
    axes[1, 1].set_xlabel('Protection Technique', fontsize=11, fontweight='bold')
    axes[1, 1].set_ylabel('Score (1-10)', fontsize=11, fontweight='bold')
    axes[1, 1].set_title('Privacy vs Utility Trade-off', fontsize=12, fontweight='bold')
    axes[1, 1].set_xticks(x)
    axes[1, 1].set_xticklabels(techniques, rotation=15)
    axes[1, 1].legend()
    axes[1, 1].grid(axis='y', alpha=0.3)
    axes[1, 1].set_ylim([0, 11])
    plt.tight_layout()
    plt.savefig('unit3-privacy-security', 
                dpi=300, bbox_inches='tight')
    print("\n‚úÖ Saved: data_protection_comparison.png")
    plt.close()
# ============================================================================
# MAIN EXECUTION
# ============================================================================
if __name__ == "__main__":
    print("="*80)
    print("Unit 3 - Example 1: Data Protection Strategies")
    print("="*80)
    # Demonstrate data protection techniques
    df, df_anonymized, df_pseudonymized = demonstrate_data_protection()
    # Create visualizations
    print("\n" + "="*80)
    print("Creating Visualizations...")
    print("="*80)
    plot_data_protection_comparison(df, df_anonymized, df_pseudonymized)
    # Summary
    print("\n" + "="*80)
    print("SUMMARY")
    print("="*80)
    print("\nKey Takeaways:")
    print("1. Anonymization removes identifying information completely")
    print("2. Pseudonymization replaces identifiers with reversible hashes")
    print("3. Encryption protects data at rest and in transit")
    print("4. Each technique has trade-offs between privacy and utility")
    print("5. Choose protection technique based on use case requirements")
    print("="*80 + "\n")



üîí ANONYMIZATION AND PSEUDONYMIZATION TECHNIQUES

We'll implement:
  1. Anonymization: Remove identifying information
  2. Pseudonymization: Replace with hashed values

Unit 3 - Example 1: Data Protection Strategies
ORIGINAL DATA (First 5 rows):
       name              email          ssn        salary  age
0  Person_0  user0@example.com  202-61-1860  84100.177370   36
1  Person_1  user1@example.com  370-81-6734  52204.619757   27
2  Person_2  user2@example.com  221-92-5426  58094.138024   25
3  Person_3  user3@example.com  558-97-9322  27525.565439   57
4  Person_4  user4@example.com  761-62-1769  49828.348288   64

1. ANONYMIZATION

Anonymized Data (First 5 rows):
   name email   ssn        salary  age
0  ID_0  ID_0  ID_0  84100.177370   36
1  ID_1  ID_1  ID_1  52204.619757   27
2  ID_2  ID_2  ID_2  58094.138024   25
3  ID_3  ID_3  ID_3  27525.565439   57
4  ID_4  ID_4  ID_4  49828.348288   64

2. PSEUDONYMIZATION

Pseudonymized Data (First 5 rows):
               name            