# 🔒 Sensitive Data Demo: Pre-Commit Hooks in Action

This notebook demonstrates how **pre-commit hooks** protect against accidentally committing sensitive data:

## 🎯 **What We'll Demo:**
1. **`nbstripout`** - Automatically removes notebook outputs and execution metadata
2. **`detect-secrets`** - Scans for API keys, passwords, and other sensitive patterns
3. **Real-world scenarios** - Common mistakes developers make

## ⚠️ **IMPORTANT SAFETY NOTES:**
- All sensitive data in this notebook is **FAKE** for demonstration purposes
- **NEVER** commit real credentials to version control
- This demo shows what **NOT** to do in real projects

---

## 📋 **Pre-Commit Hooks Overview**

Pre-commit hooks run automatically before each commit to:
- ✅ **Prevent security leaks** - Block sensitive data from reaching the repository
- ✅ **Clean notebooks** - Remove execution outputs that could contain sensitive info
- ✅ **Maintain consistency** - Ensure code quality and formatting standards
- ✅ **Catch mistakes early** - Prevent problems before they reach the remote repository

## 1. 🚨 Hardcoded Secrets Detection Demo

The following cell contains **FAKE** sensitive data that `detect-secrets` will catch:

In [None]:
# ❌ BAD PRACTICE: Hardcoded secrets (FAKE data for demo)
# These will be detected by detect-secrets hook!

import requests
import pandas as pd

# 🚨 FAKE API Keys - detect-secrets will flag these!
OPENAI_API_KEY = "sk-1234567890abcdefghijklmnopqrstuvwxyz123456"
STRIPE_SECRET_KEY = "sk_test_fake123456789012345678901234567890abc"
AWS_ACCESS_KEY = "AKIAFAKEAWSACCESSKEY12345"
AWS_SECRET_KEY = "fake/aws/secret/access/key/1234567890abcdefghijklmn"

# 🚨 FAKE Database credentials - also flagged!
DATABASE_URL = "postgresql://user:password123@localhost:5432/myapp"
MONGODB_CONNECTION = "mongodb://admin:supersecret@localhost:27017/database"

# 🚨 FAKE JWT Secret - high entropy string detection!
JWT_SECRET = "my_super_secret_jwt_signing_key_with_high_entropy_2025"

# 🚨 FAKE OAuth tokens
GITHUB_TOKEN = "ghp_fakeGitHubPersonalAccessToken123456789012345"
GOOGLE_API_KEY = "AIzaSyFakeGoogleMapsApiKey123456789012345678"

print("🔥 This cell contains FAKE sensitive data!")
print("🛡️ Pre-commit hooks would prevent this from being committed!")
print(f"📊 Found {8} different types of fake credentials")

In [None]:
# ✅ GOOD PRACTICE: Proper secret management
import os
from dotenv import load_dotenv

# Load environment variables from .env file (not committed to git!)
load_dotenv()

# Safe way to access secrets
def get_api_key(service_name):
    """Safely retrieve API key from environment variables."""
    api_key = os.getenv(f'{service_name.upper()}_API_KEY')
    if not api_key:
        raise ValueError(f"❌ {service_name} API key not found in environment!")
    return api_key

# Safe configuration loading
def get_database_config():
    """Get database configuration from environment."""
    return {
        'host': os.getenv('DB_HOST', 'localhost'),
        'port': os.getenv('DB_PORT', '5432'),
        'database': os.getenv('DB_NAME'),
        'username': os.getenv('DB_USER'),
        'password': os.getenv('DB_PASSWORD')  # Never hardcode!
    }

# Example usage (will fail safely if env vars not set)
try:
    # This is the CORRECT way to handle API keys
    openai_key = get_api_key('openai')
    print("✅ API key loaded safely from environment")
except ValueError as e:
    print(f"✅ {e}")
    print("✅ This is expected - we don't have real keys in our demo environment")

print("\n🎯 Key Security Principles:")
print("   1. Never hardcode secrets in source code")
print("   2. Use environment variables or secret managers")
print("   3. Always use .env files (and gitignore them!)")
print("   4. Fail safely when secrets are missing")
print("   5. Use pre-commit hooks to catch mistakes")

## 2. 📓 Notebook Output Risks Demo

Jupyter notebook outputs can accidentally expose sensitive data. The `nbstripout` hook prevents this by cleaning notebooks before commit.

In [None]:
# 🚨 RISKY: This cell will produce output containing "sensitive" data
# nbstripout will remove this output before commit!

import pandas as pd
import json

# Simulate loading data that might contain sensitive information
fake_user_data = {
    'users': [
        {
            'id': 1,
            'name': 'John Doe',
            'email': 'john.doe@company.com',
            'ssn': '123-45-6789',  # 🚨 Sensitive!
            'api_token': 'fake_token_abc123',  # 🚨 Sensitive!
            'salary': 75000  # 🚨 Sensitive!
        },
        {
            'id': 2,
            'name': 'Jane Smith', 
            'email': 'jane.smith@company.com',
            'ssn': '987-65-4321',  # 🚨 Sensitive!
            'api_token': 'fake_token_xyz789',  # 🚨 Sensitive!
            'salary': 82000  # 🚨 Sensitive!
        }
    ]
}

print("⚠️ WARNING: This output contains sensitive fake data!")
print("📊 User Data (FAKE - for demo only):")
print(json.dumps(fake_user_data, indent=2))

# Create a DataFrame that would expose sensitive data
df = pd.DataFrame(fake_user_data['users'])
print("\n📋 DataFrame with sensitive columns:")
print(df.head())

print("\n🔥 DANGER ZONE:")
print("• SSNs, API tokens, and salary data exposed in output!")
print("• Without nbstripout, this would be committed to git!")
print("• Anyone with repo access could see this sensitive data!")
print("• nbstripout prevents this by cleaning outputs before commit")

In [None]:
# 📈 Visualization with embedded sensitive data
import matplotlib.pyplot as plt
import numpy as np

# Simulate plotting sensitive business metrics
fake_revenue_data = {
    'Q1 2024': 2500000,  # $2.5M
    'Q2 2024': 3200000,  # $3.2M  
    'Q3 2024': 2800000,  # $2.8M
    'Q4 2024': 3800000   # $3.8M
}

quarters = list(fake_revenue_data.keys())
revenues = list(fake_revenue_data.values())

plt.figure(figsize=(10, 6))
plt.bar(quarters, revenues, color=['#1f77b4', '#ff7f0e', '#2ca02c', '#d62728'])
plt.title('🚨 CONFIDENTIAL: Quarterly Revenue (FAKE DATA)', fontsize=14, fontweight='bold')
plt.ylabel('Revenue ($USD)')
plt.xlabel('Quarter')

# Add value labels on bars
for i, v in enumerate(revenues):
    plt.text(i, v + 50000, f'${v:,.0f}', ha='center', va='bottom', fontweight='bold')

plt.grid(axis='y', alpha=0.3)
plt.tight_layout()

# 🚨 This plot contains "confidential" financial data!
plt.show()

print("⚠️ CRITICAL SECURITY ISSUE:")
print("• Plot contains confidential revenue figures!")
print("• Image would be embedded in notebook JSON!")
print("• Without nbstripout: Plot permanently stored in git history!")
print("• With nbstripout: Plot removed before commit!")
print("\n✅ nbstripout protection:")
print("  ✓ Removes all matplotlib/plotly outputs")
print("  ✓ Strips execution metadata") 
print("  ✓ Cleans cell execution counts")
print("  ✓ Preserves only source code")

## 3. 🕐 Execution Metadata Risks

Jupyter notebooks store execution metadata that can reveal sensitive information about when, where, and how code was run.

In [None]:
# 🕵️ Metadata exposure demonstration
import datetime
import platform
import os
import getpass

# This cell exposes system and environment information
current_time = datetime.datetime.now()
username = getpass.getuser()
hostname = platform.node()
system_info = platform.platform()
python_version = platform.python_version()
working_directory = os.getcwd()

print("🚨 METADATA EXPOSURE RISKS:")
print("=" * 50)
print(f"🕐 Execution Time: {current_time}")
print(f"👤 Username: {username}")
print(f"💻 Hostname: {hostname}")
print(f"🖥️  System: {system_info}")
print(f"🐍 Python Version: {python_version}")
print(f"📁 Working Directory: {working_directory}")

# Environment variables might contain sensitive paths
sensitive_env_vars = ['PATH', 'HOME', 'USER', 'USERPROFILE']
print(f"\n🔍 Environment Variables (partial):")
for var in sensitive_env_vars:
    value = os.getenv(var, 'Not set')
    if len(str(value)) > 60:
        value = str(value)[:60] + "..."
    print(f"  {var}: {value}")

print(f"\n⚠️ SECURITY IMPLICATIONS:")
print("• Execution timestamps reveal when work was done")
print("• Usernames and hostnames expose developer identities") 
print("• System info reveals internal infrastructure details")
print("• File paths might expose sensitive directory structures")
print("• Environment variables could contain secrets")

print(f"\n🛡️ nbstripout PROTECTION:")
print("✅ Removes execution_count from all cells")
print("✅ Clears all output data")  
print("✅ Strips kernel metadata")
print("✅ Preserves only clean source code")
print("✅ Prevents metadata leakage in git history")

## 4. 🧪 Testing Pre-Commit Hooks

Let's demonstrate how to test our pre-commit configuration and see the hooks in action.

In [None]:
# 🔧 Pre-commit hook testing utilities
import subprocess
import json
import os
from pathlib import Path

def run_command(command):
    """Run a shell command and return the result."""
    try:
        result = subprocess.run(command, shell=True, capture_output=True, text=True, cwd='..')
        return {
            'success': result.returncode == 0,
            'stdout': result.stdout.strip(),
            'stderr': result.stderr.strip(),
            'returncode': result.returncode
        }
    except Exception as e:
        return {
            'success': False,
            'stdout': '',
            'stderr': str(e),
            'returncode': -1
        }

def check_notebook_for_outputs(notebook_path):
    """Check if a notebook contains outputs or execution metadata."""
    try:
        with open(notebook_path, 'r', encoding='utf-8') as f:
            nb_data = json.load(f)
        
        has_outputs = False
        has_execution_count = False
        output_count = 0
        
        for cell in nb_data.get('cells', []):
            if cell.get('outputs'):
                has_outputs = True
                output_count += len(cell['outputs'])
            if cell.get('execution_count'):
                has_execution_count = True
                
        return {
            'has_outputs': has_outputs,
            'has_execution_count': has_execution_count,
            'output_count': output_count,
            'total_cells': len(nb_data.get('cells', []))
        }
    except Exception as e:
        return {'error': str(e)}

# Test current notebook status
print("📊 CURRENT NOTEBOOK STATUS:")
print("=" * 40)

notebook_path = Path("../notebooks/sensitive_data_demo.ipynb")
if notebook_path.exists():
    status = check_notebook_for_outputs(notebook_path)
    if 'error' not in status:
        print(f"📄 Notebook: {notebook_path.name}")
        print(f"📱 Total cells: {status['total_cells']}")
        print(f"🔢 Has execution counts: {'Yes' if status['has_execution_count'] else 'No'}")
        print(f"📤 Has outputs: {'Yes' if status['has_outputs'] else 'No'}")
        print(f"📊 Output count: {status['output_count']}")
        
        if status['has_outputs'] or status['has_execution_count']:
            print("\n⚠️ SECURITY RISK: Notebook contains execution data!")
            print("🛡️ nbstripout would clean this before commit")
        else:
            print("\n✅ SAFE: Notebook is clean (no outputs/metadata)")
    else:
        print(f"❌ Error reading notebook: {status['error']}")

print(f"\n🧪 TESTING SCENARIOS:")
print("=" * 40)

# Scenario 1: Check if detect-secrets would catch our fake secrets
print("1. 🔍 Secret Detection Test:")
print("   • This notebook contains 8+ fake secrets")
print("   • detect-secrets hook would flag them all")
print("   • Commit would be BLOCKED until secrets are removed")

# Scenario 2: Check nbstripout effectiveness  
print("\n2. 📓 Notebook Stripout Test:")
print("   • Current notebook will have outputs after running cells")
print("   • nbstripout would remove all outputs before commit") 
print("   • Only clean source code would be committed")

# Scenario 3: File size check
try:
    notebook_size = notebook_path.stat().st_size if notebook_path.exists() else 0
    print(f"\n3. 📏 File Size Test:")
    print(f"   • Current notebook size: {notebook_size:,} bytes")
    if notebook_size > 1024 * 1000:  # 1MB
        print("   • ⚠️ Would exceed 1MB limit after adding outputs!")
        print("   • check-added-large-files hook would block commit")
    else:
        print("   • ✅ Within size limits")
except Exception as e:
    print(f"   • ❌ Error checking size: {e}")

print(f"\n🎯 DEMO COMMANDS:")
print("=" * 40)
print("# Test individual hooks:")
print("pre-commit run detect-secrets --all-files")
print("pre-commit run nbstripout --all-files") 
print("pre-commit run check-added-large-files --all-files")
print("\n# Test all hooks:")
print("pre-commit run --all-files")
print("\n# See what files are being ignored:")
print("git status --ignored")

## 5. 📋 Demo Summary & Best Practices

### 🎯 **What We've Demonstrated:**

#### 🚨 **Security Risks Without Pre-Commit Hooks:**
1. **Hardcoded Secrets** - API keys, passwords, tokens in source code
2. **Sensitive Outputs** - User data, financial info, credentials in notebook outputs  
3. **Metadata Leakage** - Usernames, timestamps, system info in execution metadata
4. **Large Files** - Notebooks with embedded images/data exceeding size limits

#### 🛡️ **Protection Provided by Pre-Commit Hooks:**
1. **`detect-secrets`** - Scans for 20+ types of secrets and high-entropy strings
2. **`nbstripout`** - Removes all outputs, metadata, and execution counts
3. **`check-added-large-files`** - Prevents commits of files over size threshold
4. **`ggshield`** - GitGuardian enterprise-grade secret detection

---

### 🎓 **Key Learning Points:**

#### ✅ **DO:**
- Use environment variables for all secrets
- Configure comprehensive pre-commit hooks
- Regularly clean notebook outputs
- Test your security setup
- Use `.env` files (and gitignore them!)

#### ❌ **DON'T:**
- Hardcode any credentials in source code
- Commit notebooks with outputs
- Skip pre-commit hook setup
- Ignore security warnings
- Share real credentials in demos

---

### 🚀 **Live Demo Script:**

1. **Show the "bad" code** - Point out hardcoded secrets in the first cell
2. **Run the cells** - Generate outputs with sensitive data
3. **Check git status** - Show notebook would be committed with outputs
4. **Run pre-commit hooks** - Demonstrate how they catch issues
5. **Show clean notebook** - After nbstripout removes outputs
6. **Commit safely** - Only clean code gets committed

---

### 📊 **Expected Hook Results:**

When you run `pre-commit run --all-files` on this notebook:

- ✅ **`detect-secrets`** will flag 8+ fake secrets  
- ✅ **`nbstripout`** will clean all outputs and metadata
- ✅ **`check-added-large-files`** will pass (unless outputs make it too large)
- ✅ **Other hooks** (formatting, linting) will process the code

**Result: Secure, clean code ready for collaboration! 🎉**