# Lab 2: Sensitive Data Detection & AI Cataloguing

**Data Discovery: Harnessing AI, AGI & Vector Databases - Day 2**

| Duration | Difficulty | Framework | Exercises |
|---|---|---|---|
| 90 min | Intermediate | pandas, re, spacy, scikit-learn, chromadb, matplotlib | 5 |

In this lab, you'll practice:
- Scanning text for PII using regex patterns
- Using spaCy NER for entity extraction and hybrid detection
- Computing automated risk scores for data assets
- Building a compliance dashboard with matplotlib
- Integrating risk metadata into a vector catalogue

---

## Setup

First, let's import the necessary libraries.

In [None]:
# Core libraries
import numpy as np
import pandas as pd
import re
import matplotlib.pyplot as plt
from collections import Counter

# NLP
import spacy

# ML & Vector DB
from sklearn.feature_extraction.text import TfidfVectorizer
from sentence_transformers import SentenceTransformer
import chromadb

# Settings
%matplotlib inline
plt.style.use('seaborn-v0_8-whitegrid')

# Load spaCy model
nlp = spacy.load('en_core_web_sm')

print("Libraries loaded successfully!")

## Part 1: Generate Synthetic Documents

We'll create ~200 synthetic text documents that simulate HR memos, financial reports, medical forms, and other enterprise content containing various types of PII.

In [None]:
np.random.seed(42)

first_names = ['John', 'Jane', 'Robert', 'Maria', 'David', 'Sarah', 'Michael', 'Emily', 'James', 'Lisa']
last_names = ['Smith', 'Johnson', 'Williams', 'Brown', 'Jones', 'Garcia', 'Miller', 'Davis', 'Rodriguez', 'Wilson']
companies = ['Acme Corp', 'GlobalTech', 'MedPlus Health', 'FinanceFirst', 'DataDriven Inc']
cities = ['New York', 'San Francisco', 'Chicago', 'Boston', 'Seattle', 'Austin', 'Denver', 'Atlanta']

def random_ssn():
    return f"{np.random.randint(100,999)}-{np.random.randint(10,99)}-{np.random.randint(1000,9999)}"

def random_cc():
    return f"{np.random.randint(4000,4999)}-{np.random.randint(1000,9999)}-{np.random.randint(1000,9999)}-{np.random.randint(1000,9999)}"

def random_email(first, last):
    domains = ['company.com', 'email.org', 'corp.net', 'enterprise.io']
    return f"{first.lower()}.{last.lower()}@{np.random.choice(domains)}"

def random_phone():
    return f"({np.random.randint(200,999)}) {np.random.randint(200,999)}-{np.random.randint(1000,9999)}"

templates = {
    'hr_memo': [
        "Employee {name} (SSN: {ssn}) has been promoted to Senior Analyst effective March 2024. Contact: {email}, Phone: {phone}. Based in {city}.",
        "Termination notice for {name}, SSN: {ssn}. Final paycheck to be sent to address on file. HR contact: {email}. Processed by {company}.",
        "{name} from {company} submitted a leave request. Employee ID: EMP-{emp_id}. Emergency contact phone: {phone}. Location: {city}.",
        "Salary adjustment memo: {name} (SSN: {ssn}) annual compensation increased to ${salary:,}. Effective date: January 2024. Department: {company}.",
    ],
    'financial_report': [
        "Invoice #INV-{inv_id} for {company}: Payment of ${amount:,.2f} via credit card {cc}. Approved by {name}. Contact: {email}.",
        "Expense report submitted by {name} ({email}) for ${amount:,.2f}. Corporate card ending {cc_last4}. Reimbursement approved by finance team at {company}.",
        "Quarterly financial summary for {company}: Revenue ${amount:,.2f}. Prepared by {name}, CFO. Confidential. Phone: {phone}.",
        "Wire transfer confirmation: ${amount:,.2f} sent to account ending {acct_last4} for {name} at {company}. Reference: TXN-{txn_id}.",
    ],
    'medical_form': [
        "Patient: {name}, DOB: {dob}, SSN: {ssn}. Diagnosis: Type 2 Diabetes. Prescribed Metformin 500mg. Dr. {doctor} at {city} Medical Center.",
        "Insurance claim for {name} (Member ID: MED-{med_id}). Procedure: Annual physical exam. Provider: {company} Health. Phone: {phone}.",
        "Medical records request for {name}, DOB: {dob}. Records to be sent to {doctor} at {city} General Hospital. Patient email: {email}.",
    ],
    'marketing_data': [
        "Campaign analytics report for {company}: {impressions:,} impressions, {clicks:,} clicks, {conversions} conversions. Manager: {name}, {email}.",
        "Customer profile: {name}, {city}. Purchase history includes {purchases} orders. Email: {email}. Phone: {phone}. Loyalty tier: Gold.",
        "Event registration: {name} from {company} registered for AI Summit 2024 in {city}. Contact: {email}. Dietary: vegetarian.",
    ],
    'legal_document': [
        "Non-disclosure agreement between {name} and {company}. Effective date: January 2024. Jurisdiction: {city}. Contact: {email}.",
        "Data processing agreement: {company} processes personal data of EU residents per GDPR Art. 28. DPO: {name}, {email}, {phone}.",
        "Contract #CTR-{ctr_id} between {name} and {company}. Value: ${amount:,.2f}. Signed in {city}. Witness: {witness}.",
    ],
}

documents = []
for i in range(200):
    doc_type = np.random.choice(list(templates.keys()))
    template = np.random.choice(templates[doc_type])
    first = np.random.choice(first_names)
    last = np.random.choice(last_names)
    name = f"{first} {last}"
    
    doc_text = template.format(
        name=name,
        ssn=random_ssn(),
        cc=random_cc(),
        cc_last4=f"{np.random.randint(1000,9999)}",
        email=random_email(first, last),
        phone=random_phone(),
        city=np.random.choice(cities),
        company=np.random.choice(companies),
        salary=np.random.randint(50000, 200000),
        amount=np.random.uniform(100, 500000),
        emp_id=np.random.randint(10000, 99999),
        inv_id=np.random.randint(10000, 99999),
        txn_id=np.random.randint(100000, 999999),
        acct_last4=f"{np.random.randint(1000,9999)}",
        med_id=np.random.randint(100000, 999999),
        dob=f"{np.random.randint(1,12):02d}/{np.random.randint(1,28):02d}/{np.random.randint(1950,2000)}",
        doctor=f"Dr. {np.random.choice(last_names)}",
        impressions=np.random.randint(10000, 1000000),
        clicks=np.random.randint(100, 50000),
        conversions=np.random.randint(10, 1000),
        purchases=np.random.randint(1, 50),
        ctr_id=np.random.randint(10000, 99999),
        witness=f"{np.random.choice(first_names)} {np.random.choice(last_names)}",
    )
    
    documents.append({
        'doc_id': f'DOC-{i+1:04d}',
        'doc_type': doc_type,
        'text': doc_text,
        'department': doc_type.replace('_', ' ').title().split()[0],
    })

docs_df = pd.DataFrame(documents)
print(f"Generated {len(docs_df)} documents")
print(f"\nDocument type distribution:")
print(docs_df['doc_type'].value_counts())
print(f"\nSample document:")
print(docs_df.iloc[0]['text'])

## Exercise 1.1: Regex PII Scanning

Build a regex-based scanner to detect SSNs, credit card numbers, emails, and phone numbers in each document.

**Your Task:** Implement the scanner and compute precision/recall against the known document types.

In [None]:
def scan_pii_regex(text):
    """Scan text for PII using regex patterns.
    
    Detect: SSN, credit card, email, phone
    
    Returns: dict of {pii_type: [matches]}
    """
    # TODO: Define regex patterns for SSN, credit_card, email, phone
    # TODO: Apply each pattern to the text
    # TODO: Return dict of findings
    pass

# TODO: Apply scan_pii_regex to all documents
# TODO: Add columns for each PII type count
# TODO: Print summary statistics
pass

## Exercise 1.2: NER with spaCy

Use spaCy's Named Entity Recognition to extract PERSON, ORG, and GPE entities, then combine with regex findings for hybrid detection.

**Your Task:** Extract NER entities and merge with regex results.

In [None]:
def extract_ner_entities(text):
    """Extract named entities using spaCy.
    
    Extract: PERSON, ORG, GPE entities
    
    Returns: dict of {entity_type: [entities]}
    """
    # TODO: Process text with spaCy nlp()
    # TODO: Extract PERSON, ORG, GPE entities
    # TODO: Return dict of findings
    pass

# TODO: Apply to all documents and add NER columns
# TODO: Create a combined 'total_pii_types' column (regex + NER unique types)
# TODO: Print a sample showing both regex and NER findings
pass

## Exercise 2.1: Risk Scoring

Compute a 0-100 risk score for each document based on the types and volume of PII found, and the applicable regulations.

**Your Task:** Implement a risk scoring function.

In [None]:
def compute_risk_score(row):
    """Compute a 0-100 risk score for a document.
    
    Scoring factors:
    - SSN found: +30 points
    - Credit card found: +25 points
    - Email found: +10 points
    - Phone found: +10 points
    - PERSON entities found: +5 per entity (max 15)
    - Medical document type: +15 points
    - Financial document type: +10 points
    
    Cap at 100.
    
    Returns: integer risk score 0-100
    """
    # TODO: Compute score based on PII findings
    # TODO: Add document type bonus
    # TODO: Cap at 100 and return
    pass

# TODO: Apply risk scoring to all documents
# TODO: Assign risk tiers: Critical (76-100), High (51-75), Medium (26-50), Low (0-25)
# TODO: Print distribution of risk tiers
pass

## Exercise 2.2: Compliance Dashboard

Create a 2x2 matplotlib dashboard showing PII distribution, risk tiers, document types vs risk, and regulation applicability.

**Your Task:** Build the compliance dashboard.

In [None]:
def build_compliance_dashboard(docs_df):
    """Build a 2x2 compliance dashboard.
    
    Plots:
    1. Top-left: PII type distribution (bar chart)
    2. Top-right: Risk tier distribution (pie chart)
    3. Bottom-left: Average risk score by document type (horizontal bar)
    4. Bottom-right: Applicable regulations count (bar chart)
       - GDPR: docs with PERSON entities
       - HIPAA: medical_form docs
       - PCI-DSS: docs with credit card findings
       - CCPA: docs with email + phone
    """
    # TODO: Create 2x2 subplot figure (16, 12)
    # TODO: Implement each of the 4 visualisations
    pass

build_compliance_dashboard(docs_df)

## Exercise 2.3: Catalogue Integration

Build a vector store that includes risk metadata, then perform filtered queries to find high-risk documents matching specific search criteria.

**Your Task:** Create a ChromaDB collection with risk metadata and run filtered semantic queries.

In [None]:
def build_risk_catalogue(docs_df):
    """Build a vector catalogue with risk metadata.
    
    Steps:
    1. Load SentenceTransformer('all-MiniLM-L6-v2')
    2. Encode document texts
    3. Create ChromaDB collection 'risk_catalogue'
    4. Add with metadata: doc_type, risk_score, risk_tier
    
    Returns: (collection, model)
    """
    # TODO: Load model and encode texts
    # TODO: Create ChromaDB collection
    # TODO: Add embeddings with risk metadata
    # TODO: Return (collection, model)
    pass

catalogue_result = build_risk_catalogue(docs_df)

In [None]:
def filtered_risk_search(collection, query, risk_tier=None, n_results=5):
    """Search the risk catalogue with optional risk tier filter.
    
    If risk_tier is provided, filter results to that tier.
    Print query, results with doc_type, risk_score, and text preview.
    """
    # TODO: Build where clause if risk_tier is provided
    # TODO: Query collection
    # TODO: Print formatted results
    pass

# Test queries
if catalogue_result:
    collection, model = catalogue_result
    filtered_risk_search(collection, "employee personal data", risk_tier="Critical")
    filtered_risk_search(collection, "financial transactions and payments")
    filtered_risk_search(collection, "medical patient records", risk_tier="Critical")

## Summary

In this lab, you learned how to:

1. **Scan** documents for PII using regex patterns with precision/recall awareness
2. **Extract** named entities with spaCy NER for hybrid PII detection
3. **Score** data assets for compliance risk on a 0-100 scale
4. **Visualise** compliance posture with a multi-chart dashboard
5. **Integrate** risk metadata into a vector catalogue for filtered semantic search

---

*Data Discovery: Harnessing AI, AGI & Vector Databases | AI Elevate*