# Lab 4: Data Governance & Automated Policy Engine

**Data Discovery: Harnessing AI, AGI & Vector Databases - Day 2**

| Duration | Difficulty | Framework | Exercises |
|---|---|---|---|
| 90 min | Intermediate | pandas, numpy, matplotlib, networkx | 6 |

In this lab, you'll practice:
- Defining governance policies as executable rule functions
- Scanning data assets against policies to detect violations
- Modelling data lineage as a directed graph
- Simulating role-based access control (RBAC)
- Building a compliance dashboard
- Generating automated compliance reports

---

## Student Notes & Background

### Why Data Governance Matters

As organisations accumulate thousands of data assets across departments, the question shifts from *"Can we find this data?"* to *"Should anyone be accessing this data, and is it compliant with our policies?"* **Data governance** is the framework of policies, processes, and controls that ensures data is managed responsibly throughout its lifecycle.

Without automated governance, organisations face:
- **Compliance violations** — PII stored without encryption, breaching GDPR, HIPAA, or CCPA
- **Security risks** — unauthorised users accessing sensitive data
- **Data decay** — stale, orphaned, or redundant data consuming resources
- **Audit failures** — inability to demonstrate compliance to regulators

### Key Concepts

#### 1. Policy-as-Code
Traditional governance relies on written policies that humans interpret and enforce manually. **Policy-as-code** encodes governance rules as executable functions that can be run automatically against every data asset in your catalogue. Each policy check takes an asset's metadata and returns a verdict: *compliant* or *violation* with a detail message.

The benefits are significant:
- **Consistency** — every asset is evaluated by the same rules
- **Speed** — hundreds of assets scanned in seconds rather than weeks of manual audit
- **Auditability** — every check produces a traceable record
- **Composability** — new policies can be added without modifying existing ones

#### 2. Policy Severity Levels
Not all violations are equal. A standard severity classification:

| Severity | Description | Response Time |
|---|---|---|
| **Critical** | Immediate risk of data breach or regulatory penalty | Fix within 24 hours |
| **High** | Significant compliance gap requiring prompt action | Fix within 1 week |
| **Medium** | Best-practice deviation that should be addressed | Fix within 1 month |
| **Low** | Minor improvement opportunity | Address in next review cycle |

#### 3. Data Lineage
**Data lineage** tracks how data flows through an organisation — from source systems (databases, APIs, files), through transformations (ETL jobs, ML pipelines, aggregations), to outputs (dashboards, reports, exports). Lineage is modelled as a **directed acyclic graph (DAG)** where:
- **Nodes** represent data assets or processing steps
- **Edges** represent data flow from upstream to downstream

Lineage analysis reveals:
- **Bottleneck nodes** — transforms that many pipelines depend on (high failure impact)
- **Long dependency chains** — complex pipelines that are harder to debug and audit
- **Blast radius** — if a source system fails, which downstream outputs are affected?

#### 4. Role-Based Access Control (RBAC)
**RBAC** restricts data access based on a user's role rather than their individual identity. A typical enterprise model:

| Role | Access Level | Typical Users |
|---|---|---|
| **Admin** | All sensitivity levels | Data platform team, CTO |
| **Analyst** | Up to Confidential | Data analysts, business intelligence |
| **Engineer** | Up to Internal | Software engineers, DevOps |
| **Viewer** | Public only | General staff, external partners |

A **violation** occurs when a user in a lower-privilege role accesses data above their clearance level. Monitoring access patterns helps detect both policy gaps and potential security incidents.

#### 5. Compliance Dashboards
A **compliance dashboard** provides at-a-glance visibility into your governance posture. Effective dashboards show:
- Overall compliance score (percentage of clean assets)
- Violations broken down by severity, category, and policy
- Trends over time (is compliance improving or degrading?)
- Sensitivity distribution of violating assets

#### 6. Compliance Reports
Automated **compliance reports** translate raw violation data into actionable documents for different audiences:
- **Executive summary** for leadership (score, headline numbers, rating)
- **Critical findings** for the security team (specific assets and violations)
- **Department breakdown** for data stewards (their team's compliance posture)
- **Recommendations** for the governance committee (prioritised remediation actions)

### What You'll Build

In this lab, you will:
1. **Define** 6 governance policy check functions covering PII encryption, retention periods, stale data, owner assignment, access control, and backup compliance
2. **Build** an automated scanner that runs all policies against 300 synthetic data assets and collects violations into a structured DataFrame
3. **Model** data lineage as a NetworkX directed graph with source, transform, and output nodes, then analyse it for bottlenecks and long dependency chains
4. **Simulate** 1,000 RBAC access requests across 4 roles and visualise violation rates as a heatmap
5. **Create** a 6-panel governance dashboard with matplotlib showing severity breakdowns, compliance rates, and an overall score
6. **Generate** a formatted compliance report with executive summary, critical findings, department breakdown, and actionable recommendations

### Prerequisites
- Familiarity with pandas DataFrames, numpy, and matplotlib
- Understanding of set operations (for access control checks)
- Concepts from Labs 1-2: data asset metadata, sensitivity levels, PII detection

### Tips
- Each policy function follows the same signature: `def check_policy(row) -> (bool, str)` — this makes them composable and testable
- The synthetic data includes deliberate governance gaps (unencrypted PII, missing owners, stale assets) so you will find violations
- For the lineage graph, `networkx` provides powerful analysis functions like `nx.dag_longest_path()`, `nx.descendants()`, and `nx.betweenness_centrality()`
- When building the dashboard, use `plt.tight_layout()` to prevent label overlap in the 2×3 grid

---

## Setup

First, let's import the necessary libraries.

In [None]:
# Core libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import networkx as nx
import json
from collections import defaultdict
from datetime import datetime, timedelta

# Settings
%matplotlib inline
plt.style.use('seaborn-v0_8-whitegrid')

print("Libraries loaded successfully!")

## Part 1: Generate Synthetic Data Assets

We'll create ~300 data assets with governance-relevant metadata including PII flags, encryption status, retention periods, and access patterns.

In [None]:
np.random.seed(42)

categories = ['HR', 'Finance', 'Marketing', 'Engineering', 'Legal']
owners = ['alice', 'bob', 'carol', 'dave', 'eve', None]
sensitivity_levels = ['Public', 'Internal', 'Confidential', 'Restricted']

n_assets = 300
assets = []

all_users = ['alice', 'bob', 'carol', 'dave', 'eve', 'frank', 'grace', 'henry', 'iris', 'jack']

for i in range(n_assets):
    cat = np.random.choice(categories)
    sens = np.random.choice(sensitivity_levels, p=[0.15, 0.30, 0.30, 0.25])
    has_pii = cat in ['HR', 'Legal'] or (cat == 'Finance' and np.random.random() < 0.5)
    has_encryption = (sens in ['Confidential', 'Restricted'] and np.random.random() < 0.7) or np.random.random() < 0.2
    
    # Determine approved users based on sensitivity
    if sens == 'Restricted':
        n_approved = np.random.randint(1, 4)
    elif sens == 'Confidential':
        n_approved = np.random.randint(2, 6)
    else:
        n_approved = np.random.randint(3, 10)
    approved = list(np.random.choice(all_users, size=n_approved, replace=False))
    
    # Actual users (sometimes includes unauthorized)
    n_actual = np.random.randint(1, min(n_approved + 3, len(all_users)))
    actual = list(np.random.choice(all_users, size=n_actual, replace=False))
    
    assets.append({
        'asset_id': f'ASSET-{i+1:04d}',
        'name': f'{cat.lower()}_asset_{i+1:04d}',
        'category': cat,
        'owner': np.random.choice(owners, p=[0.2, 0.2, 0.2, 0.2, 0.15, 0.05]),
        'sensitivity': sens,
        'has_pii': has_pii,
        'has_encryption': has_encryption,
        'last_access_days_ago': np.random.randint(0, 400),
        'retention_days': np.random.choice([90, 180, 365, 730, 1095]),
        'days_since_creation': np.random.randint(30, 1200),
        'access_count_30d': np.random.randint(0, 200),
        'has_backup': np.random.random() < 0.6,
        'approved_users': approved,
        'actual_users': actual,
    })

assets_df = pd.DataFrame(assets)
print(f"Generated {len(assets_df)} data assets")
print(f"\nSensitivity distribution:")
print(assets_df['sensitivity'].value_counts())
print(f"\nPII assets: {assets_df['has_pii'].sum()} ({assets_df['has_pii'].mean()*100:.1f}%)")
print(f"Encrypted:  {assets_df['has_encryption'].sum()} ({assets_df['has_encryption'].mean()*100:.1f}%)")
assets_df.head()

## Exercise 1.1: Define Governance Policies

Create executable policy rule functions that check data assets for compliance violations.

**Your Task:** Implement 6 policy check functions. Each takes a row (asset dict) and returns `(bool, str)` — whether the asset violates the policy and a description of the violation.

In [None]:
def check_pii_encryption(row):
    """CRITICAL: Assets containing PII must be encrypted.
    Returns: (is_violation: bool, detail: str)
    """
    # YOUR CODE HERE
    pass

def check_retention_compliance(row):
    """HIGH: Assets past their retention period must be flagged for review.
    Violation if days_since_creation > retention_days.
    Returns: (is_violation: bool, detail: str)
    """
    # YOUR CODE HERE
    pass

def check_stale_data(row):
    """MEDIUM: Assets not accessed in 180+ days should be reviewed.
    Returns: (is_violation: bool, detail: str)
    """
    # YOUR CODE HERE
    pass

def check_owner_assigned(row):
    """HIGH: All data assets must have an assigned owner.
    Returns: (is_violation: bool, detail: str)
    """
    # YOUR CODE HERE
    pass

def check_access_control(row):
    """CRITICAL: No unauthorized users should be accessing assets.
    Violation if any user in actual_users is not in approved_users.
    Returns: (is_violation: bool, detail: str)
    """
    # YOUR CODE HERE
    pass

def check_backup_compliance(row):
    """HIGH: Confidential and Restricted assets must have backups.
    Returns: (is_violation: bool, detail: str)
    """
    # YOUR CODE HERE
    pass

# Policy registry
POLICIES = [
    {'id': 'POL-001', 'name': 'PII Encryption Required', 'severity': 'Critical', 'check': check_pii_encryption},
    {'id': 'POL-002', 'name': 'Retention Compliance', 'severity': 'High', 'check': check_retention_compliance},
    {'id': 'POL-003', 'name': 'Stale Data Review', 'severity': 'Medium', 'check': check_stale_data},
    {'id': 'POL-004', 'name': 'Owner Assignment', 'severity': 'High', 'check': check_owner_assigned},
    {'id': 'POL-005', 'name': 'Access Control', 'severity': 'Critical', 'check': check_access_control},
    {'id': 'POL-006', 'name': 'Backup Compliance', 'severity': 'High', 'check': check_backup_compliance},
]

print(f"Defined {len(POLICIES)} governance policies")
for p in POLICIES:
    print(f"  [{p['severity']}] {p['id']}: {p['name']}")

## Exercise 1.2: Automated Policy Scanning

Run all policies against all data assets to produce a violations report.

**Your Task:** Scan every asset against every policy, collect all violations, and produce a summary.

In [None]:
def scan_all_assets(assets_df, policies):
    """Scan all assets against all policies.
    
    Steps:
    1. For each asset, run each policy check function
    2. Collect violations as dicts: {asset_id, policy_id, policy_name, severity, detail, category}
    3. Return as a DataFrame
    
    Returns: violations DataFrame
    """
    # YOUR CODE HERE
    pass

violations_df = scan_all_assets(assets_df, POLICIES)

if violations_df is not None and len(violations_df) > 0:
    print(f"Total violations found: {len(violations_df)}")
    print(f"Assets with violations: {violations_df['asset_id'].nunique()} / {len(assets_df)}")
    print(f"\nViolations by severity:")
    print(violations_df['severity'].value_counts())
    print(f"\nViolations by policy:")
    print(violations_df['policy_name'].value_counts())

## Exercise 2.1: Data Lineage Tracking

Build a directed graph representing data lineage — how data flows from source systems through transformations to outputs.

**Your Task:** Generate synthetic lineage data, build a NetworkX directed graph, and visualize it.

In [None]:
def build_lineage_graph(n_pipelines=30):
    """Build a data lineage directed graph.
    
    Steps:
    1. Create source nodes (databases, APIs, files)
    2. Create transform nodes (ETL jobs, ML models, aggregations)
    3. Create output nodes (reports, dashboards, exports)
    4. Connect them in realistic pipeline patterns
    5. Visualize with node coloring by type
    
    Node types: 'source' (green), 'transform' (orange), 'output' (blue)
    
    Returns: NetworkX DiGraph
    """
    # YOUR CODE HERE
    pass

G = build_lineage_graph()

if G is not None:
    print(f"Lineage graph: {G.number_of_nodes()} nodes, {G.number_of_edges()} edges")

In [None]:
def analyze_lineage(G):
    """Analyze the lineage graph for governance insights.
    
    Find:
    - Most connected nodes (highest degree) — potential bottlenecks
    - Longest paths — most complex data pipelines
    - Source nodes with most downstream dependents
    
    Print findings.
    """
    # YOUR CODE HERE
    pass

if G is not None:
    analyze_lineage(G)

## Exercise 2.2: Access Control Simulation

Model a role-based access control (RBAC) system and simulate access requests to detect violations.

**Your Task:** Define roles with sensitivity permissions, simulate access requests, and analyze violations.

In [None]:
def simulate_rbac(assets_df, n_requests=1000):
    """Simulate RBAC access requests and detect violations.
    
    Roles and their maximum allowed sensitivity:
    - admin: Restricted (can access everything)
    - analyst: Confidential
    - engineer: Internal
    - viewer: Public
    
    Sensitivity ordering: Public < Internal < Confidential < Restricted
    
    Steps:
    1. Define role -> max_sensitivity mapping
    2. Generate random access requests (user, role, asset)
    3. Check if the role is allowed to access that sensitivity level
    4. Compute violation rates by role
    5. Plot a heatmap of violations by role vs sensitivity level
    
    Returns: access_log DataFrame
    """
    # YOUR CODE HERE
    pass

access_log = simulate_rbac(assets_df)

if access_log is not None:
    print(f"\nTotal requests: {len(access_log)}")
    print(f"Violations: {access_log['violation'].sum()} ({access_log['violation'].mean()*100:.1f}%)")

## Exercise 3.1: Governance Dashboard

Build a comprehensive 2x3 compliance dashboard visualizing the governance state of your data catalogue.

**Your Task:** Create a 6-panel matplotlib dashboard.

In [None]:
def build_governance_dashboard(assets_df, violations_df):
    """Build a 2x3 governance compliance dashboard.
    
    Panels:
    1. Top-left: Violations by severity (bar chart)
    2. Top-centre: Violations by category (grouped bar showing category vs severity)
    3. Top-right: Asset compliance rate (pie chart: compliant vs non-compliant)
    4. Bottom-left: Top violated policies (horizontal bar)
    5. Bottom-centre: Sensitivity distribution of violating assets (bar)
    6. Bottom-right: Overall compliance score (large text showing percentage)
    
    Use figsize=(18, 10) and tight_layout.
    """
    # YOUR CODE HERE
    pass

if violations_df is not None and len(violations_df) > 0:
    build_governance_dashboard(assets_df, violations_df)

## Exercise 3.2: Compliance Report Generator

Generate a text-based compliance report with executive summary, critical findings, and per-department breakdown.

**Your Task:** Implement a report generator that produces a formatted compliance report.

In [None]:
def generate_compliance_report(assets_df, violations_df):
    """Generate a formatted compliance report.
    
    Sections:
    1. Executive Summary: overall compliance score, total assets, total violations
    2. Critical Findings: list all Critical severity violations with asset details
    3. Department Breakdown: for each category, show violation count, top issues
    4. Recommendations: top 3 actionable recommendations based on findings
    
    Returns: formatted report string
    """
    # YOUR CODE HERE
    pass

if violations_df is not None and len(violations_df) > 0:
    report = generate_compliance_report(assets_df, violations_df)
    if report:
        print(report)

## Summary

In this lab, you learned how to:

1. **Define** governance policies as executable rule functions
2. **Scan** data assets against policies to detect compliance violations
3. **Model** data lineage as a directed graph and identify bottlenecks
4. **Simulate** role-based access control to detect unauthorized access
5. **Build** a comprehensive governance compliance dashboard
6. **Generate** automated compliance reports with findings and recommendations

---

*Data Discovery: Harnessing AI, AGI & Vector Databases | AI Elevate*