# Lab 4: Data Governance & Automated Policy Engine

**Data Discovery: Harnessing AI, AGI & Vector Databases - Day 2**

| Duration | Framework | Sections |
|---|---|---|
| 90 min | pandas, numpy, matplotlib, networkx | 6 |

In this lab, you'll explore:
- Defining governance policies as executable rule functions
- Scanning data assets against policies to detect violations
- Modelling data lineage as a directed graph
- Simulating role-based access control (RBAC)
- Building a compliance dashboard
- Generating automated compliance reports

---

## Student Notes & Background

### Why Data Governance Matters

As organisations accumulate thousands of data assets across departments, the question shifts from *"Can we find this data?"* to *"Should anyone be accessing this data, and is it compliant with our policies?"* **Data governance** is the framework of policies, processes, and controls that ensures data is managed responsibly throughout its lifecycle.

Without automated governance, organisations face:
- **Compliance violations** — PII stored without encryption, breaching GDPR, HIPAA, or CCPA
- **Security risks** — unauthorised users accessing sensitive data
- **Data decay** — stale, orphaned, or redundant data consuming resources
- **Audit failures** — inability to demonstrate compliance to regulators

### Key Concepts

#### 1. Policy-as-Code
Traditional governance relies on written policies that humans interpret and enforce manually. **Policy-as-code** encodes governance rules as executable functions that can be run automatically against every data asset in your catalogue. Each policy check takes an asset's metadata and returns a verdict: *compliant* or *violation* with a detail message.

The benefits are significant:
- **Consistency** — every asset is evaluated by the same rules
- **Speed** — hundreds of assets scanned in seconds rather than weeks of manual audit
- **Auditability** — every check produces a traceable record
- **Composability** — new policies can be added without modifying existing ones

#### 2. Policy Severity Levels
Not all violations are equal. A standard severity classification:

| Severity | Description | Response Time |
|---|---|---|
| **Critical** | Immediate risk of data breach or regulatory penalty | Fix within 24 hours |
| **High** | Significant compliance gap requiring prompt action | Fix within 1 week |
| **Medium** | Best-practice deviation that should be addressed | Fix within 1 month |
| **Low** | Minor improvement opportunity | Address in next review cycle |

#### 3. Data Lineage
**Data lineage** tracks how data flows through an organisation — from source systems (databases, APIs, files), through transformations (ETL jobs, ML pipelines, aggregations), to outputs (dashboards, reports, exports). Lineage is modelled as a **directed acyclic graph (DAG)** where:
- **Nodes** represent data assets or processing steps
- **Edges** represent data flow from upstream to downstream

Lineage analysis reveals:
- **Bottleneck nodes** — transforms that many pipelines depend on (high failure impact)
- **Long dependency chains** — complex pipelines that are harder to debug and audit
- **Blast radius** — if a source system fails, which downstream outputs are affected?

#### 4. Role-Based Access Control (RBAC)
**RBAC** restricts data access based on a user's role rather than their individual identity. A typical enterprise model:

| Role | Access Level | Typical Users |
|---|---|---|
| **Admin** | All sensitivity levels | Data platform team, CTO |
| **Analyst** | Up to Confidential | Data analysts, business intelligence |
| **Engineer** | Up to Internal | Software engineers, DevOps |
| **Viewer** | Public only | General staff, external partners |

A **violation** occurs when a user in a lower-privilege role accesses data above their clearance level. Monitoring access patterns helps detect both policy gaps and potential security incidents.

#### 5. Compliance Dashboards
A **compliance dashboard** provides at-a-glance visibility into your governance posture. Effective dashboards show:
- Overall compliance score (percentage of clean assets)
- Violations broken down by severity, category, and policy
- Trends over time (is compliance improving or degrading?)
- Sensitivity distribution of violating assets

#### 6. Compliance Reports
Automated **compliance reports** translate raw violation data into actionable documents for different audiences:
- **Executive summary** for leadership (score, headline numbers, rating)
- **Critical findings** for the security team (specific assets and violations)
- **Department breakdown** for data stewards (their team's compliance posture)
- **Recommendations** for the governance committee (prioritised remediation actions)

### What You'll Build

In this lab, you will:
1. **Define** 6 governance policy check functions covering PII encryption, retention periods, stale data, owner assignment, access control, and backup compliance
2. **Build** an automated scanner that runs all policies against 300 synthetic data assets and collects violations into a structured DataFrame
3. **Model** data lineage as a NetworkX directed graph with source, transform, and output nodes, then analyse it for bottlenecks and long dependency chains
4. **Simulate** 1,000 RBAC access requests across 4 roles and visualise violation rates as a heatmap
5. **Create** a 6-panel governance dashboard with matplotlib showing severity breakdowns, compliance rates, and an overall score
6. **Generate** a formatted compliance report with executive summary, critical findings, department breakdown, and actionable recommendations

### Prerequisites
- Familiarity with pandas DataFrames, numpy, and matplotlib
- Understanding of set operations (for access control checks)
- Concepts from Labs 1-2: data asset metadata, sensitivity levels, PII detection

### Tips
- Each policy function follows the same signature: `def check_policy(row) -> (bool, str)` — this makes them composable and testable
- The synthetic data includes deliberate governance gaps (unencrypted PII, missing owners, stale assets) so you will find violations
- For the lineage graph, `networkx` provides powerful analysis functions like `nx.dag_longest_path()`, `nx.descendants()`, and `nx.betweenness_centrality()`
- When building the dashboard, use `plt.tight_layout()` to prevent label overlap in the 2×3 grid

---

## Setup

First, let's import the necessary libraries.

In [None]:
# Core libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import networkx as nx
import json
from collections import defaultdict
from datetime import datetime, timedelta

# Settings
%matplotlib inline
plt.style.use('seaborn-v0_8-whitegrid')

print("Libraries loaded successfully!")

## Part 1: Generate Synthetic Data Assets

We'll create ~300 data assets with governance-relevant metadata including PII flags, encryption status, retention periods, and access patterns.

In [None]:
np.random.seed(42)

categories = ['HR', 'Finance', 'Marketing', 'Engineering', 'Legal']
owners = ['alice', 'bob', 'carol', 'dave', 'eve', None]
sensitivity_levels = ['Public', 'Internal', 'Confidential', 'Restricted']

n_assets = 300
assets = []

all_users = ['alice', 'bob', 'carol', 'dave', 'eve', 'frank', 'grace', 'henry', 'iris', 'jack']

for i in range(n_assets):
    cat = np.random.choice(categories)
    sens = np.random.choice(sensitivity_levels, p=[0.15, 0.30, 0.30, 0.25])
    has_pii = cat in ['HR', 'Legal'] or (cat == 'Finance' and np.random.random() < 0.5)
    has_encryption = (sens in ['Confidential', 'Restricted'] and np.random.random() < 0.7) or np.random.random() < 0.2
    
    # Determine approved users based on sensitivity
    if sens == 'Restricted':
        n_approved = np.random.randint(1, 4)
    elif sens == 'Confidential':
        n_approved = np.random.randint(2, 6)
    else:
        n_approved = np.random.randint(3, 10)
    approved = list(np.random.choice(all_users, size=n_approved, replace=False))
    
    # Actual users (sometimes includes unauthorized)
    n_actual = np.random.randint(1, min(n_approved + 3, len(all_users)))
    actual = list(np.random.choice(all_users, size=n_actual, replace=False))
    
    assets.append({
        'asset_id': f'ASSET-{i+1:04d}',
        'name': f'{cat.lower()}_asset_{i+1:04d}',
        'category': cat,
        'owner': np.random.choice(owners, p=[0.2, 0.2, 0.2, 0.2, 0.15, 0.05]),
        'sensitivity': sens,
        'has_pii': has_pii,
        'has_encryption': has_encryption,
        'last_access_days_ago': np.random.randint(0, 400),
        'retention_days': np.random.choice([90, 180, 365, 730, 1095]),
        'days_since_creation': np.random.randint(30, 1200),
        'access_count_30d': np.random.randint(0, 200),
        'has_backup': np.random.random() < 0.6,
        'approved_users': approved,
        'actual_users': actual,
    })

assets_df = pd.DataFrame(assets)
print(f"Generated {len(assets_df)} data assets")
print(f"\nSensitivity distribution:")
print(assets_df['sensitivity'].value_counts())
print(f"\nPII assets: {assets_df['has_pii'].sum()} ({assets_df['has_pii'].mean()*100:.1f}%)")
print(f"Encrypted:  {assets_df['has_encryption'].sum()} ({assets_df['has_encryption'].mean()*100:.1f}%)")
assets_df.head()

## Section 1.1: Governance Policies

The code below defines 6 governance policy check functions as executable rules. Each function takes an asset's metadata and returns whether it violates the policy. This "policy-as-code" approach ensures every asset is evaluated consistently and automatically.

In [None]:
def check_pii_encryption(asset):
    """POL-001: PII data must be encrypted."""
    if asset['has_pii'] and not asset['has_encryption']:
        return True, f"Asset contains PII but is NOT encrypted (sensitivity={asset['sensitivity']})"
    return False, ""


def check_retention_compliance(asset):
    """POL-002: Assets older than their retention period must be flagged."""
    if asset['days_since_creation'] > asset['retention_days']:
        over = asset['days_since_creation'] - asset['retention_days']
        return True, f"Asset is {over} days past its {asset['retention_days']}-day retention period"
    return False, ""


def check_stale_data(asset):
    """POL-003: Assets not accessed in 180+ days should be reviewed."""
    if asset['last_access_days_ago'] >= 180:
        return True, f"Asset last accessed {asset['last_access_days_ago']} days ago (stale threshold: 180)"
    return False, ""


def check_owner_assigned(asset):
    """POL-004: Every asset must have an assigned owner."""
    if asset['owner'] is None or (isinstance(asset['owner'], float) and np.isnan(asset['owner'])):
        return True, "Asset has no owner assigned"
    return False, ""


def check_access_control(asset):
    """POL-005: Only approved users should access an asset."""
    approved = set(asset['approved_users'])
    actual = set(asset['actual_users'])
    unauthorized = actual - approved
    if unauthorized:
        return True, f"{len(unauthorized)} unauthorized user(s): {', '.join(sorted(unauthorized))}"
    return False, ""


def check_sensitivity_review(asset):
    """POL-006: Restricted assets accessed by >5 users need review."""
    if asset['sensitivity'] == 'Restricted' and len(asset['actual_users']) > 5:
        return True, (f"Restricted asset accessed by {len(asset['actual_users'])} users "
                      f"(max recommended: 5)")
    return False, ""


def check_classification(asset):
    """POL-007: All assets must have a sensitivity classification."""
    if not asset['sensitivity'] or asset['sensitivity'] == '':
        return True, "Asset has no sensitivity classification"
    return False, ""


def check_backup_compliance(asset):
    """POL-008: Confidential and Restricted assets must have backups."""
    if asset['sensitivity'] in ('Confidential', 'Restricted') and not asset['has_backup']:
        return True, f"No backup for {asset['sensitivity']} asset"
    return False, ""


POLICY_FUNCTIONS = {
    'check_pii_encryption':      check_pii_encryption,
    'check_retention_compliance': check_retention_compliance,
    'check_stale_data':          check_stale_data,
    'check_owner_assigned':      check_owner_assigned,
    'check_access_control':      check_access_control,
    'check_sensitivity_review':  check_sensitivity_review,
    'check_classification':      check_classification,
    'check_backup_compliance':   check_backup_compliance,
}

print(f"Defined {len(POLICY_FUNCTIONS)} policy check functions:")
for name, fn in POLICY_FUNCTIONS.items():
    print(f"  - {name:35s}  {fn.__doc__}")

### Analysis Questions

1. How would you prioritize remediation across Critical, High, and Medium violations?
2. Which policy would be hardest to implement in a real enterprise? Why?

## Section 1.2: Automated Policy Scanning

The code below runs every policy against every data asset, collecting all violations into a structured DataFrame. This automated scanning can process hundreds of assets in seconds — compared to weeks of manual audit.

In [None]:
# Build the policies metadata table (links policy IDs to check functions)
policies_df = pd.DataFrame([
    {'policy_id': 'POL-001', 'name': 'PII Encryption', 'description': 'Assets containing PII must be encrypted', 'check_function_name': 'check_pii_encryption', 'severity': 'Critical'},
    {'policy_id': 'POL-002', 'name': 'Retention Compliance', 'description': 'Assets past retention must be flagged', 'check_function_name': 'check_retention_compliance', 'severity': 'High'},
    {'policy_id': 'POL-003', 'name': 'Stale Data', 'description': 'Assets not accessed in 180+ days need review', 'check_function_name': 'check_stale_data', 'severity': 'Medium'},
    {'policy_id': 'POL-004', 'name': 'Owner Assigned', 'description': 'Every asset must have an owner', 'check_function_name': 'check_owner_assigned', 'severity': 'High'},
    {'policy_id': 'POL-005', 'name': 'Access Control', 'description': 'Only approved users should access assets', 'check_function_name': 'check_access_control', 'severity': 'Critical'},
    {'policy_id': 'POL-006', 'name': 'Sensitivity Review', 'description': 'Restricted assets with >5 users need review', 'check_function_name': 'check_sensitivity_review', 'severity': 'High'},
    {'policy_id': 'POL-007', 'name': 'Classification Required', 'description': 'All assets must have sensitivity set', 'check_function_name': 'check_classification', 'severity': 'Medium'},
    {'policy_id': 'POL-008', 'name': 'Backup Compliance', 'description': 'Confidential+ assets need backups', 'check_function_name': 'check_backup_compliance', 'severity': 'High'},
])

def run_policy_scan(assets_df, policies_df, policy_functions):
    """Scan all assets against all policies and collect violations."""
    violations = []
    for _, policy in policies_df.iterrows():
        fn = policy_functions[policy['check_function_name']]
        for _, asset in assets_df.iterrows():
            violated, detail = fn(asset)
            if violated:
                violations.append({
                    'asset_id': asset['asset_id'], 'asset_name': asset['name'],
                    'category': asset['category'], 'sensitivity': asset['sensitivity'],
                    'policy_id': policy['policy_id'], 'policy_name': policy['name'],
                    'severity': policy['severity'], 'detail': detail,
                })
    return pd.DataFrame(violations)

violations_df = run_policy_scan(assets_df, policies_df, POLICY_FUNCTIONS)

print(f"Total violations found: {len(violations_df)}")
print(f"Assets with at least one violation: {violations_df['asset_id'].nunique()} / {len(assets_df)}")
print(f"\nViolations by severity:")
print(violations_df['severity'].value_counts())
print(f"\nViolations by policy:")
print(violations_df['policy_name'].value_counts())
print(f"\nViolations by category:")
print(violations_df['category'].value_counts())
print(f"\nSample violations:")
violations_df.head(10)

### Analysis Questions

1. What percentage of assets have at least one violation? Is this realistic for a real organisation?
2. Which department has the most violations? What structural factors cause this?
3. Are certain policies violated together (co-occurrence)? What does this suggest?

## Section 2.1: Data Lineage Tracking

The code below builds a directed graph representing data lineage — how data flows from source systems (databases, APIs) through transformations (ETL jobs, ML pipelines) to outputs (dashboards, reports). The graph is analysed for bottlenecks and long dependency chains.

In [None]:
np.random.seed(123)
G = nx.DiGraph()

source_nodes = ['CRM_Database', 'ERP_System', 'HR_Portal', 'Web_Analytics',
    'IoT_Sensors', 'Email_Server', 'Payment_Gateway', 'Social_API',
    'Survey_Platform', 'Legacy_Mainframe', 'Cloud_Storage', 'Partner_Feed']
transform_nodes = ['ETL_Pipeline_A', 'ETL_Pipeline_B', 'Spark_Job_Clean',
    'Spark_Job_Enrich', 'Python_Transform', 'dbt_Model_Stg',
    'dbt_Model_Int', 'dbt_Model_Mart', 'ML_Feature_Eng',
    'Anonymiser', 'Data_Quality_Check', 'Aggregator',
    'Dedup_Service', 'Schema_Validator', 'Encryption_Layer']
output_nodes = ['DW_Finance_Mart', 'DW_HR_Mart', 'DW_Marketing_Mart',
    'DW_Engineering_Mart', 'DW_Legal_Mart', 'BI_Dashboard',
    'ML_Training_Set', 'Customer_360', 'Compliance_Report',
    'Executive_Summary', 'Data_Catalogue', 'Archive_S3']

for s in source_nodes: G.add_node(s, node_type='source')
for t in transform_nodes: G.add_node(t, node_type='transform')
for o in output_nodes: G.add_node(o, node_type='output')

for _ in range(50):
    src = np.random.choice(source_nodes)
    chain = list(np.random.choice(transform_nodes, size=np.random.randint(1,4), replace=False))
    out = np.random.choice(output_nodes)
    G.add_edge(src, chain[0])
    for j in range(len(chain)-1): G.add_edge(chain[j], chain[j+1])
    G.add_edge(chain[-1], out)

print(f"Lineage graph: {G.number_of_nodes()} nodes, {G.number_of_edges()} edges")

color_map = {'source': '#3b82f6', 'transform': '#f59e0b', 'output': '#10b981'}
node_colors = [color_map[G.nodes[n].get('node_type','transform')] for n in G.nodes]

fig, ax = plt.subplots(figsize=(18, 12))
pos = nx.spring_layout(G, seed=42, k=1.8)
nx.draw_networkx_nodes(G, pos, node_color=node_colors, node_size=600, alpha=0.9, ax=ax)
nx.draw_networkx_labels(G, pos, font_size=6, font_weight='bold', ax=ax)
nx.draw_networkx_edges(G, pos, edge_color='#94a3b8', arrows=True, arrowsize=15, width=1.2, alpha=0.7, ax=ax)

from matplotlib.lines import Line2D
legend_elements = [
    Line2D([0],[0], marker='o', color='w', markerfacecolor='#3b82f6', markersize=12, label='Source'),
    Line2D([0],[0], marker='o', color='w', markerfacecolor='#f59e0b', markersize=12, label='Transform'),
    Line2D([0],[0], marker='o', color='w', markerfacecolor='#10b981', markersize=12, label='Output')]
ax.legend(handles=legend_elements, loc='upper left', fontsize=11)
ax.set_title('Data Lineage Graph', fontsize=16, fontweight='bold')
plt.tight_layout(); plt.show()

print("\n--- Lineage Analysis ---")
try:
    longest_path = nx.dag_longest_path(G)
    print(f"Longest lineage path ({len(longest_path)} nodes):")
    print(f"  {' -> '.join(longest_path)}")
except nx.NetworkXUnfeasible:
    print("Graph contains cycles. Finding longest simple path...")
    max_path = []
    for s in source_nodes:
        for o in output_nodes:
            try:
                for p in nx.all_simple_paths(G, s, o):
                    if len(p) > len(max_path): max_path = p
            except nx.NetworkXError: pass
    if max_path: print(f"Longest path ({len(max_path)} nodes): {' -> '.join(max_path)}")

degree_sorted = sorted(G.degree(), key=lambda x: x[1], reverse=True)[:10]
print(f"\nTop 10 most-connected nodes:")
for node, deg in degree_sorted:
    print(f"  {node:30s}  type={G.nodes[node].get('node_type','?'):10s}  degree={deg}")

betweenness = nx.betweenness_centrality(G)
top_b = sorted(betweenness.items(), key=lambda x: x[1], reverse=True)[:5]
print(f"\nTop 5 bottleneck nodes (betweenness centrality):")
for node, bc in top_b:
    print(f"  {node:30s}  type={G.nodes[node].get('node_type','?'):10s}  centrality={bc:.4f}")

In [None]:
# Lineage analysis was performed in the cell above

### Analysis Questions

1. Which transform nodes are bottlenecks (highest betweenness centrality)? What's the risk if they fail?
2. What's the longest dependency chain? How does this affect debugging and auditing?
3. If a source database goes offline, how many downstream outputs are affected?

## Section 2.2: RBAC Access Control

The code below simulates a role-based access control system with four roles (admin, analyst, engineer, viewer) and 2,000 access requests. Each role has a maximum sensitivity level it can access. The violation rates are visualised as a heatmap showing which role/sensitivity combinations cause denials.

In [None]:
np.random.seed(99)
roles = {
    'admin':    {'allowed': ['Public', 'Internal', 'Confidential', 'Restricted']},
    'analyst':  {'allowed': ['Public', 'Internal', 'Confidential']},
    'engineer': {'allowed': ['Public', 'Internal']},
    'viewer':   {'allowed': ['Public']},
}
user_roles = {}
for user in all_users:
    user_roles[user] = np.random.choice(list(roles.keys()), p=[0.10, 0.30, 0.35, 0.25])

print("User role assignments:")
for user, role in sorted(user_roles.items()):
    print(f"  {user:10s} -> {role}")

access_log = []
for _ in range(2000):
    user = np.random.choice(all_users)
    asset = assets_df.iloc[np.random.randint(0, len(assets_df))]
    role = user_roles[user]
    granted = asset['sensitivity'] in roles[role]['allowed']
    access_log.append({'user': user, 'role': role, 'asset_id': asset['asset_id'],
        'sensitivity': asset['sensitivity'], 'category': asset['category'], 'granted': granted})

access_df = pd.DataFrame(access_log)
print(f"\nSimulated {len(access_df)} access requests")
print(f"Granted: {access_df['granted'].sum()}  |  Denied: {(~access_df['granted']).sum()}")
print(f"Overall denial rate: {(~access_df['granted']).mean():.1%}")

denial_matrix = access_df.groupby(['role','sensitivity'])['granted'].apply(
    lambda x: (~x).mean()).unstack(fill_value=0)
role_order = ['admin', 'analyst', 'engineer', 'viewer']
sens_order = ['Public', 'Internal', 'Confidential', 'Restricted']
denial_matrix = denial_matrix.reindex(index=role_order, columns=sens_order, fill_value=0)
print(f"\nDenial rate matrix:")
print(denial_matrix.round(3))

fig, ax = plt.subplots(figsize=(10, 6))
im = ax.imshow(denial_matrix.values, cmap='YlOrRd', aspect='auto', vmin=0, vmax=1)
ax.set_xticks(range(4)); ax.set_xticklabels(sens_order, fontsize=12)
ax.set_yticks(range(4)); ax.set_yticklabels(role_order, fontsize=12)
for i in range(4):
    for j in range(4):
        val = denial_matrix.values[i, j]
        ax.text(j, i, f'{val:.0%}', ha='center', va='center',
                fontsize=14, fontweight='bold', color='white' if val > 0.5 else 'black')
plt.colorbar(im, ax=ax, label='Denial Rate')
ax.set_title('Access Denial Rate: Role vs Sensitivity Level', fontsize=14, fontweight='bold')
ax.set_xlabel('Asset Sensitivity', fontsize=12); ax.set_ylabel('User Role', fontsize=12)
plt.tight_layout(); plt.show()

print("\nPer-role access summary:")
print(access_df.groupby('role').agg(
    total=('granted','count'), granted=('granted','sum'),
    denied=('granted', lambda x: (~x).sum()),
    denial_rate=('granted', lambda x: f"{(~x).mean():.1%}")))

### Analysis Questions

1. Which role has the highest access denial rate? Is the access model too restrictive or too permissive?
2. Look at the heatmap — which role/sensitivity combination causes the most violations?
3. How would you redesign the role hierarchy to reduce violations while maintaining security?

## Section 3.1: Governance Dashboard

The code below creates a 6-panel governance dashboard showing: violations by severity, violations by category, compliance trend, top violated policies, sensitivity distribution of violating assets, and an overall compliance score gauge.

In [None]:
violating_ids = violations_df['asset_id'].unique()
n_compliant = len(assets_df) - len(violating_ids)
compliance_pct = n_compliant / len(assets_df) * 100

fig, axes = plt.subplots(2, 3, figsize=(20, 12))

# 1. Violations by severity
ax = axes[0, 0]
sev_order = ['Critical', 'High', 'Medium', 'Low']
sev_colors = {'Critical': '#ef4444', 'High': '#f59e0b', 'Medium': '#3b82f6', 'Low': '#10b981'}
sev_counts = violations_df['severity'].value_counts().reindex(sev_order, fill_value=0)
bars = ax.bar(sev_counts.index, sev_counts.values, color=[sev_colors[s] for s in sev_counts.index])
for bar, val in zip(bars, sev_counts.values):
    ax.text(bar.get_x()+bar.get_width()/2, bar.get_height()+1, str(val), ha='center', fontweight='bold')
ax.set_title('Violations by Severity', fontsize=12, fontweight='bold'); ax.set_ylabel('Count')

# 2. Violations by category (stacked)
ax = axes[0, 1]
cat_sev = violations_df.groupby(['category','severity']).size().unstack(fill_value=0).reindex(columns=sev_order, fill_value=0)
cat_sev.plot(kind='bar', stacked=True, ax=ax, color=[sev_colors[s] for s in sev_order])
ax.set_title('Violations by Category & Severity', fontsize=12, fontweight='bold')
ax.set_ylabel('Count'); ax.legend(title='Severity', fontsize=8); ax.tick_params(axis='x', rotation=45)

# 3. Compliance trend
ax = axes[0, 2]
months = pd.date_range('2024-01-01', periods=12, freq='MS')
np.random.seed(55)
trend = [min(60 + i*2.5 + np.random.normal(0, 1.5), 98) for i in range(12)]
ax.plot(months, trend, 'o-', color='#3b82f6', linewidth=2, markersize=6)
ax.fill_between(months, trend, alpha=0.15, color='#3b82f6')
ax.axhline(y=85, color='#ef4444', linestyle='--', alpha=0.7, label='Target (85%)')
ax.set_title('Compliance Rate Trend (Simulated)', fontsize=12, fontweight='bold')
ax.set_ylabel('Compliance %'); ax.set_ylim(50, 100); ax.legend(fontsize=9)
ax.tick_params(axis='x', rotation=45)

# 4. Top violated policies
ax = axes[1, 0]
pc = violations_df['policy_name'].value_counts().head(10)
psev = violations_df.drop_duplicates('policy_name').set_index('policy_name')['severity']
bc = [sev_colors.get(psev.get(p, 'Medium'), '#3b82f6') for p in pc.index]
ax.barh(pc.index[::-1], pc.values[::-1], color=bc[::-1])
ax.set_title('Top Violated Policies', fontsize=12, fontweight='bold'); ax.set_xlabel('Violations')

# 5. Sensitivity pie
ax = axes[1, 1]
sd = violations_df.drop_duplicates('asset_id')['sensitivity'].value_counts()
scm = {'Public':'#10b981','Internal':'#3b82f6','Confidential':'#f59e0b','Restricted':'#ef4444'}
ax.pie(sd.values, labels=sd.index, colors=[scm.get(s,'#94a3b8') for s in sd.index],
       autopct='%1.1f%%', startangle=90, textprops={'fontsize': 10})
ax.set_title('Sensitivity of Violating Assets', fontsize=12, fontweight='bold')

# 6. Gauge
ax = axes[1, 2]
theta = np.linspace(np.pi, 0, 100)
ax.plot(np.cos(theta), np.sin(theta), color='#e2e8f0', linewidth=25, solid_capstyle='round')
score = compliance_pct
tf = np.linspace(np.pi, np.pi-(np.pi*score/100), max(int(score), 2))
gc = '#10b981' if score >= 80 else '#f59e0b' if score >= 60 else '#ef4444'
ax.plot(np.cos(tf), np.sin(tf), color=gc, linewidth=25, solid_capstyle='round')
ax.text(0, 0.15, f'{score:.1f}%', ha='center', va='center', fontsize=36, fontweight='bold', color=gc)
ax.text(0, -0.15, 'Overall Compliance', ha='center', va='center', fontsize=12, color='#64748b')
ax.text(0, -0.35, f'{n_compliant} of {len(assets_df)} assets compliant', ha='center', va='center', fontsize=10, color='#94a3b8')
ax.set_xlim(-1.3, 1.3); ax.set_ylim(-0.5, 1.3); ax.set_aspect('equal'); ax.axis('off')
ax.set_title('Compliance Score', fontsize=12, fontweight='bold')

plt.suptitle('Data Governance Dashboard', fontsize=18, fontweight='bold', y=1.01)
plt.tight_layout(); plt.show()

### Analysis Questions

1. What's the overall compliance score? Would you be comfortable presenting this to a regulator?
2. Which single improvement would have the biggest impact on the score?

## Section 3.2: Compliance Report

The code below generates a formatted compliance report with an executive summary, critical findings, per-department breakdown, and actionable recommendations. This is the kind of document that governance teams present to leadership and regulators.

In [None]:
def generate_compliance_report(assets_df, violations_df, policies_df):
    """Generate a formatted text compliance report."""
    total = len(assets_df)
    vids = violations_df['asset_id'].unique()
    n_v = len(vids); n_c = total - n_v
    score = n_c / total * 100
    sev_counts = violations_df['severity'].value_counts()

    L = []
    L.append('=' * 72)
    L.append('        DATA GOVERNANCE COMPLIANCE REPORT')
    L.append(f'        Generated: {datetime.now().strftime("%Y-%m-%d %H:%M:%S")}')
    L.append('=' * 72); L.append('')

    L.append('EXECUTIVE SUMMARY'); L.append('-' * 40)
    L.append(f'  Overall Compliance Score:  {score:.1f}%')
    L.append(f'  Total Assets Scanned:      {total}')
    L.append(f'  Compliant Assets:          {n_c}')
    L.append(f'  Non-Compliant Assets:      {n_v}')
    L.append(f'  Total Violations Found:    {len(violations_df)}')
    L.append(f'  Policies Evaluated:        {len(policies_df)}'); L.append('')

    if score >= 90: rating = 'EXCELLENT - Governance posture is strong.'
    elif score >= 75: rating = 'GOOD - Minor remediation needed.'
    elif score >= 60: rating = 'FAIR - Significant gaps require attention.'
    else: rating = 'POOR - Urgent remediation required.'
    L.append(f'  Rating: {rating}'); L.append('')

    L.append('VIOLATIONS BY SEVERITY'); L.append('-' * 40)
    for sev in ['Critical', 'High', 'Medium', 'Low']:
        cnt = sev_counts.get(sev, 0)
        L.append(f'  {sev:10s}  {cnt:4d}  {"#" * (cnt // 3)}')
    L.append('')

    L.append('CRITICAL FINDINGS'); L.append('-' * 40)
    crit = violations_df[violations_df['severity'] == 'Critical']
    if len(crit) > 0:
        for i, (pol, cnt) in enumerate(crit['policy_name'].value_counts().items(), 1):
            L.append(f'  {i}. {pol}: {cnt} violations')
            for _, r in crit[crit['policy_name']==pol].head(3).iterrows():
                L.append(f'     - {r["asset_id"]} ({r["category"]}): {r["detail"]}')
    else:
        L.append('  No critical violations found.')
    L.append('')

    L.append('RECOMMENDATIONS'); L.append('-' * 40)
    recs = []
    cc = sev_counts.get('Critical', 0); hc = sev_counts.get('High', 0)
    if cc > 0:
        recs.append(f'[URGENT] Address {cc} critical violations immediately, '
                    f'focusing on PII encryption and access control gaps.')
    if hc > 0:
        recs.append(f'[HIGH] Remediate {hc} high-severity violations within 30 days, '
                    f'including retention compliance and owner assignment.')
    ov = violations_df[violations_df['policy_name']=='Owner Assigned']
    if len(ov) > 0:
        recs.append(f'[PROCESS] Assign owners to {len(ov)} orphaned assets. '
                    f'Implement mandatory owner field in asset registration.')
    sv = violations_df[violations_df['policy_name']=='Stale Data']
    if len(sv) > 0:
        recs.append(f'[HYGIENE] Review {len(sv)} stale assets for archival or deletion. '
                    f'Consider automated lifecycle management.')
    recs.append('[ONGOING] Schedule quarterly governance scans and track compliance trends.')
    for i, rec in enumerate(recs, 1):
        L.append(f'  {i}. {rec}')
    L.append('')

    L.append('PER-DEPARTMENT BREAKDOWN'); L.append('-' * 40)
    for cat in sorted(assets_df['category'].unique()):
        ca = assets_df[assets_df['category']==cat]
        cv = violations_df[violations_df['category']==cat]
        cv_n = cv['asset_id'].nunique(); ct = len(ca)
        cc_ = ct - cv_n; cs = cc_/ct*100 if ct > 0 else 100
        L.append(f'  {cat}')
        L.append(f'    Assets:     {ct}')
        L.append(f'    Compliant:  {cc_} ({cs:.1f}%)')
        L.append(f'    Violations: {len(cv)}')
        csev = cv['severity'].value_counts()
        parts = [f'{s}: {csev.get(s,0)}' for s in ['Critical','High','Medium','Low'] if csev.get(s,0)>0]
        if parts: L.append(f'    Breakdown:  {", ".join(parts)}')
        L.append('')

    L.append('=' * 72)
    L.append('  End of Report')
    L.append('  Data Discovery: Harnessing AI, AGI & Vector Databases | AI Elevate')
    L.append('=' * 72)
    return '\n'.join(L)

report = generate_compliance_report(assets_df, violations_df, policies_df)
print(report)

### Analysis Questions

1. Does the executive summary effectively communicate the compliance posture?
2. Are the recommendations actionable? What's missing?
3. Who should receive this report, and how often should it be generated?

## Summary

In this lab, you learned how to:

1. **Define** governance policies as executable rule functions
2. **Scan** data assets against policies to detect compliance violations
3. **Model** data lineage as a directed graph and identify bottlenecks
4. **Simulate** role-based access control to detect unauthorized access
5. **Build** a comprehensive governance compliance dashboard
6. **Generate** automated compliance reports with findings and recommendations

---

*Data Discovery: Harnessing AI, AGI & Vector Databases | AI Elevate*