# Lesson 4 - Exercise 3: Create Materialized View for Fare Analytics

## Learning Objectives

1. Understand when and why to use materialized views
2. Write CREATE MATERIALIZED VIEW syntax
3. Implement REFRESH logic
4. Validate materialized view results

## Prerequisites

- `aws_config.py` with your credentials
- `data/van_transit_trips_postgres.csv` data file
- Completed Exercise 2 (or this notebook will set up tables automatically)

In [None]:
# ========= SETUP: Imports, Config, and Functions (Run this cell first!) =========

import os
import time
from datetime import datetime
from typing import Dict, Any, List

import pandas as pd
import numpy as np
import boto3

# Load AWS credentials from aws_config.py
import aws_config

# Configuration
AWS_REGION = os.getenv('AWS_REGION')
REDSHIFT_DATABASE = os.getenv('REDSHIFT_DATABASE')
REDSHIFT_WORKGROUP = os.getenv('REDSHIFT_WORKGROUP')
REDSHIFT_SECRET_ARN = os.getenv('REDSHIFT_SECRET_ARN', None)  # Optional

print("Configuration:")
print(f"   Region: {AWS_REGION}")
print(f"   Database: {REDSHIFT_DATABASE}")
print(f"   Workgroup: {REDSHIFT_WORKGROUP}")

# Redshift Data API client
session_boto = boto3.Session(region_name=AWS_REGION)
rsd = session_boto.client("redshift-data", region_name=AWS_REGION)


def _rs_kwargs() -> Dict[str, Any]:
    """Build connection arguments for Redshift Data API."""
    base = dict(Database=REDSHIFT_DATABASE)
    if REDSHIFT_WORKGROUP:
        base["WorkgroupName"] = REDSHIFT_WORKGROUP
        if REDSHIFT_SECRET_ARN:
            base["SecretArn"] = REDSHIFT_SECRET_ARN
    return base


def rs_exec(sql: str, return_results=False, timeout_s=900):
    """Execute SQL on Redshift via the Data API."""
    sql = sql.strip()
    if not sql:
        return None
    
    kwargs = _rs_kwargs()
    kwargs["Sql"] = sql
    
    sid = rsd.execute_statement(**kwargs)["Id"]
    
    start = time.time()
    while True:
        d = rsd.describe_statement(Id=sid)
        if d["Status"] in ("FINISHED", "FAILED", "ABORTED"):
            break
        if time.time() - start > timeout_s:
            raise TimeoutError("Redshift statement timeout")
        time.sleep(0.5)
    
    if d["Status"] != "FINISHED":
        raise RuntimeError(f"Redshift SQL failed: {d.get('Error')}\n---\n{sql}")
    
    if return_results or sql.strip().lower().startswith("select"):
        out, next_token = [], None
        while True:
            args = dict(Id=sid)
            if next_token:
                args["NextToken"] = next_token
            r = rsd.get_statement_result(**args)
            cols = [c["name"] for c in r["ColumnMetadata"]]
            for rec in r["Records"]:
                row = [next(iter(cell.values())) for cell in rec]
                out.append(dict(zip(cols, row)))
            next_token = r.get("NextToken")
            if not next_token:
                break
        return out
    
    return None


print("\nFunctions defined: rs_exec()")
print("Setup complete!")

---
## Check Prerequisites

This exercise requires the fact table from Exercise 2. If it doesn't exist, run Exercise 2 first.

In [None]:
# ========= Check if prerequisite tables exist =========

print("Checking prerequisite tables...")
print("=" * 60)

try:
    result = rs_exec("SELECT COUNT(*) AS cnt FROM public.dw_fact_trips;")
    print(f"   dw_fact_trips: {result[0]['cnt']} rows")
    
    result = rs_exec("SELECT COUNT(*) AS cnt FROM public.dw_dim_fare_class;")
    print(f"   dw_dim_fare_class: {result[0]['cnt']} rows")
    
    print("\n   Prerequisites met! Ready to create materialized view.")
except Exception as e:
    print(f"   ERROR: {e}")
    print("\n   Please run Exercise 2 first to create the required tables.")

---
## Step 1: Create the Materialized View

Materialized views pre-compute and store query results for fast access.

### When to Use Materialized Views

| Use Case | Benefit |
|----------|--------|
| Frequently run aggregations | Avoid repeated computation |
| Complex joins | Pre-join tables for faster queries |
| Dashboard queries | Sub-second response times |
| Data that changes infrequently | Refresh only when needed |

**TODO**: Create a materialized view `dw_mv_fare_summary` that aggregates fare metrics by fare_class.

Include these columns:
- `fare_class` (from dim_fare_class)
- `trip_count` (COUNT(*))
- `total_revenue_cad` (SUM of total_fare_cad)
- `avg_fare_cad` (AVG of total_fare_cad)
- `min_fare_cad`, `max_fare_cad`
- `total_discounts_cad` (SUM of discount_amount_cad)
- `avg_discount_rate`
- `avg_distance_km`, `total_distance_km`
- `avg_transfers`
- `on_time_trips` (COUNT where on_time_arrival = TRUE)
- `on_time_pct` (percentage of on-time trips)

In [None]:
# ========= STEP 1: Create Materialized View =========

print("Creating materialized view...")
print("=" * 60)

# Drop if exists
rs_exec("DROP MATERIALIZED VIEW IF EXISTS public.dw_mv_fare_summary;")
print("   Dropped existing view (if any)")

# TODO: Create the materialized view
rs_exec("""
CREATE MATERIALIZED VIEW public.dw_mv_fare_summary AS
SELECT 
    -- TODO: Select fare_class from dim_fare_class
    -- TODO: Add COUNT(*) AS trip_count
    -- TODO: Add SUM(f.total_fare_cad) AS total_revenue_cad
    -- TODO: Add AVG(f.total_fare_cad) AS avg_fare_cad
    -- TODO: Add MIN/MAX fare
    -- TODO: Add SUM(f.discount_amount_cad) AS total_discounts_cad
    -- TODO: Add AVG(f.discount_rate) AS avg_discount_rate
    -- TODO: Add AVG/SUM distance_km
    -- TODO: Add AVG(f.transfers) AS avg_transfers
    -- TODO: Add on_time_trips count and on_time_pct
    
FROM public.dw_fact_trips f
JOIN public.dw_dim_fare_class dfc ON f.fare_class_sk = dfc.fare_class_sk
GROUP BY dfc.fare_class;
""")

print("   Materialized view created!")

# Verify
result = rs_exec("SELECT COUNT(*) AS cnt FROM public.dw_mv_fare_summary;")
print(f"   Rows in view: {result[0]['cnt']}")

---
## Step 2: Refresh the Materialized View

After data changes, refresh the view to update results.

**TODO**: Refresh the materialized view using `REFRESH MATERIALIZED VIEW`.

In [None]:
# ========= STEP 2: Refresh Materialized View =========

print("Refreshing materialized view...")
print("=" * 60)

# TODO: Write the REFRESH MATERIALIZED VIEW statement
rs_exec("-- TODO: REFRESH MATERIALIZED VIEW public.dw_mv_fare_summary;")

print("   Materialized view refreshed!")

---
## Step 3: Query the Materialized View

In [None]:
# ========= STEP 3: Query Materialized View =========

print("Fare Summary by Fare Class:")
print("=" * 60)

results = rs_exec("""
SELECT 
    fare_class,
    trip_count,
    ROUND(total_revenue_cad, 2) AS total_revenue,
    ROUND(avg_fare_cad, 2) AS avg_fare,
    ROUND(avg_distance_km, 2) AS avg_distance,
    on_time_pct
FROM public.dw_mv_fare_summary
ORDER BY total_revenue_cad DESC;
""")

display(pd.DataFrame(results))

In [None]:
# ========= Additional Analytics =========

print("Additional Analytics")
print("=" * 60)

print("\n1. Revenue Share by Fare Class:")
results = rs_exec("""
SELECT 
    fare_class,
    ROUND(total_revenue_cad, 2) AS revenue,
    ROUND(total_revenue_cad / SUM(total_revenue_cad) OVER() * 100, 1) AS pct_of_total
FROM public.dw_mv_fare_summary
ORDER BY total_revenue_cad DESC;
""")
display(pd.DataFrame(results))

print("\n2. Discount Analysis:")
results = rs_exec("""
SELECT 
    fare_class,
    trip_count,
    ROUND(total_discounts_cad, 2) AS total_discounts,
    ROUND(avg_discount_rate * 100, 1) AS avg_discount_pct
FROM public.dw_mv_fare_summary
ORDER BY total_discounts_cad DESC;
""")
display(pd.DataFrame(results))

---
## Step 4: Validate Materialized View

Ensure the materialized view matches results from the base tables.

**TODO**: Write validation queries to compare:
1. Total trip count from MV vs base tables
2. Total revenue from MV vs base tables

In [None]:
# ========= STEP 4: Validate MV matches base tables =========

print("Validating materialized view against base tables...")
print("=" * 60)

# TODO: Query MV for total trips and revenue
mv_result = rs_exec("""
-- TODO: SELECT SUM(trip_count), SUM(total_revenue_cad) FROM dw_mv_fare_summary

""")

# TODO: Query base tables directly for comparison
base_result = rs_exec("""
-- TODO: SELECT COUNT(*), SUM(total_fare_cad) FROM dw_fact_trips
-- JOIN dw_dim_fare_class to match MV logic

""")

print(f"\nMaterialized View: {mv_result[0]['total_trips']} trips, ${mv_result[0]['total_revenue']} revenue")
print(f"Base Tables:       {base_result[0]['total_trips']} trips, ${base_result[0]['total_revenue']} revenue")

if str(mv_result[0]['total_trips']) == str(base_result[0]['total_trips']):
    print("\nVALIDATION PASSED!")
else:
    print("\nWARNING: Mismatch - consider refreshing the view.")

---
## Summary

### Lesson 4 - Exercise 3 Complete

You learned:

1. **When to use materialized views** (frequent, expensive queries)
2. **CREATE MATERIALIZED VIEW** syntax
3. **REFRESH MATERIALIZED VIEW** to update data
4. **How to validate MV results** against base tables
5. **Best practices** for MV maintenance