# 04_Cost_Aware_Decision_Logic

In [0]:
# ============================================
# 04_Cost_Aware_Decision_Logic.ipynb
# --------------------------------------------
# Purpose:
#   Transform ML risk predictions into
#   cost-aware investigation decisions
#   under limited operational capacity.
#
# Key Innovation:
#   Optimize expected financial savings,
#   NOT just prediction accuracy.
#
# Business Problem:
#   - Limited investigators (capacity constraint)
#   - Each case has different potential loss
#   - Each investigation has a cost
#   - Goal: Maximize value captured within capacity
#
# Evaluation Alignment:
#   - AI Innovation & Insight Generation
#   - Business Impact & Practical Use
#   - Database <-> AI Workflow
#
# Output:
#   - gold_decision_recommendations table
#   - Actionable investigate/don't decisions
# ============================================

### Imports & Configuration

In [0]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import (
    col, lit, when, sum as spark_sum, 
    avg, count, round as spark_round
)
from pyspark.sql.window import Window
from pyspark.sql.functions import row_number
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

spark = SparkSession.getActiveSession()

CATALOG = "cost_aware_capstone"
SCHEMA = "risk_decisioning"

In [0]:

SILVER_TABLE = (
    "cost_aware_capstone.risk_decisioning."
    "silver_cost_aware_features"
)

RISK_TABLE = (
    "cost_aware_capstone.risk_decisioning."
    "ml_risk_predictions"
)

GOLD_TABLE = (
    "cost_aware_capstone.risk_decisioning."
    "gold_decision_recommendations"
)

---
## Business Constraints Definition

### Operational Reality at NexGen Financial Services (fictional)

| Constraint | Value | Rationale |
|------------|-------|-----------|
| Daily Investigation Capacity | 50 | Limited fraud analyst headcount |
| Avg Investigation Time | 45 mins | Manual review process |
| Investigation Cost | $75-150 | Labor + system costs |

**The Core Challenge**: We receive 500+ alerts daily but can only investigate 50. Which 50 should we choose?

In [0]:
# Business constraint: How many cases can we investigate daily?
DAILY_INVESTIGATION_CAPACITY = 50

# Edge case handling: Ensure capacity is valid
if DAILY_INVESTIGATION_CAPACITY <= 0:
    raise ValueError("Investigation capacity must be positive")

print(f"Daily Investigation Capacity: {DAILY_INVESTIGATION_CAPACITY} cases")

### Read Inputs

In [0]:
silver_df = spark.table(SILVER_TABLE)
risk_df = spark.table(RISK_TABLE)

# Edge case: Check for empty tables
silver_count = silver_df.count()
risk_count = risk_df.count()

if silver_count == 0 or risk_count == 0:
    raise ValueError("Input tables are empty. Run previous notebooks first.")

print(f"Silver table records: {silver_count:,}")
print(f"Risk predictions records: {risk_count:,}")

# Join features with predictions
df = (
    silver_df
    .join(risk_df, on="case_id", how="inner")
)

join_count = df.count()
print(f"Joined records: {join_count:,}")

# Edge case: Check for data loss in join
if join_count < min(silver_count, risk_count) * 0.9:
    print("WARNING: Significant data loss in join. Check case_id matching.")

---
## Expected Loss Modeling

### The Mathematical Foundation

**Traditional Approach** (Risk-First):
```
Priority = Risk Probability
Investigate: Top N by risk score
```

**Our Cost-Aware Approach**:
```
Expected Loss if Ignored = P(fraud) Ã— Fraud_Loss_If_Missed
Expected Savings if Investigated = Expected_Loss_If_Ignored - Investigation_Cost
Priority = Expected Savings
```

### Why This Matters

| Case | Risk | Potential Loss | Inv. Cost | Expected Savings | Priority |
|------|------|----------------|-----------|------------------|----------|
| A | 90% | $500 | $100 | $350 | Lower |
| B | 50% | $10,000 | $100 | $4,900 | **Higher** |

Case B has lower risk but **much higher expected value** - our system correctly prioritizes it!

In [0]:
# Calculate expected values
decision_df = (
    df
    # Expected loss if we DON'T investigate
    .withColumn(
        "expected_loss_if_ignored",
        col("risk_probability") * col("fraud_loss_if_missed")
    )
    # Net expected savings from investigating
    # (what we save minus what it costs to investigate)
    .withColumn(
        "expected_savings_if_investigated",
        col("expected_loss_if_ignored") - col("investigation_cost")
    )
    # Edge case: Flag negative savings cases
    .withColumn(
        "worth_investigating",
        col("expected_savings_if_investigated") > 0
    )
)

# Summary statistics
print("EXPECTED VALUE SUMMARY")
print("=" * 50)
decision_df.select(
    spark_round(avg("expected_loss_if_ignored"), 2).alias("avg_expected_loss"),
    spark_round(avg("expected_savings_if_investigated"), 2).alias("avg_expected_savings"),
    spark_sum(when(col("worth_investigating"), 1).otherwise(0)).alias("cases_worth_investigating")
).show()

---
## Optimization: Rank by Financial Impact

### Greedy Optimization

For a single capacity constraint with linear objective, **greedy selection is optimal**:

1. Rank all cases by expected savings (descending)
2. Select top K cases (K = capacity)
3. This maximizes total expected savings

**Mathematical Proof**: This is a special case of the Knapsack problem where all items have the same "weight" (one investigation slot each).

In [0]:
# Rank by expected savings (descending = highest value first)
window_spec = Window.orderBy(
    col("expected_savings_if_investigated").desc()
)

ranked_df = (
    decision_df
    .withColumn("priority_rank", row_number().over(window_spec))
)

# Show top 5 highest-value cases
print("TOP 5 HIGHEST-VALUE CASES")
ranked_df.select(
    "case_id", 
    spark_round("risk_probability", 3).alias("risk"),
    spark_round("fraud_loss_if_missed", 2).alias("potential_loss"),
    spark_round("expected_savings_if_investigated", 2).alias("expected_savings"),
    "priority_rank"
).filter(col("priority_rank") <= 5).show()

### Apply Capacity Constraint (Optimization Step)

In [0]:
# Apply capacity constraint: Investigate top K cases
final_df = (
    ranked_df
    .withColumn(
        "decision",
        when(
            col("priority_rank") <= lit(DAILY_INVESTIGATION_CAPACITY),
            1
        ).otherwise(0)
    )
    # Add decision explanation for interpretability
    .withColumn(
        "decision_reason",
        when(
            col("decision") == 1,
            "Within capacity, positive expected savings"
        ).when(
            col("expected_savings_if_investigated") <= 0,
            "Negative expected savings"
        ).otherwise(
            "Below capacity threshold"
        )
    )
)

# Decision summary
print("DECISION SUMMARY")
print("=" * 50)
final_df.groupBy("decision").agg(
    count("*").alias("count"),
    spark_round(avg("risk_probability"), 3).alias("avg_risk"),
    spark_round(spark_sum("expected_savings_if_investigated"), 2).alias("total_expected_savings")
).orderBy(col("decision").desc()).show()

### Final Gold output

In [0]:
# Select columns for Gold table
gold_df = final_df.select(
    "case_id",
    "risk_probability",
    "investigation_cost",
    "fraud_loss_if_missed",
    "expected_loss_if_ignored",
    "expected_savings_if_investigated",
    "priority_rank",
    "decision",
    "decision_reason"
)

# Validate output
print("GOLD TABLE SCHEMA")
gold_df.printSchema()

### Write Gold Table

In [0]:
(
    gold_df
    .write
    .format("delta")
    .mode("overwrite")
    .saveAsTable(GOLD_TABLE)
)

In [0]:
display(gold_df)

### Preview Decisions

In [0]:
display(spark.sql(f"""
    SELECT *
    FROM {GOLD_TABLE}
    ORDER BY expected_savings_if_investigated DESC
    LIMIT 10
"""))

---
## Baseline Comparison: Cost-Aware vs Risk-First

Let's prove that our approach outperforms the traditional "highest risk first" strategy.

In [0]:
# Compare strategies
comparison_df = spark.sql(f"""
    WITH ranked_data AS (
        SELECT
            *,
            ROW_NUMBER() OVER (ORDER BY expected_savings_if_investigated DESC) AS cost_aware_rank,
            ROW_NUMBER() OVER (ORDER BY risk_probability DESC) AS risk_first_rank
        FROM {GOLD_TABLE}
    )
    SELECT
        'Cost-Aware (Ours)' AS strategy,
        ROUND(SUM(CASE WHEN cost_aware_rank <= {DAILY_INVESTIGATION_CAPACITY} 
                       THEN expected_savings_if_investigated ELSE 0 END), 2) AS total_savings
    FROM ranked_data
    
    UNION ALL
    
    SELECT
        'Risk-First (Baseline)' AS strategy,
        ROUND(SUM(CASE WHEN risk_first_rank <= {DAILY_INVESTIGATION_CAPACITY} 
                       THEN expected_savings_if_investigated ELSE 0 END), 2)
    FROM ranked_data
""")

comparison_pdf = comparison_df.toPandas()

# Calculate improvement
our_savings = comparison_pdf[comparison_pdf['strategy'] == 'Cost-Aware (Ours)']['total_savings'].values[0]
baseline_savings = comparison_pdf[comparison_pdf['strategy'] == 'Risk-First (Baseline)']['total_savings'].values[0]
improvement = (our_savings - baseline_savings) / baseline_savings * 100

print("STRATEGY COMPARISON")
print("=" * 50)
display(comparison_df)
print(f"\nImprovement: {improvement:.1f}% more savings with Cost-Aware approach!")

---
## Key Takeaways

### What We Built
1. **Cost-aware decision engine** that optimizes for business value
2. **Capacity-constrained optimization** reflecting real-world limitations
3. **Transparent decision logic** with explainable rankings

### Business Value Delivered
- Maximizes expected savings given limited resources
- Outperforms traditional risk-first prioritization
- Provides actionable recommendations, not just scores


---
**Next**: See `05_Gold_Analytics_and_Insights` for business impact analysis