# 05_Gold_Analytics_and_Insights

## Business Value Quantification & ROI Analysis

This notebook provides:
- **Executive-ready business metrics**
- **ROI calculations** comparing our approach to baselines
- **What-if scenarios** for capacity planning
- **Actionable insights** for stakeholders

**Evaluation Alignment**:
- Business Impact & Practical Use
- AI Innovation & Insight Generation
- Documentation & Explainability

## Setup & Table References

In [0]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import (
    sum as spark_sum, avg, count, 
    round as spark_round, col, when, lit
)
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

spark = SparkSession.getActiveSession()

# Table references
GOLD_TABLE = "cost_aware_capstone.risk_decisioning.gold_decision_recommendations"
SILVER_TABLE = "cost_aware_capstone.risk_decisioning.silver_cost_aware_features"
RISK_TABLE = "cost_aware_capstone.risk_decisioning.ml_risk_predictions"

# Business constants
DAILY_CAPACITY = 50
INVESTIGATOR_HOURLY_RATE = 50  # $ per hour
AVG_INVESTIGATION_TIME = 0.75  # hours

print("Analytics notebook configured")

---
## 1. ROI Analysis: Cost-Aware vs Baseline Strategies

### Quantifying the Value of AI-Driven Optimization

In [0]:
# Compare multiple strategies
strategy_comparison = spark.sql(f"""
    WITH ranked AS (
        SELECT
            *,
            ROW_NUMBER() OVER (ORDER BY expected_savings_if_investigated DESC) AS cost_aware_rank,
            ROW_NUMBER() OVER (ORDER BY risk_probability DESC) AS risk_first_rank,
            ROW_NUMBER() OVER (ORDER BY fraud_loss_if_missed DESC) AS loss_first_rank,
            ROW_NUMBER() OVER (ORDER BY RAND(42)) AS random_rank
        FROM {GOLD_TABLE}
    )
    SELECT 'Cost-Aware (Ours)' AS strategy,
           ROUND(SUM(CASE WHEN cost_aware_rank <= {DAILY_CAPACITY} THEN expected_savings_if_investigated END), 2) AS savings,
           ROUND(SUM(CASE WHEN cost_aware_rank <= {DAILY_CAPACITY} THEN investigation_cost END), 2) AS cost
    FROM ranked
    UNION ALL
    SELECT 'Risk-First', 
           ROUND(SUM(CASE WHEN risk_first_rank <= {DAILY_CAPACITY} THEN expected_savings_if_investigated END), 2),
           ROUND(SUM(CASE WHEN risk_first_rank <= {DAILY_CAPACITY} THEN investigation_cost END), 2)
    FROM ranked
    UNION ALL
    SELECT 'Loss-First',
           ROUND(SUM(CASE WHEN loss_first_rank <= {DAILY_CAPACITY} THEN expected_savings_if_investigated END), 2),
           ROUND(SUM(CASE WHEN loss_first_rank <= {DAILY_CAPACITY} THEN investigation_cost END), 2)
    FROM ranked
    UNION ALL
    SELECT 'Random',
           ROUND(SUM(CASE WHEN random_rank <= {DAILY_CAPACITY} THEN expected_savings_if_investigated END), 2),
           ROUND(SUM(CASE WHEN random_rank <= {DAILY_CAPACITY} THEN investigation_cost END), 2)
    FROM ranked
""")

strategy_pdf = strategy_comparison.toPandas()
strategy_pdf['roi'] = strategy_pdf['savings'] / strategy_pdf['cost']
strategy_pdf['net_value'] = strategy_pdf['savings'] - strategy_pdf['cost']

print("STRATEGY COMPARISON (50 Investigations)")
print("=" * 70)
display(strategy_comparison)

# Calculate improvement over baselines
our_savings = strategy_pdf[strategy_pdf['strategy'] == 'Cost-Aware (Ours)']['savings'].values[0]
risk_savings = strategy_pdf[strategy_pdf['strategy'] == 'Risk-First']['savings'].values[0]
random_savings = strategy_pdf[strategy_pdf['strategy'] == 'Random']['savings'].values[0]

print(f"\nIMPROVEMENT OVER BASELINES:")
print(f"   vs Risk-First: +{(our_savings - risk_savings)/risk_savings*100:.1f}%")
print(f"   vs Random:     +{(our_savings - random_savings)/random_savings*100:.1f}%")

---
## 2. What-If Analysis: Capacity Scenarios

### Business Question: What if we hire more investigators?

In [0]:
# Simulate different capacity scenarios
capacity_scenarios = [25, 50, 75, 100, 150, 200]
scenario_results = []

for cap in capacity_scenarios:
    result = spark.sql(f"""
        SELECT
            {cap} AS capacity,
            ROUND(SUM(CASE WHEN priority_rank <= {cap} THEN expected_savings_if_investigated END), 2) AS total_savings,
            ROUND(SUM(CASE WHEN priority_rank <= {cap} THEN investigation_cost END), 2) AS total_cost
        FROM {GOLD_TABLE}
    """).toPandas()
    
    result['net_value'] = result['total_savings'] - result['total_cost']
    result['roi'] = result['total_savings'] / result['total_cost']
    result['marginal_savings'] = result['total_savings'] - scenario_results[-1]['total_savings'].values[0] if scenario_results else result['total_savings']
    scenario_results.append(result)

scenarios_df = pd.concat(scenario_results, ignore_index=True)

print("CAPACITY SCENARIO ANALYSIS")
print("=" * 70)
print(scenarios_df.to_string(index=False))

# Visualize
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

axes[0].plot(scenarios_df['capacity'], scenarios_df['total_savings'], 'go-', lw=2, markersize=8, label='Expected Savings')
axes[0].plot(scenarios_df['capacity'], scenarios_df['total_cost'], 'rs--', lw=2, markersize=8, label='Investigation Cost')
axes[0].axvline(x=50, color='blue', linestyle=':', lw=2, alpha=0.7, label='Current Capacity')
axes[0].set_xlabel('Investigation Capacity', fontsize=11)
axes[0].set_ylabel('Amount ($)', fontsize=11)
axes[0].set_title('Savings vs Cost by Capacity', fontsize=13, fontweight='bold')
axes[0].legend()
axes[0].grid(True, alpha=0.3)

axes[1].plot(scenarios_df['capacity'], scenarios_df['roi'], 'b^-', lw=2, markersize=8)
axes[1].axvline(x=50, color='red', linestyle=':', lw=2, alpha=0.7, label='Current Capacity')
axes[1].set_xlabel('Investigation Capacity', fontsize=11)
axes[1].set_ylabel('ROI (Savings / Cost)', fontsize=11)
axes[1].set_title('ROI Diminishing Returns', fontsize=13, fontweight='bold')
axes[1].legend()
axes[1].grid(True, alpha=0.3)

plt.tight_layout()
plt.show()

print("\nINSIGHT: Increasing capacity from 50->75 would increase savings but with lower marginal ROI.")

---
## 3. Risk Coverage Analysis

### How well do we cover high-risk cases?

In [0]:
# Risk coverage by decile
risk_coverage = spark.sql(f"""
    WITH deciles AS (
        SELECT *,
               NTILE(10) OVER (ORDER BY risk_probability) AS risk_decile
        FROM {GOLD_TABLE}
    )
    SELECT
        risk_decile,
        COUNT(*) AS total_cases,
        SUM(decision) AS investigated,
        ROUND(SUM(decision) * 100.0 / COUNT(*), 1) AS coverage_pct,
        ROUND(AVG(risk_probability) * 100, 1) AS avg_risk_pct,
        ROUND(SUM(expected_savings_if_investigated * decision), 2) AS savings_captured
    FROM deciles
    GROUP BY risk_decile
    ORDER BY risk_decile
""")

print("RISK COVERAGE BY DECILE")
print("=" * 70)
display(risk_coverage)

coverage_pdf = risk_coverage.toPandas()

# Visualize
fig, ax = plt.subplots(figsize=(10, 5))
bars = ax.bar(coverage_pdf['risk_decile'], coverage_pdf['coverage_pct'], 
              color=plt.cm.RdYlGn_r(np.linspace(0.2, 0.8, 10)))
ax.set_xlabel('Risk Decile (1=Lowest, 10=Highest)', fontsize=11)
ax.set_ylabel('Investigation Coverage (%)', fontsize=11)
ax.set_title('Investigation Coverage by Risk Decile', fontsize=13, fontweight='bold')
ax.set_xticks(range(1, 11))
ax.grid(True, alpha=0.3)

for bar, val in zip(bars, coverage_pdf['coverage_pct']):
    ax.text(bar.get_x() + bar.get_width()/2, bar.get_height() + 1, 
            f'{val:.0f}%', ha='center', fontsize=9)

plt.tight_layout()
plt.show()

print("\nINSIGHT: High coverage in high-risk deciles, but we also investigate")
print("   some lower-risk cases when their potential loss is high.")

---
## 4. Actionable Insights for Stakeholders

### Key Findings & Recommendations

KEY FINDING #1: Cost-Aware Outperforms Traditional Approaches - 
   Our system captures significantly more value than risk-first or
   random selection strategies by considering both probability AND
   financial impact of each case.

   ACTION: Implement cost-aware prioritization in production.


KEY FINDING #2: Diminishing Returns Beyond Current Capacity  -  

   ROI decreases as we increase investigation capacity, indicating
   we're already capturing the highest-value cases.

   ACTION: Expanding capacity yields lower marginal returns.
             Consider automation for lower-priority cases instead.    


KEY FINDING #3: Some Low-Risk Cases Have High Value - 
   Cases with moderate risk but very high potential losses are
   correctly prioritized by our system.

   ACTION: Don't filter purely on risk score. Use expected value.

 
KEY FINDING #4: Investigation Costs Are Well-Justified     - 
   ROI > 1 indicates investigations generate positive net value.
   Every dollar spent on investigation returns multiple dollars
   in prevented fraud losses.

   ACTION: Maintain or increase investigation budget - it's profitable.


---
## Notebook Complete

**Next Steps**:
1. Review `06_Interactive_Dashboard.ipynb` for detailed visualizations
2. Present findings to stakeholders
3. Implement in production environment

---