# 02_Silver_Feature_Engineering

## Data Cleaning & ML Feature Creation

This notebook implements the **Silver layer** of the Medallion Architecture:
- **Purpose**: Clean data and engineer business-meaningful features
- **Input**: Bronze Delta table (raw data)
- **Output**: Silver Delta table (ML-ready features)

### Feature Engineering Philosophy
I create features that capture both **risk signals** and **business value**:
- Risk indicators (behavioral patterns)
- Cost-aware features (expected loss proxies)
- Normalized features (log-scaled amounts)

In [0]:
# ============================================
# 02_Silver_Feature_Engineering.ipynb
# --------------------------------------------
# Purpose:
#   Clean Bronze data and engineer
#   cost-aware ML features for risk prediction
#   and decision optimization.
#
# Silver Principles:
#   - Schema enforcement
#   - Feature creation
#   - Business-aware transformations
#
# Output:
#   - silver_cost_aware_features
# ============================================

### Configuration & Imports

In [0]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import (
    col, when, log1p
)

In [0]:
spark = SparkSession.getActiveSession()

CATALOG = "cost_aware_capstone"
SCHEMA = "risk_decisioning"

In [0]:
BRONZE_TABLE = (
    "cost_aware_capstone.risk_decisioning."
    "bronze_cost_aware_cases"
)

SILVER_TABLE = (
    "cost_aware_capstone.risk_decisioning."
    "silver_cost_aware_features"
)

### Read Bronze Data

In [0]:
bronze_df = spark.table(BRONZE_TABLE)

bronze_df.printSchema()

In [0]:
display(bronze_df)

### Data Cleaning

In [0]:
# Data cleaning with edge case handling
original_count = bronze_df.count()

clean_df = (
    bronze_df
    # Remove invalid records
    .filter(col("transaction_amount") > 0)
    .filter(col("investigation_cost") > 0)
    .filter(col("account_age_days") > 0)
    # Handle potential nulls
    .na.drop(subset=["case_id", "transaction_amount", "fraud_loss_if_missed"])
)

clean_count = clean_df.count()
removed_count = original_count - clean_count

print(f"DATA CLEANING SUMMARY")
print(f"   Original records:  {original_count:,}")
print(f"   Clean records:     {clean_count:,}")
print(f"   Removed:           {removed_count:,} ({removed_count/original_count*100:.1f}%)")

# Edge case
if removed_count / original_count > 0.1:
    print("WARNING: More than 10% of data removed. Review data quality.")

### Feature Engineering

#### Log-scaled monetary features (stabilizes ML) 

In [0]:
feature_df = (
    clean_df
    .withColumn(
        "log_transaction_amount",
        log1p(col("transaction_amount"))
    )
    .withColumn(
        "log_fraud_loss_if_missed",
        log1p(col("fraud_loss_if_missed"))
    )
)

#### Behavioral Risk Score

**Business Rationale**: Combine multiple risk signals into a single composite score.

Formula: `behavioral_risk_score = 0.4 × tx_velocity + 2.5 × unusual_location + 2.0 × device_change`

Weights based on domain expertise:
- Transaction velocity: Moderate signal (0.4)
- Unusual location: Strong signal (2.5) - geographic anomalies are significant
- Device change: Strong signal (2.0) - common in account takeover

In [0]:
feature_df = feature_df.withColumn(
    "behavioral_risk_score",
    (
        col("tx_velocity_24h") * 0.4 +
        col("unusual_location_flag") * 2.5 +
        col("device_change_flag") * 2.0
    )
)

#### Expected Loss Proxy (Cost-Aware Feature)

**Innovation**: We include business cost information as a feature, not just as a post-hoc filter.

Formula: `expected_loss_proxy = 0.05 × transaction_amount + 0.95 × fraud_loss_if_missed`

This helps the model learn that some cases are inherently more valuable to investigate.

In [0]:
feature_df = feature_df.withColumn(
    "expected_loss_proxy",
    col("transaction_amount") * 0.05 +
    col("fraud_loss_if_missed") * 0.95
)

#### Final Silver Feature Selection

In [0]:
silver_df = (
    feature_df
    .select(
        "case_id",
        # ML features
        "log_transaction_amount",
        "tx_velocity_24h",
        "unusual_location_flag",
        "device_change_flag",
        "account_age_days",
        "behavioral_risk_score",
        "expected_loss_proxy",
        # cost fields
        "investigation_cost",
        "fraud_loss_if_missed",
        # label
        col("label_fraud").alias("label")
    )
)

-----

#### Write Silver Delta Table

In [0]:
(
    silver_df
    .write
    .format("delta")
    .mode("overwrite")
    .saveAsTable(SILVER_TABLE)
)

#### Validation & Preview

In [0]:
display(spark.sql(f"""
    SELECT COUNT(*) AS silver_row_count
    FROM {SILVER_TABLE}
"""))

In [0]:
display(spark.sql(f"""
    SELECT *
    FROM {SILVER_TABLE}
    LIMIT 5
"""))

---
## Silver Layer Complete

**Features Created**:
| Feature | Type | Purpose |
|---------|------|---------|
| `log_transaction_amount` | Numeric | Normalized transaction value |
| `tx_velocity_24h` | Numeric | Behavioral signal |
| `unusual_location_flag` | Binary | Geographic anomaly |
| `device_change_flag` | Binary | Device anomaly |
| `account_age_days` | Numeric | Tenure risk |
| `behavioral_risk_score` | Composite | Combined risk signals |
| `expected_loss_proxy` | Composite | Business value signal |

**Next**: Run `03_ML_Risk_Prediction.ipynb` for model training

---