# Task A: Baseline Model & Offline Audit

**Objectives:**
1. Perform data audit (rows, columns, duplicates, class imbalance)
2. Train baseline fraud detection model
3. Report PR-AUC and recall at chosen threshold
4. Save model for streaming inference

In [1]:
from pathlib import Path
import pandas as pd
import numpy as np
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.ml.feature import VectorAssembler, StandardScaler
from pyspark.ml.classification import LogisticRegression, RandomForestClassifier
from pyspark.ml import Pipeline
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
import matplotlib.pyplot as plt
import seaborn as sns

# Setup paths
WORK = Path("/home/jovyan/work")
DATA = WORK / "data"
MODELS = WORK / "models"
AUDIT = WORK / "audit_results"

for p in [DATA, MODELS, AUDIT]:
    p.mkdir(parents=True, exist_ok=True)

csv_path = str(DATA / "creditcard.csv")
print(f"CSV Path: {csv_path}")

CSV Path: /home/jovyan/work/data/creditcard.csv


## 1. Data Audit

In [2]:
# Load data with pandas for quick audit
df = pd.read_csv(csv_path)

print("="*80)
print("DATA AUDIT REPORT")
print("="*80)

# 1. Basic dimensions
n_rows, n_cols = df.shape
print(f"\n1. DATASET DIMENSIONS")
print(f"   - Number of rows: {n_rows:,}")
print(f"   - Number of columns: {n_cols}")
print(f"\n   Column names: {list(df.columns)}")

DATA AUDIT REPORT

1. DATASET DIMENSIONS
   - Number of rows: 284,807
   - Number of columns: 31

   Column names: ['Time', 'V1', 'V2', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9', 'V10', 'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V20', 'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'Amount', 'Class']


In [3]:
# 2. Duplicates analysis
n_duplicates_before = df.duplicated().sum()
duplicate_pct_before = (n_duplicates_before / n_rows) * 100

print(f"\n2. DUPLICATE ANALYSIS")
print(f"   - Duplicates (before drop): {n_duplicates_before:,} ({duplicate_pct_before:.4f}%)")

# Drop duplicates
df_clean = df.drop_duplicates()
n_rows_after = len(df_clean)
n_duplicates_removed = n_rows - n_rows_after

print(f"   - Duplicates removed: {n_duplicates_removed:,}")
print(f"   - Rows after drop: {n_rows_after:,}")


2. DUPLICATE ANALYSIS
   - Duplicates (before drop): 1,081 (0.3796%)
   - Duplicates removed: 1,081
   - Rows after drop: 283,726


In [4]:
# 3. Class imbalance
class_counts = df_clean['Class'].value_counts().sort_index()
class_0 = class_counts[0]
class_1 = class_counts[1]
total = class_0 + class_1

pct_0 = (class_0 / total) * 100
pct_1 = (class_1 / total) * 100
imbalance_ratio = class_0 / class_1

print(f"\n3. CLASS IMBALANCE")
print(f"   - Class 0 (Normal): {class_0:,} ({pct_0:.4f}%)")
print(f"   - Class 1 (Fraud):  {class_1:,} ({pct_1:.4f}%)")
print(f"   - Imbalance ratio (0:1): {imbalance_ratio:.2f}:1")
print(f"   - This is a HIGHLY IMBALANCED dataset")


3. CLASS IMBALANCE
   - Class 0 (Normal): 283,253 (99.8333%)
   - Class 1 (Fraud):  473 (0.1667%)
   - Imbalance ratio (0:1): 598.84:1
   - This is a HIGHLY IMBALANCED dataset


In [5]:
# 4. Missing values
missing_counts = df_clean.isnull().sum()
total_missing = missing_counts.sum()

print(f"\n4. MISSING VALUES")
print(f"   - Total missing cells: {total_missing}")
if total_missing > 0:
    print(f"\n   Columns with missing values:")
    for col, count in missing_counts[missing_counts > 0].items():
        print(f"      {col}: {count}")
else:
    print(f"   - No missing values detected")


4. MISSING VALUES
   - Total missing cells: 0
   - No missing values detected


In [6]:
# 5. Statistical summary for key fields
print(f"\n5. STATISTICAL SUMMARY")
print(f"\nAmount column statistics:")
print(df_clean['Amount'].describe())

print(f"\nTime column statistics:")
print(df_clean['Time'].describe())


5. STATISTICAL SUMMARY

Amount column statistics:
count    283726.000000
mean         88.472687
std         250.399437
min           0.000000
25%           5.600000
50%          22.000000
75%          77.510000
max       25691.160000
Name: Amount, dtype: float64

Time column statistics:
count    283726.000000
mean      94811.077600
std       47481.047891
min           0.000000
25%       54204.750000
50%       84692.500000
75%      139298.000000
max      172792.000000
Name: Time, dtype: float64


In [7]:
# 6. Create audit summary table
audit_summary = pd.DataFrame({
    'Metric': [
        'Total Rows (Original)',
        'Total Columns',
        'Duplicates Found',
        'Duplicates %',
        'Rows After Dedup',
        'Missing Values',
        'Class 0 Count',
        'Class 0 %',
        'Class 1 Count',
        'Class 1 %',
        'Imbalance Ratio (0:1)'
    ],
    'Value': [
        f"{n_rows:,}",
        f"{n_cols}",
        f"{n_duplicates_before:,}",
        f"{duplicate_pct_before:.4f}%",
        f"{n_rows_after:,}",
        f"{total_missing}",
        f"{class_0:,}",
        f"{pct_0:.4f}%",
        f"{class_1:,}",
        f"{pct_1:.4f}%",
        f"{imbalance_ratio:.2f}:1"
    ]
})

print("\n" + "="*80)
print("AUDIT SUMMARY TABLE")
print("="*80)
print(audit_summary.to_string(index=False))

# Save audit summary
audit_summary.to_csv(AUDIT / "data_audit_summary.csv", index=False)
print(f"\nAudit summary saved to: {AUDIT / 'data_audit_summary.csv'}")


AUDIT SUMMARY TABLE
               Metric    Value
Total Rows (Original)  284,807
        Total Columns       31
     Duplicates Found    1,081
         Duplicates %  0.3796%
     Rows After Dedup  283,726
       Missing Values        0
        Class 0 Count  283,253
            Class 0 % 99.8333%
        Class 1 Count      473
            Class 1 %  0.1667%
Imbalance Ratio (0:1) 598.84:1

Audit summary saved to: /home/jovyan/work/audit_results/data_audit_summary.csv


## 2. Baseline Model Training

We will train two models:
1. **Logistic Regression** (fast, interpretable)
2. **Random Forest** (tree-based, handles non-linearity)

Both will use class weights to handle imbalance.

In [8]:
# Initialize Spark
spark = (SparkSession.builder
         .appName("TaskA-Baseline-Model")
         .master("local[*]")
         .config("spark.sql.shuffle.partitions", "8")
         .getOrCreate())

spark.sparkContext.setLogLevel("WARN")
print(f"Spark version: {spark.version}")

Spark version: 3.5.0


In [9]:
# Load clean data to Spark
sdf = spark.createDataFrame(df_clean)

# Define features and label
feature_cols = [f"V{i}" for i in range(1, 29)] + ["Amount"]
label_col = "Class"

print(f"Feature columns ({len(feature_cols)}): {feature_cols[:5]} ... {feature_cols[-2:]}")
print(f"Label column: {label_col}")

Feature columns (29): ['V1', 'V2', 'V3', 'V4', 'V5'] ... ['V28', 'Amount']
Label column: Class


In [10]:
# Split data: 80% train, 20% test
train_sdf, test_sdf = sdf.randomSplit([0.8, 0.2], seed=42)

print(f"Training set size: {train_sdf.count():,}")
print(f"Test set size: {test_sdf.count():,}")

Training set size: 227,157
Test set size: 56,569


In [11]:
# Calculate class weights for training set
train_class_counts = train_sdf.groupBy(label_col).count().collect()
counts_dict = {row[label_col]: row["count"] for row in train_class_counts}

n_class_0 = counts_dict.get(0, 1)
n_class_1 = counts_dict.get(1, 1)
weight_class_1 = n_class_0 / n_class_1
weight_class_0 = 1.0

print(f"\nTraining set class distribution:")
print(f"   Class 0: {n_class_0:,}")
print(f"   Class 1: {n_class_1:,}")
print(f"   Weight for class 1 (fraud): {weight_class_1:.2f}")

# Add weight column
train_sdf = train_sdf.withColumn(
    "weight",
    F.when(F.col(label_col) == 1, F.lit(float(weight_class_1))).otherwise(F.lit(float(weight_class_0)))
)

test_sdf = test_sdf.withColumn(
    "weight",
    F.when(F.col(label_col) == 1, F.lit(float(weight_class_1))).otherwise(F.lit(float(weight_class_0)))
)


Training set class distribution:
   Class 0: 226,792
   Class 1: 365
   Weight for class 1 (fraud): 621.35


### 2.1 Model 1: Logistic Regression

In [12]:
# Build Logistic Regression pipeline
assembler_lr = VectorAssembler(inputCols=feature_cols, outputCol="features_raw")
scaler_lr = StandardScaler(inputCol="features_raw", outputCol="features", withMean=True, withStd=True)
lr = LogisticRegression(
    featuresCol="features",
    labelCol=label_col,
    weightCol="weight",
    maxIter=100,
    regParam=0.01,
    elasticNetParam=0.0  # L2 regularization
)

pipeline_lr = Pipeline(stages=[assembler_lr, scaler_lr, lr])

print("Training Logistic Regression model...")
model_lr = pipeline_lr.fit(train_sdf)
print("Training complete!")

Training Logistic Regression model...
Training complete!


In [13]:
# Evaluate Logistic Regression
pred_lr = model_lr.transform(test_sdf)

evaluator_pr = BinaryClassificationEvaluator(
    labelCol=label_col,
    rawPredictionCol="rawPrediction",
    metricName="areaUnderPR"
)

evaluator_roc = BinaryClassificationEvaluator(
    labelCol=label_col,
    rawPredictionCol="rawPrediction",
    metricName="areaUnderROC"
)

pr_auc_lr = evaluator_pr.evaluate(pred_lr)
roc_auc_lr = evaluator_roc.evaluate(pred_lr)

print(f"\nLogistic Regression Performance:")
print(f"   PR-AUC:  {pr_auc_lr:.6f}")
print(f"   ROC-AUC: {roc_auc_lr:.6f}")


Logistic Regression Performance:
   PR-AUC:  0.632238
   ROC-AUC: 0.965672


### 2.2 Model 2: Random Forest

In [14]:
# Build Random Forest pipeline
assembler_rf = VectorAssembler(inputCols=feature_cols, outputCol="features")
rf = RandomForestClassifier(
    featuresCol="features",
    labelCol=label_col,
    weightCol="weight",
    numTrees=100,
    maxDepth=10,
    seed=42
)

pipeline_rf = Pipeline(stages=[assembler_rf, rf])

print("Training Random Forest model...")
model_rf = pipeline_rf.fit(train_sdf)
print("Training complete!")

Training Random Forest model...
Training complete!


In [15]:
# Evaluate Random Forest
pred_rf = model_rf.transform(test_sdf)

pr_auc_rf = evaluator_pr.evaluate(pred_rf)
roc_auc_rf = evaluator_roc.evaluate(pred_rf)

print(f"\nRandom Forest Performance:")
print(f"   PR-AUC:  {pr_auc_rf:.6f}")
print(f"   ROC-AUC: {roc_auc_rf:.6f}")


Random Forest Performance:
   PR-AUC:  0.784108
   ROC-AUC: 0.957809


## 3. Threshold Selection & Recall Analysis

In [16]:
# Extract fraud probability for both models
from pyspark.sql.types import DoubleType

def extract_prob(v):
    try:
        return float(v[1])
    except:
        return 0.0

extract_prob_udf = F.udf(extract_prob, DoubleType())

pred_lr = pred_lr.withColumn("fraud_prob", extract_prob_udf(F.col("probability")))
pred_rf = pred_rf.withColumn("fraud_prob", extract_prob_udf(F.col("probability")))

In [17]:
# Function to calculate precision, recall at different thresholds
def calc_metrics_at_threshold(pred_df, threshold):
    pred_with_alert = pred_df.withColumn(
        "predicted",
        F.when(F.col("fraud_prob") >= threshold, 1).otherwise(0)
    )
    
    # Calculate TP, FP, TN, FN
    metrics = pred_with_alert.groupBy().agg(
        F.sum(F.when((F.col("predicted") == 1) & (F.col(label_col) == 1), 1).otherwise(0)).alias("tp"),
        F.sum(F.when((F.col("predicted") == 1) & (F.col(label_col) == 0), 1).otherwise(0)).alias("fp"),
        F.sum(F.when((F.col("predicted") == 0) & (F.col(label_col) == 0), 1).otherwise(0)).alias("tn"),
        F.sum(F.when((F.col("predicted") == 0) & (F.col(label_col) == 1), 1).otherwise(0)).alias("fn"),
        F.sum(F.col(label_col)).alias("total_fraud")
    ).collect()[0]
    
    tp = metrics["tp"] or 0
    fp = metrics["fp"] or 0
    tn = metrics["tn"] or 0
    fn = metrics["fn"] or 0
    total_fraud = metrics["total_fraud"] or 1
    
    precision = tp / (tp + fp) if (tp + fp) > 0 else 0
    recall = tp / total_fraud if total_fraud > 0 else 0
    f1 = 2 * (precision * recall) / (precision + recall) if (precision + recall) > 0 else 0
    
    return {
        "threshold": threshold,
        "tp": tp,
        "fp": fp,
        "tn": tn,
        "fn": fn,
        "precision": precision,
        "recall": recall,
        "f1": f1
    }

# Test multiple thresholds
thresholds = [0.3, 0.5, 0.7, 0.8, 0.9, 0.95, 0.99]

print("\n" + "="*80)
print("THRESHOLD ANALYSIS - LOGISTIC REGRESSION")
print("="*80)

results_lr = []
for thresh in thresholds:
    metrics = calc_metrics_at_threshold(pred_lr, thresh)
    results_lr.append(metrics)
    print(f"Threshold: {thresh:.2f} | Precision: {metrics['precision']:.4f} | Recall: {metrics['recall']:.4f} | F1: {metrics['f1']:.4f} | TP: {metrics['tp']} | FP: {metrics['fp']} | FN: {metrics['fn']}")

df_results_lr = pd.DataFrame(results_lr)


THRESHOLD ANALYSIS - LOGISTIC REGRESSION
Threshold: 0.30 | Precision: 0.0338 | Recall: 0.8981 | F1: 0.0651 | TP: 97 | FP: 2773 | FN: 11
Threshold: 0.50 | Precision: 0.1141 | Recall: 0.8611 | F1: 0.2015 | TP: 93 | FP: 722 | FN: 15
Threshold: 0.70 | Precision: 0.3061 | Recall: 0.8333 | F1: 0.4478 | TP: 90 | FP: 204 | FN: 18
Threshold: 0.80 | Precision: 0.5176 | Recall: 0.8148 | F1: 0.6331 | TP: 88 | FP: 82 | FN: 20
Threshold: 0.90 | Precision: 0.6343 | Recall: 0.7870 | F1: 0.7025 | TP: 85 | FP: 49 | FN: 23
Threshold: 0.95 | Precision: 0.6875 | Recall: 0.7130 | F1: 0.7000 | TP: 77 | FP: 35 | FN: 31
Threshold: 0.99 | Precision: 0.7558 | Recall: 0.6019 | F1: 0.6701 | TP: 65 | FP: 21 | FN: 43


In [18]:
print("\n" + "="*80)
print("THRESHOLD ANALYSIS - RANDOM FOREST")
print("="*80)

results_rf = []
for thresh in thresholds:
    metrics = calc_metrics_at_threshold(pred_rf, thresh)
    results_rf.append(metrics)
    print(f"Threshold: {thresh:.2f} | Precision: {metrics['precision']:.4f} | Recall: {metrics['recall']:.4f} | F1: {metrics['f1']:.4f} | TP: {metrics['tp']} | FP: {metrics['fp']} | FN: {metrics['fn']}")

df_results_rf = pd.DataFrame(results_rf)


THRESHOLD ANALYSIS - RANDOM FOREST
Threshold: 0.30 | Precision: 0.7236 | Recall: 0.8241 | F1: 0.7706 | TP: 89 | FP: 34 | FN: 19
Threshold: 0.50 | Precision: 0.8300 | Recall: 0.7685 | F1: 0.7981 | TP: 83 | FP: 17 | FN: 25
Threshold: 0.70 | Precision: 0.8621 | Recall: 0.6944 | F1: 0.7692 | TP: 75 | FP: 12 | FN: 33
Threshold: 0.80 | Precision: 0.8816 | Recall: 0.6204 | F1: 0.7283 | TP: 67 | FP: 9 | FN: 41
Threshold: 0.90 | Precision: 0.9265 | Recall: 0.5833 | F1: 0.7159 | TP: 63 | FP: 5 | FN: 45
Threshold: 0.95 | Precision: 0.9492 | Recall: 0.5185 | F1: 0.6707 | TP: 56 | FP: 3 | FN: 52
Threshold: 0.99 | Precision: 1.0000 | Recall: 0.2315 | F1: 0.3759 | TP: 25 | FP: 0 | FN: 83


## 4. Select Best Model & Threshold

In [19]:
# Choose threshold based on F1 score
best_lr = df_results_lr.loc[df_results_lr['f1'].idxmax()]
best_rf = df_results_rf.loc[df_results_rf['f1'].idxmax()]

print("\n" + "="*80)
print("BEST THRESHOLD SELECTION (Based on F1 Score)")
print("="*80)

print(f"\nLogistic Regression:")
print(f"   Best Threshold: {best_lr['threshold']:.2f}")
print(f"   Precision: {best_lr['precision']:.4f}")
print(f"   Recall: {best_lr['recall']:.4f}")
print(f"   F1 Score: {best_lr['f1']:.4f}")
print(f"   PR-AUC: {pr_auc_lr:.6f}")

print(f"\nRandom Forest:")
print(f"   Best Threshold: {best_rf['threshold']:.2f}")
print(f"   Precision: {best_rf['precision']:.4f}")
print(f"   Recall: {best_rf['recall']:.4f}")
print(f"   F1 Score: {best_rf['f1']:.4f}")
print(f"   PR-AUC: {pr_auc_rf:.6f}")


BEST THRESHOLD SELECTION (Based on F1 Score)

Logistic Regression:
   Best Threshold: 0.90
   Precision: 0.6343
   Recall: 0.7870
   F1 Score: 0.7025
   PR-AUC: 0.632238

Random Forest:
   Best Threshold: 0.50
   Precision: 0.8300
   Recall: 0.7685
   F1 Score: 0.7981
   PR-AUC: 0.784108


In [20]:
# Select final model (based on PR-AUC)
if pr_auc_lr >= pr_auc_rf:
    final_model = model_lr
    final_model_name = "Logistic Regression"
    final_pr_auc = pr_auc_lr
    final_threshold = best_lr['threshold']
    final_precision = best_lr['precision']
    final_recall = best_lr['recall']
    final_f1 = best_lr['f1']
else:
    final_model = model_rf
    final_model_name = "Random Forest"
    final_pr_auc = pr_auc_rf
    final_threshold = best_rf['threshold']
    final_precision = best_rf['precision']
    final_recall = best_rf['recall']
    final_f1 = best_rf['f1']

print("\n" + "="*80)
print("FINAL MODEL SELECTION")
print("="*80)
print(f"\nSelected Model: {final_model_name}")
print(f"PR-AUC: {final_pr_auc:.6f}")
print(f"\nRecommended Operating Point:")
print(f"   Threshold: {final_threshold:.2f}")
print(f"   Precision: {final_precision:.4f}")
print(f"   Recall: {final_recall:.4f}")
print(f"   F1 Score: {final_f1:.4f}")

print(f"\nRationale for threshold selection:")
print(f"   - Threshold {final_threshold:.2f} maximizes F1 score")
print(f"   - Balances precision ({final_precision:.2%}) and recall ({final_recall:.2%})")
print(f"   - In production, adjust based on business cost of false positives vs false negatives")


FINAL MODEL SELECTION

Selected Model: Random Forest
PR-AUC: 0.784108

Recommended Operating Point:
   Threshold: 0.50
   Precision: 0.8300
   Recall: 0.7685
   F1 Score: 0.7981

Rationale for threshold selection:
   - Threshold 0.50 maximizes F1 score
   - Balances precision (83.00%) and recall (76.85%)
   - In production, adjust based on business cost of false positives vs false negatives


## 5. Save Models & Results

In [21]:
import shutil
import os
import json

# Save Logistic Regression model
model_lr_path = str(MODELS / "fraud_lr_model")
if os.path.exists(model_lr_path):
    shutil.rmtree(model_lr_path)
model_lr.write().overwrite().save(model_lr_path)
print(f"Logistic Regression model saved to: {model_lr_path}")

# Save Random Forest model
model_rf_path = str(MODELS / "fraud_rf_model")
if os.path.exists(model_rf_path):
    shutil.rmtree(model_rf_path)
model_rf.write().overwrite().save(model_rf_path)
print(f"Random Forest model saved to: {model_rf_path}")

# Save threshold analysis results
df_results_lr.to_csv(AUDIT / "threshold_analysis_lr.csv", index=False)
df_results_rf.to_csv(AUDIT / "threshold_analysis_rf.csv", index=False)
print(f"\nThreshold analysis saved to: {AUDIT}")

# Save final model selection summary
final_summary = {
    "selected_model": final_model_name,
    "pr_auc": float(final_pr_auc),
    "recommended_threshold": float(final_threshold),
    "precision_at_threshold": float(final_precision),
    "recall_at_threshold": float(final_recall),
    "f1_at_threshold": float(final_f1),
    "model_comparison": {
        "logistic_regression_pr_auc": float(pr_auc_lr),
        "random_forest_pr_auc": float(pr_auc_rf)
    }
}

with open(AUDIT / "model_selection_summary.json", "w") as f:
    json.dump(final_summary, f, indent=2)

print(f"Model selection summary saved to: {AUDIT / 'model_selection_summary.json'}")

Logistic Regression model saved to: /home/jovyan/work/models/fraud_lr_model
Random Forest model saved to: /home/jovyan/work/models/fraud_rf_model

Threshold analysis saved to: /home/jovyan/work/audit_results
Model selection summary saved to: /home/jovyan/work/audit_results/model_selection_summary.json


## 6. Task A Deliverables Summary

### Deliverable A.1: Audit Summary Table
See audit summary table above and saved CSV file.

### Deliverable A.2: Model Performance
- **Selected Model:** See final model selection above
- **PR-AUC:** Reported above
- **Recommended Threshold:** Selected based on F1 score optimization
- **Rationale:** Threshold balances precision and recall. In production, adjust based on:
  - Cost of false positives (operational overhead of reviewing alerts)
  - Cost of false negatives (financial loss from missed fraud)
  - Alert volume capacity of fraud investigation team

In [22]:
print("\n" + "="*80)
print("TASK A COMPLETE")
print("="*80)
print("\nAll deliverables saved to:")
print(f"   - Audit summary: {AUDIT / 'data_audit_summary.csv'}")
print(f"   - Threshold analysis: {AUDIT}")
print(f"   - Model selection: {AUDIT / 'model_selection_summary.json'}")
print(f"   - Models: {MODELS}")


TASK A COMPLETE

All deliverables saved to:
   - Audit summary: /home/jovyan/work/audit_results/data_audit_summary.csv
   - Threshold analysis: /home/jovyan/work/audit_results
   - Model selection: /home/jovyan/work/audit_results/model_selection_summary.json
   - Models: /home/jovyan/work/models
