<a href="https://colab.research.google.com/github/sr6awi/ieee_fraud_detection/blob/main/notebooks/01_scoping.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# IEEE-CIS Fraud Detection — 01_scoping.ipynb
**Phase:** Scoping  
**Owner:** Salem Ihsan Abidrabbu  
**Last updated:** 2025-09-30 (Asia/Dubai)  

> Use this notebook ONLY for scoping decisions. No EDA/modeling here. Keep it crisp and actionable.

---

## 1) Problem framing
**Business context**  
- ☑ *Decision supported*: Approve, decline, or send transaction for manual review/OTP.  
- ☑ *Primary stakeholders*: Risk & Fraud team (primary), Payments team (integration), Customer support (disputes).  
- ☑ *Downstream systems*: Rules engine + case management tool.  
- ☑ *Decision latency budget*: <200ms online scoring.  

**ML task**  
- ☑ *Type*: Supervised binary classification (`isFraud`).  
- ☑ *Prediction timing*: Authorization-time (before payment completes).  
- ☑ *Decisioning approach*: Probability score [0–1] + threshold bands → high = auto-decline, medium = review/OTP, low = approve.  

---

## 2) Objectives
**Business objectives (KPIs)**  
- ☑ Reduce chargeback loss by: **20–30%**.  
- ☑ Reduce manual review workload by: **30–40%**.  
- ☑ Keep false-positive rate below: **1–2%**.  

**ML objectives (offline)**  
- ☑ Achieve **PR-AUC ≥ 0.40**, **ROC-AUC ≥ 0.95**.  
- ☑ Recall@Top 5% ≥ **80%**.  
- ☑ Calibration: **ECE ≤ 0.05**.  

---

## 3) Success criteria & guardrails
- ☑ **Go/No-Go**: If PR-AUC ≥ 0.40 AND latency <200ms.  
- ☑ **Fairness/Abuse**: No protected attributes; monitor proxy leakage.  
- ☑ **Privacy**: No raw PII in logs; hash/tokenize identifiers.  
- ☑ **Observability**: Monitor latency, throughput, error rate, score distribution.  



## 4) Constraints & risks
- **Data imbalance**: Fraud ≈ 3–5% → must use PR-AUC, cost-sensitive metrics, class weights, or resampling.  
- **Leakage risk**: Features like `TransactionDT` or identity joins may leak future info. Split carefully by **time/user**.  
- **Compute limits**: Running on Colab GPUs (T4/V100) → careful memory handling, chunked data loading.  
- **Regulatory**: Some stakeholders may require explainability → add reason codes or interpretable overlays.  
- **Operational trade-off**: Threshold tuning affects customer friction (false positives vs missed fraud). Must align with Ops.  

## 5) Data sources & access
- **Primary dataset**: IEEE-CIS Fraud Detection (train/test CSVs).  
- **Local/Colab path**: `/content/data/ieee/` (adjust if needed).  
- **Storage plan**: Keep raw read-only; save processed under `/content/data/ieee/processed/`.  
- **Sensitive fields**: Email domains, device/browser info → treat as quasi-identifiers.  
- **Data dictionary**: Build during EDA (separate notebook).  

> If running in Colab: mount Google Drive and copy dataset into `/content/data/ieee/raw/`.  



Config code **cell**

In [1]:
# ==== Project config (scoping phase) ====
from dataclasses import dataclass
from pathlib import Path
import random, os, numpy as np

@dataclass
class Config:
    PROJECT: str = "ieee_fraud_detection"
    PHASE: str   = "scoping"
    SEED: int    = 42
    # Change this path if needed (Colab vs local)
    DATA_DIR: Path = Path("/content/data/ieee")
    RAW_DIR:  Path = DATA_DIR / "raw"
    PROC_DIR: Path = DATA_DIR / "processed"
    ARTIFACTS_DIR: Path = Path("/content/artifacts")

CFG = Config()

# Reproducibility
random.seed(CFG.SEED)
np.random.seed(CFG.SEED)

# Create folders (no-op if exist)
for p in [CFG.DATA_DIR, CFG.RAW_DIR, CFG.PROC_DIR, CFG.ARTIFACTS_DIR]:
    p.mkdir(parents=True, exist_ok=True)

CFG


Config(PROJECT='ieee_fraud_detection', PHASE='scoping', SEED=42, DATA_DIR=PosixPath('/content/data/ieee'), RAW_DIR=PosixPath('/content/data/ieee/raw'), PROC_DIR=PosixPath('/content/data/ieee/processed'), ARTIFACTS_DIR=PosixPath('/content/artifacts'))

## 6) Evaluation plan (offline, scoping view)
- **Metric focus**:  
  - PR-AUC (robust for class imbalance)  
  - ROC-AUC  
  - Recall@k and Precision@k (for Ops review capacity)  
  - FPR at fixed TPR (business trade-offs)  

- **Thresholding**:  
  - Pick operating point via **cost matrix** (fraud loss vs. review cost vs. customer friction).  

- **Baselines planned** (to implement in modeling phase):  
  - Dummy predictor (prevalence baseline)  
  - Logistic Regression (with class weights)  
  - LightGBM / XGBoost (tree-based, handles high-dim tabular)  
  - (Optional) CatBoost (handles categorical features natively)  

- **Outputs**:  
  - Probability score [0,1]  
  - Optional “reason codes” (from interpretable features/rules) for Ops.
---


## 7) Experimental design & splitting strategy
- **Primary split**: **Time-based** using `TransactionDT` → ensures no future info leaks into training.  
- **Group integrity**: Keep all records of a user/device/card in the same fold (avoid leakage).  
- **Validation strategy**:  
  - Time-based train/val/test (e.g., 70/15/15)  
  - GroupKFold (by user/card) as a backup.  
- **Holdout test set**: Final test locked and only evaluated once.  
- **Hyperparameter tuning**:  
  - Early stopping on validation.  
  - Optionally nested CV if time/resources allow.

## 8) Deliverables & 5-day timeline
**Planned deliverables**
- ☑ `01_scoping.ipynb` (this file)  
- ☐ `02_eda.ipynb` (schema check, leakage checks, nulls, drift)  
- ☐ `03_feature_engineering.ipynb` (encodings, aggregates, feature pipelines)  
- ☐ `04_modeling.ipynb` (baselines → tuned models)  
- ☐ `05_evaluation_thresholding.ipynb` (cost matrix, thresholding, operating point)  
- ☐ `06_deployment_plan.md` (inference graph, latency, monitoring plan)  

**Target timeline**
- **Day 1:** Scoping (this notebook) ☑  
- **Day 2:** EDA + leakage checks ☐  
- **Day 3:** Baselines (LR, LGBM) ☐  
- **Day 4:** Feature engineering + tuning ☐  
- **Day 5:** Thresholding + packaging + final report ☐  

---

## 9) Approvals & next actions
**Decisions to finalize now**
- ☐ Confirm business KPI targets with stakeholders.  
- ☐ Confirm latency budget (<200ms) with Ops/Payments.  
- ☐ Approve data split strategy (time-based + group integrity).  

**Next actions**
- ☐ Fill in data dictionary during `02_eda.ipynb`.  
- ☐ Upload raw dataset to `/content/data/ieee/raw/`.  
- ☐ Open GitHub issues for each deliverable (copy from Section 8 checklist).  