# Task 2 - Model Building and Training

# Task -II Todo list 


#### 1. Data Preparation
- [ ] **Load and Split Data**
  - [ ] Load [creditcard.csv](cci:7://file:///C:/Users/My%20Device/Desktop/Week-5_KAIM/fraud-detection/data/raw/creditcard.csv:0:0-0:0) and `Fraud_Data.csv`
  - [ ] For each dataset:
    - [ ] Separate features (X) and target (y)
    - [ ] Split into train/test (80/20) with `stratify=y`
    - [ ] Verify class distribution in splits

#### 2. Baseline Model (Logistic Regression)
- [ ] **Implementation**
  - [ ] Initialize with `class_weight='balanced'`
  - [ ] Fit on training data
  - [ ] Make predictions on test set

- [ ] **Evaluation**
  - [ ] Calculate and log:
    - [ ] AUC-PR score
    - [ ] F1-Score
    - [ ] Confusion Matrix
  - [ ] Save metrics to `results/baseline_metrics.json`

#### 3. Ensemble Model (Choose One)
- [ ] **Model Selection**
  - [ ] Option 1: Random Forest
  - [ ] Option 2: XGBoost
  - [ ] Option 3: LightGBM

- [ ] **Hyperparameter Tuning**
  - [ ] Define parameter grid
  - [ ] Use `RandomizedSearchCV` with 5-fold CV
  - [ ] Optimize for AUC-PR

#### 4. Cross-Validation
- [ ] **Stratified K-Fold (k=5)**
  - [ ] For each model (baseline + ensemble):
    - [ ] Calculate metrics per fold
    - [ ] Compute mean and std of:
      - [ ] AUC-PR
      - [ ] F1-Score

#### 5. Model Comparison
- [ ] **Performance Summary**
  - [ ] Create comparison table
  - [ ] Generate visualizations:
    - [ ] ROC curves
    - [ ] Precision-Recall curves
  - [ ] Document model selection decision

#### 6. Save Outputs
- [ ] **Artifacts**
  - [ ] Save trained models to `models/`
  - [ ] Save evaluation plots to `results/plots/`
  - [ ] Update `notebooks/model-training.ipynb` with all steps

#### 7. Final Checks
- [ ] **Code Review**
  - [ ] Ensure reproducibility
  - [ ] Verify all rubric items are addressed
  - [ ] Check for hardcoded values

## Data loading and preparation

In [1]:
# Data manipulation
import pandas as pd
import numpy as np
from collections import Counter

# Machine Learning
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

# Utilities
import os
import json
from pathlib import Path
import warnings
warnings.filterwarnings('ignore')

# Set random seed for reproducibility
RANDOM_STATE = 42
np.random.seed(RANDOM_STATE)

# Set display options
pd.set_option('display.float_format', '{:.2f}'.format)
pd.set_option('display.max_columns', None)

# Create directories if they don't exist
os.makedirs('../data/processed', exist_ok=True)
os.makedirs('../results', exist_ok=True)