# 010 - Sklearn DuckDB SQL Classification (CatBoost + XGBoost)

This notebook demonstrates **complete DuckDB SQL preprocessing** for classification with CatBoost (primary) and XGBoost (fallback).

## Key Features

| Aspect | Implementation |
|--------|---------------|
| **Primary Model** | CatBoost (best for categorical features, native handling) |
| **Fallback Model** | XGBoost (sklearn-native, for YellowBrick testing) |
| **JSON extraction** | `json_extract_string()` in DuckDB SQL |
| **Timestamp parsing** | `date_part()` in DuckDB SQL |
| **Feature scaling** | None needed (tree-based models) |

## Model Comparison

| Aspect | CatBoost | XGBoost |
|--------|----------|---------|
| **Categorical handling** | ✅ Native (strings) | Requires label encoding |
| **SQL Query** | Keeps strings | `DENSE_RANK() - 1` |
| **Imbalanced data** | `auto_class_weights` | `scale_pos_weight` |
| **YellowBrick** | ❌ Incompatible | ✅ sklearn-native |

## SQL Queries by Model Type

**CatBoost (Primary)** - Keeps categorical strings:
```sql
SELECT amount, currency, json_extract_string(device_info, '$.browser') AS browser, ...
FROM delta_scan('s3://...')
```

**XGBoost (Fallback)** - Label encodes categoricals:
```sql
SELECT amount, DENSE_RANK() OVER (ORDER BY currency) - 1 AS currency, ...
FROM delta_scan('s3://...')
```

In [3]:
import duckdb
import numpy as np
import os
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.metrics import (
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    roc_auc_score,
)
from sklearn import metrics
from imblearn.metrics import geometric_mean_score
from catboost import CatBoostClassifier  # Primary model - best for categorical features
from xgboost import XGBClassifier  # Fallback model - sklearn-native for YellowBrick testing

In [4]:
MINIO_HOST = "localhost"
MINIO_PORT = "9000"
MINIO_ENDPOINT = f"{MINIO_HOST}:{MINIO_PORT}"
MINIO_ACCESS_KEY = "minioadmin"
MINIO_SECRET_KEY = "minioadmin123"
PROJECT_NAME = "Transaction Fraud Detection"

In [5]:
DELTA_PATHS = {
    "Transaction Fraud Detection": "s3://lakehouse/delta/transaction_fraud_detection",
    "Estimated Time of Arrival": "s3://lakehouse/delta/estimated_time_of_arrival",
    "E-Commerce Customer Interactions": "s3://lakehouse/delta/e_commerce_customer_interactions",
}

delta_path = DELTA_PATHS.get(PROJECT_NAME)

In [6]:
# Disable AWS EC2 metadata service lookup (prevents 169.254.169.254 errors)
os.environ["AWS_EC2_METADATA_DISABLED"] = "true"

# Create connection (in-memory database)
conn = duckdb.connect()

# Install and load required extensions
conn.execute("INSTALL delta; LOAD delta;")
conn.execute("INSTALL httpfs; LOAD httpfs;")

# Create a secret for S3/MinIO credentials
conn.execute(f"""
    CREATE SECRET minio_secret (
        TYPE S3,
        KEY_ID '{MINIO_ACCESS_KEY}',
        SECRET '{MINIO_SECRET_KEY}',
        REGION 'us-east-1',
        ENDPOINT '{MINIO_ENDPOINT}',
        URL_STYLE 'path',
        USE_SSL false
    );
""")
print("DuckDB extensions loaded and S3 secret configured")

FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

DuckDB extensions loaded and S3 secret configured


## Feature Definitions

Define features upfront for CatBoost's native categorical handling.

In [7]:
# Feature definitions for Transaction Fraud Detection
TFD_NUMERICAL_FEATURES = [
    "amount",
    "account_age_days",
    "cvv_provided",
    "billing_address_match",
]

TFD_CATEGORICAL_FEATURES = [
    "currency",
    "merchant_id",
    "payment_method",
    "product_category",
    "transaction_type",
    "browser",
    "os",
    "year",
    "month",
    "day",
    "hour",
    "minute",
    "second",
]

TFD_ALL_FEATURES = TFD_NUMERICAL_FEATURES + TFD_CATEGORICAL_FEATURES

# Categorical feature indices for CatBoost (position in feature list)
TFD_CAT_FEATURE_INDICES = list(range(
    len(TFD_NUMERICAL_FEATURES),
    len(TFD_ALL_FEATURES)
))

print(f"Numerical features: {len(TFD_NUMERICAL_FEATURES)}")
print(f"Categorical features: {len(TFD_CATEGORICAL_FEATURES)}")
print(f"Categorical indices: {TFD_CAT_FEATURE_INDICES}")

Numerical features: 4
Categorical features: 13
Categorical indices: [4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]


## DuckDB SQL Preprocessing

Two SQL query strategies based on model type:

### CatBoost Query (Primary)
- Keeps categorical features as **strings** (CatBoost handles natively)
- Converts to `category` dtype in Python for memory efficiency
- **Best performance** - CatBoost's native categorical handling is superior

### XGBoost Query (Fallback)
- **Label encodes** categoricals with `DENSE_RANK() - 1`
- All features become integers (XGBoost requires numeric input)
- Useful for YellowBrick visualization testing

In [8]:
from typing import Literal

def load_data_duckdb_sql(
    delta_path: str,
    model_type: Literal["catboost", "xgboost"] = "catboost",
    sample_frac: float | None = None,
    max_rows: int | None = None,
) -> pd.DataFrame:
    """
    Load and preprocess data using pure DuckDB SQL.
    
    Two query strategies based on model_type:
    - CatBoost: Keeps categorical strings (native handling)
    - XGBoost: Label encodes categoricals with DENSE_RANK() - 1
    
    Args:
        delta_path: Path to Delta Lake table
        model_type: "catboost" (strings) or "xgboost" (label encoded)
        sample_frac: Optional fraction of data to sample (0.0-1.0)
        max_rows: Optional maximum number of rows to load
    
    Returns:
        DataFrame with preprocessed features and target
    """
    if model_type == "catboost":
        # CatBoost query: Keep categorical strings (native handling)
        query = f"""
        SELECT
            -- Numerical features
            amount,
            account_age_days,
            CAST(cvv_provided AS INTEGER) AS cvv_provided,
            CAST(billing_address_match AS INTEGER) AS billing_address_match,

            -- Categorical features (keep as strings for CatBoost)
            currency,
            merchant_id,
            payment_method,
            product_category,
            transaction_type,

            -- JSON extraction (keep as strings)
            json_extract_string(device_info, '$.browser') AS browser,
            json_extract_string(device_info, '$.os') AS os,

            -- Timestamp components
            CAST(date_part('year', CAST(timestamp AS TIMESTAMP)) AS INTEGER) AS year,
            CAST(date_part('month', CAST(timestamp AS TIMESTAMP)) AS INTEGER) AS month,
            CAST(date_part('day', CAST(timestamp AS TIMESTAMP)) AS INTEGER) AS day,
            CAST(date_part('hour', CAST(timestamp AS TIMESTAMP)) AS INTEGER) AS hour,
            CAST(date_part('minute', CAST(timestamp AS TIMESTAMP)) AS INTEGER) AS minute,
            CAST(date_part('second', CAST(timestamp AS TIMESTAMP)) AS INTEGER) AS second,

            -- Target
            is_fraud

        FROM delta_scan('{delta_path}')
        """
        print(f"Loading data for CatBoost (categorical strings)...")
    
    else:  # xgboost
        # XGBoost query: Label encode categoricals with DENSE_RANK() - 1
        query = f"""
        SELECT
            -- Numerical features
            amount,
            account_age_days,
            CAST(cvv_provided AS INTEGER) AS cvv_provided,
            CAST(billing_address_match AS INTEGER) AS billing_address_match,

            -- Categorical features: Label encoded with DENSE_RANK() - 1
            DENSE_RANK() OVER (ORDER BY currency) - 1 AS currency,
            DENSE_RANK() OVER (ORDER BY merchant_id) - 1 AS merchant_id,
            DENSE_RANK() OVER (ORDER BY payment_method) - 1 AS payment_method,
            DENSE_RANK() OVER (ORDER BY product_category) - 1 AS product_category,
            DENSE_RANK() OVER (ORDER BY transaction_type) - 1 AS transaction_type,
            
            -- JSON extraction + Label encoded
            DENSE_RANK() OVER (ORDER BY json_extract_string(device_info, '$.browser')) - 1 AS browser,
            DENSE_RANK() OVER (ORDER BY json_extract_string(device_info, '$.os')) - 1 AS os,

            -- Timestamp components
            CAST(date_part('year', CAST(timestamp AS TIMESTAMP)) AS INTEGER) AS year,
            CAST(date_part('month', CAST(timestamp AS TIMESTAMP)) AS INTEGER) AS month,
            CAST(date_part('day', CAST(timestamp AS TIMESTAMP)) AS INTEGER) AS day,
            CAST(date_part('hour', CAST(timestamp AS TIMESTAMP)) AS INTEGER) AS hour,
            CAST(date_part('minute', CAST(timestamp AS TIMESTAMP)) AS INTEGER) AS minute,
            CAST(date_part('second', CAST(timestamp AS TIMESTAMP)) AS INTEGER) AS second,

            -- Target
            is_fraud

        FROM delta_scan('{delta_path}')
        """
        print(f"Loading data for XGBoost (label encoded integers)...")

    # Add sampling clause
    if sample_frac is not None and 0 < sample_frac < 1:
        query += f" USING SAMPLE {sample_frac * 100}%"
        print(f"  Sampling: {sample_frac * 100}%")

    # Add limit clause
    if max_rows is not None:
        query += f" LIMIT {max_rows}"
        print(f"  Max rows: {max_rows}")

    df = conn.execute(query).df()
    print(f"  Loaded {len(df)} rows with {len(df.columns)} columns")
    
    return df

In [9]:
# Set model type FIRST - this determines which SQL query to use
MODEL_TYPE = "catboost"  # "catboost" (primary) or "xgboost" (fallback for YellowBrick)

# Load data with appropriate SQL query for the model type
df = load_data_duckdb_sql(delta_path, model_type=MODEL_TYPE)
df.head()

Loading data for CatBoost (categorical strings)...


FloatProgress(value=0.0, layout=Layout(width='auto'), style=ProgressStyle(bar_color='black'))

  Loaded 1085213 rows with 18 columns


Unnamed: 0,amount,account_age_days,cvv_provided,billing_address_match,currency,merchant_id,payment_method,product_category,transaction_type,browser,os,year,month,day,hour,minute,second,is_fraud
0,463.77,556,1,1,CAD,merchant_43,credit_card,travel,deposit,Safari,macOS,2026,1,17,17,27,2,0
1,157.54,557,1,1,JPY,merchant_70,credit_card,luxury_items,purchase,Firefox,Linux,2026,1,17,17,27,3,0
2,312.8,1131,1,1,AUD,merchant_43,crypto,travel,purchase,Other,Linux,2026,1,17,17,27,3,0
3,81.24,1444,1,1,USD,merchant_52,credit_card,clothing,withdrawal,Edge,Linux,2026,1,17,17,27,3,0
4,288.43,1752,1,1,USD,merchant_120,paypal,digital_goods,withdrawal,Safari,iOS,2026,1,17,17,27,3,0


In [10]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1085213 entries, 0 to 1085212
Data columns (total 18 columns):
 #   Column                 Non-Null Count    Dtype  
---  ------                 --------------    -----  
 0   amount                 1085213 non-null  float64
 1   account_age_days       1085213 non-null  int32  
 2   cvv_provided           1085213 non-null  int32  
 3   billing_address_match  1085213 non-null  int32  
 4   currency               1085213 non-null  object 
 5   merchant_id            1085213 non-null  object 
 6   payment_method         1085213 non-null  object 
 7   product_category       1085213 non-null  object 
 8   transaction_type       1085213 non-null  object 
 9   browser                1085213 non-null  object 
 10  os                     1085213 non-null  object 
 11  year                   1085213 non-null  int32  
 12  month                  1085213 non-null  int32  
 13  day                    1085213 non-null  int32  
 14  hour              

## Process Batch Data

Split features/target and prepare for training:
- **CatBoost**: Convert categorical columns to `category` dtype (memory efficient + native handling)
- **XGBoost**: All features already numeric from SQL label encoding

In [11]:
def process_batch_data_duckdb(
    df: pd.DataFrame,
    model_type: Literal["catboost", "xgboost"] = "catboost",
    test_size: float = 0.2,
    random_state: int = 42,
):
    """
    Process batch data for model training.
    
    Args:
        df: DataFrame from load_data_duckdb_sql()
        model_type: "catboost" or "xgboost" (determines dtype handling)
        test_size: Fraction for test set
        random_state: Random seed
    
    Returns:
        X_train, X_test, y_train, y_test
    """
    # Split features and target
    y = df["is_fraud"]
    X = df.drop("is_fraud", axis=1)
    
    if model_type == "catboost":
        # Convert categorical columns to category dtype for CatBoost
        for col in TFD_CATEGORICAL_FEATURES:
            if col in X.columns:
                X[col] = X[col].astype("category")
        print(f"Features: {len(X.columns)} ({len(TFD_NUMERICAL_FEATURES)} numeric, {len(TFD_CATEGORICAL_FEATURES)} category dtype)")
    else:
        # XGBoost: all features already numeric from SQL label encoding
        print(f"Features: {len(X.columns)} (all numeric, label-encoded in SQL)")
    
    # Stratified train/test split (keeps class balance)
    print(f"Splitting data: {1-test_size:.0%} train, {test_size:.0%} test (stratified)...")
    X_train, X_test, y_train, y_test = train_test_split(
        X, y,
        test_size=test_size,
        stratify=y,
        random_state=random_state,
    )
    
    print(f"  Training set: {len(X_train)} samples")
    print(f"  Test set: {len(X_test)} samples")
    
    # Calculate class balance
    fraud_rate = y_train.sum() / len(y_train) * 100
    print(f"  Fraud rate in training set: {fraud_rate:.2f}%")
    
    return X_train, X_test, y_train, y_test

In [12]:
X_train, X_test, y_train, y_test = process_batch_data_duckdb(df, model_type=MODEL_TYPE)

Features: 17 (4 numeric, 13 category dtype)
Splitting data: 80% train, 20% test (stratified)...
  Training set: 868170 samples
  Test set: 217043 samples
  Fraud rate in training set: 1.00%


In [13]:
X_train.head()

Unnamed: 0,amount,account_age_days,cvv_provided,billing_address_match,currency,merchant_id,payment_method,product_category,transaction_type,browser,os,year,month,day,hour,minute,second
269378,398.38,1395,1,1,EUR,merchant_78,paypal,luxury_items,payment,Edge,Other,2026,1,13,15,39,50
518076,244.16,861,1,1,AUD,merchant_11,debit_card,travel,withdrawal,Safari,Windows,2026,1,14,12,32,39
506975,256.74,1141,1,1,CAD,merchant_116,crypto,groceries,payment,Safari,macOS,2026,1,14,11,40,34
516981,368.74,1335,1,1,USD,merchant_187,paypal,travel,withdrawal,Firefox,Windows,2026,1,14,12,27,28
178363,212.04,816,1,1,GBP,merchant_166,bank_transfer,electronics,withdrawal,Chrome,iOS,2026,1,13,8,15,51


In [14]:
X_train.dtypes

amount                    float64
account_age_days            int32
cvv_provided                int32
billing_address_match       int32
currency                 category
merchant_id              category
payment_method           category
product_category         category
transaction_type         category
browser                  category
os                       category
year                     category
month                    category
day                      category
hour                     category
minute                   category
second                   category
dtype: object

## Model Creation (CatBoost Primary, XGBoost Fallback)

### CatBoost (Primary)
Optimal for fraud detection with categorical features:
- **Native categorical handling** - no encoding needed (but we encode in SQL for uniformity)
- **`auto_class_weights='Balanced'`** - handles class imbalance automatically
- **Built-in regularization** - L2 leaf regularization, early stopping

### XGBoost (Fallback for YellowBrick)
sklearn-native, useful for testing YellowBrick visualizations:
- **`scale_pos_weight`** - handles class imbalance (neg/pos ratio)
- **sklearn BaseEstimator** - full compatibility with sklearn ecosystem
- **Fast histogram-based** - efficient training on large datasets

### Optimized Parameters

Based on [CatBoost docs](https://catboost.ai/docs/en/references/training-parameters/common) and [fraud detection research](https://www.preprints.org/manuscript/202503.1199):

| Parameter | CatBoost | XGBoost |
|-----------|----------|---------|
| **Iterations** | 1000 | 500 |
| **Learning Rate** | 0.05 | 0.1 |
| **Depth** | 6 | 6 |
| **Regularization** | l2_leaf_reg=3 | reg_alpha=0.1, reg_lambda=1.0 |
| **Imbalance** | auto_class_weights='Balanced' | scale_pos_weight |
| **Boosting Type** | Plain (1M+ rows) | hist |
| **Early Stopping** | 50 rounds | 50 rounds |

**Note**: `boosting_type='Plain'` is recommended for large datasets (1M+ rows). Use `'Ordered'` for smaller datasets (<100K rows).

In [15]:
from typing import Literal

def create_batch_model(
    model_type: Literal["catboost", "xgboost"] = "catboost",
    y_train=None,
    cat_feature_indices: list[int] | None = None,
):
    """
    Create classifier optimized for fraud detection.
    
    Args:
        model_type: "catboost" (primary) or "xgboost" (fallback for YellowBrick)
        y_train: Training labels for calculating class imbalance ratio
        cat_feature_indices: Indices of categorical features (for CatBoost)
    
    Returns:
        Configured classifier ready for training
    
    References:
        - CatBoost docs: https://catboost.ai/docs/en/references/training-parameters/common
        - Fraud detection research: CatBoost achieves F1=0.92, AUC=0.99
    """
    # Calculate class imbalance ratio
    scale_pos_weight = 1.0
    if y_train is not None:
        neg_samples = sum(y_train == 0)
        pos_samples = sum(y_train == 1)
        if pos_samples > 0:
            scale_pos_weight = neg_samples / pos_samples
        print(f"Class imbalance ratio: {scale_pos_weight:.2f}:1 (negative:positive)")
        print(f"Fraud rate: {pos_samples / len(y_train) * 100:.2f}%")
    
    if model_type == "catboost":
        print(f"Creating CatBoostClassifier (primary model)")
        print(f"  Using auto_class_weights='Balanced' for imbalanced data")
        if cat_feature_indices:
            print(f"  Categorical feature indices: {cat_feature_indices}")
        
        # Optimized CatBoost parameters for fraud detection (1M+ rows, ~1% fraud)
        model = CatBoostClassifier(
            # Core parameters
            iterations=1000,                # Max trees; early stopping finds optimal
            learning_rate=0.05,             # Good balance for 1M+ rows
            depth=6,                        # CatBoost default, good for most cases
            
            # Imbalanced data handling (critical for fraud detection)
            auto_class_weights='Balanced',  # Weights positive class by neg/pos ratio
            
            # Loss function & evaluation
            loss_function='Logloss',        # Binary cross-entropy
            eval_metric='AUC',              # Best for imbalanced binary classification
            
            # Regularization
            l2_leaf_reg=3,                  # L2 regularization (default=3)
            
            # Boosting type: 'Plain' for large datasets (1M+), 'Ordered' for <100K
            boosting_type='Plain',
            
            # Early stopping
            early_stopping_rounds=50,
            
            # Performance
            task_type='CPU',
            thread_count=-1,                # Use all CPU cores
            random_seed=42,
            
            # Output
            verbose=True,
        )
    
    elif model_type == "xgboost":
        print(f"Creating XGBClassifier (fallback for YellowBrick testing)")
        print(f"  Using scale_pos_weight={scale_pos_weight:.2f}")
        
        model = XGBClassifier(
            # Core parameters
            n_estimators=500,
            learning_rate=0.1,
            max_depth=6,
            
            # Imbalanced data handling
            scale_pos_weight=scale_pos_weight,
            
            # Regularization (prevent overfitting)
            min_child_weight=1,
            gamma=0.1,
            subsample=0.8,
            colsample_bytree=0.8,
            reg_alpha=0.1,  # L1 regularization
            reg_lambda=1.0,  # L2 regularization
            
            # Training settings
            objective='binary:logistic',
            eval_metric='auc',
            
            # Early stopping
            early_stopping_rounds=50,
            
            # Performance
            tree_method='hist',
            n_jobs=-1,
            random_state=42,
        )
    
    else:
        raise ValueError(f"Unknown model_type: {model_type}. Use 'catboost' or 'xgboost'.")
    
    return model

In [16]:
# Create model using MODEL_TYPE defined earlier (determines both query and model)
model = create_batch_model(
    model_type=MODEL_TYPE,
    y_train=y_train,
    cat_feature_indices=TFD_CAT_FEATURE_INDICES,
)

Class imbalance ratio: 99.17:1 (negative:positive)
Fraud rate: 1.00%
Creating CatBoostClassifier (primary model)
  Using auto_class_weights='Balanced' for imbalanced data
  Categorical feature indices: [4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]


## Train Model

Training with all-numeric features (label-encoded in DuckDB SQL).

- **CatBoost**: Pass `cat_features` for native categorical handling (optional since encoded)
- **XGBoost**: All features already numeric, no extra params needed

In [17]:
# Train model based on type
if MODEL_TYPE == "catboost":
    # CatBoost training with native categorical handling
    model.fit(
        X_train, y_train,
        eval_set=(X_test, y_test),
        cat_features=TFD_CAT_FEATURE_INDICES,  # Optional: CatBoost handles encoded cats too
        use_best_model=True,
        verbose=True,
    )
else:
    # XGBoost training
    model.fit(
        X_train, y_train,
        eval_set=[(X_test, y_test)],
        verbose=True,
    )

print(f"\n{MODEL_TYPE.upper()} training complete!")

0:	test: 0.9889922	best: 0.9889922 (0)	total: 4.3s	remaining: 1h 11m 32s
1:	test: 0.9899321	best: 0.9899321 (1)	total: 8.62s	remaining: 1h 11m 40s
2:	test: 0.9897926	best: 0.9899321 (1)	total: 12s	remaining: 1h 6m 12s
3:	test: 0.9891783	best: 0.9899321 (1)	total: 17s	remaining: 1h 10m 34s
4:	test: 0.9891432	best: 0.9899321 (1)	total: 23.2s	remaining: 1h 16m 57s
5:	test: 0.9891968	best: 0.9899321 (1)	total: 32.6s	remaining: 1h 30m 6s
6:	test: 0.9893614	best: 0.9899321 (1)	total: 38.7s	remaining: 1h 31m 23s
7:	test: 0.9896820	best: 0.9899321 (1)	total: 44.2s	remaining: 1h 31m 14s
8:	test: 0.9899649	best: 0.9899649 (8)	total: 50.1s	remaining: 1h 31m 58s
9:	test: 0.9899759	best: 0.9899759 (9)	total: 56.9s	remaining: 1h 33m 49s
10:	test: 0.9901069	best: 0.9901069 (10)	total: 1m 3s	remaining: 1h 34m 45s
11:	test: 0.9901144	best: 0.9901144 (11)	total: 1m 5s	remaining: 1h 29m 49s
12:	test: 0.9925603	best: 0.9925603 (12)	total: 1m 9s	remaining: 1h 28m 28s
13:	test: 0.9926791	best: 0.9926791 (13

## Evaluate Model

In [18]:
y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)[:, 1]

In [19]:
eval_metrics = {
    "Accuracy": float(accuracy_score(y_test, y_pred)),
    "Precision": float(precision_score(y_test, y_pred, zero_division=0)),
    "Recall": float(recall_score(y_test, y_pred, zero_division=0)),
    "F1": float(f1_score(y_test, y_pred, zero_division=0)),
    "ROCAUC": float(roc_auc_score(y_test, y_pred_proba)),
    "GeometricMean": float(geometric_mean_score(y_test, y_pred)),
}

print("Model Performance Metrics:")
print("=" * 40)
for name, value in eval_metrics.items():
    print(f"{name:20}: {value:.4f}")

Model Performance Metrics:
Accuracy            : 0.9878
Precision           : 0.4465
Recall              : 0.9234
F1                  : 0.6019
ROCAUC              : 0.9933
GeometricMean       : 0.9554


## Model Comparison

Compare CatBoost (primary) vs XGBoost (fallback) on this dataset.

Run the notebook twice with `MODEL_TYPE = "catboost"` and `MODEL_TYPE = "xgboost"` to compare.

| Metric | CatBoost | XGBoost | Winner |
|--------|----------|---------|--------|
| Accuracy | ? | ? | ? |
| Precision | ? | ? | ? |
| Recall | ? | ? | ? |
| F1 | ? | ? | ? |
| ROCAUC | ? | ? | ? |
| GeometricMean | ? | ? | ? |

**Key Differences:**
- **CatBoost**: Native categorical handling, ordered boosting, auto class weights
- **XGBoost**: sklearn-native, compatible with YellowBrick visualizations

In [20]:
# Store metrics for comparison
# Run with MODEL_TYPE="catboost" first, then "xgboost" to populate both

print(f"\n{MODEL_TYPE.upper()} Model Metrics:")
print("=" * 50)
for name, value in eval_metrics.items():
    print(f"  {name:20}: {value:.4f}")


CATBOOST Model Metrics:
  Accuracy            : 0.9878
  Precision           : 0.4465
  Recall              : 0.9234
  F1                  : 0.6019
  ROCAUC              : 0.9933
  GeometricMean       : 0.9554


In [21]:
# -----------------------------------------------------------------------------
# PRIMARY METRICS - Class-based metrics for fraud detection
# These use y_pred (predicted labels), not probabilities
# -----------------------------------------------------------------------------
primary_metric_functions = {
    # Recall: Fraud detection rate (minimize missed fraud)
    # TP / (TP + FN) - How many actual frauds did we catch?
    "recall_score": metrics.recall_score,
    # Precision: False alarm rate (customer experience)
    # TP / (TP + FP) - Of predicted frauds, how many were actually fraud?
    "precision_score": metrics.precision_score,
    # F1: Harmonic mean of Precision & Recall
    # Best when you want balance between precision and recall
    "f1_score": metrics.f1_score,
    # F-beta with beta=2: Weights Recall 2x more than Precision
    # CRITICAL for fraud detection where missing fraud is costly
    "fbeta_score": metrics.fbeta_score,
}
primary_metric_args = {
    "recall_score": {
        "pos_label": 1,
        "average": "binary",
        "zero_division": 0.0,
    },
    "precision_score": {
        "pos_label": 1,
        "average": "binary",
        "zero_division": 0.0,
    },
    "f1_score": {
        "pos_label": 1,
        "average": "binary",
        "zero_division": 0.0,
    },
    "fbeta_score": {
        "beta": 2.0,
        "pos_label": 1,
        "average": "binary",
        "zero_division": 0.0,
    },
}

In [22]:
# -----------------------------------------------------------------------------
# SECONDARY METRICS - Good for monitoring and additional insights
# These provide complementary information but shouldn't drive model selection
# -----------------------------------------------------------------------------
secondary_metric_functions = {
    # Accuracy: Overall correctness (TP + TN) / Total
    # CAUTION: Misleading for imbalanced data when used alone!
    # Include for: baseline comparison, sanity checks, stakeholder reporting
    # With 3% fraud: predicting all non-fraud = 97% accuracy (useless!)
    # ALWAYS show alongside balanced_accuracy and recall
    "accuracy_score": metrics.accuracy_score,
    # Balanced Accuracy: Average of recall on each class
    # = (TPR + TNR) / 2 = (Recall_fraud + Recall_non_fraud) / 2
    # Better than accuracy for imbalanced data - penalizes ignoring minority
    "balanced_accuracy_score": metrics.balanced_accuracy_score,
    # Matthews Correlation Coefficient: Most robust single metric
    # Balanced measure, works well with imbalanced classes
    # Range: [-1, +1], 0 = random, +1 = perfect, -1 = inverse
    # Only metric that gives high score when all 4 confusion matrix categories are good
    "matthews_corrcoef": metrics.matthews_corrcoef,
    # Cohen's Kappa: Agreement beyond chance
    # Useful for comparing with baseline/random classifier
    # Range: [-1, +1], 0 = no better than chance, +1 = perfect
    "cohen_kappa_score": metrics.cohen_kappa_score,
    # Jaccard Score: Intersection over Union (IoU)
    # TP / (TP + FP + FN) - stricter than F1
    # Ignores TN, focuses only on positive class predictions
    "jaccard_score": metrics.jaccard_score,
}
secondary_metric_args = {
    # Accuracy: normalize=True returns fraction [0, 1]
    # sample_weight=None means equal weight for all samples
    # For imbalanced data: use alongside balanced_accuracy, never alone!
    "accuracy_score": {
        "normalize": True,  # Return fraction (0.0 to 1.0), not count
        # sample_weight: Can be set dynamically to correct for imbalance
        # Example: weight fraud samples higher to penalize missing them
    },
    # Balanced Accuracy: adjusted=False returns [0, 1], adjusted=True shifts to [-0.5, 1]
    # adjusted=True: random classifier scores 0, adjusted=False: random scores ~0.5
    "balanced_accuracy_score": {
        "adjusted": False,  # Keep in [0, 1] range for interpretability
    },
    # MCC: No special args, works on y_true vs y_pred
    # Handles imbalanced data well by design
    "matthews_corrcoef": {},
    # Cohen's Kappa: weights=None for unweighted agreement
    # weights='linear' or 'quadratic' for ordinal classification
    "cohen_kappa_score": {
        "weights": None,  # Unweighted (linear/quadratic for ordinal data)
    },
    # Jaccard: binary classification with fraud as positive class
    "jaccard_score": {
        "pos_label": 1,
        "average": "binary",
        "zero_division": 0.0,
    },
}

In [23]:
# -----------------------------------------------------------------------------
# PROBABILISTIC METRICS - Use y_pred_proba (probability scores)
# These measure ranking ability and probability calibration
# -----------------------------------------------------------------------------
probabilistic_metric_functions = {
    # ROC-AUC: Best overall threshold-independent metric for imbalanced binary
    # Area under ROC curve, measures ranking ability
    "roc_auc_score": metrics.roc_auc_score,
    # Average Precision (PR-AUC): Area under precision-recall curve
    # Better than ROC-AUC for highly imbalanced data
    "average_precision_score": metrics.average_precision_score,
    # Log Loss (Cross-Entropy): Penalizes confident wrong predictions
    # Lower is better, heavily penalizes confident mistakes
    "log_loss": metrics.log_loss,
    # Brier Score: Mean squared error of probability predictions
    # Lower is better, range [0, 1]
    "brier_score_loss": metrics.brier_score_loss,
    # D^2 Log Loss Score: Fraction of log loss explained
    # Similar to R^2, but for log loss; higher is better
    "d2_log_loss_score": metrics.d2_log_loss_score,
    # D^2 Brier Score: Fraction of Brier score explained
    # Similar to R^2, but for Brier score; higher is better
    "d2_brier_score": metrics.d2_brier_score,
}
probabilistic_metric_args = {
    "roc_auc_score": {},
    "average_precision_score": {
        "pos_label": 1,
    },
    "log_loss": {
        "normalize": True,
    },
    "brier_score_loss": {
        "pos_label": 1,
    },
    "d2_log_loss_score": {},
    "d2_brier_score": {
        "pos_label": 1,
    },
}

In [24]:
# -----------------------------------------------------------------------------
# ANALYSIS/REPORTING METRICS - For detailed analysis and threshold tuning
# These return multiple values or structured outputs
# -----------------------------------------------------------------------------
analysis_metric_functions = {
    # Confusion Matrix: Foundation for many other metrics
    # Returns 2x2 matrix: [[TN, FP], [FN, TP]]
    "confusion_matrix": metrics.confusion_matrix,
    # NEW IN SKLEARN 1.8! Confusion Matrix at Thresholds
    # Returns TN, FP, FN, TP arrays for each threshold
    # CRITICAL for threshold optimization in fraud detection
    "confusion_matrix_at_thresholds": metrics.confusion_matrix_at_thresholds,
    # Classification Report: Text summary of P, R, F1 per class
    # Can return dict with output_dict=True
    "classification_report": metrics.classification_report,
    # Precision-Recall Curve: For threshold analysis
    # Returns (precision, recall, thresholds)
    "precision_recall_curve": metrics.precision_recall_curve,
    # ROC Curve: For threshold analysis
    # Returns (fpr, tpr, thresholds)
    "roc_curve": metrics.roc_curve,
    # DET Curve: Detection Error Tradeoff
    # Returns (fpr, fnr, thresholds) - plots FNR vs FPR
    # Useful for fraud: visualize false alarm vs missed fraud tradeoff
    "det_curve": metrics.det_curve,
    # Class Likelihood Ratios: LR+, LR- for diagnostic testing
    # Returns (positive_lr, negative_lr)
    "class_likelihood_ratios": metrics.class_likelihood_ratios,
    # Precision-Recall-FScore-Support: All in one
    # Returns (precision, recall, fbeta, support) arrays
    "precision_recall_fscore_support": metrics.precision_recall_fscore_support,
    # AUC: General utility to compute area under any curve
    "auc": metrics.auc,
}
analysis_metric_args = {
    # Confusion Matrix: labels=[0, 1] ensures consistent ordering
    # normalize='true' normalizes over actual (row-wise)
    "confusion_matrix": {
        "labels": [0, 1],  # [non-fraud, fraud]
        "normalize": None,  # Return raw counts; use 'true'/'pred'/'all' for proportions
    },
    # Confusion Matrix at Thresholds: pos_label=1 for fraud
    # Returns (tns, fps, fns, tps, thresholds) arrays
    "confusion_matrix_at_thresholds": {
        "pos_label": 1,  # Fraud is positive class
    },
    # Classification Report: output_dict=True for programmatic access
    "classification_report": {
        "target_names": ["Non-Fraud", "Fraud"],
        "output_dict": True,  # Return dict instead of string
        "zero_division": 0.0,
    },
    # Precision-Recall Curve: pos_label=1 for fraud class
    "precision_recall_curve": {
        "pos_label": 1,
    },
    # ROC Curve: pos_label=1 for fraud class
    "roc_curve": {
        "pos_label": 1,
        "drop_intermediate": True,  # Reduce points for efficiency
    },
    # DET Curve: pos_label=1 for fraud class
    "det_curve": {
        "pos_label": 1,
        "drop_intermediate": True,  # Reduce points for efficiency
    },
    # Class Likelihood Ratios: labels=[non-fraud, fraud] ordering
    "class_likelihood_ratios": {
        "labels": [0, 1],  # [negative_class, positive_class]
    },
    # Precision-Recall-FScore-Support: beta=1.0 for F1
    "precision_recall_fscore_support": {
        "beta": 1.0,
        "pos_label": 1,
        "average": "binary",
        "zero_division": 0.0,
    },
    # AUC: No default args, takes x and y arrays directly
    "auc": {},
}

## Comprehensive Metrics Evaluation

Sklearn metrics organized following River ML structure for consistency:

| Category | River (Online) | Sklearn (Batch) |
|----------|----------------|-----------------|
| **Class-based** | `.update(y, pred)` | `func(y_true, y_pred)` |
| **Probability-based** | `.update(y, proba)` | `func(y_true, y_proba)` |
| **Report/Matrix** | `.update(y, pred)` | `func(y_true, y_pred)` |

Key difference: River metrics are incremental (`.update()`), sklearn are batch functions.

In [25]:
# =============================================================================
# COMPUTE ALL METRICS
# =============================================================================
metrics_to_log = {}

# -----------------------------------------------------------------------------
# PRIMARY METRICS (class-based, use y_pred)
# -----------------------------------------------------------------------------
for name, func in primary_metric_functions.items():
    metrics_to_log[name] = func(y_test, y_pred, **primary_metric_args[name])

# -----------------------------------------------------------------------------
# SECONDARY METRICS (class-based, use y_pred)
# -----------------------------------------------------------------------------
for name, func in secondary_metric_functions.items():
    metrics_to_log[name] = func(y_test, y_pred, **secondary_metric_args[name])

# -----------------------------------------------------------------------------
# PROBABILISTIC METRICS (use y_pred_proba)
# -----------------------------------------------------------------------------
for name, func in probabilistic_metric_functions.items():
    metrics_to_log[name] = func(y_test, y_pred_proba, **probabilistic_metric_args[name])

metrics_to_log

{'recall_score': 0.9233964005537609,
 'precision_score': 0.4464524765729585,
 'f1_score': 0.6018950218077906,
 'fbeta_score': 0.7608365019011407,
 'accuracy_score': 0.9878042599853485,
 'balanced_accuracy_score': 0.9559251032348655,
 'matthews_corrcoef': 0.6374838597438147,
 'cohen_kappa_score': 0.5964632472381868,
 'jaccard_score': 0.4305077452667814,
 'roc_auc_score': 0.9933091211885177,
 'average_precision_score': 0.9179367645715938,
 'log_loss': 0.07253448486036858,
 'brier_score_loss': 0.019032686368593052,
 'd2_log_loss_score': -0.29690518159390145,
 'd2_brier_score': -0.9255058228904667}

In [26]:
# Display report metrics (ConfusionMatrix, ClassificationReport)
print("Confusion Matrix:")
print(metrics.confusion_matrix(y_test, y_pred, **analysis_metric_args["confusion_matrix"]))
print("\nClassification Report:")
print(metrics.classification_report(
    y_test, y_pred, 
    target_names=["Non-Fraud", "Fraud"],
    zero_division=0.0
))

Confusion Matrix:
[[212395   2481]
 [   166   2001]]

Classification Report:
              precision    recall  f1-score   support

   Non-Fraud       1.00      0.99      0.99    214876
       Fraud       0.45      0.92      0.60      2167

    accuracy                           0.99    217043
   macro avg       0.72      0.96      0.80    217043
weighted avg       0.99      0.99      0.99    217043



In [27]:
# Print all model parameters
print("\nModel parameters:")
all_params = model.get_all_params()
for param_name, param_value in all_params.items():
    print(f"  {param_name}: {param_value}")


Model parameters:
  nan_mode: Min
  eval_metric: AUC
  combinations_ctr: ['Borders:CtrBorderCount=15:CtrBorderType=Uniform:TargetBorderCount=1:TargetBorderType=MinEntropy:Prior=0/1:Prior=0.5/1:Prior=1/1', 'Counter:CtrBorderCount=15:CtrBorderType=Uniform:Prior=0/1']
  iterations: 1000
  sampling_frequency: PerTree
  fold_permutation_block: 0
  leaf_estimation_method: Newton
  od_pval: 0
  random_score_type: NormalWithModelSizeDecrease
  counter_calc_method: SkipTest
  grow_policy: SymmetricTree
  penalties_coefficient: 1
  boosting_type: Plain
  model_shrink_mode: Constant
  feature_border_type: GreedyLogSum
  ctr_leaf_count_limit: 18446744073709551615
  bayesian_matrix_reg: 0.10000000149011612
  one_hot_max_size: 2
  eval_fraction: 0
  force_unit_auto_pair_weights: False
  l2_leaf_reg: 3
  random_strength: 1
  od_type: Iter
  rsm: 1
  boost_from_average: False
  max_ctr_complexity: 4
  model_size_reg: 0.5
  simple_ctr: ['Borders:CtrBorderCount=15:CtrBorderType=Uniform:TargetBorderCoun