# Fraud model training with MLflow (Phase 3)

This notebook demonstrates how to:

- Generate a small synthetic dataset (reusing the Phase 2 generator)
- Build a fraud training DataFrame
- Train an XGBoost model with Optuna tuning
- Log metrics, parameters, model, and SHAP artifacts to MLflow


In [1]:
from pathlib import Path

import mlflow

from scripts.seed_data import generate_synthetic_data
from common.model_utils import build_fraud_training_dataframe, train_fraud_model


  import pkg_resources  # noqa: TID251


In [2]:
# Generate a modest dataset so the notebook runs quickly.

project_root = Path.cwd()
print(f"Project root: {project_root}")

event_metrics, user_metrics = generate_synthetic_data(
    n_events=50,
    n_users=500,
    n_transactions=5000,
    seed=42,
)

user_metrics.head()

Project root: d:\ai_ws\projects\ibook_ai_ops\notebooks


Unnamed: 0,user_id,event_timestamp,lifetime_purchases,fraud_risk_score,preferred_category
0,1,2025-12-31 13:29:25.509613+00:00,26,0.090909,cultural
1,2,2025-02-25 13:29:25.509613+00:00,23,0.0,cultural
2,3,2024-07-12 13:29:25.509613+00:00,31,0.0,sports
3,4,2023-05-10 13:29:25.509613+00:00,36,0.0,sports
4,5,2023-11-18 13:29:25.509613+00:00,12,0.166667,sports


In [3]:
# Build a training DataFrame with a simple binary label derived from fraud_risk_score.

train_df = build_fraud_training_dataframe(user_metrics, fraud_threshold=0.08)
train_df.head()

Unnamed: 0,lifetime_purchases,fraud_risk_score,is_fraud_label
0,26,0.090909,1
1,23,0.0,0
2,31,0.0,0
3,36,0.0,0
4,12,0.166667,1


In [4]:
# Optionally override the tracking URI here, or rely on ENV/`.env` via common.config.

# mlflow.set_tracking_uri("http://localhost:5000")

result = train_fraud_model(
    df=train_df,
    target_column="is_fraud_label",
    n_trials=5,
    test_size=0.2,
    random_state=123,
)

print("ROC AUC:", result.roc_auc)
print("Accuracy:", result.accuracy)
print("Run ID:", result.run_id)
print("Features:", result.feature_names)


[I 2026-02-11 18:29:25,694] A new study created in memory with name: no-name-8ba6ac4b-f9b6-47a8-b218-eaa4e809fdfa
[I 2026-02-11 18:29:26,123] Trial 0 finished with value: 1.0 and parameters: {'max_depth': 4, 'learning_rate': 0.17139201286561012, 'n_estimators': 54, 'subsample': 0.6110668661095805, 'colsample_bytree': 0.7260343534573334}. Best is trial 0 with value: 1.0.
[I 2026-02-11 18:29:26,281] Trial 1 finished with value: 1.0 and parameters: {'max_depth': 2, 'learning_rate': 0.2578352746215169, 'n_estimators': 55, 'subsample': 0.6642114294351257, 'colsample_bytree': 0.9684128124606703}. Best is trial 0 with value: 1.0.
[I 2026-02-11 18:29:26,423] Trial 2 finished with value: 1.0 and parameters: {'max_depth': 2, 'learning_rate': 0.19009433177832896, 'n_estimators': 40, 'subsample': 0.7342778851381261, 'colsample_bytree': 0.6648639762872156}. Best is trial 0 with value: 1.0.
[I 2026-02-11 18:29:26,558] Trial 3 finished with value: 1.0 and parameters: {'max_depth': 2, 'learning_rate':

ROC AUC: 1.0
Accuracy: 1.0
Run ID: a622112d95024cf8a92ec5e515e59d16
Features: ['lifetime_purchases', 'fraud_risk_score']


You can now open the MLflow UI (default: `http://localhost:5000`) to inspect the
runs, parameters, metrics, model artifact, and SHAP outputs under the
`fraud_detection` experiment.
