# 03 - Baseline Models with MLflow Tracking
## Credit Scoring Model Project

**Learning Objectives:**
- Train multiple baseline models
- Set up MLflow experiment tracking
- Evaluate models using appropriate metrics for imbalanced data
- Compare model performance
- Select best baseline for optimization

**Why Multiple Baselines?**
Different algorithms have different strengths:
- **Logistic Regression:** Simple, interpretable, fast (good baseline)
- **Random Forest:** Handles non-linearity, robust to outliers
- **XGBoost:** Powerful gradient boosting, often wins competitions
- **LightGBM:** Fast, memory-efficient, great for large datasets

**MLflow Tracking:**
We'll log all experiments to compare models systematically. This is professional ML workflow!

Let's build our models! 🚀

## 📦 Import Libraries



In [None]:
# Standard libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from pathlib import Path
import time

# ML models
from sklearn.dummy import DummyClassifier  # Reference baseline
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier

# Evaluation
from sklearn.metrics import (
    roc_auc_score, average_precision_score,
    classification_report, confusion_matrix
)

# MLflow
import mlflow
import mlflow.sklearn

# Our utilities
import sys
sys.path.append('../')
from src.evaluation import (
    evaluate_model,
    plot_roc_curve,
    plot_precision_recall_curve,
    plot_confusion_matrix,
    compare_models,
    plot_feature_importance
)
from src.model_training import train_and_evaluate_model

# Configuration
warnings.filterwarnings('ignore')
RANDOM_STATE = 42

print("[OK] Libraries imported successfully!")
print(f"MLflow version: {mlflow.__version__}")



## 📂 Load Processed Data

Load the data we prepared in the feature engineering notebook.



In [None]:
# Load processed data
data_dir = Path('../data/processed')

print("Loading processed datasets...")
X_train = pd.read_csv(data_dir / 'X_train.csv')
X_val = pd.read_csv(data_dir / 'X_val.csv')
y_train = pd.read_csv(data_dir / 'y_train.csv').squeeze()
y_val = pd.read_csv(data_dir / 'y_val.csv').squeeze()

print(f"[OK] Data loaded!")
print(f"\nDataset shapes:")
print(f"  X_train: {X_train.shape}")
print(f"  X_val: {X_val.shape}")
print(f"  y_train: {y_train.shape}")
print(f"  y_val: {y_val.shape}")

print(f"\nTarget distribution:")
print(f"  Training: {y_train.value_counts(normalize=True).to_dict()}")
print(f"  Validation: {y_val.value_counts(normalize=True).to_dict()}")

print(f"\nFeature count: {X_train.shape[1]}")



## 🔬 Setup MLflow Experiment Tracking

**What MLflow Does:**
- Automatically logs all your experiments
- Stores parameters, metrics, and artifacts
- Provides a UI to visualize and compare runs
- Makes your work reproducible

**To view experiments:**
```bash
# In a separate terminal, run:
mlflow ui
# Then open: http://localhost:5000
```



In [None]:
# Set experiment name
experiment_name = "credit_scoring_baseline_models"
mlflow.set_tracking_uri("sqlite:///../mlruns/mlflow.db")
mlflow.set_experiment(experiment_name)

# Get experiment ID
experiment = mlflow.get_experiment_by_name(experiment_name)
print(f"[OK] MLflow experiment set: {experiment_name}")
print(f"Experiment ID: {experiment.experiment_id}")
print(f"\nArtifacts will be stored in: {experiment.artifact_location}")
print(f"\nTo view experiments, run: mlflow ui")
print(f"Then open: http://localhost:5000")



## 🚀 Train Baseline Models

We'll train 4 different models and compare them.

**Model Selection Rationale:**

**0. DummyClassifier (Reference Baseline)**
   - Predicts majority class (always "no default")
   - NO learning - just baseline to beat
   - Shows minimum acceptable performance
   - **Critical:** Any real model MUST beat this!

1. **Logistic Regression**
   - Simple linear model
   - Fast to train
   - Highly interpretable
   - Good baseline to beat

2. **Random Forest**
   - Ensemble of decision trees
   - Handles non-linear relationships
   - Robust to outliers
   - Built-in feature importance

3. **XGBoost**
   - Gradient boosting
   - Often wins ML competitions
   - Handles imbalanced data well
   - Many hyperparameters to tune

4. **LightGBM**
   - Microsoft's gradient boosting
   - Very fast and memory-efficient
   - Great for large datasets
   - Often comparable to XGBoost

Let's train them all!



In [None]:
# Log feature names and count once for the experiment
with mlflow.start_run(run_name="Experiment_Setup", nested=True):
    # Log feature names as an artifact
    feature_names = X_train.columns.tolist()
    features_path = Path("artifacts") / "feature_names.txt"
    features_path.parent.mkdir(exist_ok=True)
    with open(features_path, "w") as f:
        for feature in feature_names:
            f.write(f"{feature}\n")
    mlflow.log_artifact(features_path, artifact_path="features")
    print(f"[OK] Logged {len(feature_names)} feature names to MLflow artifact 'features/feature_names.txt'")

    # Log number of features as a parameter
    mlflow.log_param("num_features", len(feature_names))
    print(f"[OK] Logged number of features ({len(feature_names)}) to MLflow parameter 'num_features'")


In [None]:
# 0. DUMMY CLASSIFIER (REFERENCE BASELINE)
print("\n\n### 0. DUMMY CLASSIFIER (REFERENCE BASELINE) ###\n")
print("This model always predicts the majority class (no default).")
print("Any real model MUST beat this to be useful!\n")

dummy_params = {
    'strategy': 'most_frequent',  # Always predict majority class
    'random_state': RANDOM_STATE
}

dummy_model = DummyClassifier(**dummy_params)
dummy_metrics, dummy_trained = train_and_evaluate_model(
    dummy_model, "Dummy_Classifier", dummy_params,
    X_train, y_train, X_val, y_val
)

print("\n" + "="*80)
print("⚠️  IMPORTANT: This is the MINIMUM performance threshold!")
print("   Any real model with ROC-AUC < 0.50 is worse than random guessing!")
print("="*80)


In [None]:
# 1. LOGISTIC REGRESSION
print("\n\n### 1. LOGISTIC REGRESSION ###\n")

lr_params = {
    'max_iter': 1000,
    'class_weight': 'balanced',  # Handle imbalance
    'random_state': RANDOM_STATE,
    'n_jobs': -1
}

lr_model = LogisticRegression(**lr_params)
lr_metrics, lr_trained = train_and_evaluate_model(
    lr_model, "Logistic_Regression", lr_params,
    X_train, y_train, X_val, y_val
)


In [None]:
# 2. RANDOM FOREST
print("\n\n### 2. RANDOM FOREST ###\n")

rf_params = {
    'n_estimators': 100,
    'max_depth': 10,
    'min_samples_split': 50,
    'min_samples_leaf': 20,
    'class_weight': 'balanced',
    'random_state': RANDOM_STATE,
    'n_jobs': -1,
    'verbose': 0
}

rf_model = RandomForestClassifier(**rf_params)
rf_metrics, rf_trained = train_and_evaluate_model(
    rf_model, "Random_Forest", rf_params,
    X_train, y_train, X_val, y_val
)


In [None]:
# 3. XGBOOST
print("\n\n### 3. XGBOOST ###\n")

# Calculate scale_pos_weight for class imbalance
scale_pos_weight = (y_train == 0).sum() / (y_train == 1).sum()

xgb_params = {
    'n_estimators': 100,
    'max_depth': 6,
    'learning_rate': 0.1,
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'scale_pos_weight': scale_pos_weight,  # Handle imbalance
    'random_state': RANDOM_STATE,
    'n_jobs': -1,
    'verbosity': 0
}

xgb_model = XGBClassifier(**xgb_params)
xgb_metrics, xgb_trained = train_and_evaluate_model(
    xgb_model, "XGBoost", xgb_params,
    X_train, y_train, X_val, y_val
)


In [None]:
# 4. LIGHTGBM
print("\n\n### 4. LIGHTGBM ###\n")

lgbm_params = {
    'n_estimators': 100,
    'max_depth': 6,
    'learning_rate': 0.1,
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'class_weight': 'balanced',
    'random_state': RANDOM_STATE,
    'n_jobs': -1,
    'verbose': -1
}

lgbm_model = LGBMClassifier(**lgbm_params)
lgbm_metrics, lgbm_trained = train_and_evaluate_model(
    lgbm_model, "LightGBM", lgbm_params,
    X_train, y_train, X_val, y_val
)



## 📊 Compare All Models

Let's compare all our baseline models side-by-side.



In [None]:
# Gather all results
all_results = {
    'Dummy_Classifier': dummy_metrics,
    'Logistic_Regression': lr_metrics,
    'Random_Forest': rf_metrics,
    'XGBoost': xgb_metrics,
    'LightGBM': lgbm_metrics
}

# Compare using our utility function
comparison_df = compare_models(all_results, metric='roc_auc')

# Display comparison
print("\n" + "="*80)
print("DETAILED COMPARISON")
print("="*80)
print(comparison_df.to_string())

# Save comparison
comparison_df.to_csv('model_comparison.csv')
print("\n[OK] Comparison saved to model_comparison.csv")

# Calculate improvement over dummy baseline
dummy_roc = dummy_metrics['roc_auc']
print("\n" + "="*80)
print("IMPROVEMENT OVER DUMMY BASELINE")
print("="*80)
for model_name, metrics in all_results.items():
    if model_name != 'Dummy_Classifier':
        improvement = ((metrics['roc_auc'] - dummy_roc) / dummy_roc) * 100
        print(f"{model_name:25s}: +{improvement:6.2f}% improvement over dummy")


In [None]:
# Visualize model comparison
metrics_to_plot = ['roc_auc', 'pr_auc', 'f1', 'precision', 'recall']

fig, axes = plt.subplots(2, 3, figsize=(18, 10))
axes = axes.ravel()

for idx, metric in enumerate(metrics_to_plot):
    if metric in comparison_df.columns:
        data = comparison_df[metric].sort_values(ascending=False)
        
        axes[idx].barh(range(len(data)), data.values, color='steelblue', alpha=0.8)
        axes[idx].set_yticks(range(len(data)))
        axes[idx].set_yticklabels(data.index)
        axes[idx].set_xlabel(metric.upper().replace('_', '-'), fontsize=12)
        axes[idx].set_title(f'{metric.upper().replace(" ", " ")} Comparison', 
                           fontsize=14, fontweight='bold')
        axes[idx].grid(axis='x', alpha=0.3)
        
        # Add value labels
        for i, v in enumerate(data.values):
            axes[idx].text(v + 0.005, i, f'{v:.4f}', 
                          va='center', fontsize=10)

# Remove empty subplot
fig.delaxes(axes[5])

plt.tight_layout()
plt.savefig('plots/model_comparison.png', dpi=150, bbox_inches='tight')
plt.show()

print("[OK] Comparison visualization saved!")



## 📝 Baseline Models Summary

### ✅ What We Accomplished

1. **Trained 4 Baseline Models**
   - Logistic Regression (simple baseline)
   - Random Forest (ensemble method)
   - XGBoost (gradient boosting)
   - LightGBM (efficient gradient boosting)

2. **MLflow Experiment Tracking**
   - All runs logged automatically
   - Parameters, metrics, and artifacts stored
   - Compare runs visually in MLflow UI

3. **Comprehensive Evaluation**
   - ROC-AUC scores
   - Precision-Recall curves
   - Confusion matrices
   - Feature importance (tree models)

4. **Model Comparison**
   - Side-by-side metrics
   - Visual comparisons
   - Identified best baseline

### 🏆 Best Performing Model

Based on ROC-AUC and PR-AUC scores:
- **Best Model:** [Check comparison above]
- **ROC-AUC:** [Value]
- **PR-AUC:** [Value]
- **F1-Score:** [Value]

### 💡 Key Insights

1. **Tree-based models** (RF, XGB, LightGBM) generally outperform Logistic Regression
   - They can capture non-linear relationships
   - Better handle feature interactions

2. **Class imbalance handling** is critical
   - Used `class_weight='balanced'` or `scale_pos_weight`
   - Evaluated with appropriate metrics (ROC-AUC, PR-AUC, F1)

3. **Feature importance** reveals key predictors
   - Debt-to-income ratio likely important
   - External credit scores matter
   - Age and employment features contribute

4. **Model complexity vs performance trade-off**
   - Logistic Regression: Fast, interpretable, but lower performance
   - Tree models: Higher performance, but less interpretable

### 🎯 Next Steps

In the next notebook ([04_hyperparameter_optimization.ipynb](04_hyperparameter_optimization.ipynb)), we will:

1. **Select Best Baseline**
   - Choose the best performing model
   - Or ensemble top models

2. **Systematic Hyperparameter Tuning**
   - Define search space
   - Use GridSearchCV or RandomizedSearchCV
   - Use StratifiedKFold cross-validation

3. **Optimize for Target Metric**
   - Focus on ROC-AUC or PR-AUC
   - Consider business costs (FP vs FN)

4. **Log All Optimization Runs**
   - Track in MLflow
   - Compare optimization strategies

---

**Excellent work! You now have solid baseline models! 🎉**

### 📊 To View Your Experiments:

```bash
# In terminal, run:
mlflow ui

# Then open in browser:
http://localhost:5000
```

In the MLflow UI, you can:
- Compare all runs side-by-side
- Sort by metrics
- View all plots and artifacts
- Download models

---

**Remember:** These are baselines! We'll improve them significantly in the next notebook through hyperparameter optimization! 🚀

