
# Data Science with MLFlow

This notebook demonstrates a complete workflow for credit risk modeling using MLFlow in Databricks. It covers data loading, exploratory data analysis (EDA), feature engineering, and preparation, followed by train-test splitting for supervised machine learning. The notebook leverages LightGBM for classification, visualizes class distributions and feature correlations, and encodes categorical variables. All steps are tracked and reproducible with MLFlow, enabling robust model development and experiment management.

In [None]:
CATALOG = 'workspace'
BRONZE_SCHEMA = 'bronze'
SILVER_SCHEMA = 'silver'
GOLD_SCHEMA = 'gold'

USER_EMAIL = 'myname@example.com'
MLFLOW_EXPERIMENT_PATH = f'/Users/{USER_EMAIL}/credit-scoring-experiment'

In [0]:
# Install required libraries
%pip install lightgbm optuna scikit-learn matplotlib seaborn mlflow --quiet

In [0]:
# Restart Python kernel after installation
dbutils.library.restartPython()

In [0]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import classification_report, confusion_matrix, roc_auc_score, roc_curve, precision_recall_curve
import lightgbm as lgb
import mlflow
import mlflow.lightgbm
from datetime import datetime

# Set style for visualizations
sns.set_style('whitegrid')
plt.rcParams['figure.figsize'] = (10, 6)

In [0]:
# Load the data from the table
df = spark.table(f"{CATALOG}.{GOLD_SCHEMA}.final_features").toPandas()

print(f"Dataset shape: {df.shape}")
print(f"\nTarget variable distribution:")
print(df['flag_default'].value_counts())
print(f"\nDefault rate: {df['flag_default'].mean():.2%}")

In [0]:
# Display basic information about the dataset
print("Dataset Info:")
print(df.info())
print("\n" + "="*80)
print("\nBasic Statistics:")
display(df.describe())
print("\n" + "="*80)
print("\nMissing Values:")
missing = df.isnull().sum()
missing_pct = (missing / len(df) * 100).round(2)
missing_df = pd.DataFrame({'Missing_Count': missing, 'Percentage': missing_pct})
missing_df = missing_df[missing_df['Missing_Count'] > 0].sort_values('Missing_Count', ascending=False)
if len(missing_df) > 0:
    display(missing_df)
else:
    print("No missing values found!")

In [0]:
# Visualize class distribution
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Count plot
df['flag_default'].value_counts().plot(kind='bar', ax=axes[0], color=['#2ecc71', '#e74c3c'])
axes[0].set_title('Class Distribution', fontsize=14, fontweight='bold')
axes[0].set_xlabel('Flag Default (0=No Default, 1=Default)', fontsize=12)
axes[0].set_ylabel('Count', fontsize=12)
axes[0].set_xticklabels(['No Default', 'Default'], rotation=0)

# Add count labels on bars
for i, v in enumerate(df['flag_default'].value_counts()):
    axes[0].text(i, v + 10, str(v), ha='center', fontweight='bold')

# Pie chart
df['flag_default'].value_counts().plot(kind='pie', ax=axes[1], autopct='%1.1f%%', 
                                        colors=['#2ecc71', '#e74c3c'], labels=['No Default', 'Default'])
axes[1].set_title('Class Distribution (%)', fontsize=14, fontweight='bold')
axes[1].set_ylabel('')

plt.tight_layout()
plt.show()

print(f"\nClass Imbalance Ratio: 1:{df['flag_default'].value_counts()[0] / df['flag_default'].value_counts()[1]:.2f}")

In [0]:
# Select numerical features for correlation analysis
numerical_features = df.select_dtypes(include=[np.number]).columns.tolist()
numerical_features.remove('flag_default')  # Remove target

# Calculate correlation with target
correlations = df[numerical_features + ['flag_default']].corr()['flag_default'].drop('flag_default').sort_values(ascending=False)

# Plot top 15 correlations
fig, ax = plt.subplots(figsize=(12, 8))
top_corr = pd.concat([correlations.head(10), correlations.tail(5)])
colors = ['#e74c3c' if x > 0 else '#3498db' for x in top_corr.values]
top_corr.plot(kind='barh', ax=ax, color=colors)
ax.set_title('Top Features Correlated with Default Flag', fontsize=14, fontweight='bold')
ax.set_xlabel('Correlation Coefficient', fontsize=12)
ax.set_ylabel('Features', fontsize=12)
ax.axvline(x=0, color='black', linestyle='--', linewidth=0.8)
plt.tight_layout()
plt.show()

print("\nTop 10 Positive Correlations with Default:")
print(correlations.head(10))
print("\nTop 5 Negative Correlations with Default:")
print(correlations.tail(5))

In [0]:
# Visualize key features by default status
key_features = ['skor_kredit', 'debt_to_income_ratio', 'rasio_pembayaran', 'avg_hari_keterlambatan']

fig, axes = plt.subplots(2, 2, figsize=(15, 10))
axes = axes.ravel()

for idx, feature in enumerate(key_features):
    # Remove nulls for visualization
    data_to_plot = df[[feature, 'flag_default']].dropna()
    
    data_to_plot.boxplot(column=feature, by='flag_default', ax=axes[idx])
    axes[idx].set_title(f'{feature} by Default Status', fontsize=12, fontweight='bold')
    axes[idx].set_xlabel('Flag Default (0=No Default, 1=Default)', fontsize=10)
    axes[idx].set_ylabel(feature, fontsize=10)
    axes[idx].get_figure().suptitle('')  # Remove automatic title

plt.tight_layout()
plt.show()

In [0]:
# Prepare features for modeling
print("Preparing features for modeling...\n")

# Separate features and target
X = df.drop(['flag_default', 'loan_id', 'applicant_id'], axis=1)
y = df['flag_default']

# Identify categorical and numerical columns
categorical_cols = X.select_dtypes(include=['object']).columns.tolist()
numerical_cols = X.select_dtypes(include=[np.number]).columns.tolist()

print(f"Categorical features ({len(categorical_cols)}): {categorical_cols}")
print(f"\nNumerical features ({len(numerical_cols)}): {numerical_cols[:10]}... (showing first 10)")

# Encode categorical variables
label_encoders = {}
for col in categorical_cols:
    le = LabelEncoder()
    X[col] = le.fit_transform(X[col].astype(str))
    label_encoders[col] = le

print(f"\nCategorical encoding completed.")

# Handle missing values (fill with median for numerical)
for col in numerical_cols:
    if X[col].isnull().sum() > 0:
        median_val = X[col].median()
        X[col].fillna(median_val, inplace=True)
        print(f"Filled {X[col].isnull().sum()} missing values in {col} with median: {median_val:.2f}")

print(f"\nFinal feature matrix shape: {X.shape}")
print(f"Target variable shape: {y.shape}")

In [0]:
# Split data into train and test sets (80-20 split)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print(f"Training set size: {X_train.shape[0]} samples")
print(f"Test set size: {X_test.shape[0]} samples")
print(f"\nTraining set default rate: {y_train.mean():.2%}")
print(f"Test set default rate: {y_test.mean():.2%}")
print(f"\nFeature count: {X_train.shape[1]}")

In [0]:
# Set MLflow experiment
mlflow.set_experiment(MLFLOW_EXPERIMENT_PATH)

# Start MLflow run
with mlflow.start_run(run_name=f"credit_scoring_{datetime.now().strftime('%Y%m%d_%H%M%S')}") as run:
    
    print("Training LightGBM model with MLflow tracking...\n")
    
    # Define model parameters
    params = {
        'objective': 'binary',
        'metric': 'auc',
        'boosting_type': 'gbdt',
        'num_leaves': 31,
        'learning_rate': 0.05,
        'feature_fraction': 0.8,
        'bagging_fraction': 0.8,
        'bagging_freq': 5,
        'max_depth': 6,
        'min_child_samples': 20,
        'scale_pos_weight': len(y_train[y_train==0]) / len(y_train[y_train==1]),  # Handle class imbalance
        'random_state': 42,
        'verbose': -1
    }
    
    # Log parameters to MLflow
    mlflow.log_params(params)
    mlflow.log_param("train_size", len(X_train))
    mlflow.log_param("test_size", len(X_test))
    mlflow.log_param("n_features", X_train.shape[1])
    
    # Create LightGBM datasets
    train_data = lgb.Dataset(X_train, label=y_train)
    test_data = lgb.Dataset(X_test, label=y_test, reference=train_data)
    
    # Train model
    model = lgb.train(
        params,
        train_data,
        num_boost_round=200,
        valid_sets=[train_data, test_data],
        valid_names=['train', 'test'],
        callbacks=[lgb.early_stopping(stopping_rounds=20), lgb.log_evaluation(period=20)]
    )
    
    # Make predictions
    y_pred_proba = model.predict(X_test, num_iteration=model.best_iteration)
    y_pred = (y_pred_proba > 0.5).astype(int)
    
    # Calculate metrics
    from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
    
    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    roc_auc = roc_auc_score(y_test, y_pred_proba)
    
    # Log metrics to MLflow
    mlflow.log_metric("accuracy", accuracy)
    mlflow.log_metric("precision", precision)
    mlflow.log_metric("recall", recall)
    mlflow.log_metric("f1_score", f1)
    mlflow.log_metric("roc_auc", roc_auc)
    mlflow.log_metric("best_iteration", model.best_iteration)
    
    # Log model to MLflow
    mlflow.lightgbm.log_model(model, "model", input_example=X_train.head(5))
    
    # Log feature importance as artifact
    feature_importance = pd.DataFrame({
        'feature': X_train.columns,
        'importance': model.feature_importance(importance_type='gain')
    }).sort_values('importance', ascending=False)
    
    mlflow.log_dict(feature_importance.to_dict(), "feature_importance.json")
    
    print("\n" + "="*80)
    print("MODEL TRAINING COMPLETED")
    print("="*80)
    print(f"\nMLflow Run ID: {run.info.run_id}")
    print(f"\nModel Performance Metrics:")
    print(f"  - Accuracy:  {accuracy:.4f}")
    print(f"  - Precision: {precision:.4f}")
    print(f"  - Recall:    {recall:.4f}")
    print(f"  - F1 Score:  {f1:.4f}")
    print(f"  - ROC AUC:   {roc_auc:.4f}")
    print(f"\nBest iteration: {model.best_iteration}")
    print(f"\n‚úì Model and metrics logged to MLflow experiment")

In [0]:
# Plot confusion matrix
from sklearn.metrics import ConfusionMatrixDisplay

fig, ax = plt.subplots(figsize=(8, 6))
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm, display_labels=['No Default', 'Default'])
disp.plot(ax=ax, cmap='Blues', values_format='d')
ax.set_title('Confusion Matrix - Credit Scoring Model', fontsize=14, fontweight='bold')
plt.tight_layout()
plt.show()

print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['No Default', 'Default']))

In [0]:
# Plot ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_pred_proba)

fig, ax = plt.subplots(figsize=(10, 7))
ax.plot(fpr, tpr, color='#e74c3c', linewidth=2, label=f'ROC Curve (AUC = {roc_auc:.4f})')
ax.plot([0, 1], [0, 1], color='gray', linestyle='--', linewidth=1, label='Random Classifier')
ax.set_xlabel('False Positive Rate', fontsize=12)
ax.set_ylabel('True Positive Rate', fontsize=12)
ax.set_title('ROC Curve - Credit Scoring Model', fontsize=14, fontweight='bold')
ax.legend(loc='lower right', fontsize=11)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

In [0]:
# Plot Precision-Recall curve
precision_curve, recall_curve, pr_thresholds = precision_recall_curve(y_test, y_pred_proba)

fig, ax = plt.subplots(figsize=(10, 7))
ax.plot(recall_curve, precision_curve, color='#3498db', linewidth=2, label='Precision-Recall Curve')
ax.set_xlabel('Recall', fontsize=12)
ax.set_ylabel('Precision', fontsize=12)
ax.set_title('Precision-Recall Curve - Credit Scoring Model', fontsize=14, fontweight='bold')
ax.legend(loc='upper right', fontsize=11)
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()

print(f"\nAverage Precision Score: {precision_curve.mean():.4f}")

In [0]:
# Plot feature importance
top_n = 15
top_features = feature_importance.head(top_n)

fig, ax = plt.subplots(figsize=(12, 8))
ax.barh(range(len(top_features)), top_features['importance'], color='#9b59b6')
ax.set_yticks(range(len(top_features)))
ax.set_yticklabels(top_features['feature'])
ax.set_xlabel('Importance (Gain)', fontsize=12)
ax.set_ylabel('Features', fontsize=12)
ax.set_title(f'Top {top_n} Most Important Features', fontsize=14, fontweight='bold')
ax.invert_yaxis()
plt.tight_layout()
plt.show()

print(f"\nTop {top_n} Most Important Features:")
display(top_features)

In [0]:
import mlflow

mlflow.set_registry_uri("databricks-uc")
uc_model_name = f"{CATALOG}.{GOLD_SCHEMA}.credit_scoring_model"

result = mlflow.register_model(
    model_uri=f"runs:/{run.info.run_id}/model",
    name=uc_model_name
)

print(f"Model registered to Unity Catalog as: {uc_model_name}")
print(f"Version: {result.version}")

In [0]:
from mlflow import MlflowClient

# Set alias for the registered model version (Databricks MLflow only)
client = MlflowClient()
client.set_registered_model_alias(
    name=uc_model_name,
    version=result.version,
    alias="production"
)
print(f"Alias 'production' set for model version {result.version}")

# Load model using alias
loaded_model = mlflow.pyfunc.load_model(f"models:/{uc_model_name}@production")
print("Model loaded successfully using alias 'production'")

## üéØ Credit Scoring Model - Summary

### Model Performance
Our LightGBM credit scoring model achieved excellent results:
* **Accuracy**: 93.65% - correctly classified 94% of all loans
* **ROC AUC**: 0.9882 - excellent discrimination between default and non-default
* **Recall**: 91.30% - successfully identified 91% of actual defaults
* **Precision**: 77.78% - when predicting default, correct 78% of the time

### Key Findings
The most important features for predicting loan defaults are:
1. **avg_hari_keterlambatan** (average days late) - strongest predictor
2. **skor_kredit** (credit score) - second most important
3. **total_payments** - payment history matters
4. **total_hari_keterlambatan** - cumulative lateness
5. **total_denda** (total penalties) - indicates payment issues

### üìä MLflow Capabilities Demonstrated

This notebook showcased several beginner-friendly MLflow features:

#### 1. **Experiment Tracking**
   * Created a dedicated experiment: `/Users/aditya.pradana@databricks.com/credit-scoring-experiment`
   * All runs are organized in one place for easy comparison

#### 2. **Parameter Logging**
   * Logged all model hyperparameters (learning_rate, num_leaves, max_depth, etc.)
   * Logged dataset information (train_size, test_size, n_features)
   * Makes it easy to reproduce results and understand what settings were used

#### 3. **Metrics Logging**
   * Automatically tracked key performance metrics:
     - Accuracy, Precision, Recall, F1 Score
     - ROC AUC for model discrimination
     - Best iteration from early stopping
   * Metrics are stored and can be compared across multiple runs

#### 4. **Model Logging**
   * Saved the trained LightGBM model as an MLflow artifact
   * Model can be loaded later for predictions or deployment
   * Includes model versioning and lineage tracking

#### 5. **Artifact Logging**
   * Saved feature importance as a JSON artifact
   * Can store any additional files (plots, data, configs)

### üîç How to Access Your MLflow Experiment

You can view your experiment in the Databricks UI:
1. Click on **"Experiments"** in the left sidebar
2. Find the experiment: `credit-scoring-experiment`
3. Click on the run to see:
   * All logged parameters and metrics
   * Model artifacts and files
   * Comparison charts across multiple runs
   * Model lineage and versioning

### üí° Next Steps

To improve this model further, you could:
* Try different algorithms (XGBoost, Random Forest)
* Perform hyperparameter tuning with Optuna
* Add more feature engineering
* Handle class imbalance with SMOTE
* Register the model to MLflow Model Registry for deployment

All of these experiments can be tracked with MLflow to compare performance!