# Credit Scoring Model with AutoML

**Generated by:** Databricks Data Science Agent

**Purpose:** Demonstrate AutoML capabilities for building a credit scoring model to predict loan defaults

**Dataset:** `apradana_demo_permata.gold.final_features`

**Target Variable:** `flag_default` (binary classification: 0 = no default, 1 = default)

---

## ⚠️ Important: AutoML Compute Requirements

**AutoML requires a classic cluster** (single-node or multi-node) and **cannot run on serverless compute**.

### To run this notebook:

1. **Switch to a classic cluster:**
   * Click the compute selector in the top-right corner of this notebook
   * Select an existing classic cluster, or
   * Create a new cluster:
     - Click "Create compute"
     - Choose **Single Node** for a quick demo (cheaper and faster to start)
     - Recommended: `i3.xlarge` or `m5d.large` instance type
     - Runtime: DBR 13.3 LTS ML or higher

2. **Why classic cluster?**
   * AutoML needs to install additional libraries
   * Requires persistent compute for experiment tracking
   * Needs access to MLflow experiment management

3. **Once connected to a classic cluster**, run all cells below

---

*Note: This notebook was generated by Databricks Data Science Agent*

In [0]:
# Load the credit data from Unity Catalog
df = spark.table("apradana_demo_permata.gold.final_features")

# Display basic information
print(f"Total records: {df.count()}")
print(f"\nTarget variable distribution:")
df.groupBy("flag_default").count().orderBy("flag_default").show()

# Convert to Pandas for AutoML (AutoML works with Pandas DataFrames)
df_pandas = df.toPandas()

print(f"\nDataset shape: {df_pandas.shape}")
print(f"Features available: {df_pandas.shape[1] - 1}")

In [0]:
%pip install databricks-automl-runtime --quiet
dbutils.library.restartPython()

In [0]:
# Install AutoML runtime (required on some cluster configurations)


from databricks import automl

# Run AutoML for binary classification
# AutoML will automatically:
# - Split data into train/validation/test sets
# - Try multiple algorithms (logistic regression, decision trees, random forest, XGBoost, LightGBM, etc.)
# - Perform hyperparameter tuning
# - Handle feature engineering
# - Log experiments to MLflow
# - Generate a notebook with the best model

summary = automl.classify(
    dataset=df_pandas,
    target_col="flag_default",
    primary_metric="f1",  # Good for imbalanced classification
    timeout_minutes=10,  # Limit runtime for demo purposes
    experiment_name="/Users/aditya.pradana@databricks.com/credit_scoring_automl"
)

print("\n" + "="*50)
print("AutoML Run Complete!")
print("="*50)

[0;31m---------------------------------------------------------------------------[0m
[0;31mImportError[0m                               Traceback (most recent call last)
File [0;32m<command-6224717010893906>, line 4[0m
[1;32m      1[0m [38;5;66;03m# Install AutoML runtime (required on some cluster configurations)[39;00m
[0;32m----> 4[0m [38;5;28;01mfrom[39;00m [38;5;21;01mdatabricks[39;00m [38;5;28;01mimport[39;00m automl
[1;32m      6[0m [38;5;66;03m# Run AutoML for binary classification[39;00m
[1;32m      7[0m [38;5;66;03m# AutoML will automatically:[39;00m
[1;32m      8[0m [38;5;66;03m# - Split data into train/validation/test sets[39;00m
[0;32m   (...)[0m
[1;32m     12[0m [38;5;66;03m# - Log experiments to MLflow[39;00m
[1;32m     13[0m [38;5;66;03m# - Generate a notebook with the best model[39;00m
[1;32m     15[0m summary [38;5;241m=[39m automl[38;5;241m.[39mclassify(
[1;32m     16[0m     dataset[38;5;241m=[39mdf_pandas,
[1;32m    

In [0]:
# Display the best trial information
print(f"Best trial ID: {summary.best_trial_notebook_id}")
print(f"Best model metrics:")
print(f"  - F1 Score: {summary.best_trial_f1_score:.4f}")
print(f"  - Precision: {summary.best_trial_precision_score:.4f}")
print(f"  - Recall: {summary.best_trial_recall_score:.4f}")
print(f"  - Accuracy: {summary.best_trial_accuracy_score:.4f}")

print(f"\nExperiment URL: {summary.experiment.experiment_url}")
print(f"\nBest trial notebook: {summary.best_trial_notebook_url}")
print("\n💡 Click the notebook URL above to see the detailed model code and evaluation!")

## 🚀 AutoML Simplifies the ML Lifecycle

### What AutoML Just Did For You:

1. **Automated Algorithm Selection**
   * Tested multiple algorithms (Logistic Regression, Random Forest, XGBoost, LightGBM, etc.)
   * Compared performance across different model types

2. **Hyperparameter Optimization**
   * Automatically tuned parameters for each algorithm
   * Used intelligent search strategies to find optimal configurations

3. **Feature Engineering**
   * Handled categorical variables (one-hot encoding)
   * Managed missing values
   * Scaled numerical features appropriately

4. **Model Evaluation**
   * Split data into train/validation/test sets
   * Calculated multiple metrics (F1, precision, recall, accuracy, ROC-AUC)
   * Generated confusion matrices and feature importance plots

5. **MLflow Integration**
   * Logged all experiments automatically
   * Tracked parameters, metrics, and artifacts
   * Registered the best model for deployment

6. **Reproducible Notebooks**
   * Generated editable notebooks for each trial
   * Included complete code for the best model
   * Easy to customize and retrain

### Time Saved:
* **Without AutoML**: 2-3 days of manual experimentation
* **With AutoML**: 10 minutes of automated optimization

### Next Steps:
* Review the best model notebook for detailed insights
* Register the model to Unity Catalog for deployment
* Use the model for batch or real-time scoring