# Automated Model Development with AutoML

In this demo, we will demonstrate how to initiate AutoML experiments both through the user-friendly AutoML UI and programmatically using the AutoML API.  
When using the API, we will demonstrate some custom functionalities such as feature table integration and custom split ratios for train, validation and test.

## Learning Objectives:

*By the end of this demo, you will be able to:*

- Start an AutoML experiment via the AutoML UI.  
- Start an AutoML experiment via the AutoML API.  
- Open and edit a notebook generated by AutoML.  
- Identify the best model generated by AutoML based on a given metric.  
- Modify the best model generated by AutoML.


## Requirements

Please review the following requirements before starting the lesson:

- To run this notebook, you need to use one of the following Databricks runtime(s): **16.0.x-cpu-ml-scala2.12**

---

## REQUIRED - SELECT CLASSIC COMPUTE

Before executing cells in this notebook, please select your classic compute cluster in the lab. Be aware that **Serverless** is enabled by default. Follow these steps to select the classic compute cluster:

1. Navigate to the top-right of this notebook and click the drop-down menu to select your cluster. By default, the notebook will use **Serverless**.
2. If your cluster is available, select it and continue to the next cell. If the cluster is not shown:
   - In the drop-down, select **More**.
   - In the **Attach to an existing compute resource** pop-up, select the first drop-down. You will see a unique cluster name in that drop-down. Please select that cluster.

**NOTE:** If your cluster has terminated, you might need to restart it in order to select it. To do this:

1. Right-click on **Compute** in the left navigation pane and select **Open in new tab**.  
2. Find the triangle icon to the right of your compute cluster name and click it.  
3. Wait a few minutes for the cluster to start.  
4. Once the cluster is running, complete the steps above to select your cluster.

---

## Classroom Setup

Before starting the demo, run the provided classroom setup script. This script will define configuration variables necessary for the demo. Execute the following cell:


In [0]:
%run ../Includes/Classroom-Setup-3.1


## Other Conventions:

Throughout this demo, we'll refer to the object `DA`. This object, provided by Databricks Academy, contains variables such as your username, catalog name, schema name, working directory, and dataset locations. Run the code block below to view these details:


In [0]:
print(f"Username:         {DA.username}")
print(f"Catalog Name:     {DA.catalog_name}")
print(f"Schema Name:      {DA.schema_name}")
print(f"Working Directory:{DA.paths.working_dir}")
print(f"User DB Location: {DA.paths.datasets}")


## Prepare Data

For this demonstration, we will utilize a fictional dataset from a Telecom Company, which includes customer information. This dataset encompasses **customer demographics**, including gender, as well as internet subscription details such as subscription plans and payment methods.

A table with all features is already created for you.

**Table name:** `customer_churn`

To get started, execute the code block below and review the dataset schema.


In [0]:
churn_data = spark.sql("SELECT * FROM customer_churn")
display(churn_data)


## Visualize Dataset

Let’s preview the `customer_churn` dataset using a SQL query with Spark.


## AutoML Experiment with UI

Databricks AutoML supports experimentation via the UI and the API.  
Thus, **in the first section of this demo we will demonstrate how to create an experiment using the UI.**  
Then, show how to create the same experiment via the API.

### Create AutoML Experiment

Let's initiate an AutoML experiment to construct a baseline model for predicting customer churn.  
The target field for this prediction will be the `Churn` field.

Follow these step-by-step instructions to create an AutoML experiment:

1. Navigate to the **Experiments** section in Databricks.  
2. Click on **Start training** under **Classification**.
- imagen 
3. Choose a cluster to execute the experiment.  
4. Select the **catalog > schema > `customers_churn`** table,  
   which was created in the previous step, as the input training dataset.  
5. Specify `Churn` as the prediction target.



## AutoML Experiment with API

In the previous section, we created an AutoML experiment using the user interface (UI) with basic functionalities. AutoML also supports advanced functionalities, such as **feature table integration** and **custom data split ratios**, which can enhance model performance and flexibility.

In this section, we will utilize the AutoML API to create an experiment incorporating these advanced features. By leveraging the API, we gain greater control over the experiment's configuration, enabling the customization of feature inputs and the specification of data splitting strategies.

### Set Features Table

AutoML supports the use of feature tables as input. During setup, a feature table (`customer_churn_features`) is created. In this section, we will utilize this feature table during model training.


In [0]:
features_table_path = f"{DA.catalog_name}.{DA.schema_name}.customer_churn_features"

# View features tables
display(spark.sql(f"SELECT * FROM {features_table_path}"))

# Define the feature store lookups
feature_store_lookups = [
    {
        "table_name": features_table_path,
        "lookup_key": ["CustomerID"]
    }
]


## Set Custom Split - Random Split

If you prefer AutoML to split the dataset with a different ratio than **the default 60:20:20**, you can create a new column in your dataset with the desired split assignments. This column **should contain the values** `"train"`, `"validate"`, or `"test"` to designate each row's role. When invoking the AutoML API, pass this column to the `split_col` parameter.

This approach allows you to define custom data splits tailored to your specific requirements. Ensure that the `custom_split` column accurately reflects the intended distribution of your data into training, validation, and test sets.

> **Example for understanding the code below**:  
> Consider the three values 0.5, 0.8, and 0.91 that are each mapped to three different rows.  
> We will consider the row containing 0.5 as a train data point, while 0.8 is considered a validation data point and 0.91 as a test data point.  
> Basically, values in the interval [0, 0.79] belong to the training dataset,  
> values between [0.8, 0.89] belong to the validation set,  
> and values between [0.9, 1.0] belong to the test set.


In [0]:
from pyspark.sql.functions import when, rand

dataset = spark.read.table("customer_churn")

seed = 42  # define your seed here for reproduction
train_ratio, validate_ratio, test_ratio = 0.8, 0.1, 0.1  # define your preferred ratios here

dataset = dataset.withColumn("random", rand(seed=seed))
dataset = dataset.withColumn(
    "custom_split",
    when(dataset.random < train_ratio, "train")
    .when(dataset.random < 1 - test_ratio, "validate")
    .otherwise("test")
)

dataset = dataset.drop("random")
display(dataset)


## Further Reading: Stratified Sampling with AutoML

Stratified sampling ensures that the distribution of a categorical variable (e.g., target labels) is preserved across the training, validation, and test sets. This is particularly useful when dealing with imbalanced datasets.

1. **Identify the Stratification Column** – Choose a categorical variable to maintain proportions across dataset splits.
2. **Compute Class Proportions** – Determine the distribution of each category in the dataset.
3. **Calculate Sample Sizes** – Apply the desired split ratios to compute the exact number of records per class for each split.
4. **Perform Stratified Sampling** – Split each category proportionally into training, validation, and test sets.
5. **Assign Labels and Combine Splits** – Label the subsets accordingly and merge them into the final dataset.
6. **Validate Class Distribution** – Ensure each split maintains the original class proportions.

---

### Sample Code
(see below)
```python
# Load dataset
dataset = spark.read.table("customer_churn")

# Define stratification column
stratify_col = "Gender"

# Define split ratios
train_ratio, validate_ratio, test_ratio = 0.8, 0.1, 0.1
seed = 42

# Step 1: Compute class counts and original distribution
class_counts = dataset.groupBy(stratify_col).agg(count("*").alias("count"))

original_distribution = (
    class_counts.withColumn("percentage", round((col("count") / dataset.count()) * 100, 2))
                 .withColumn("dataset", lit("original"))
)

# Step 2: Perform stratified sampling
train_df = dataset.sampleBy(stratify_col, {row[stratify_col]: train_ratio for row in class_counts.collect()}, seed)
validate_df = dataset.subtract(train_df).sampleBy(
    stratify_col,
    {row[stratify_col]: validate_ratio / (validate_ratio + test_ratio) for row in class_counts.collect()},
    seed
)
test_df = dataset.subtract(train_df).subtract(validate_df)

# Assign split labels
train_df = train_df.withColumn("custom_split", lit("train"))
validate_df = validate_df.withColumn("custom_split", lit("validate"))
test_df = test_df.withColumn("custom_split", lit("test"))

# Combine datasets efficiently
final_dataset = train_df.unionByName(validate_df).unionByName(test_df)
```python
# Step 4: Validate stratification with correct percentage calculation
def validate_distribution(df, split_name):
    total_split_count = df.count()
    return (
        df.groupBy(stratify_col)
          .agg(count("*").alias("count"))
          .withColumn("dataset", lit(split_name))
          .withColumn("percentage", round((col("count") / total_split_count) * 100, 2))
    )

# Compute distributions
train_dist = validate_distribution(train_df, "train")
validate_dist = validate_distribution(validate_df, "validate")
test_dist = validate_distribution(test_df, "test")

# Ensure Schema Consistency Before Union
columns_order = ["Gender", "count", "percentage", "dataset"]

original_distribution = original_distribution.select(*columns_order)
train_dist = train_dist.select(*columns_order)
validate_dist = validate_dist.select(*columns_order)
test_dist = test_dist.select(*columns_order)

# Combine all distributions (original + splits)
distribution_comparison = original_distribution.unionByName(train_dist) \
                                               .unionByName(validate_dist) \
                                               .unionByName(test_dist)

# Display the final distribution comparison
display(distribution_comparison)


## Start an Experiment

Now that we have **feature lookups** and **custom splits column** ready, we can continue to setup an AutoML experiment.


In [0]:
from databricks import automl
from datetime import datetime

automl_run = automl.classify(
    dataset = dataset,
    target_col = "Churn",
    split_col = "custom_split",
    exclude_cols = ["CustomerID"],  # Exclude columns as needed
    timeout_minutes = 5,
    feature_store_lookups = feature_store_lookups
)


## Search for the Best Run

The search for the best run in this experiment, we need to first **get the experiment ID** and then **search for the runs** by experiment.


In [0]:
import mlflow

# Get the experiment path by experiment ID
exp_path = mlflow.get_experiment(automl_run.experiment.experiment_id).name

# Find the most recent experiment in the AutoML folder
filter_string = f"name LIKE '{exp_path}'"
automl_experiment_id = mlflow.search_experiments(
    filter_string=filter_string,
    max_results=1,
    order_by=["last_update_time DESC"]
)[0].experiment_id


In [0]:
from mlflow.entities import ViewType

# Find the best run ...
automl_runs_pd = mlflow.search_runs(
    experiment_ids=[automl_experiment_id],
    filter_string="attributes.status = 'FINISHED'",
    run_view_type=ViewType.ACTIVE_ONLY,
    order_by=["metrics.val_f1_score DESC"]
)


Print information about the best trial from the AutoML experiment.


In [0]:
print(automl_run.best_trial)


## Import notebooks for other runs in AutoML

For classification and regression experiments, AutoML generated notebooks for data exploration and the best trial in your experiment are automatically imported to your workspace. Generated notebooks for other experiment trials are saved as MLflow artifacts on DBFS instead of auto-imported into your workspace.

For all trials besides the best trial, the `notebook_path` and `notebook_url` in the TrialInfo Python API are not set. If you need to use these notebooks, you can manually import them into your workspace with the AutoML experiment UI or the `automl.import_notebook` Python API.

📛 **Notice**: `destination_path` takes Workspace as root.


In [0]:
# Create the Destination path for storing the best run notebook
destination_path = f"/Users/{DA.username}/imported_notebooks/demo-3.1-{datetime.now().strftime('%Y%m%d%H%M%S')}"

# Get the path and URL for the generated notebook
result = automl.import_notebook(automl_run.trials[1].artifact_uri, destination_path)
print(f"The notebook is imported to: {result.path}")
print(f"The notebook URL           : {result.url}")


## Conclusion

In this demo, we show how to use AutoML UI and AutoML API for creating classification model and how we can retrieve the best run and access the generated notebook, and how we can modify the parameters of the best model.
