
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning">
</div>



# LAB - Batch Deployment

Welcome to the "Batch Deployment" lab! This lab focuses on batch deployment of machine learning models using Databricks. You will engage in tasks related to model inference, model registry, and explore performance results for feature such as Liquid Clustering using `CLUSTER BY`.

**Learning Objectives:**

By the end of this lab, you will be able to;

+ **Task 1: Load Dataset**
    + Load Dataset
    + Split the dataset to features and response sets

+ **Task 2: Inference with feature table**

    + Create Feature Table
    + Setup Feature Lookups
    + Fit and Register a Model with UC using Feature Table
    + Perform batch inference using Feature Engineering's  **`score_batch`** method.

+ **Task 3: Assess Liquid Clustering:**

    + Evaluate the performance results for specific optimization techniques:
        + Liquid Clustering


## REQUIRED - SELECT CLASSIC COMPUTE
Before executing cells in this notebook, please select your classic compute cluster in the lab. Be aware that **Serverless** is enabled by default.

Follow these steps to select the classic compute cluster:
1. Navigate to the top-right of this notebook and click the drop-down menu to select your cluster. By default, the notebook will use **Serverless**.

2. If your cluster is available, select it and continue to the next cell. If the cluster is not shown:

   - Click **More** in the drop-down.
   
   - In the **Attach to an existing compute resource** window, use the first drop-down to select your unique cluster.

**NOTE:** If your cluster has terminated, you might need to restart it in order to select it. To do this:

1. Right-click on **Compute** in the left navigation pane and select *Open in new tab*.

2. Find the triangle icon to the right of your compute cluster name and click it.

3. Wait a few minutes for the cluster to start.

4. Once the cluster is running, complete the steps above to select your cluster.

## Requirements

Please review the following requirements before starting the lesson:

* To run this notebook, you need to use one of the following Databricks runtime(s): **16.3.x-cpu-ml-scala2.12**


## Classroom Setup

Before starting the Lab, run the provided classroom setup script. This script will define configuration variables necessary for the demo. Execute the following cell:

In [0]:
%run ../Includes/Classroom-Setup-2.2

[43mNote: you may need to restart the kernel using %restart_python or dbutils.library.restartPython() to use updated packages.[0m


Using catalog dbacademy and schema labuser10817494_1751605583.


**Other Conventions:**

Throughout this Lab, we'll refer to the object `DA`. This object, provided by Databricks Academy, contains variables such as your username, catalog name, schema name, working directory, and dataset locations. Run the code block below to view these details:

In [0]:
print(f"Username:          {DA.username}")
print(f"Catalog Name:      {DA.catalog_name}")
print(f"Schema Name:       {DA.schema_name}")
print(f"Working Directory: {DA.paths.working_dir}")
print(f"User DB Location:  {DA.paths.datasets}")

Username:          labuser10817494_1751605583@vocareum.com
Catalog Name:      dbacademy
Schema Name:       labuser10817494_1751605583
Working Directory: /Volumes/dbacademy/ops/labuser10817494_1751605583@vocareum_com
User DB Location:  NestedNamespace (telco='/Volumes/dbacademy_telco/v01', cdc_diabetes='/Volumes/dbacademy_cdc_diabetes/v01')


## Task 1: Load Dataset

+ Load a dataset:
  + Define the dataset path
  + Define the primary key (`customerID`), response variable (`Churn`), and feature variables (`SeniorCitizen`, `tenure`, `MonthlyCharges`, `TotalCharges`) for further processing.
  + Read the dataset, casting relevant columns to the correct data types, and drop any rows with missing values
+ Split the dataset into training and testing sets
  + Separate the features and the response for the training set


In [0]:
from pyspark.sql.functions import col
## Load dataset with spark
shared_volume_name = 'telco' # From Marketplace
csv_name = 'telco-customer-churn-missing' # CSV file name
dataset_p_telco = f"{DA.paths.datasets.telco}/{shared_volume_name}/{csv_name}.csv" # Full path

## Features to use
primary_key = "customerID"
response = "Churn"
features = ["SeniorCitizen", "tenure", "MonthlyCharges", "TotalCharges"]

## Read dataset (and drop nan)
telco_df = spark.read.csv(dataset_p_telco, inferSchema=True, header=True, multiLine=True, escape='"')\
            .withColumn("TotalCharges", col("TotalCharges").cast('double'))\
            .withColumn("SeniorCitizen", col("SeniorCitizen").cast('double'))\
            .withColumn("Tenure", col("tenure").cast('double'))\
            .na.drop(how='any')

## Split with 80 percent of the data in train_df and 20 percent of the data in test_df
train_df, test_df = telco_df.randomSplit([.8, .2], seed=42)

## Separate features and ground-truth
features_df = train_df.select(primary_key, *features)
response_df = train_df.select(primary_key, response)

##Task 2: Inference with feature table
In this task, you will perform batch inference using a feature table. Follow the steps below:

+ **Step 1: Create Feature Table**

+ **Step 2: Setup Feature Lookups**

+ **Step 3: Fit and Register a Model with UC using Feature Table**

+ **Step 4: Use the Model for Inference**


**Step 1: Create Feature Table**
  + Begin by creating a feature table that incorporates the relevant features for inference. This involves selecting the appropriate columns, performing any necessary transformations, and storing the resulting data in a feature table.

In [0]:
from databricks.feature_engineering import FeatureEngineeringClient

## Prepare feature set
features_df_all = telco_df.select(primary_key, *features)

## Feature table definition
fe = FeatureEngineeringClient()
feature_table_name = f"{DA.catalog_name}.{DA.schema_name}.features"

## Drop table if exists
try:
    fe.drop_table(name=feature_table_name)
except:
    pass

## Create feature table
fe.create_table(
    name = feature_table_name,
    df = features_df_all,
    primary_keys=[primary_key],
    description= "Features for Churn Prediction",
)

2025/07/04 06:02:54 INFO databricks.ml_features._compute_client._compute_client: Setting columns ['customerID'] of table 'dbacademy.labuser10817494_1751605583.features' to NOT NULL.
2025/07/04 06:02:55 INFO databricks.ml_features._compute_client._compute_client: Setting Primary Keys constraint ['customerID'] on table 'dbacademy.labuser10817494_1751605583.features'.
2025/07/04 06:03:00 INFO databricks.ml_features._compute_client._compute_client: Created feature table 'dbacademy.labuser10817494_1751605583.features'.


<FeatureTable: name='dbacademy.labuser10817494_1751605583.features', table_id='45bead10-e06a-4280-b977-910622e35536', description='Features for Churn Prediction', primary_keys=['customerID'], partition_columns=[], features=['customerID', 'SeniorCitizen', 'tenure', 'MonthlyCharges', 'TotalCharges'], creation_timestamp=1751608974135, online_stores=[], notebook_producers=[], job_producers=[], table_data_sources=[], path_data_sources=[], custom_data_sources=[], timestamp_keys=[], tags={}>

**Step 2: Setup Feature Lookups**
  + Set up a feature lookup to create a training set from the feature table. 
  + Specify the `lookup_key` based on the columns that uniquely identify records in your feature table.

In [0]:
from databricks.feature_engineering import FeatureLookup

fl_handle = FeatureLookup(
    lookup_key= [primary_key],
    table_name = feature_table_name
)

## Create a training set based on feature lookup
training_set_spec = fe.create_training_set(
    df = response_df,
    label = response,
    feature_lookups= [fl_handle],
    exclude_columns=[primary_key],
)

## Load training dataframe based on defined feature-lookup specification
training_df = training_set_spec.load_df()

**Step 3: Fit and Register a Model with UC using Feature Table**
  + Fit and register a Machine Learning Model using the created training set.
  + Train a model on the training set and register it in the model registry.

In [0]:
import mlflow
import warnings
from mlflow.types.utils import _infer_schema

## Point to UC model registry
mlflow.set_registry_uri("databricks-uc")
client = mlflow.MlflowClient()

## Helper function that we will use for getting latest version of a model
def get_latest_model_version(model_name):
    """Helper function to get latest model version"""
    model_version_infos = client.search_model_versions("name = '%s'" % model_name)
    return max([model_version_info.version for model_version_info in model_version_infos])

## Train a sklearn Decision Tree Classification model
from sklearn.tree import DecisionTreeClassifier

## Convert data to pandas dataframes
X_train_pdf = training_df.drop(primary_key, response).toPandas()
Y_train_pdf = training_df.select(response).toPandas()
clf = DecisionTreeClassifier(max_depth=3, random_state=42)

## End the active MLflow run before starting a new one
mlflow.end_run()

with mlflow.start_run(run_name="Model-Batch-Deployment-lab-With-FS") as mlflow_run:

    ##Enable automatic logging of input samples, metrics, parameters, and models
    mlflow.sklearn.autolog(
        log_input_examples=True,
        log_models=False,
        log_post_training_metrics=True,
        silent=True)
    
    clf.fit(X_train_pdf, Y_train_pdf)

    ## Infer output schema
    try:
        output_schema = _infer_schema(clf.predict(Y_train_pdf))
    except Exception as e:
        warnings.warn(f"Could not infer model output schema: {e}")
        output_schema = None

    model_name = f"{DA.catalog_name}.{DA.schema_name}.ml_model"
    
    ## Log using feature engineering client and push to registry
    fe.log_model(
        flavor = mlflow.sklearn,
        model = clf,
        artifact_path= "decision_tree",
        training_set= training_set_spec,
        registered_model_name= model_name,
        output_schema = output_schema
    )

    ## Set model alias (i.e. Champion)
    client.set_registered_model_alias(model_name, "Champion", get_latest_model_version(model_name))

Uploading artifacts:   0%|          | 0/1 [00:00<?, ?it/s]

Feature names unseen at fit time:
- Churn
Feature names seen at fit time, yet now missing:
- MonthlyCharges
- SeniorCitizen
- TotalCharges
- tenure



Uploading artifacts:   0%|          | 0/14 [00:00<?, ?it/s]

Registered model 'dbacademy.labuser10817494_1751605583.ml_model' already exists. Creating a new version of this model...


Downloading artifacts:   0%|          | 0/14 [00:00<?, ?it/s]

Uploading artifacts:   0%|          | 0/14 [00:00<?, ?it/s]

Created version '3' of model 'dbacademy.labuser10817494_1751605583.ml_model'.
2025/07/04 06:14:54 INFO mlflow.tracking._tracking_service.client: 🏃 View run Model-Batch-Deployment-lab-With-FS at: dbc-7a815474-411f.cloud.databricks.com/ml/experiments/4054457989793451/runs/01d9a6f308984d76aecf85b812fad009.
2025/07/04 06:14:54 INFO mlflow.tracking._tracking_service.client: 🧪 View experiment at: dbc-7a815474-411f.cloud.databricks.com/ml/experiments/4054457989793451.


**Step 4: Use the Model for Inference**
  + Utilize the feature engineering client's `score_batch()` method for inference.
  + Provide the model URI and a dataframe containing primary key information for the inference.

In [0]:
## Load the model
model_uri = f"models:/{model_name}@champion"

## Define the test dataset
test_features_df = test_df.select("customerID")

## Perform batch inference using Feature Engineering's score_batch method
result_df = fe.score_batch(
  model_uri=model_uri,
  df=test_features_df,
  result_type="string"
)

## Display the inference results
display(result_df)

Downloading artifacts:   0%|          | 0/14 [00:00<?, ?it/s]

Downloading artifacts:   0%|          | 0/5 [00:00<?, ?it/s]



Downloading artifacts:   0%|          | 0/5 [00:00<?, ?it/s]

2025/07/04 06:16:48 INFO mlflow.models.flavor_backend_registry: Selected backend for flavor 'python_function'


customerID,SeniorCitizen,tenure,MonthlyCharges,TotalCharges,prediction
0013-EXCHZ,1.0,3.0,83.9,267.4,Yes
0027-KWYKW,0.0,23.0,83.75,1849.95,No
0048-LUMLS,0.0,37.0,91.2,3247.55,No
0078-XZMHT,0.0,72.0,85.15,6316.2,No
0096-BXERS,0.0,6.0,50.35,314.55,No
0114-IGABW,0.0,71.0,58.25,4145.9,No
0139-IVFJG,0.0,2.0,90.35,190.5,Yes
0188-GWFLE,0.0,2.0,20.05,33.7,No
0218-QNVAS,0.0,71.0,100.55,7113.75,No
0231-LXVAP,0.0,1.0,75.9,75.9,Yes


## Task 3: Assess Liquid Clustering:

Evaluate the performance results for specific optimization techniques, such as: Liquid Clustering Follow the step-wise instructions below:  
+ **Step 1:** Create `batch_inference_liquid_clustering` table and import the following columns: `customerID`, `Churn`, `SeniorCitizen`, `tenure`, `MonthlyCharges`, `TotalCharges`, and `prediction`.
+ **Step 2:**  Begin by assessing Liquid Clustering, an optimization technique for improving performance by physically organizing data based on a specified clustering column.
+ **Step 3:**  Optimize the target table for Liquid Clustering.
+ **Step 4:** Specify the `CLUSTER BY` clause with the desired columns (e.g., (customerID, tenure)) to enable Liquid Clustering on the table.

In [0]:
%sql
CREATE OR REPLACE TABLE batch_inference_liquid_clustering(
  customerID string,
  Churn string,
  SeniorCitizen double,
  tenure double,
  MonthlyCharges double,
  TotalCharges double,
  prediction string
  )

In [0]:
%sql
OPTIMIZE batch_inference_liquid_clustering;
ALTER TABLE batch_inference_liquid_clustering
CLUSTER BY (customerID, tenure);


## Conclusion

This lab provides you with hands-on experience in batch deployment, covering model inference, Model Registry usage, and the impact of features like Liquid Clustering on performance. you will gain practical insights into deploying models at scale in a batch-oriented environment.


&copy; 2025 Databricks, Inc. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg logo are trademarks of the <a href="https://www.apache.org/" target="blank">Apache Software Foundation</a>.<br/>
<br/><a href="https://databricks.com/privacy-policy" target="blank">Privacy Policy</a> | 
<a href="https://databricks.com/terms-of-use" target="blank">Terms of Use</a> | 
<a href="https://help.databricks.com/" target="blank">Support</a>