## Pipeline Deployment

In this demo, we will show how to use a model as part of a data pipeline for inference. In the first section of the demo, we will prepare data and perform some basic feature engineering. Then, we will fit and register the model to model registry. Please note that these two steps are already covered in other courses and they are not the main focus of this demo. In the last section, which is the main focus of this demo, we will create a Delta Live Tables (DLT) pipeline and use the registered model as part of the pipeline.

### Learning Objectives:

**By the end of this demo, you will be able to:**

- Describe steps for deploying a model within a pipeline.
- Develop a simple Delta Live Tables pipeline that performs batch inference in its final step.



## Requirements

Please review the following requirements before starting the lesson:

- To run this notebook, you need to use one of the following Databricks runtime(s): `{{supported_dbrs}}`

📛 **Prerequisites**:

- **Feature Engineering** and **Feature Store** are not focus of this lesson. This course expects that you already know these topics. If not, you can check the *Data Preparation for Machine Learning* course.

- Model development with MLflow is not in the scope of this course. If you need to refresh your knowledge about model tracking and logging, you can check the *Machine Learning Model Development* course.


## Classroom Setup

Before starting the demo, run the provided classroom setup script. This script will define configuration variables necessary for the demo. Execute the following cell:


In [0]:
%run ../Includes/Classroom-Setup-01

# Note: you may need to restart the kernel using dbutils.library.restartPython() to use updated packages.

## Data Preparation

For this demonstration, we will utilize a fictional dataset from a Telecom Company, which includes customer information. This dataset encompasses **customer demographics**, including gender, as well as internet subscription details such as subscription plans and payment methods.

After loading the dataset, we will perform simple **data cleaning and feature selection**.

In the final step, we will split the dataset to **features** and **response** sets.


In [0]:
from pyspark.sql.functions import col

# Dataset path
dataset_p_telco = f"{DA.paths.datasets}/telco/telco-customer-churn.csv"

# Dataset specs
primary_key = "customerID"
response = "churn"
features = ["SeniorCitizen", "tenure", "MonthlyCharges", "TotalCharges"]  # Keeping numerical only for simplicity and demo purposes

# Read dataset (and drop nan)
telco_df = spark.read.csv(dataset_p_telco, inferSchema=True, header=True, multiLine=True, escape='"')\
    .withColumn("TotalCharges", col("TotalCharges").cast("double"))\
    .na.drop("any")

# Separate features and ground-truth
features_df = telco_df.select(primary_key, *features)
response_df = telco_df.select(primary_key, response)

# Train a sklearn Decision Tree Classification model
# Convert data to pandas dataframes
X_train_pdf = features_df.drop(primary_key).toPandas()
Y_train_pdf = response_df.drop(primary_key).toPandas()

for col in X_train_pdf.select_dtypes("int32"):
    X_train_pdf[col] = X_train_pdf[col].astype("double")



## Model Preparation

> **Note:** This section is not the main focus of this course. We are just repeating the model development and registration process here.


### Setup Model Registry with UC

Before we start model deployment, we need to fit and register a model. In this demo, **we will log models to Unity Catalog**, which means first we need to setup the **MLflow Model Registery URI**.


In [0]:
import mlflow

# Point to UC model registry
mlflow.set_registry_uri("databricks-uc")
client = mlflow.MlflowClient()

def get_latest_model_version(model_name):
    """Helper function to get latest model version"""
    model_version_infos = client.search_model_versions("name = '%s'" % model_name)
    return max([model_version_info.version for model_version_info in model_version_infos])


### Fit and Register a Model with UC


In [0]:
from sklearn.tree import DecisionTreeClassifier
from mlflow.models import infer_signature

# Use 3-level namespace for model name
model_name = f"{DA.catalog_name}.{DA.schema_name}.ml_model"

# model to use for classification
clf = DecisionTreeClassifier(max_depth=4, random_state=10)

with mlflow.start_run(run_name="Model-Deployment demo") as mlflow_run:

    # Enable automatic logging of input samples, metrics, parameters, and models
    mlflow.sklearn.autolog(
        log_input_examples=True,
        log_models=False,
        log_post_training_metrics=True,
        silent=True
    )

    clf.fit(X_train_pdf, Y_train_pdf)

    # Log model and push to registry
    signature = infer_signature(X_train_pdf, Y_train_pdf)
    mlflow.sklearn.log_model(
        clf,
        artifact_path="decision_tree",
        signature=signature,
        registered_model_name=model_name
    )

# Set model alias
client.set_registered_model_alias(model_name, "DLT", get_latest_model_version(model_name))


## Configure Pipeline to Run Batch Inference

Now that our model is registered and ready, we can move on the most important part; using the model for inference inside a pipeline.

**Note:** The DLT pipeline is already defined in `3.1.b` notebook.  
**Note:** If you want to learn more about DLT please check out `Data Engineering with Databricks (Data Pipeline with Delta Live Tables)`.


### Config Variables

While defining the DLT pipeline, you will need to use the following variables. Run the code block below first. Then, use the output in the next section while creating the pipeline.


In [0]:
print(f"mlpipeline.bronze_dataset_path: {dataset_p_telco}")
print(f"mlpipeline.model_name: {model_name}")


va la seccion de delta live table y crea un pipeline manual desde ahí , 
selecciona un notebook, selecciona data, nos vamos para otro notebook que tienen lo siguiente 

## Inference Pipeline

MLflow-trained models can be used in Delta Live Tables pipelines. MLflow models are treated as transformations in Databricks, meaning they act upon a Spark DataFrame input and return results as a Spark DataFrame. Because Delta Live Tables defines datasets against DataFrames, you can convert Apache Spark workloads that leverage MLflow to Delta Live Tables with just a few lines of code.


## Pipeline configs


In [0]:
bronze_dataset_path = spark.conf.get("mlpipeline.bronze_dataset_path")
model_name = spark.conf.get("mlpipeline.model_name")


## Inference configs


In [0]:
import mlflow

mlflow.set_registry_uri("databricks-uc")
model_uri=f"models:/{model_name}@DLT"
loaded_model_udf = mlflow.pyfunc.spark_udf(spark, model_uri=model_uri, result_type="string")

primary_key = "customerID"
features = ["SeniorCitizen", "tenure", "MonthlyCharges", "TotalCharges"]


## DLT Inference code


In [0]:
import dlt
from pyspark.sql.functions import col, struct

@dlt.table(
  name="raw_inputs",
  comment="Raw inputs table",
  table_properties={
    "quality": "bronze"
  }
)
def raw_inputs():
  return spark.read.csv(bronze_dataset_path, inferSchema=True, header=True, multiLine=True, escape='"')

@dlt.table(
  name="features_input",
  comment="Features table",
  table_properties={
    "quality": "silver"
  }
)
def features_input():
  return (
    dlt.read("raw_inputs")
      .select(primary_key, *features)
      .withColumn("TotalCharges", col("TotalCharges").cast("double"))
      .na.drop(how="any")
  )

@dlt.table(
  name="model_predictions",
  comment="Inference table",
  table_properties={
    "quality": "gold"
  }
)
def model_predictions():
  return (
    dlt.read("features_input")
      .withColumn("prediction", loaded_model_udf(struct(features)))
  )
