# Real-time Deployment with Model Serving

In this demo, we will focus on real-time deployment of machine learning models. Databricks' Model Serving is an easy-to-use serverless infrastructure for serving the models in real-time that supports both online and offline feature tables as well as automatic feature lookups for online tables with no additional endpoint configuration.

## Learning Objectives:

**By the end of this demo, you will be able to;**

- Understand the differences between **offline** and **online** feature tables for Databricks Model Serving.
- Understand how to serve multiple versions of a model simultaneously and set up **A/B testing** for real-time inferencing.
- Utilize **feature lookups** and a **feature function** for online tables for real-time inference.


## Requirements

Please review the following requirements before starting the lesson:

- To run this notebook, you need to use one of the following Databricks runtime(s): `16.2.x-cpu-ml-scala2.12`
- Online Tables must be enabled for the workspace.

## Classroom Setup

Before starting the demo, run the provided classroom setup script. This script will define configuration variables necessary for the demo. Execute the following cell:


In [0]:
%run ../Includes/Classroom-Setup-4.1


### Other Conventions:

Throughout this demo, we’ll refer to the object `DA`. This object, provided by Databricks Academy, contains variables such as your username, catalog name, schema name, working directory, and dataset locations. Run the code block below to view these details:


In [0]:
print(f"Username:       {DA.username}")
print(f"Catalog Name:   {DA.catalog_name}")
print(f"Schema Name:    {DA.schema_name}")
print(f"Working Directory: {DA.paths.working_dir}")
print(f"User DB Location:  {DA.paths.datasets}")


### Offline vs. Online Feature Tables For Real-Time Inferencing

Let's take a moment to discuss the importance of feature tables with real-time model serving.

We make the distinction to demonstrate real-time model serving *with* and *without* utilizing feature lookups, since the setup for utilizing offline and online tables are handled differently with Model Serving on Databricks. When using real-time model serving with Databricks, you can use **offline** tables *without* utilizing feature lookups or **online** tables *with* feature lookups (in which case Databricks provides automatic feature lookup). To utilize **offline** tables with feature lookups, there is batch inferencing via the `score_batch` method from the Databricks SDK.

> **Fundamentally**, a **feature table** is a **materialized Delta table with a primary key** that we want to use for model training. Feature lookups must be configured prior to training your model. For further reading and references of offline vs online feature tables, see the Appendix at the end of this demo.

---

### Part 1: Real-time Deployment With Offline Feature Tables

Here we consider a scenario where you have already gone through the development process (data preparation, and model development) and you're ready to deploy a model with offline features. We will first look at deploying two models that were created as a part of the classroom setup – a champion model and a challenger model with aliases `champion` and `challenger`, respectively.

We will serve our two models using a 50/50 traffic split for A/B Testing. First, let's read in our data and explore its lineage.


### Step 1: Inspect Offline The Feature Table and Model Versions

For this demonstration, we will use a fictional dataset from a Telecom Company, which includes customer information. This dataset encompasses **customer demographics**, including internet subscription details such as subscription plans, monthly charges and payment methods.

As a part of the classroom setup for this course, a feature table was created called **features** that **did not** include feature lookups. This is the table we are reading in during the next step.

---

### Lineage Inspection

- Navigate to the catalog and schema used with this Vocareum environment (see the output from the previous cell).
- Find the table called `features` and model called `ml_model`.
  - Click on **Lineage**.
  - Click on **See lineage graph** and inspect it. This will show the footprint of how the catalog assets were made.

### Step 2: Read in Features and Response Variable from Feature Store

Here we will read in our dataset and split between features and response variables. We will show how this can be performed with the Databricks SDK using the Feature Engineering Client.

---

#### What's the difference between `fe.read_table()` and `read.spark.table()`?

Essentially, we use `fe.read_table()` whenever we are specifically working with feature tables stored within Feature Store and `spark.read.table()` for general-purpose reading.  
Note that `fe.read_table()` is part of the Databricks Feature Engineering API and integrates well with other Feature Store APIs like logging models  
(see *Part 2: Real-Time Deployment with Online Feature Tables*).

On the other hand, `spark.read.table()` is a broader Spark SQL method for reading data from any table within the Spark session.


In [0]:
from databricks.feature_engineering import FeatureEngineeringClient

# Initialize Feature Engineering Client
fe = FeatureEngineeringClient()

# Define primary key
primary_key = "customerID"

# Read in feature table
feature_table_name = f"{DA.catalog_name}.{DA.schema_name}.features"
X_train_df = fe.read_table(name=feature_table_name)
X_train_pdf = X_train_df.drop(primary_key).toPandas()

# Read in response table
response_table_name = f"{DA.catalog_name}.{DA.schema_name}.response"
Y_train_df = spark.read.table(response_table_name)
Y_train_pdf = Y_train_df.drop(primary_key).toPandas()


### Step 3: Real-time A/B Testing with Model Serving

Let's serve the two models we logged in the previous step using Model Serving. Model Serving supports endpoint management via the UI and the API.

Below you will find instructions for using the UI and it is simpler method compared to the API. **In this demo, we will use the API to configure and create the endpoint.**

**Both the UI and the API support querying created endpoints in real-time.** We will use the API to query the endpoint using a test-set.

---

> **What is A/B Testing?**  
> A/B testing is a method to compare two versions of a model or system by splitting user traffic and measuring performance metrics to determine which version delivers better results.


### Option 1: Serve model(s) using UI

After registering the (new version(s) of the) model to the model registry. To provision a serving endpoint via UI, follow the steps below.

1. In the left sidebar, click **Serving**.
2. To create a new serving endpoint, click **Create serving endpoint**.
   a. In the **Name** field, type a name for the endpoint.  
   b. Click in the **Entity** field. A dialog appears. Go to **My models**, and then select the catalog, schema, and model from the drop-down menus.  
   c. In the **Version** drop-down menu, select the version of the model to use.  
   d. Click **Confirm**.  
   e. In the **Compute Scale-out** drop-down, select Small, Medium, or Large. If you want to use GPU serving, select a GPU type from the **Compute type** drop-down menu.  
   f. _[OPTIONAL]_ To deploy another model (e.g. for A/B testing), click on **+Add Served Entity** and fill the above mentioned details.  
   g. Click **Create**. The endpoint page opens and the endpoint creation process starts.

See the Databricks documentation for details [AWS](https://docs.databricks.com) | [Azure](https://docs.databricks.com).


### Option 2: Serve Model(s) Using the Databricks Python SDK

#### Get Models to Serve

In order to serve the model, we will initialize the MLflow client with `MLflowClient` and the workspace client with `WorkspaceClient`. We will configure the MLflow client to point to Unity Catalog instead of the Workspace with `set_registry_uri("databricks-uc")`. The workspace client will be used to create the model serving endpoint.


In [0]:
from mlflow.tracking import MlflowClient
from databricks.sdk import WorkspaceClient
from databricks.sdk.service.serving import EndpointTag

# Point to UC model registry
mlflow.set_registry_uri("databricks-uc")

# Initialize MLflow client
client = MlflowClient()

# Initialize workspace client
w = WorkspaceClient()


Define variables that will be used for configuring the endpoint like `model_name`. The output from running the next cell will show version 1 of our model registered as the champion model and version 2 as being the challenger.


In [0]:
# Define model name
model_name = f"dbacademy.{DA.schema_name}.ml_model"

# Parse model name from UC namespace
served_model_name = model_name.split(".")[-1]

# Define the endpoint name
endpoint_name = f"ML_AS_03_Demo4_{DA.unique_name('_')}"

# Get version of our model registered to UC as a part of the classroom setup
model_version_champion = client.get_model_version_by_alias(name=model_name, alias="Champion").version  # Get champion version
model_version_challenger = client.get_model_version_by_alias(name=model_name, alias="Challenger").version  # Get challenger version

print(f"Model version Champion: {model_version_champion}")
print(f"Model version Challenger: {model_version_challenger}")


### Configure

Define our model serving endpoint with `endpoint_config`. The configuration below shows two versions of the same being deployed (`model_version_champion` and `model_version_challenger`) along with how to configure traffic during inferencing.


In [0]:
from databricks.sdk.service.serving import EndpointCoreConfigInput

endpoint_config_dict = {
    "served_models": [
        {
            "model_name": model_name,
            "model_version": model_version_champion,
            "scale_to_zero_enabled": True,
            "workload_size": "Small"
        },
        {
            "model_name": model_name,
            "model_version": model_version_challenger,
            "scale_to_zero_enabled": True,
            "workload_size": "Small"
        }
    ],
    "traffic_config": {
        "routes": [
            {
                "served_model_name": f"{served_model_name}-{model_version_champion}",
                "traffic_percentage": 50
            },
            {
                "served_model_name": f"{served_model_name}-{model_version_challenger}",
                "traffic_percentage": 50
            }
        ]
    },
    "auto_capture_config": {
        "catalog_name": DA.catalog_name,
        "schema_name": DA.schema_name,
        "table_name_prefix": "db_academy"
    }
}
endpoint_config = EndpointCoreConfigInput.from_dict(endpoint_config_dict)


In [0]:
try:
    w.serving_endpoints.create_and_wait(
        name=endpoint_name,
        config=endpoint_config,
        tags=[EndpointTag.from_dict({"key": "db_academy", "value": "serve_fs_model_example"})]
    )
    print(f"Creating endpoint {endpoint_name} with models {model_name} versions {model_version_champion} & {model_version_challenger}")

except Exception as e:
    if "already exists" in e.args[0]:
        print(f"Endpoint with name {endpoint_name} already exists")
    else:
        raise(e)


## Serve the endpoint

Use the configuration just created to serve the model.

> The time to create a model serving endpoint is roughly 10 minutes on average.


In [0]:
try:
    w.serving_endpoints.create_and_wait(
        name=endpoint_name,
        config=endpoint_config,
        tags=[EndpointTag.from_dict({"key": "db_academy", "value": "serve_fs_model_example"})]
    )
    print(f"Creating endpoint {endpoint_name} with models {model_name} versions {model_version_champion} & {model_version_challenger}")

except Exception as e:
    if "already exists" in e.args[0]:
        print(f"Endpoint with name {endpoint_name} already exists")
    else:
        raise(e)


## Verify Endpoint Creation

Let's verify that the endpoint is created and ready to be used for inference using the `assert` command, which is used to check whether a given condition is true.


In [0]:
endpoint = w.serving_endpoints.wait_get_serving_endpoint_not_updating(endpoint_name)

assert endpoint.state.config_update.value == "NOT_UPDATING" and endpoint.state.ready.value == "READY", "Endpoint not ready or failed"


## Query the Endpoint and Visualize

Here we will use the training dataset to query our endpoint.

1. Define the dataset to sample from.  
2. Query by batch to highlight model-split traffic.


In [0]:
dataframe_records = X_train_pdf.iloc[:1000].to_dict(orient='records')  # 1k sample records


Here we will query in batches so we can see the traffic split per 100 rows (there are around 2000 rows in this dataset)

To help visualize the A/B testing output, create a visual using the UI (you only need to do this once; rerunning the cell will update the visualization).

1. After running the next cell, select the + sign on the second table and select **Visualization**.  
2. The default visual should represent the Yes/No split per model.

> Since the dataset we're working with is not very large, you might have to run the cell a few times to get a fairly close 50/50 split.


In [0]:
import pandas as pd

print("Inference results:")

batch_size = 100  # Number of records per batch
num_batches = (len(dataframe_records) + batch_size - 1) // batch_size  # Total number of batches

all_predictions = []
all_models = []

# Process data in batches
for i in range(num_batches):
    batch_records = dataframe_records[i * batch_size : (i + 1) * batch_size]  # Slice batch

    # Query the model serving endpoint
    query_response = w.serving_endpoints.query(name=endpoint_name, dataframe_records=batch_records)

    # Collect predictions and model served details
    all_predictions.extend(query_response.predictions)
    all_models.extend([query_response.served_model_name] * len(query_response.predictions))  # Duplicate model name per prediction

# Convert to DataFrame
results_df = pd.DataFrame({
    "prediction": all_predictions,
    "model_served": all_models
})

# Count occurrences of predictions
count_results = results_df['prediction'].value_counts().reset_index()
count_results.columns = ['prediction', 'count']

# Aggregate count of predictions per model
model_count_results = results_df.groupby(["model_served", "prediction"]).size().reset_index(name="count")

# Display results grouped by model and prediction type
display(model_count_results)


## Part 2: Real-time Deployment with Online Feature Tables

In the previous section we deployed a model that utilized an offline feature table without utilizing feature lookups. In this section we will build a model that utilizes feature lookups with an online table and serve this model. Here are the steps we will take:

1. Create a feature function that computes the average monthly usage charges per customer.
2. Bundle the feature lookups and feature function into one feature-defining object called `features`.
3. Use the Databricks SDK to create the online feature table using the same feature table from part 1.
4. Train an ML model using `features`. By creating a model using feature lookups, we will enable automatic feature lookups when deploying the model to a model serving endpoint. This requires no additional configuration with the online feature table.
5. Create a model serving endpoint.
6. Query the model serving endpoint.

---

## Step 1: Create a Feature Function

Here we will create a feature function that uses a Python UDF to create on-demand features.

### On-Demand Features

"On-demand" refers to features whose values are not known ahead of time, but are *calculated at the time of inference*. In this demo, we will calculate the **average monthly charges** on the fly. This is done by defining a `UDF` with SQL and registering it to Unity Catalog. **The function will be registered with the name** `monthly_charges_avrg` using the syntax `CREATE OR REPLACE FUNCTION`.


In [0]:
%sql
CREATE OR REPLACE FUNCTION monthly_charges_avrg (TotalCharges DOUBLE, tenure DOUBLE)
RETURNS DOUBLE
LANGUAGE PYTHON AS
$$
avrg = TotalCharges / tenure
return avrg
$$


### Step 2: Define Combined Features

Now that we have both an **online features table** and **on-demand features** created, we can combine these together to be passed to the model training. We combine these two Unity Catalog assets and store them as a single object called `features`. Then use the Databricks SDK to create a `FeatureSpec` to bundle features into single feature-defining object.

> Feature lookups and feature functions must be created prior to training the model.


In [0]:
features_for_online_name = f"{DA.catalog_name}.{DA.schema_name}.features"


In [0]:
from databricks.feature_engineering import FeatureLookup, FeatureFunction

fe = FeatureEngineeringClient()

features = [
    FeatureLookup(
        table_name=features_for_online_name,
        lookup_key="primary_key"
    ),
    FeatureFunction(
        udf_name="monthly_charges_avrg",
        output_name="m_charges_avrg",
        input_bindings={
            "TotalCharges": "TotalCharges",
            "tenure": "tenure"
        }
    ),
]


### Step 3: Create an Online Table

In this section, we will create an online table to serve feature table for real-time inference. When using Model Serving to serve a model that was built using features from Databricks, the model automatically looks up and transforms features for inference requests.

> 🛈 Databricks Online Tables can be created and managed via the UI and the SDK. While we provided instructions for both of these methods, you can pick one option for creating the table.

---

#### OPTION 1: Create Online Table via the UI

You create an online table from the Catalog Explorer. The steps are described below. For more details, see the Databricks documentation ([AWS](https://docs.databricks.com) | [Azure](https://docs.databricks.com)).

In **Catalog Explorer**, navigate to the source table that you want to sync to an online table.

From the kebab menu, select **Create online table**.

- Use the selectors in the dialog to configure the online table.

  - **Name**: Name to use for the online table in Unity Catalog.
  - **Primary Key**: Column(s) in the source table to use as primary key(s) in the online table.
  - **Timeseries Key** *(Optional)*: Column in the source table to use as timeseries key. When specified, the online table includes only the row with the latest timeseries key value for each primary key.
  - **Sync mode**: Select `Snapshot` for Sync mode. Please refer to the documentation for more details about available options.
  - When you are done, click **Confirm**. The online table page appears.

- The new online table is created under the catalog, schema, and name specified in the creation dialog. In Catalog Explorer, the online table is indicated by online table icon.


### OPTION 2: Use the Databricks SDK

The first option for creating an online table is using the UI. The alternative is using Databricks' [python-sdk](https://docs.databricks.com).

**🚨 Note:** The workspace must be enabled for using the SDK for creating and managing online tables. You can run following code blocks if your workspace is enabled for this feature.

> The following code alters your existing feature table using change data feed (CDF). Essentially, this allows tracking of row-level changes between versions of our feature table (any Delta table in general).

---

Define the name for our online table.


In [0]:
from databricks.sdk.service.catalog import (
    OnlineTableSpec,
    OnlineTable,
    OnlineTableSpecTriggeredSchedulingPolicy
)

online_table_name = f"{DA.catalog_name}.{DA.schema_name}.online_features"


Drop the table if it already exists and enable change data feed for `features_for_online`.


In [0]:
try:
    # Drop the online table if it already exists
    w.online_tables.delete(online_table_name)
except:
    pass

# Enable CDF for the table
spark.sql(f"""ALTER TABLE {features_for_online_name} SET TBLPROPERTIES (delta.enableChangeDataFeed = true)""")


### Configure online table initialization


In [0]:
# Create an online table
spec = OnlineTableSpec(
    primary_key_columns=[primary_key],
    source_table_full_name=features_for_online_name,
    run_triggered=OnlineTableSpecTriggeredSchedulingPolicy.from_dict({'triggered': 'true'}),
    perform_full_copy=True
)

online_table = OnlineTable(
    name=online_table_name,
    spec=spec
)


### Create an online table based off `features`

> **Does this mean my original model versions are now using feature lookups?**
> No. Because feature lookups were *not* configured during model training, the model serving endpoint will not "know" to perform automatic feature lookup. This step simply syncs the feature table to an online table.


In [0]:
try:
    online_table_pipeline = w.online_tables.create_and_wait(table=online_table)
except Exception as e:
    if "already exists" in str(e):
        pass
    else:
        raise e

print(w.online_tables.get(online_table_name))


### Step 4: Fit and Log the Model with Online Feature Table

Next, we will use the feature engineering client from the Databricks SDK to create our training set that includes the `feature_lookups` parameter – which is our bundled `features` object from the previous cell.

The function `fit_and_register_model` is used in the cell below. This function is created as a part of the classroom setup, but we provide the code here for completeness.

```python
def fit_and_register_model(
    feature_df,
    response_df,
    model_name_,
    random_state_,
    model_alias=None,
    training_set_spec_=None,
    is_online=False
):
    """Train and register a Decision Tree model."""
    clf = DecisionTreeClassifier(random_state=random_state_)
    if is_online:
        X = feature_df # pyspark dataframe
        y = response_df # pyspark dataframe
    else:
        feature_pdf = feature_df.df.toPandas() # instance of an mlflow.data.Dataset
        response_pdf = response_df.df.toPandas() # instance of an mlflow.data.Dataset
        dataset = feature_pdf.merge(response_pdf, on="customerID", how="inner")
        # Prepare X and y
        X = dataset.drop(columns=["customerID", "Churn"])  # Drop unnecessary columns
        y = dataset["Churn"]

    with mlflow.start_run(run_name=f"Train_DecisionTree_{random_state_}"):
        mlflow.sklearn.autolog(
            log_input_examples=True,
            log_models=False,
            log_post_training_metrics=True,
            silent=True
        )

        clf.fit(X, y) # Fit the model

        # Log model
        if is_online:
            try:
                output_schema = _infer_schema(y)
            except Exception as e:
                warnings.warn(f"Could not infer model output schema: {e}")
                output_schema = None

            fe = FeatureEngineeringClient()
            # Log the original dataset that supports mlflow logging
            fe.log_model(
                model=clf,
                artifact_path="decision_tree",
                flavor=mlflow.sklearn,
                training_set=training_set_spec_,
                output_schema=output_schema,
                registered_model_name=model_name_
            )
        else:
            # Log the original dataset that supports mlflow logging
            mlflow.log_input(feature_df, "training_features")
            mlflow.log_input(response_df, "training_responses")
            input_example = X.iloc[[0]]
            mlflow.sklearn.log_model(
                sk_model=clf,
                artifact_path="ml_model",
                input_example=input_example,
                registered_model_name=model_name_,
            )

        # Assign alias if provided
        if model_alias:
            time.sleep(8)  # Shorter wait time before updating alias
            latest_version = get_latest_model_version(model_name_)
            client.set_registered_model_alias(model_name_, model_alias, latest_version)

    return clf
``` 



In [0]:
training_set_spec = fe.create_training_set(
    df=Y_train_df,  # response_df
    label="Churn",  # response
    feature_lookups=features,
    exclude_columns=[primary_key]
)

# Load training dataframe based on defined feature-lookup specification
training_df = training_set_spec.load_df()

# Convert data to pandas dataframes
X_train_pdf2 = training_df.drop("Churn").toPandas()
Y_train_pdf2 = training_df.select("Churn").toPandas()

fit_and_register_model(
    X_train_pdf2,
    Y_train_pdf2,
    model_name,
    20,
    model_alias="Online",
    training_set_spec_=training_set_spec,
    is_online=True
)


### Inspect the Lineage

At this point, you can navigate to the registered model within Unity Catalog and inspect the alias and lineage.

> The alias was configured using the MLflow client while the lineage was generated using the Databricks SDK (`FeatureEngineeringClient()`).


## Step 5: Deploy the Model with Online Features

Now that we have a model registered to Unity Catalog, we can deploy the model with Mosaic AI Model Serving and use the online table at the time of inference.


###  configure the endpoint

In [0]:
# Configure the endpoint
fs_endpoint_config_dict = {
    "served_models": [
        {
            "model_name": model_name,
            "model_version": fs_model_version,
            "scale_to_zero_enabled": True,
            "workload_size": "Small"
        }
    ]
}

fs_endpoint_config = EndpointCoreConfigInput.from_dict(fs_endpoint_config_dict)


### serve the endpoint

In [0]:
# Serve the endpoint
try:
    w.serving_endpoints.create_and_wait(
        name=fs_endpoint_name_online,
        config=fs_endpoint_config,
        tags=[EndpointTag.from_dict({"key": "db_academy", "value": "serve_fs_model_example"})]
    )
    print(f"Creating endpoint {fs_endpoint_name_online} with models {model_name} versions {fs_model_version}")

except Exception as e:
    if "already exists" in e.args[0]:
        print(f"Endpoint with name {fs_endpoint_name_online} already exists")
    else:
        raise(e)


### Step 6: Query the Model Serving Endpoint

We'll now query the served model using the Databricks SDK like we showed in Part 1 with the offline features table.


In [0]:
dataframe_records_lookups_only = X_train_df.select('customerID') \
    .limit(1000) \
    .toPandas() \
    .to_dict(orient='records')


In [0]:
import pandas as pd
from collections import Counter

print("FS Inference results:")
query_response = w.serving_endpoints.query(
    name=fs_endpoint_name_online,
    dataframe_records=dataframe_records_lookups_only
)

# Count occurrences of "Yes" and "No" in predictions from list query_response.predictions
prediction_counts = Counter(query_response.predictions)

# Convert counts to a Pandas DataFrame
df_counts = pd.DataFrame.from_dict(prediction_counts, orient='index', columns=['Count']).reset_index()
df_counts.rename(columns={'index': 'Prediction'}, inplace=True)

# Display the DataFrame
display(df_counts)


## Conclusion

This demonstration discussed how to deploy and serve machine learning models in real-time using Databricks Model Serving. It covered the differences between offline and online feature tables, configuring a model serving endpoint, and leveraging feature lookups for real-time inference. Additionally, it explores techniques for on-demand feature computation, and A/B testing with real-time model serving.


## Appendix

Below is some additional information regarding offline and online feature tables.

### More on Offline and Online Tables with Real-Time Model Serving

#### Offline

- You can use an existing Delta table in Unity Catalog that includes a primary key constraint as a feature table. If the table does not have a primary key defined, you must update the table using ALTER TABLE DDL statements to add the constraint. See *Use an existing Delta table in Unity Catalog as a feature table*.
- Any streaming table or materialized view in Unity Catalog with a primary key can be a feature table in Unity Catalog, and you can use the Features UI and API with the table.
- You can update a feature table in Unity Catalog by adding new features or by modifying specific rows based on the primary key.

**Additional Reading**: [Working with feature tables in UC](https://www.databricks.com/)

#### Online

- When a scoring request comes in to the model, Model Serving automatically retrieves the published feature values needed by the model. In this way, the most recent feature values are always used for predictions.
- You can create a Python UDF in a notebook or in Databricks SQL.
- When a Python UDF depends on the result of a FeatureLookup, the value returned if the requested lookup key is not found depends on the environment. When using score_batch, the value returned is None. When using online serving, the value returned is float("nan").
- Models packaged with feature metadata can be registered to Unity Catalog. The feature tables used to create the model must be stored in Unity Catalog.

**Additional Reading**: [Use features in online workflows](https://www.databricks.com/), [Compute features on demand](https://www.databricks.com/)

### Feature Serving – [Feature Serving Endpoints](https://www.databricks.com/)

When you use Mosaic AI Model Serving to serve a model that was built using features from Databricks, the model automatically looks up and transforms features for inference requests. With Databricks Feature Serving, you can serve structured data for retrieval augmented generation (RAG) applications, as well as features that are required for other applications, such as models served outside of Databricks or any other application that requires features based on data in Unity Catalog.

Databricks Feature Serving provides a single interface that serves pre-materialized and on-demand features. It also includes the following benefits:

- **Simplicity**: Databricks handles the infrastructure. With a single API call, Databricks creates a production-ready serving environment.
- **High availability and scalability**: Feature Serving endpoints automatically scale up and down to adjust to the volume of serving requests.
- **Security**: Endpoints are deployed in a secure network boundary and use dedicated compute that terminates when the endpoint is deleted or scaled to zero.

