# <b>Machine Learning Model Deployment and Management: A Hands-on Tutorial using MLflow</b><br>
MLflow is an open-source platform for managing the end-to-end machine learning lifecycle. It provides tools for tracking experiments, packaging code into reproducible runs, and deploying models. In this tutorial, we will explore the basic functionalities of MLflow for experiment tracking.

**PLEASE READ THE README.md BEFORE RUNNING THIS NOTEBOOK**

## **Imports**

Main packages used in the notebook

In [None]:
# Import a demo dataset from sklearn
from sklearn.datasets import fetch_california_housing

# MLFlow related imports, loads the main module and the package for sklearn
import mlflow
import mlflow.sklearn

# For finding the best model
from mlflow.tracking.client import MlflowClient
from mlflow.entities import ViewType

# Pandas and matplotlib are used for data manipulation and data visualization respectively
import numpy as np

# ML related imports
from sklearn.model_selection import train_test_split

# Model imports
from sklearn.ensemble import RandomForestRegressor

# For type annotations
from sklearn.base import BaseEstimator


## **Starting the MLflow UI** 

**Manual Process: You must complete this step manually.**

If you are on a Windows PC:
- And if you installed Anaconda, start the Anaconda Command Prompt, then **browse to the folder that contains this Jupyter notebook** and execute the following code: `mlflow ui`
- If you did not install Anaconda but you are running Python on your own, then start a terminal window, then **browse to the folder that contains this Jupyter notebook** and execute the following: `python -m mlflow ui`

If you are on a MAC: start your terminal, then **browse to the folder that contains this Jupyter notebook** and execute the following code: `mlflow ui`

When this command is run in the active terminal window, **it will freeze the session** and will not stop until a stop (Ctrl+C) command is executed (or the terminal window is closed). **DO NOT CLOSE THIS TERMINAL WINDOW UNTIL THE END OF THE TUTORIAL**

**Manual Process: To access the MLflow UI, open a browser window and go to the following link:** http://localhost:5000

## **0. Data Preparation**

### Loading the Data from sklearn Datasets

The dataset chosen is the California Hoursing price dataset, provided in StatLib

https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

#### Data Set Characteristics:

In [None]:
print(fetch_california_housing()["DESCR"])

In [None]:
# Load the housing dataset
california_housing_df = fetch_california_housing()

# Separate features (X) and target (y)
X = california_housing_df.data
y = california_housing_df.target

In [None]:
# Limit the size of the dataset for faster execution
# Ranodmly select 2000 samples from the dataset
np.random.seed(42)
indices = np.random.choice(X.shape[0], size=2000, replace=False)

X_cut = X[indices]
y_cut = y[indices]

### Performing a train-test split

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X_cut, y_cut, random_state=42)

# **1. Logging experiments**

There are several ways to log and compare runs in mlflow. This tutorial will cover the most common options with an example.

## **1.1 Creating an experiment**

_Experiments_ in MLflow allow for efficient comparison between runs.

Use **_create_experiment()_** to define a new experiment with a name and as many tags as needed. This can also be done directly from the UI.

_Experiments_ are defined using **_mlflow.set_experiment()_**, and passing a string with the name of the experiment.

In [None]:
# Set the experiment name
experiment_name = "Example Experiment" # It can contain whitespaces or special characters, but it will make code commands harder to perform

# Provide an Experiment description that will appear in the UI
experiment_description = (
"This is the example of a tag, in this case the description."
)
experiment_tags = {
    "mlflow.note.content": experiment_description,
}

# Check if the experiment already exists, if not, create it
experiment = mlflow.get_experiment_by_name(experiment_name)
if experiment is None:
    mlflow.create_experiment(
    name=experiment_name, tags=experiment_tags
    )

### **1.2 Automatic Logging**

By default, MLflow allows logging metrics, parameters, and models without the need for explicit log statements. All that is needed is to call the appropriate `autolog()` method before the training code. 

For some libraries it **only logs the train metrics**. Make sure to check the documantation to avoid logging the wrong metrics:

https://mlflow.org/docs/latest/tracking/autolog.html

In [None]:
# Set the experiment to the one that was just created
mlflow.set_experiment(experiment_name=experiment_name)

# Enable automatic logging of scikit-learn models with MLflow
mlflow.sklearn.autolog()

# Create an instance of the model
rfr = RandomForestRegressor(random_state=42)

# Fit the train data, this will be automatically logged with autolog, including TRAIN metrics
rfr.fit(X_train, y_train)

mlflow.end_run() # This lets MLflow know it must not log anything else from this point forward


The results of this run can be seen in the UI:

http://localhost:5000

Users can click on the experiment name, then click on the model run. On the model's "Overview" page, the model's parameters can be observed. Clicking on the "Model metrics" tab shows the training metrics captured for the model. 

Alternatively, users can bring model runs back to this notebook and organize it using a Pandas DataFrame (see below).

In [None]:
current_experiment = dict(mlflow.get_experiment_by_name("Example Experiment"))

experiment_id = current_experiment["experiment_id"] # Extract the experiment ID

search_result_df = mlflow.search_runs(
    experiment_ids=experiment_id, # Define the scope of the search
    run_view_type=ViewType.ALL, # Select the type of run
    output_format="pandas", # Select the output format
)

search_result_df

### **1.3 Manual Logging**

In most situations, it will be neccessary to track metrics that are not registered by default. In this case, the MLflow API has an extensive set of methods to customize what is logged (see https://mlflow.org/docs/latest/tracking/tracking-api.html#tracking-logging-functions)

Some of the basic functions are:

**_mlflow.start_run()_**: Starts the run and designates it as the current run for logging metrics and parameters.

**_mlflow.log_metric()_**: Logs an specific custom metric. It is stored in a single key-value pair, where the key is the metric name and the value the actual value of the metric.

**_mlflow.end_run()_**: As seen before, ends the current run.

### Create a new experiment

While we could still use our Example Experiment, it wouldn't provide much in the sense of comparison, since the sklearn.autolog method only registers TRAIN metrics.

In [None]:
# To begin, we will create a separate experiment for custom logging.
new_experiment_name = "California Housing" 

# Provide an Experiment description that will appear in the UI
new_experiment_description = (
"This experiment tests the performance of a regression model on the California Housing dataset."
)
new_experiment_tags = {
    "mlflow.note.content": new_experiment_description,
}

# Check if the experiment already exists, if not, create it
experiment = mlflow.get_experiment_by_name(new_experiment_name)
if experiment is None:
    mlflow.create_experiment(
    name=new_experiment_name, tags=new_experiment_tags
    )

### Manually log "mean squared error"

In [None]:
# Import the required methods for calculating metrics
from sklearn.metrics import mean_squared_error

# Set the experiment to the one that was just created
mlflow.set_experiment(experiment_name=new_experiment_name)

# Create an instance of the model
model = RandomForestRegressor(n_estimators=1, max_depth=1, max_features=1, random_state=42)

# Create a custom name for the run
test_run_name = "Manual_Run_Test"

# We must keep the run open, so MLflow knows where to track, we use the with function for that
with mlflow.start_run(run_name=test_run_name):
        # Train the model, and calculate any variables required for the metrics
        model.fit(X_train, y_train)
        train_y_pred = model.predict(X_train)
        test_y_pred = model.predict(X_test)
        # Calculate and log the metrics
        train_mse = mean_squared_error(y_train, train_y_pred)
        mlflow.log_metric("train_mse", train_mse)
        test_mse = mean_squared_error(y_test, test_y_pred)
        mlflow.log_metric("test_mse", test_mse)

mlflow.end_run()

The result of this custom run can also be seen in the UI or by using the search_run method with the new experiment ID (as shown below).

In [None]:
current_experiment = dict(mlflow.get_experiment_by_name(new_experiment_name))

experiment_id = current_experiment["experiment_id"] # Extract the experiment ID

search_result_df = mlflow.search_runs(
    experiment_ids=experiment_id, # Define the scope of the search
    run_view_type=ViewType.ALL, # Select the type of run
    output_format="pandas", # Select the output format
)

search_result_df

### Custom Function to Capture Manual Logs

The process of logging this metrics and performing this runs can be encapsulated in a function, such as the one shown here:

In [None]:
def perform_experiment_run(
    model: BaseEstimator,
    run_name: str,
    X_train: np.ndarray,
    X_test: np.ndarray,
    y_train: np.ndarray,
    y_test: np.ndarray
) -> str:
    """
    Train a machine learning model, evaluate its accuracy on both training and testing data, log these
    performance metrics using MLflow, and print the results. Additionally, it registers the model in MLflow with the given run name.

    This function takes a scikit-learn compatible model, fits it on the provided training data, evaluates it on both the training and testing datasets,
    logs the accuracy metrics using MLflow, prints the train and test accuracy, and finally registers the model in MLflow using the provided run name and model description.

    Parameters:
    - model: BaseEstimator
        An instance of a scikit-learn compatible estimator (model) to be trained and evaluated.
    - run_name: str
        A string representing the name of the MLflow run, which will also be used as the model name when registering in MLflow.
    - X_train: np.ndarray
        A numpy array containing the features of the training data.
    - X_test: np.ndarray
        A numpy array containing the features of the test data.
    - y_train: np.ndarray
        A numpy array containing the true values of the training data.
    - y_test: np.ndarray
        A numpy array containing the true values of the test data.


    Returns:
    - run_id: The ID of the model run, in case the model is to be registered
    """
    with mlflow.start_run(run_name=run_name):
        model.fit(X_train, y_train)
        train_y_pred = model.predict(X_train)
        test_y_pred = model.predict(X_test)
        train_mse = mean_squared_error(y_train, train_y_pred)
        mlflow.log_metric("train_mse", train_mse)
        test_mse = mean_squared_error(y_test, test_y_pred)
        mlflow.log_metric("test_mse", test_mse)
        run_id = mlflow.active_run().info.run_uuid
        print(f"Model run: {run_id}")

    # Inform MLFlow that the run is over
    mlflow.end_run()
    return run_id # The ID is used to programatically register a model, while it can be extracted from the search_runs method, it's much simpler to just return it here

To use it, just replace the main code with the function:

In [None]:
# Set the experiment to the one that was just created
mlflow.set_experiment(experiment_name=new_experiment_name)

# Create an instance of the model
rfr = RandomForestRegressor(n_estimators=1, max_depth=2, max_features=2, random_state=42)

# Create a custom name for the run
run_name = "RFR_California"

model_run_id = perform_experiment_run(model=rfr, run_name=run_name, X_train=X_train, X_test=X_test, y_train=y_train, y_test=y_test) 

## **2. Registering a model**
Once a model with the desired performance is found, it can be registered in MLFlow for future use. While this can be done during an MLFlow run using the mlflow.log_model method, if the model is to be registered after logging, the ID of that particular model run needs to be passed as a part of the URI to locate the model, as shown in the next line:

In [None]:
model_uri = f"runs:/{model_run_id}/model"
model_name = "california-random-forest-r" # It's important to give the model a name that is shell appropiate (no spaces, no reserved characters)
mlflow.register_model(model_uri, model_name)

Alternatively, a model can be manually promoted from the UI.

## **3. Transitioning models between stages**

Once a good run with a specific model is identified, it is common to promote the model from a developement phase to several stages all the way up to Production. While traditionally models would be moved between stages, in more recent versions of MLflow, this has been substituted by using aliases. The most common convention is to call the best model Champion and any possible upgrades as Challenger. To to do so, the following method can be used:

**_{client}.set_registered_model_alias()_**

Where the main parameters are the name of the run, the alias, and the version of the model

### Make a model the Champion

In [None]:
client = MlflowClient() # Create an instance of the client, that allows to programmatically perform the same operations as in the UI
client.set_registered_model_alias(model_name, "champion",1)

### **3.1 Building Alternative Models**

While the Random Forest model performs well and with great accuracy, some hyperparameter tuning might yield an event better one. 

To this end, usually a Grid Search Cross Validation or other form of comparison is performed between some of the most common parameters. There is one problem about it with MLflow, and it is that the default GridSearchCV class does not allow the use of custom metrics, and so **it would only work with autolog(), resulting in logging of only TRAIN metrics**. This is why we disable autologging, find the best model, then record its custom metrics.

Define the parameter grid:

In [None]:
# A basic Grid Search Cross Validation is enough to find an improved model from the base one
from sklearn.model_selection import GridSearchCV

# Define the hyperparameters and their possible values
param_grid = {
    "n_estimators": [25, 50, 100, 150],
    "max_features": ["sqrt", "log2", None],
    "max_depth": [3, 6, 9],
    "max_leaf_nodes": [3, 6, 9],
}

Temporarily disable autologging

In [None]:
# Temporary disable autologging
mlflow.sklearn.autolog(disable=True)

Run the grid search

In [None]:
# Grid Search will perform runs with every possible combination
grid_search = GridSearchCV(RandomForestRegressor(), param_grid=param_grid)
grid_search.fit(X_train, y_train)

Enable autologging

In [None]:
# Enable autologging
mlflow.sklearn.autolog()

Extract the best model

In [None]:
# Extract the best model from the grid search
best_gscv_model = grid_search.best_estimator_

# Print the best hyperparameters for the model
print(best_gscv_model)

In order to log custom metrics, the best model must be run manually:

In [None]:
# Set the experiment to the one that was just created
mlflow.set_experiment(experiment_name=new_experiment_name)

# Create a custom name for the run
run_name = "Best_GSCV_Run"

model_run_id = perform_experiment_run(model=best_gscv_model, run_name=run_name, X_train=X_train, X_test=X_test, y_train=y_train, y_test=y_test) 

### **3.2 Identifying the best run**

After all the runs, the best model can be identified by defining what filters and conditions would make it qualify as such, usually just ordering by the metric (or metrics) of interest is enough.

Before that, and for ease of use and compatibility, it's best to switch from working with the experiment name, and rely on the experiment ID. This is easy to achieve, as shown in the next cell

In [None]:
current_experiment = dict(mlflow.get_experiment_by_name(new_experiment_name))

experiment_id = current_experiment["experiment_id"] # Extract the experiment ID

In [None]:
best_model_run = mlflow.search_runs(
    experiment_ids=experiment_id, # Define the scope of the search
    filter_string="", # Add specific filtering conditions
    run_view_type=ViewType.ALL, # Select the type of run
    max_results=1, # Filter the number of results desired, in this case since it's the best model, only one is required
    order_by=["metrics.test_mse ASC"], # Specify an order by condition
    output_format="pandas", # Select the output format
)

In [None]:
best_model_run

In [None]:
best_model_run_id = best_model_run["run_id"][0]

In [None]:
best_model_run_id

### **3.3 Registering the best run**

In [None]:
best_model_uri = f"runs:/{best_model_run_id}/model"
mlflow.register_model(best_model_uri, model_name)

We then alias this new version as a Challenger

In [None]:
client.set_registered_model_alias(model_name, "challenger",2)

## **4. Using registered models**

Once a model is registered in MLFlow, it can be easily extracted for use in other projects via the URI, and either the version or the alias:

In [None]:
import mlflow.pyfunc

champion_model = mlflow.pyfunc.load_model(model_uri=f"models:/{model_name}@champion") #Using the alias
challenger_model = mlflow.pyfunc.load_model(model_uri=f"models:/{model_name}@challenger") #Using the alias

In [None]:
# Generate a random number to select a random observation from the test set
random_number = np.random.randint(0, len(X_test))

# Select the random observation from the test set
random_observation = X_test[random_number].reshape(1, -1)

In [None]:
champion_model.predict(random_observation)

In [None]:
challenger_model.predict(random_observation)

In [None]:
y_test[random_number]

##  **5. Model Serving**

Registered models can be served using the MLFlow API. To do so, complete the below steps:

**Manual process:**

**1) If you are on Windows: start a new Anaconda PowerShell Prompt window. If you are on a MAC: start a new terminal window.**

**2) Navigate to the "MLflow Tutorial" folder you created for this tutorial**

In this window, execute the following code:

`mlflow models serve --model-uri models:/california-random-forest-r@champion -p 1234 --no-conda`

Please note that, the template the serve models is as follows:

`mlflow models serve --model-uri models:/{model_name}@{alias} -p 1234 --no-conda # If using the alias convention`

**Here is where the alias method shines the most. By using aliases, the serve command remains the same.**

**Finding the model name**: The model name can be extracted from the UI, or from the query performed before. In the MLflow UI, click on "Models" on the upper left corner and select the desired model. After that just fill the {model_name} with it and use "champion" for {alias} 

### **5.1 Using the model API**

Once a model has been served in a specific port, performing an API call on it with some data to predict will return the predictions from that model

In [None]:
import json

# Define some test data in JSON format to pass to the endpoint to make a prediction
data = {"inputs": X_test[:3].tolist()}

# Convert the data dictionary to JSON string
payload = json.dumps(data)

In [None]:
payload

In [None]:
import requests

host = "127.0.0.1"
port = "1234"

url = f"http://{host}:{port}/invocations"

headers = {
    "Content-Type": "application/json",
}

# Perform a POST request to the local hosted MLFlow server using the port defined while serving the model
response = requests.post(url=url, headers=headers, data=payload)

print(f"Predictions: {response.text}")

### **5.2 Updating a Champion**

The champion and challenger alias convention is used to easily serve models without having to modify the version of the current model, as explained above:

In [None]:
client.set_registered_model_alias(model_name, "champion",2)

Now with just re-starting the served model, it will use the second version of the model as the one to serve. 

**End the current serving process (Ctrl+C or Cmd+C) then click then re-run the same code as before. Clicking arrow Up should bring the last command to the terminal.**

In [None]:
new_response = requests.post(url=url, headers=headers, data=payload)
print(f"Predictions: {new_response.text}")

In [None]:
print(f"Initial Champion prediction: {response.text} \n Challeger (New Champion) prediction: {new_response.text} \n Real Value: {y_test[:3]}")

## **NOTE:**
**REMEMBER TO CLOSE ALL TERMINALS AND DELETE THE "mlruns" FOLDER BEFORE RUNNING AGAIN**