<h4 style="font-variant-caps: small-caps;font-size:35pt;">Databricks-ML-professional-S01b-Experiment-Tracking</h4>

<div style='background-color:black;border-radius:5px;border-top:1px solid'></div>
<br/>
<p>This Notebook adds information related to the following requirements:</p><br/>
<b>Experiment Tracking:</b>
<ul>
<li>Manually log parameters, models, and evaluation metrics using MLflow</li>
<li>Programmatically access and use data, metadata, and models from MLflow experiments</li>
</ul>
<br/>
<p><b>Download this notebook at format ipynb <a href="Databricks-ML-professional-S01b-Experiment-Tracking.ipynb">here</a>.</b></p>
<br/>
<div style='background-color:black;border-radius:5px;border-top:1px solid'></div>

<div style='background-color:rgba(30, 144, 255, 0.1);border-radius:5px;padding:2px;'>
<span style="font-variant-caps: small-caps;font-weight:700">1. Import libraries</span></div>

In [0]:
import pandas as pd
import seaborn as sns
#
from pyspark.sql.functions import *
#
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler
from pyspark.ml.regression import GBTRegressor
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml import Pipeline
#
import mlflow
#
import logging

In [0]:
logging.getLogger("mlflow").setLevel(logging.FATAL)

<div style='background-color:rgba(30, 144, 255, 0.1);border-radius:5px;padding:2px;'>
<span style="font-variant-caps: small-caps;font-weight:700">2. Load dataset, convert to Spark DataFrame</span></div>

In [0]:
tips_df = sns.load_dataset("tips")
#
tips_sdf = spark.createDataFrame(tips_df)
#
display(tips_sdf.limit(5))

In [0]:
display(tips_sdf.filter("size is null"))

<div style='background-color:rgba(30, 144, 255, 0.1);border-radius:5px;padding:2px;'>
<span style="font-variant-caps: small-caps;font-weight:700">3. Prepare data</span></div>

<p>Some transformations are done to prepare dataset to be used in training a ML model.</p>
<table border style='border-collapse: collapse;'>
<tr style="background-color:#EDEDED">
    <th>column name</th>
    <th>comment</th>
</tr>
<tr>
    <td><code>tip</code></td>
    <td><b style='color:orangered'>target</b> to predict. Contains numeric</td>
</tr>
<tr>
    <td><code>total_bill</code></td>
    <td>numeric column to keep as is</td>
</tr>
<tr>
    <td><code>sex</code></td>
    <td>Contains <code>Female</code> and <code>Male</code> converted to <code>0</code> and <code>1</code></td>
</tr>
<tr>
    <td><code>smoker</code></td>
    <td>Contains <code>yes</code> and <code>no</code> converted to <code>0</code> and <code>1</code></td>
</tr>
<tr>
    <td><code>time</code></td>
    <td>Contains <code>Dinner</code> and <code>Lunch</code> converted to <code>0</code> and <code>1</code></td>
</tr>
<tr>
    <td><code>day</code></td>
    <td>categorical column to <b>One Hot Encode</b></td>
</tr>
<tr>
    <td><code>size</code></td>
    <td>categorical column to <b>One Hot Encode</b></td>
</tr>
</table>

In [0]:
tips_sdf = tips_sdf.selectExpr("total_bill",
                               "tip",
                               "case when sex = 'Female' then 1 else 0 end as sex",
                               "case when smoker = 'yes' then 1 else 0 end as smoker",
                               "case when time = 'Dinner' then 1 else 0 end as time",
                               "day",
                               "size")
#
train_df, test_df = tips_sdf.randomSplit([.8, .2])
#
ohe_cols = ["size", "day"]
num_cols = ["total_bill", "sex", "smoker", "time"]
target_col = "tip"
#
string_indexer = StringIndexer(inputCols=ohe_cols, outputCols=[c+"_index" for c in ohe_cols], handleInvalid="skip")
#
ohe = OneHotEncoder()
ohe.setInputCols([c+"_index" for c in ohe_cols])
ohe.setOutputCols([c+"_ohe" for c in ohe_cols])
#
assembler_inputs = [c+"_ohe" for c in ohe_cols] + num_cols
vec_assembler = VectorAssembler(inputCols=assembler_inputs, outputCol="features")

<div style='background-color:rgba(30, 144, 255, 0.1);border-radius:5px;padding:2px;'>
<span style="font-variant-caps: small-caps;font-weight:700">4. Evaluator and model</span></div>

In [0]:
gbt =       GBTRegressor(featuresCol="features", labelCol=target_col, maxIter=5)
evaluator = RegressionEvaluator(labelCol=target_col, predictionCol="prediction", metricName="rmse")

<a id="manuallylog"></a>
<div style='background-color:rgba(30, 144, 255, 0.1);border-radius:5px;padding:2px;'>
<span style="font-variant-caps: small-caps;font-weight:700">5. Manually log parameters, models, and evaluation metrics using MLflow</span></div>

In [0]:
model_name = "GBT-Regressor"
#
with mlflow.start_run(run_name="Tip-run") as run:
    #
    # define pipeline stages according to model
    stages = [string_indexer, ohe, vec_assembler, gbt]
    #
    # set pipeline
    pipeline = Pipeline(stages=stages)
    #
    # fit pipeline to train set
    model = pipeline.fit(train_df)
    #
    # manually log model to mlflow
    mlflow.spark.log_model(model, model_name)
    #
    # manually log parameter to mlflow
    mlflow.log_param("maxIter", 5)
    #
    # predict test set
    pred_df = model.transform(test_df)
    #
    # evaluate prediction
    rmse = evaluator.evaluate(pred_df)
    #
    # manually log metric to mlflow
    mlflow.log_metric("rmse", rmse)

<a id="programmaticallyaccess"></a>
<div style='background-color:rgba(30, 144, 255, 0.1);border-radius:5px;padding:2px;'>
<span style="font-variant-caps: small-caps;font-weight:700">6. Programmatically access and use data, metadata, and models from MLflow experiments</span></div>

<p>This can be done in different ways. One of them is to access it programmaticaly with the function <code>mlflow.search_runs</code> which results in a Pandas dataframe containing all useful information for all runs in the current experiment <i>(by default, the current experiment has the name of the current notebook)</i>:</p>

In [0]:
mlflow.search_runs().drop(['tags.mlflow.databricks.workspaceURL',
                           'tags.mlflow.databricks.notebookPath',
                           'tags.mlflow.source.name',
                           'tags.mlflow.user'], axis=1)

<p>Using Pandas syntax information can be filtered on what is needed:</p>

In [0]:
mlflow.search_runs()[["tags.mlflow.runName", "run_id", "params.maxIter", "metrics.rmse"]].sort_values(by=['metrics.rmse'], ascending=True)

<p>A <b>SQL filter</b> can also be applied directly in the <code>mlflow.search_run()</code> function by using its <code>filter_string</code> parameter. This is particularly useful when there are many runs:</p>

In [0]:
mlflow.search_runs(filter_string="tags.mlflow.runName like '%Tip%' and metrics.rmse<=1.5")[["tags.mlflow.runName", "run_id", "params.maxIter", "metrics.rmse"]]

<p>With this, let's load the best model:</p>

In [0]:
bestModelRunId = mlflow.search_runs().sort_values(by=['metrics.rmse'], ascending=True).head(1)["run_id"].values[0]
#
best_model_path = f"runs:/{bestModelRunId}/{model_name}"
print(f"Best model path is: {best_model_path}")
#
loaded_model = mlflow.spark.load_model(best_model_path)

In [0]:
display(loaded_model.transform(test_df).select("tip", "prediction"))

<img src="https://i.ibb.co/xSdfvyD/mlflow3.png"/>