<h4 style="font-variant-caps: small-caps;font-size:35pt;">Databricks-ML-professional-S01b-Experiment-Tracking</h4>

<div style='background-color:rgba(30, 144, 255, 0.1);border-radius:5px;padding:2px;'>
<span style="font-variant-caps: small-caps;font-weight:700">1. Import libraries</span></div>

In [0]:
import pandas as pd
import seaborn as sns
#
from pyspark.sql.functions import *
#
from pyspark.ml.feature import OneHotEncoder, StringIndexer, VectorAssembler
from pyspark.ml.regression import GBTRegressor
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml import Pipeline
#
import mlflow



<div style='background-color:rgba(30, 144, 255, 0.1);border-radius:5px;padding:2px;'>
<span style="font-variant-caps: small-caps;font-weight:700">2. Load dataset, convert to Spark DataFrame</span></div>

In [0]:
tips_df = sns.load_dataset("tips")
#
tips_sdf = spark.createDataFrame(tips_df)
#
display(tips_sdf.limit(5))

total_bill,tip,sex,smoker,day,time,size
16.99,1.01,Female,No,Sun,Dinner,2
10.34,1.66,Male,No,Sun,Dinner,3
21.01,3.5,Male,No,Sun,Dinner,3
23.68,3.31,Male,No,Sun,Dinner,2
24.59,3.61,Female,No,Sun,Dinner,4


In [0]:
display(tips_sdf.filter("size is null"))

total_bill,tip,sex,smoker,day,time,size


<div style='background-color:rgba(30, 144, 255, 0.1);border-radius:5px;padding:2px;'>
<span style="font-variant-caps: small-caps;font-weight:700">3. Prepare data</span></div>

<p>Some transformations are done to prepare dataset to be used in training a ML model.</p>
<table border style='border-collapse: collapse;'>
<tr style="background-color:#EDEDED">
    <th>column name</th>
    <th>comment</th>
</tr>
<tr>
    <td><code>tip</code></td>
    <td><b style='color:orangered'>target</b> to predict. Contains numeric</td>
</tr>
<tr>
    <td><code>total_bill</code></td>
    <td>numeric column to keep as is</td>
</tr>
<tr>
    <td><code>sex</code></td>
    <td>Contains <code>Female</code> and <code>Male</code> converted to <code>0</code> and <code>1</code></td>
</tr>
<tr>
    <td><code>smoker</code></td>
    <td>Contains <code>yes</code> and <code>no</code> converted to <code>0</code> and <code>1</code></td>
</tr>
<tr>
    <td><code>time</code></td>
    <td>Contains <code>Dinner</code> and <code>Lunch</code> converted to <code>0</code> and <code>1</code></td>
</tr>
<tr>
    <td><code>day</code></td>
    <td>categorical column to <b>One Hot Encode</b></td>
</tr>
<tr>
    <td><code>size</code></td>
    <td>categorical column to <b>One Hot Encode</b></td>
</tr>
</table>

In [0]:
tips_sdf = tips_sdf.selectExpr("total_bill",
                               "tip",
                               "case when sex = 'Female' then 1 else 0 end as sex",
                               "case when smoker = 'yes' then 1 else 0 end as smoker",
                               "case when time = 'Dinner' then 1 else 0 end as time",
                               "day",
                               "size")
#
train_df, test_df = tips_sdf.randomSplit([.8, .2], seed=42)
#
ohe_cols = ["size", "day"]
num_cols = ["total_bill", "sex", "smoker", "time"]
target_col = "tip"
#
string_indexer = StringIndexer(inputCols=ohe_cols, outputCols=[c+"_index" for c in ohe_cols], handleInvalid="skip")
#
ohe = OneHotEncoder()
ohe.setInputCols([c+"_index" for c in ohe_cols])
ohe.setOutputCols([c+"_ohe" for c in ohe_cols])
#
assembler_inputs = [c+"_ohe" for c in ohe_cols] + num_cols
vec_assembler = VectorAssembler(inputCols=assembler_inputs, outputCol="features")

<div style='background-color:rgba(30, 144, 255, 0.1);border-radius:5px;padding:2px;'>
<span style="font-variant-caps: small-caps;font-weight:700">4. Evaluator and model</span></div>

In [0]:
gbt =       GBTRegressor(featuresCol="features", labelCol=target_col, maxIter=5)
evaluator = RegressionEvaluator(labelCol=target_col, predictionCol="prediction", metricName="rmse")

<a id="manuallylog"></a>
<div style='background-color:rgba(30, 144, 255, 0.1);border-radius:5px;padding:2px;'>
<span style="font-variant-caps: small-caps;font-weight:700">5. Manually log parameters, models, and evaluation metrics using MLflow</span></div>

In [0]:
model_name = "GBT-Regressor"
#
with mlflow.start_run(run_name="Tip-run") as run:
    #
    # define pipeline stages according to model
    stages = [string_indexer, ohe, vec_assembler, gbt]
    #
    # set pipeline
    pipeline = Pipeline(stages=stages)
    #
    # fit pipeline to train set
    model = pipeline.fit(train_df)
    #
    # manually log model to mlflow
    mlflow.spark.log_model(model, model_name)
    #
    # manually log parameter to mlflow
    mlflow.log_param("maxIter", 5)
    #
    # predict test set
    pred_df = model.transform(test_df)
    #
    # evaluate prediction
    rmse = evaluator.evaluate(pred_df)
    #
    # manually log metric to mlflow
    mlflow.log_metric("rmse", rmse)



<a id="programmaticallyaccess"></a>
<div style='background-color:rgba(30, 144, 255, 0.1);border-radius:5px;padding:2px;'>
<span style="font-variant-caps: small-caps;font-weight:700">6. Programmatically access and use data, metadata, and models from MLflow experiments</span></div>

<p>This can be done in different ways. One of them is to access it programmaticaly with the function <code>mlflow.search_runs</code> which results in a Pandas dataframe containing all useful information for all runs in the current experiment <i>(by default, the current experiment has the name of the current notebook)</i>:</p>

In [0]:
mlflow.search_runs()

Unnamed: 0,run_id,experiment_id,status,artifact_uri,start_time,end_time,metrics.rmse,params.maxIter,tags.mlflow.databricks.cluster.id,tags.mlflow.databricks.cluster.libraries.error,tags.mlflow.user,tags.mlflow.databricks.workspaceID,tags.mlflow.databricks.workspaceURL,tags.mlflow.databricks.notebookPath,tags.mlflow.source.name,tags.mlflow.runName,tags.mlflow.databricks.notebookID,tags.mlflow.source.type,tags.mlflow.log-model.history,tags.mlflow.databricks.cluster.info,tags.mlflow.databricks.notebook.commandID,tags.mlflow.databricks.webappURL,tags.sparkDatasourceInfo,tags.mlflow.databricks.notebookRevisionID
0,0a1d747c58df40d7977e830990f540eb,121806328486233,FINISHED,dbfs:/databricks/mlflow-tracking/1218063284862...,2023-11-06 15:56:07.176000+00:00,2023-11-06 15:56:43.332000+00:00,1.498592,5,1103-171254-zkooaj5p,This message class grpc_shaded.com.databricks....,victor.bonnet.mg@gmail.com,2434150836020126,adb-2434150836020126.6.azuredatabricks.net,/Users/victor.bonnet.mg@gmail.com/notebooks-fo...,/Users/victor.bonnet.mg@gmail.com/notebooks-fo...,Tip-run,121806328486233,NOTEBOOK,"[{""artifact_path"":""GBT-Regressor"",""flavors"":{""...","{""cluster_name"":""Victor BONNET's Cluster"",""spa...",9095861010363393222_4903848524339674381_12e67d...,https://eastus-c3.azuredatabricks.net,,
1,6fb42e670d5c449cb0bb598995db82fd,121806328486233,FINISHED,dbfs:/databricks/mlflow-tracking/1218063284862...,2023-11-06 15:34:45.837000+00:00,2023-11-06 15:35:34.651000+00:00,1.06905,30,1103-171254-zkooaj5p,This message class grpc_shaded.com.databricks....,victor.bonnet.mg@gmail.com,2434150836020126,adb-2434150836020126.6.azuredatabricks.net,/Users/victor.bonnet.mg@gmail.com/notebooks-fo...,/Users/victor.bonnet.mg@gmail.com/notebooks-fo...,Tip-run,121806328486233,NOTEBOOK,"[{""artifact_path"":""GBT-Regressor"",""flavors"":{""...","{""cluster_name"":""Victor BONNET's Cluster"",""spa...",7163016224346667493_6187791156394192169_96499e...,https://eastus-c3.azuredatabricks.net,"path=dbfs:/user/hive/warehouse/tips_sdf,versio...",1699284934936.0
2,df0104e271c341979d6db34f5cbbb36b,121806328486233,FINISHED,dbfs:/databricks/mlflow-tracking/1218063284862...,2023-11-06 15:33:48.987000+00:00,2023-11-06 15:34:27.323000+00:00,1.066086,15,1103-171254-zkooaj5p,This message class grpc_shaded.com.databricks....,victor.bonnet.mg@gmail.com,2434150836020126,adb-2434150836020126.6.azuredatabricks.net,/Users/victor.bonnet.mg@gmail.com/notebooks-fo...,/Users/victor.bonnet.mg@gmail.com/notebooks-fo...,Tip-run,121806328486233,NOTEBOOK,"[{""artifact_path"":""GBT-Regressor"",""flavors"":{""...","{""cluster_name"":""Victor BONNET's Cluster"",""spa...",7163016224346667493_7693249504534724776_f5160c...,https://eastus-c3.azuredatabricks.net,"path=dbfs:/user/hive/warehouse/tips_sdf,versio...",1699284867633.0
3,1fca201a147a4e5eb870f5e5c042f854,121806328486233,FINISHED,dbfs:/databricks/mlflow-tracking/1218063284862...,2023-11-06 15:24:24.329000+00:00,2023-11-06 15:25:03.545000+00:00,1.061725,10,1103-171254-zkooaj5p,This message class grpc_shaded.com.databricks....,victor.bonnet.mg@gmail.com,2434150836020126,adb-2434150836020126.6.azuredatabricks.net,/Users/victor.bonnet.mg@gmail.com/notebooks-fo...,/Users/victor.bonnet.mg@gmail.com/notebooks-fo...,Tip-run,121806328486233,NOTEBOOK,"[{""artifact_path"":""GBT-Regressor"",""flavors"":{""...","{""cluster_name"":""Victor BONNET's Cluster"",""spa...",7163016224346667493_7817536333738181060_bbe0ea...,https://eastus-c3.azuredatabricks.net,"path=dbfs:/user/hive/warehouse/tips_sdf,versio...",1699284303884.0


<p>Using Pandas syntax information can be filtered on what is needed:</p>

In [0]:
mlflow.search_runs()[["tags.mlflow.runName", "run_id", "params.maxIter", "metrics.rmse"]].sort_values(by=['metrics.rmse'], ascending=True)

Unnamed: 0,tags.mlflow.runName,run_id,params.maxIter,metrics.rmse
3,Tip-run,1fca201a147a4e5eb870f5e5c042f854,10,1.061725
2,Tip-run,df0104e271c341979d6db34f5cbbb36b,15,1.066086
1,Tip-run,6fb42e670d5c449cb0bb598995db82fd,30,1.06905
0,Tip-run,0a1d747c58df40d7977e830990f540eb,5,1.498592


<p>A <b>SQL filter</b> can also be applied directly in the <code>mlflow.search_run()</code> function by using its <code>filter_string</code> parameter. This is particularly useful when there are many runs:</p>

In [0]:
mlflow.search_runs(filter_string="tags.mlflow.runName like '%Tip%' and metrics.rmse<=1.069")[["tags.mlflow.runName", "run_id", "params.maxIter", "metrics.rmse"]]

Unnamed: 0,tags.mlflow.runName,run_id,params.maxIter,metrics.rmse
0,Tip-run,df0104e271c341979d6db34f5cbbb36b,15,1.066086
1,Tip-run,1fca201a147a4e5eb870f5e5c042f854,10,1.061725


<p>With this, let's load the best model:</p>

In [0]:
bestModelRunId = mlflow.search_runs().sort_values(by=['metrics.rmse'], ascending=True).head(1)["run_id"].values[0]
#
print(f"Best model path is: {bestModelRunId}")
best_model_path = f"runs:/{bestModelRunId}/{model_name}"
#
loaded_model = mlflow.spark.load_model(best_model_path)

2023/11/06 15:56:44 INFO mlflow.spark: 'runs:/1fca201a147a4e5eb870f5e5c042f854/GBT-Regressor' resolved as 'dbfs:/databricks/mlflow-tracking/121806328486233/1fca201a147a4e5eb870f5e5c042f854/artifacts/GBT-Regressor'
Best model path is: 1fca201a147a4e5eb870f5e5c042f854


In [0]:
display(loaded_model.transform(test_df).select("tip", "prediction"))

tip,prediction
1.32,1.814391139677373
1.67,1.7628004227316882
1.76,2.116650713735796
3.23,2.247819711105944
2.24,2.65410884549186
3.5,3.07345882528984
3.0,2.996355983784853
3.0,2.956974962442829
3.31,4.063825640809367
3.6,3.302916179808894


<img src="https://i.ibb.co/xSdfvyD/mlflow3.png"/>