For this hands-on, we will be using the [Power Plant dataset](https://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant) dataset where the goal is to predict the net hourly electrical energy output (EP) of a plant.

In [1]:
from datetime import datetime

import mlflow
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

pd.set_option("display.max_columns", None)

In [2]:
df = pd.read_csv("../data/power_plants.csv")
df.head()

Unnamed: 0,AT,V,AP,RH,PE
0,14.96,41.76,1024.07,73.17,463.26
1,25.18,62.96,1020.04,59.08,444.37
2,5.11,39.4,1012.16,92.14,488.56
3,20.86,57.32,1010.24,76.64,446.48
4,10.82,37.5,1009.23,96.62,473.9


# MLflow Tracking

## Model traning

In [3]:
def train_model(train_df, max_depth=2):
    # Split data
    X = train_df[["AT", "V", "AP", "RH"]]
    y = train_df["PE"]
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # Fit model
    model = RandomForestRegressor(max_depth=max_depth)
    model.fit(X_train, y_train)

    # Evaluate the model
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    rmse = mean_squared_error(y_test, y_pred, squared=False)
    print(f"Test mse = {mse:.2f}, Test RMSE = {rmse:.2f}, Random forest max depth = {max_depth}")
    return model, mse, rmse

In [4]:
_ = train_model(df, max_depth=2)

Test mse = 37.59, Test RMSE = 6.13, Random forest max depth = 2


- Test with different max depths for the Random forest

In [5]:
for max_depth in range(2, 7, 2):
    _ = train_model(df, max_depth=max_depth)

Test mse = 37.48, Test RMSE = 6.12, Random forest max depth = 2
Test mse = 19.97, Test RMSE = 4.47, Random forest max depth = 4
Test mse = 14.58, Test RMSE = 3.82, Random forest max depth = 6


## Experiment tracking

### Some vocabulary:
- **run**: single execution of model training code. Each run can record different informations (model parameters, metrics, tags, artifacts, etc).
- **experiment**: the primary unit of organization and access control for MLflow runs; all MLflow runs belong to an experiment. Experiments let you visualize, search for, and compare runs, as well as download run artifacts and metadata for analysis in other tools.

In [6]:
!dir

 Volume in drive D is Data
 Volume Serial Number is 122B-6F5D

 Directory of d:\study\DSP\DSP-Tingfen-YU\mlflow_hands_on-master\notebooks

18/11/2022  14:38    <DIR>          .
18/11/2022  15:18    <DIR>          ..
19/11/2022  20:53            55,866 mlflow_tracking_hands_on.ipynb
18/11/2022  14:38    <DIR>          mlruns
               1 File(s)         55,866 bytes
               3 Dir(s)  195,684,974,592 bytes free


In [7]:
experiment_name = "ep_prediction_with_random_forest"
mlflow.set_experiment(experiment_name)

<Experiment: artifact_location='file:///D:/study/DSP/DSP-Tingfen-YU/mlflow_hands_on-master/notebooks/mlruns/1', creation_time=1668778706169, experiment_id='1', last_update_time=1668778706169, lifecycle_stage='active', name='ep_prediction_with_random_forest', tags={}>

In [8]:
!dir

 Volume in drive D is Data
 Volume Serial Number is 122B-6F5D

 Directory of d:\study\DSP\DSP-Tingfen-YU\mlflow_hands_on-master\notebooks

18/11/2022  14:38    <DIR>          .
18/11/2022  15:18    <DIR>          ..
19/11/2022  20:53            55,866 mlflow_tracking_hands_on.ipynb
18/11/2022  14:38    <DIR>          mlruns
               1 File(s)         55,866 bytes
               3 Dir(s)  195,684,974,592 bytes free


In [9]:
!tree mlruns

Folder PATH listing for volume Data
Volume serial number is 122B-6F5D
D:\STUDY\DSP\DSP-TINGFEN-YU\MLFLOW_HANDS_ON-MASTER\NOTEBOOKS\MLRUNS
+---.trash
+---0
�   +---775862801ff94a3789fadc378d0181e5
�   �   +---artifacts
�   �   �   +---model
�   �   +---metrics
�   �   +---params
�   �   +---tags
�   +---817881f5b6d045aabd2a90982ca74218
�   �   +---artifacts
�   �   �   +---model
�   �   +---metrics
�   �   +---params
�   �   +---tags
�   +---abf561b94e684d38a5abd679a7e580b8
�       +---artifacts
�       �   +---model
�       +---metrics
�       +---params
�       +---tags
+---1
    +---07c7546a580441dda55647695a9049f9
    �   +---artifacts
    �   �   +---model
    �   +---metrics
    �   +---params
    �   +---tags
    +---1866408c6b174d0ab64296bcf906f8ac
    �   +---artifacts
    �   �   +---model
    �   +---metrics
    �   +---params
    �   +---tags
    +---797064258bd04f56b8f2d032eabe5bd9
    �   +---artifacts
    �   �   +---model
    �   +---metrics
    �   +---params
    �   +-

In [10]:
!cat mlruns/1/meta.yaml

'cat' is not recognized as an internal or external command,
operable program or batch file.


In [11]:
!clip < mlruns/1/meta.yaml
# Display the content of the meta.yaml file in windows clipboard

### Basic logging
- Log model hyper-parameters, metric and the model itself

In [12]:
def train_model(train_df, max_depth=2):
    with mlflow.start_run():
        # Split data
        X = train_df[["AT", "V", "AP", "RH"]]
        y = train_df["PE"]
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

        # Fit model
        model = RandomForestRegressor(max_depth=max_depth)
        model.fit(X_train, y_train)
        ## mlflow: log model & its hyper-parameters
        mlflow.log_param("max_depth", max_depth)
        mlflow.sklearn.log_model(model, "model")

        # Evaluate the model
        y_pred = model.predict(X_test)
        mse = mean_squared_error(y_test, y_pred)
        rmse = mean_squared_error(y_test, y_pred, squared=False)
        ## mlflow: log metrics
        mlflow.log_metrics({"testing_mse": mse, "testing_rmse": rmse})
        print(f"Test mse = {mse:.2f}, Test RMSE = {rmse:.2f}, Random forest max depth = {max_depth}")

- Run the function with mlflow tracking

In [13]:
for max_depth in range(2, 7, 2):
    _ = train_model(df, max_depth=max_depth)



Test mse = 38.36, Test RMSE = 6.19, Random forest max depth = 2
Test mse = 20.00, Test RMSE = 4.47, Random forest max depth = 4
Test mse = 14.56, Test RMSE = 3.82, Random forest max depth = 6


### Visualize experiments with MLflow tracking UI

To run the [MLflow Tracking UI](https://www.mlflow.org/docs/latest/tracking.html#tracking-ui), you need to either run the UI with ```mlflow ui``` (needs to be executed from the *notebooks* folder) oor to run an *mlflow server* (will be used in the following section)

### Where mlflow saves the data

#### Some vocabulary:
- **Backend store**: for MLflow entities (runs, parameters, metrics, tags, notes, metadata, etc)
- **Artefact store**: for artifacts (files, models, images, in-memory objects, etc)
- For more information, [check the official documentation](https://www.mlflow.org/docs/latest/tracking.html#where-runs-are-recorded)

#### Without prior configuration
- When no pror configuration is set, MLflow creates an *mlruns* folder where the data will be saved

In [14]:
!dir

 Volume in drive D is Data
 Volume Serial Number is 122B-6F5D

 Directory of d:\study\DSP\DSP-Tingfen-YU\mlflow_hands_on-master\notebooks

18/11/2022  14:38    <DIR>          .
18/11/2022  15:18    <DIR>          ..
19/11/2022  20:53            55,866 mlflow_tracking_hands_on.ipynb
18/11/2022  14:38    <DIR>          mlruns
               1 File(s)         55,866 bytes
               3 Dir(s)  195,683,811,328 bytes free


- MLflow created a new folder *mlruns* where it will store the different run informations

In [15]:
!tree mlruns

Folder PATH listing for volume Data
Volume serial number is 122B-6F5D
D:\STUDY\DSP\DSP-TINGFEN-YU\MLFLOW_HANDS_ON-MASTER\NOTEBOOKS\MLRUNS
+---.trash
+---0
�   +---775862801ff94a3789fadc378d0181e5
�   �   +---artifacts
�   �   �   +---model
�   �   +---metrics
�   �   +---params
�   �   +---tags
�   +---817881f5b6d045aabd2a90982ca74218
�   �   +---artifacts
�   �   �   +---model
�   �   +---metrics
�   �   +---params
�   �   +---tags
�   +---abf561b94e684d38a5abd679a7e580b8
�       +---artifacts
�       �   +---model
�       +---metrics
�       +---params
�       +---tags
+---1
    +---07c7546a580441dda55647695a9049f9
    �   +---artifacts
    �   �   +---model
    �   +---metrics
    �   +---params
    �   +---tags
    +---1866408c6b174d0ab64296bcf906f8ac
    �   +---artifacts
    �   �   +---model
    �   +---metrics
    �   +---params
    �   +---tags
    +---4496cf78e7194dda9b6dc43b5423de19
    �   +---artifacts
    �   �   +---model
    �   +---metrics
    �   +---params
    �   +-

#### With prior configuration
- Set the **Backend store** to an sqlite database located in */tmp/mlruns.db* and the **Artefact store**  to a folder located in */tmp/mlruns*. For more informations on the different possibilities available (S3, blobstorage, etc) check [the official documentation](https://www.mlflow.org/docs/latest/tracking.html#where-runs-are-recorded).
- To run the MLflow server, you need to:
    - stop the execution of the UI (`mlflow ui` command)
    - execute the following command:
        - Linux: ```mlflow server --backend-store-uri sqlite:////tmp/mlruns.db --default-artifact-root /tmp/mlruns```
        - Windows: ```mlflow server --backend-store-uri sqlite:///mlruns.db --default-artifact-root mlruns```
- Set the tracking uri in the notebook ```mlflow.set_tracking_uri('http://127.0.0.1:5000')```

In [32]:
# Set the backend store to a psycopg2 database postgresql://mlflow_user:mlflow@localhost/mlflow_db
!mlflow server --backend-store-uri postgresql://postgres:ytf@localhost/TTTest --default-artifact-root file:///D:/study/DSP/DSP-Tingfen-YU/mlflow_hands_on-master/notebooks/mlruns

^C


In [30]:
mlflow.set_tracking_uri('http://127.0.0.1:5000')

In [33]:
# Create the experiment in the new database
experiment_name = "ep_prediction_with_random_forest"
mlflow.set_experiment(experiment_name=experiment_name)

2022/11/20 16:07:49 INFO mlflow.tracking.fluent: Experiment with name 'ep_prediction_with_random_forest' does not exist. Creating a new experiment.


<Experiment: artifact_location='file:///D:/study/DSP/DSP-Tingfen-YU/mlflow_hands_on-master/notebooks/mlruns/1', creation_time=1668956869233, experiment_id='1', last_update_time=1668956869233, lifecycle_stage='active', name='ep_prediction_with_random_forest', tags={}>

### Loggiong with autolog

- Autollog will log all the model parameters, training metrics, model binary, etc **BUT not the test metrics**, tthey needd to be logged manually

In [34]:
def train_model(train_df, max_depth=2):
    training_timestamp = datetime.now().strftime('%Y-%m-%d, %H:%M:%S')
    with mlflow.start_run(run_name=f"model_{training_timestamp}"):

        mlflow.autolog()
        
        # Split data
        X = train_df[["AT", "V", "AP", "RH"]]
        y = train_df["PE"]
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

        # Fit model
        model = RandomForestRegressor(max_depth=max_depth)
        model.fit(X_train, y_train)

        # Evaluate the model
        y_pred = model.predict(X_test)
        mse = mean_squared_error(y_test, y_pred)
        rmse = mean_squared_error(y_test, y_pred, squared=False)
        ## mlflow: log metrics
        mlflow.log_metrics({"testing_mse": mse, "testing_rmse": rmse})
        print(f"Test mse = {mse}, Test RMSE = {rmse}, Random forest max depth = {max_depth}")

In [35]:
for max_depth in range(2, 7, 2):
    _ = train_model(df, max_depth=max_depth)

2022/11/20 16:08:05 INFO mlflow.tracking.fluent: Autologging successfully enabled for sklearn.


Test mse = 37.183492393724556, Test RMSE = 6.097826858293416, Random forest max depth = 2


2022/11/20 16:08:18 INFO mlflow.tracking.fluent: Autologging successfully enabled for sklearn.


Test mse = 20.06746768105951, Test RMSE = 4.479672720306642, Random forest max depth = 4


2022/11/20 16:08:49 INFO mlflow.tracking.fluent: Autologging successfully enabled for sklearn.


Test mse = 14.547846381734143, Test RMSE = 3.814163916474244, Random forest max depth = 6


### Search runs

- [In the UI directly](https://www.mlflow.org/docs/latest/search-syntax.html#search)
- [Programmatically with search_runs](https://www.mlflow.org/docs/latest/search-syntax.html#programmatically-searching-runs)

- Get the id of the experiment where we want to search runs

In [21]:
mlflow.get_experiment_by_name(experiment_name)

<Experiment: artifact_location='./mlruns/1', creation_time=1668956060756, experiment_id='1', last_update_time=1668956060756, lifecycle_stage='active', name='ep_prediction_with_random_forest', tags={}>

In [22]:
experiment_id = mlflow.get_experiment_by_name(experiment_name).experiment_id
experiment_id

'1'

- Get all runs for the experiment

In [23]:
mlflow.search_runs(experiment_id)

Unnamed: 0,run_id,experiment_id,status,artifact_uri,start_time,end_time,metrics.testing_mse,metrics.training_rmse,metrics.testing_rmse,metrics.training_mae,metrics.training_score,metrics.training_r2_score,metrics.training_mse,params.max_depth,params.min_samples_split,params.criterion,params.min_samples_leaf,params.n_jobs,params.random_state,params.ccp_alpha,params.bootstrap,params.oob_score,params.max_leaf_nodes,params.min_impurity_decrease,params.verbose,params.n_estimators,params.max_features,params.warm_start,params.max_samples,params.min_weight_fraction_leaf,tags.estimator_class,tags.mlflow.runName,tags.mlflow.user,tags.mlflow.source.type,tags.mlflow.log-model.history,tags.mlflow.source.name,tags.estimator_name
0,38792789ec2a4f42915430f8ce7c4510,1,FINISHED,./mlruns/1/38792789ec2a4f42915430f8ce7c4510/ar...,2022-11-20 14:55:13.781000+00:00,2022-11-20 14:55:32.039000+00:00,14.609063,3.830164,3.82218,2.924678,0.949643,0.949643,14.670158,6,2,squared_error,1,,,0.0,True,False,,0.0,0,100,1.0,False,,0.0,sklearn.ensemble._forest.RandomForestRegressor,unruly-newt-128,tingf,LOCAL,"[{""run_id"": ""38792789ec2a4f42915430f8ce7c4510""...",C:\Users\tingf\AppData\Roaming\Python\Python31...,RandomForestRegressor
1,ebcacc0023ee4d61aa77d193187ba1f3,1,FINISHED,./mlruns/1/ebcacc0023ee4d61aa77d193187ba1f3/ar...,2022-11-20 14:54:57.413000+00:00,2022-11-20 14:55:13.732000+00:00,20.055973,4.480996,4.47839,3.466452,0.931075,0.931075,20.079325,4,2,squared_error,1,,,0.0,True,False,,0.0,0,100,1.0,False,,0.0,sklearn.ensemble._forest.RandomForestRegressor,overjoyed-shoat-923,tingf,LOCAL,"[{""run_id"": ""ebcacc0023ee4d61aa77d193187ba1f3""...",C:\Users\tingf\AppData\Roaming\Python\Python31...,RandomForestRegressor
2,b08b9556f07d43b59e10709abaafc714,1,FINISHED,./mlruns/1/b08b9556f07d43b59e10709abaafc714/ar...,2022-11-20 14:54:42.373000+00:00,2022-11-20 14:54:57.364000+00:00,36.733445,6.040081,6.060812,4.762977,0.874768,0.874768,36.482578,2,2,squared_error,1,,,0.0,True,False,,0.0,0,100,1.0,False,,0.0,sklearn.ensemble._forest.RandomForestRegressor,worried-pug-159,tingf,LOCAL,"[{""run_id"": ""b08b9556f07d43b59e10709abaafc714""...",C:\Users\tingf\AppData\Roaming\Python\Python31...,RandomForestRegressor


- Filter runs by max_depth and mse and order them by mse (more information about the filters can be found [here](https://www.mlflow.org/docs/latest/search-runs.html))

In [24]:
max_depth = 4
mlflow.search_runs(
    experiment_id,
    filter_string=f"params.max_depth = '{max_depth}' AND metrics.testing_mse <= 40",
    order_by=['metrics.testing_mse asc']
)

Unnamed: 0,run_id,experiment_id,status,artifact_uri,start_time,end_time,metrics.testing_mse,metrics.training_rmse,metrics.testing_rmse,metrics.training_mae,metrics.training_score,metrics.training_r2_score,metrics.training_mse,params.max_depth,params.min_samples_split,params.criterion,params.min_samples_leaf,params.n_jobs,params.random_state,params.ccp_alpha,params.bootstrap,params.oob_score,params.max_leaf_nodes,params.min_impurity_decrease,params.verbose,params.n_estimators,params.max_features,params.warm_start,params.max_samples,params.min_weight_fraction_leaf,tags.estimator_class,tags.mlflow.runName,tags.mlflow.user,tags.mlflow.source.type,tags.mlflow.log-model.history,tags.mlflow.source.name,tags.estimator_name
0,ebcacc0023ee4d61aa77d193187ba1f3,1,FINISHED,./mlruns/1/ebcacc0023ee4d61aa77d193187ba1f3/ar...,2022-11-20 14:54:57.413000+00:00,2022-11-20 14:55:13.732000+00:00,20.055973,4.480996,4.47839,3.466452,0.931075,0.931075,20.079325,4,2,squared_error,1,,,0.0,True,False,,0.0,0,100,1.0,False,,0.0,sklearn.ensemble._forest.RandomForestRegressor,overjoyed-shoat-923,tingf,LOCAL,"[{""run_id"": ""ebcacc0023ee4d61aa77d193187ba1f3""...",C:\Users\tingf\AppData\Roaming\Python\Python31...,RandomForestRegressor


### Load a saved model

- [More informations on other format of model_uri](https://www.mlflow.org/docs/latest/python_api/mlflow.sklearn.html#mlflow.sklearn.load_model)

#### With the result of search_runs

In [25]:
run = mlflow.search_runs(
    experiment_id,
    filter_string=f"params.max_depth = '{max_depth}' AND metrics.testing_mse <= 40",
    order_by=["metrics.testing_mse asc"]
).iloc[0]
run

run_id                                              ebcacc0023ee4d61aa77d193187ba1f3
experiment_id                                                                      1
status                                                                      FINISHED
artifact_uri                       ./mlruns/1/ebcacc0023ee4d61aa77d193187ba1f3/ar...
start_time                                          2022-11-20 14:54:57.413000+00:00
end_time                                            2022-11-20 14:55:13.732000+00:00
metrics.testing_mse                                                        20.055973
metrics.training_rmse                                                       4.480996
metrics.testing_rmse                                                         4.47839
metrics.training_mae                                                        3.466452
metrics.training_score                                                      0.931075
metrics.training_r2_score                                        

In [26]:
run.artifact_uri

'./mlruns/1/ebcacc0023ee4d61aa77d193187ba1f3/artifacts'

In [27]:
model = mlflow.sklearn.load_model(model_uri=f"{run.artifact_uri}/model")
model

In [28]:
model.predict(df[:5][["AT", "V", "AP", "RH"]])

array([464.27740344, 444.31355022, 485.82923519, 446.9251842 ,
       472.01396046])