For this hands-on, we will be using the [Power Plant dataset](https://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant) dataset where the goal is to predict the net hourly electrical energy output (EP) of a plant.

In [5]:
from datetime import datetime

import mlflow
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

pd.set_option("display.max_columns", None)

In [6]:
df = pd.read_csv("../../data/train.csv")
df.head()

Unnamed: 0,Id,MSSubClass,MSZoning,LotFrontage,LotArea,Street,Alley,LotShape,LandContour,Utilities,LotConfig,LandSlope,Neighborhood,Condition1,Condition2,BldgType,HouseStyle,OverallQual,OverallCond,YearBuilt,YearRemodAdd,RoofStyle,RoofMatl,Exterior1st,Exterior2nd,MasVnrType,MasVnrArea,ExterQual,ExterCond,Foundation,BsmtQual,BsmtCond,BsmtExposure,BsmtFinType1,BsmtFinSF1,BsmtFinType2,BsmtFinSF2,BsmtUnfSF,TotalBsmtSF,Heating,HeatingQC,CentralAir,Electrical,1stFlrSF,2ndFlrSF,LowQualFinSF,GrLivArea,BsmtFullBath,BsmtHalfBath,FullBath,HalfBath,BedroomAbvGr,KitchenAbvGr,KitchenQual,TotRmsAbvGrd,Functional,Fireplaces,FireplaceQu,GarageType,GarageYrBlt,GarageFinish,GarageCars,GarageArea,GarageQual,GarageCond,PavedDrive,WoodDeckSF,OpenPorchSF,EnclosedPorch,3SsnPorch,ScreenPorch,PoolArea,PoolQC,Fence,MiscFeature,MiscVal,MoSold,YrSold,SaleType,SaleCondition,SalePrice
0,1,60,RL,65.0,8450,Pave,,Reg,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2003,2003,Gable,CompShg,VinylSd,VinylSd,BrkFace,196.0,Gd,TA,PConc,Gd,TA,No,GLQ,706,Unf,0,150,856,GasA,Ex,Y,SBrkr,856,854,0,1710,1,0,2,1,3,1,Gd,8,Typ,0,,Attchd,2003.0,RFn,2,548,TA,TA,Y,0,61,0,0,0,0,,,,0,2,2008,WD,Normal,208500
1,2,20,RL,80.0,9600,Pave,,Reg,Lvl,AllPub,FR2,Gtl,Veenker,Feedr,Norm,1Fam,1Story,6,8,1976,1976,Gable,CompShg,MetalSd,MetalSd,,0.0,TA,TA,CBlock,Gd,TA,Gd,ALQ,978,Unf,0,284,1262,GasA,Ex,Y,SBrkr,1262,0,0,1262,0,1,2,0,3,1,TA,6,Typ,1,TA,Attchd,1976.0,RFn,2,460,TA,TA,Y,298,0,0,0,0,0,,,,0,5,2007,WD,Normal,181500
2,3,60,RL,68.0,11250,Pave,,IR1,Lvl,AllPub,Inside,Gtl,CollgCr,Norm,Norm,1Fam,2Story,7,5,2001,2002,Gable,CompShg,VinylSd,VinylSd,BrkFace,162.0,Gd,TA,PConc,Gd,TA,Mn,GLQ,486,Unf,0,434,920,GasA,Ex,Y,SBrkr,920,866,0,1786,1,0,2,1,3,1,Gd,6,Typ,1,TA,Attchd,2001.0,RFn,2,608,TA,TA,Y,0,42,0,0,0,0,,,,0,9,2008,WD,Normal,223500
3,4,70,RL,60.0,9550,Pave,,IR1,Lvl,AllPub,Corner,Gtl,Crawfor,Norm,Norm,1Fam,2Story,7,5,1915,1970,Gable,CompShg,Wd Sdng,Wd Shng,,0.0,TA,TA,BrkTil,TA,Gd,No,ALQ,216,Unf,0,540,756,GasA,Gd,Y,SBrkr,961,756,0,1717,1,0,1,0,3,1,Gd,7,Typ,1,Gd,Detchd,1998.0,Unf,3,642,TA,TA,Y,0,35,272,0,0,0,,,,0,2,2006,WD,Abnorml,140000
4,5,60,RL,84.0,14260,Pave,,IR1,Lvl,AllPub,FR2,Gtl,NoRidge,Norm,Norm,1Fam,2Story,8,5,2000,2000,Gable,CompShg,VinylSd,VinylSd,BrkFace,350.0,Gd,TA,PConc,Gd,TA,Av,GLQ,655,Unf,0,490,1145,GasA,Ex,Y,SBrkr,1145,1053,0,2198,1,0,2,1,4,1,Gd,9,Typ,1,TA,Attchd,2000.0,RFn,3,836,TA,TA,Y,192,84,0,0,0,0,,,,0,12,2008,WD,Normal,250000


# MLflow Tracking

## Model traning

In [7]:
def train_model(train_df, max_depth=2):
    # Split data
    # Features
    X = train_df[["OverallQual", "GrLivArea", "GarageArea", "TotalBsmtSF"]]
    y = train_df["SalePrice"]
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # Fit model
    model = RandomForestRegressor(max_depth=max_depth)
    model.fit(X_train, y_train)

    # Evaluate the model
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    rmse = mean_squared_error(y_test, y_pred, squared=False)
    print(f"Test mse = {mse:.2f}, Test RMSE = {rmse:.2f}, Random forest max depth = {max_depth}")
    return model, mse, rmse

In [8]:
_ = train_model(df, max_depth=2)

Test mse = 2310883773.20, Test RMSE = 48071.65, Random forest max depth = 2


- Test with different max depths for the Random forest

In [9]:
for max_depth in range(2, 7, 2):
    _ = train_model(df, max_depth=max_depth)

Test mse = 2305080520.68, Test RMSE = 48011.25, Random forest max depth = 2
Test mse = 1329000917.21, Test RMSE = 36455.46, Random forest max depth = 4
Test mse = 1029549731.68, Test RMSE = 32086.60, Random forest max depth = 6


## Experiment tracking

### Some vocabulary:
- **run**: single execution of model training code. Each run can record different informations (model parameters, metrics, tags, artifacts, etc).
- **experiment**: the primary unit of organization and access control for MLflow runs; all MLflow runs belong to an experiment. Experiments let you visualize, search for, and compare runs, as well as download run artifacts and metadata for analysis in other tools.

In [10]:
!dir

 Volume in drive D is Data
 Volume Serial Number is 122B-6F5D

 Directory of d:\study\DSP\DSP-Tingfen-YU\mlflow_hands_on-master\notebooks

20/11/2022  16:24    <DIR>          .
20/11/2022  16:24    <DIR>          ..
20/11/2022  16:27            81,526 mlflow_tracking_hands_on-house_price.ipynb
19/11/2022  20:53            55,866 mlflow_tracking_hands_on.ipynb
18/11/2022  14:38    <DIR>          mlruns
               2 File(s)        137,392 bytes
               3 Dir(s)  195,672,543,232 bytes free


In [11]:
experiment_name = "ep_prediction_with_random_forest"
mlflow.set_experiment(experiment_name)

<Experiment: artifact_location='file:///D:/study/DSP/DSP-Tingfen-YU/mlflow_hands_on-master/notebooks/mlruns/1', creation_time=1668778706169, experiment_id='1', last_update_time=1668778706169, lifecycle_stage='active', name='ep_prediction_with_random_forest', tags={}>

In [12]:
!dir

 Volume in drive D is Data
 Volume Serial Number is 122B-6F5D

 Directory of d:\study\DSP\DSP-Tingfen-YU\mlflow_hands_on-master\notebooks

20/11/2022  16:24    <DIR>          .
20/11/2022  16:24    <DIR>          ..
20/11/2022  16:27            81,526 mlflow_tracking_hands_on-house_price.ipynb
19/11/2022  20:53            55,866 mlflow_tracking_hands_on.ipynb
18/11/2022  14:38    <DIR>          mlruns
               2 File(s)        137,392 bytes
               3 Dir(s)  195,672,543,232 bytes free


In [13]:
!tree mlruns

Folder PATH listing for volume Data
Volume serial number is 122B-6F5D
D:\STUDY\DSP\DSP-TINGFEN-YU\MLFLOW_HANDS_ON-MASTER\NOTEBOOKS\MLRUNS
+---.trash
+---0
�   +---775862801ff94a3789fadc378d0181e5
�   �   +---artifacts
�   �   �   +---model
�   �   +---metrics
�   �   +---params
�   �   +---tags
�   +---817881f5b6d045aabd2a90982ca74218
�   �   +---artifacts
�   �   �   +---model
�   �   +---metrics
�   �   +---params
�   �   +---tags
�   +---abf561b94e684d38a5abd679a7e580b8
�       +---artifacts
�       �   +---model
�       +---metrics
�       +---params
�       +---tags
+---1
    +---07c7546a580441dda55647695a9049f9
    �   +---artifacts
    �   �   +---model
    �   +---metrics
    �   +---params
    �   +---tags
    +---1866408c6b174d0ab64296bcf906f8ac
    �   +---artifacts
    �   �   +---model
    �   +---metrics
    �   +---params
    �   +---tags
    +---38792789ec2a4f42915430f8ce7c4510
    �   +---artifacts
    �       +---model
    +---4496cf78e7194dda9b6dc43b5423de19
    �   

In [14]:
!cat mlruns/1/meta.yaml

'cat' is not recognized as an internal or external command,
operable program or batch file.


In [15]:
!clip < mlruns/1/meta.yaml
# Display the content of the meta.yaml file in windows clipboard

### Basic logging
- Log model hyper-parameters, metric and the model itself

In [16]:
def train_model(train_df, max_depth=2):
    with mlflow.start_run():
        # Split data
        X = train_df[["OverallQual", "GrLivArea", "GarageArea", "TotalBsmtSF"]]
        y = train_df["SalePrice"]
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

        # Fit model
        model = RandomForestRegressor(max_depth=max_depth)
        model.fit(X_train, y_train)
        ## mlflow: log model & its hyper-parameters
        mlflow.log_param("max_depth", max_depth)
        mlflow.sklearn.log_model(model, "model")

        # Evaluate the model
        y_pred = model.predict(X_test)
        mse = mean_squared_error(y_test, y_pred)
        rmse = mean_squared_error(y_test, y_pred, squared=False)
        ## mlflow: log metrics
        mlflow.log_metrics({"testing_mse": mse, "testing_rmse": rmse})
        print(f"Test mse = {mse:.2f}, Test RMSE = {rmse:.2f}, Random forest max depth = {max_depth}")

- Run the function with mlflow tracking

In [20]:
for max_depth in range(2, 7, 2):
    _ = train_model(df, max_depth=max_depth)

Test mse = 2310512965.85, Test RMSE = 48067.80, Random forest max depth = 2
Test mse = 1296437429.18, Test RMSE = 36006.07, Random forest max depth = 4
Test mse = 1026876060.66, Test RMSE = 32044.91, Random forest max depth = 6


### Visualize experiments with MLflow tracking UI

To run the [MLflow Tracking UI](https://www.mlflow.org/docs/latest/tracking.html#tracking-ui), you need to either run the UI with ```mlflow ui``` (needs to be executed from the *notebooks* folder) oor to run an *mlflow server* (will be used in the following section)

### Where mlflow saves the data

#### Some vocabulary:
- **Backend store**: for MLflow entities (runs, parameters, metrics, tags, notes, metadata, etc)
- **Artefact store**: for artifacts (files, models, images, in-memory objects, etc)
- For more information, [check the official documentation](https://www.mlflow.org/docs/latest/tracking.html#where-runs-are-recorded)

#### Without prior configuration
- When no pror configuration is set, MLflow creates an *mlruns* folder where the data will be saved

In [18]:
!dir

 Volume in drive D is Data
 Volume Serial Number is 122B-6F5D

 Directory of d:\study\DSP\DSP-Tingfen-YU\mlflow_hands_on-master\notebooks

20/11/2022  16:24    <DIR>          .
20/11/2022  16:24    <DIR>          ..
20/11/2022  16:27            81,526 mlflow_tracking_hands_on-house_price.ipynb
19/11/2022  20:53            55,866 mlflow_tracking_hands_on.ipynb
18/11/2022  14:38    <DIR>          mlruns
               2 File(s)        137,392 bytes
               3 Dir(s)  195,671,478,272 bytes free


- MLflow created a new folder *mlruns* where it will store the different run informations

In [19]:
!tree mlruns

Folder PATH listing for volume Data
Volume serial number is 122B-6F5D
D:\STUDY\DSP\DSP-TINGFEN-YU\MLFLOW_HANDS_ON-MASTER\NOTEBOOKS\MLRUNS
+---.trash
+---0
�   +---775862801ff94a3789fadc378d0181e5
�   �   +---artifacts
�   �   �   +---model
�   �   +---metrics
�   �   +---params
�   �   +---tags
�   +---817881f5b6d045aabd2a90982ca74218
�   �   +---artifacts
�   �   �   +---model
�   �   +---metrics
�   �   +---params
�   �   +---tags
�   +---abf561b94e684d38a5abd679a7e580b8
�       +---artifacts
�       �   +---model
�       +---metrics
�       +---params
�       +---tags
+---1
    +---07c7546a580441dda55647695a9049f9
    �   +---artifacts
    �   �   +---model
    �   +---metrics
    �   +---params
    �   +---tags
    +---1866408c6b174d0ab64296bcf906f8ac
    �   +---artifacts
    �   �   +---model
    �   +---metrics
    �   +---params
    �   +---tags
    +---38792789ec2a4f42915430f8ce7c4510
    �   +---artifacts
    �       +---model
    +---4496cf78e7194dda9b6dc43b5423de19
    �   

#### With prior configuration
- Set the **Backend store** to an sqlite database located in */tmp/mlruns.db* and the **Artefact store**  to a folder located in */tmp/mlruns*. For more informations on the different possibilities available (S3, blobstorage, etc) check [the official documentation](https://www.mlflow.org/docs/latest/tracking.html#where-runs-are-recorded).
- To run the MLflow server, you need to:
    - stop the execution of the UI (`mlflow ui` command)
    - execute the following command:
        - Linux: ```mlflow server --backend-store-uri sqlite:////tmp/mlruns.db --default-artifact-root /tmp/mlruns```
        - Windows: ```mlflow server --backend-store-uri sqlite:///mlruns.db --default-artifact-root mlruns```
- Set the tracking uri in the notebook ```mlflow.set_tracking_uri('http://127.0.0.1:5000')```

In [None]:
# Set the backend store to a psycopg2 database postgresql://mlflow_user:mlflow@localhost/mlflow_db
!mlflow server --backend-store-uri postgresql://postgres:ytf@localhost/TTTest --default-artifact-root file:///D:/study/DSP/DSP-Tingfen-YU/mlflow_hands_on-master/notebooks/mlruns

In [None]:
mlflow.set_tracking_uri('http://127.0.0.1:5000')

In [None]:
# Create the experiment in the new database
experiment_name = "ep_prediction_with_random_forest"
mlflow.set_experiment(experiment_name=experiment_name)

### Loggiong with autolog

- Autollog will log all the model parameters, training metrics, model binary, etc **BUT not the test metrics**, tthey needd to be logged manually

In [None]:
def train_model(train_df, max_depth=2):
    training_timestamp = datetime.now().strftime('%Y-%m-%d, %H:%M:%S')
    with mlflow.start_run(run_name=f"model_{training_timestamp}"):

        mlflow.autolog()
        
        # Split data
        X = train_df[["OverallQual", "GrLivArea", "GarageArea", "TotalBsmtSF"]]
        y = train_df["SalePrice"]
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

        # Fit model
        model = RandomForestRegressor(max_depth=max_depth)
        model.fit(X_train, y_train)

        # Evaluate the model
        y_pred = model.predict(X_test)
        mse = mean_squared_error(y_test, y_pred)
        rmse = mean_squared_error(y_test, y_pred, squared=False)
        ## mlflow: log metrics
        mlflow.log_metrics({"testing_mse": mse, "testing_rmse": rmse})
        print(f"Test mse = {mse}, Test RMSE = {rmse}, Random forest max depth = {max_depth}")

In [None]:
for max_depth in range(2, 7, 2):
    _ = train_model(df, max_depth=max_depth)

### Search runs

- [In the UI directly](https://www.mlflow.org/docs/latest/search-syntax.html#search)
- [Programmatically with search_runs](https://www.mlflow.org/docs/latest/search-syntax.html#programmatically-searching-runs)

- Get the id of the experiment where we want to search runs

In [None]:
mlflow.get_experiment_by_name(experiment_name)

In [None]:
experiment_id = mlflow.get_experiment_by_name(experiment_name).experiment_id
experiment_id

- Get all runs for the experiment

In [None]:
mlflow.search_runs(experiment_id)

- Filter runs by max_depth and mse and order them by mse (more information about the filters can be found [here](https://www.mlflow.org/docs/latest/search-runs.html))

In [None]:
max_depth = 4
mlflow.search_runs(
    experiment_id,
    filter_string=f"params.max_depth = '{max_depth}' AND metrics.testing_mse <= 40",
    order_by=['metrics.testing_mse asc']
)

### Load a saved model

- [More informations on other format of model_uri](https://www.mlflow.org/docs/latest/python_api/mlflow.sklearn.html#mlflow.sklearn.load_model)

#### With the result of search_runs

In [None]:
run = mlflow.search_runs(
    experiment_id,
    filter_string=f"params.max_depth = '{max_depth}' AND metrics.testing_mse <= 40",
    order_by=["metrics.testing_mse asc"]
).iloc[0]
run

In [None]:
run.artifact_uri

In [None]:
model = mlflow.sklearn.load_model(model_uri=f"{run.artifact_uri}/model")
model

In [None]:
model.predict(df[:5][["OverallQual", "GrLivArea", "GarageArea", "TotalBsmtSF"]])