For this hands-on, we will be using the [Power Plant dataset](https://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant) dataset where the goal is to predict the net hourly electrical energy output (PE) of a plant.

In [None]:
from datetime import datetime

import mlflow
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error

pd.set_option("display.max_columns", None)

In [None]:
df = pd.read_csv("../data/power_plants.csv")
df.head()

# MLflow Tracking

## Model traning

In [None]:
def train_model(train_df, max_depth=2):
    # Split data
    X = train_df[["AT", "V", "AP", "RH"]]
    y = train_df["PE"]
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

    # Fit model
    model = RandomForestRegressor(max_depth=max_depth)
    model.fit(X_train, y_train)

    # Evaluate the model
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    rmse = mean_squared_error(y_test, y_pred, squared=False)
    print(f"Test mse = {mse:.2f}, Test RMSE = {rmse:.2f}, Random forest max depth = {max_depth}")
    return model, mse, rmse

In [None]:
_ = train_model(df, max_depth=2)

- Test with different max depths for the Random forest

In [None]:
for max_depth in range(2, 7, 2):
    _ = train_model(df, max_depth=max_depth)

## Experiment tracking

### Some vocabulary:
- **run**: single execution of model training code. Each run can record different informations (model parameters, metrics, tags, artifacts, etc).
- **experiment**: the primary unit of organization and access control for MLflow runs; all MLflow runs belong to an experiment. Experiments let you visualize, search for, and compare runs, as well as download run artifacts and metadata for analysis in other tools.

In [None]:
!ls

In [None]:
experiment_name = "ep_prediction_with_random_forest"
mlflow.set_experiment(experiment_name)

In [None]:
!ls

In [None]:
!tree mlruns

In [None]:
!cat mlruns/1/meta.yaml

### Basic logging
- Log model hyper-parameters, metric and the model itself

In [None]:
def train_model(train_df, max_depth=2):
    with mlflow.start_run():
        # Split data
        X = train_df[["AT", "V", "AP", "RH"]]
        y = train_df["PE"]
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

        # Fit model
        model = RandomForestRegressor(max_depth=max_depth)
        model.fit(X_train, y_train)
        ## mlflow: log model & its hyper-parameters
        mlflow.log_param("max_depth", max_depth)
        mlflow.sklearn.log_model(model, "model")

        # Evaluate the model
        y_pred = model.predict(X_test)
        mse = mean_squared_error(y_test, y_pred)
        rmse = mean_squared_error(y_test, y_pred, squared=False)
        ## mlflow: log metrics
        mlflow.log_metrics({"testing_mse": mse, "testing_rmse": rmse})
        print(f"Test mse = {mse:.2f}, Test RMSE = {rmse:.2f}, Random forest max depth = {max_depth}")

- Run the function with mlflow tracking

In [None]:
for max_depth in range(2, 7, 2):
    _ = train_model(df, max_depth=max_depth)

### Visualize experiments with MLflow tracking UI

To run the [MLflow Tracking UI](https://www.mlflow.org/docs/latest/tracking.html#tracking-ui), you need to either run the UI with ```mlflow ui``` (needs to be executed from the *notebooks* folder) oor to run an *mlflow server* (will be used in the following section)

### Where mlflow saves the data

#### Some vocabulary:
- **Backend store**: for MLflow entities (runs, parameters, metrics, tags, notes, metadata, etc)
- **Artefact store**: for artifacts (files, models, images, in-memory objects, etc)
- For more information, [check the official documentation](https://www.mlflow.org/docs/latest/tracking.html#where-runs-are-recorded)

#### Without prior configuration
- When no pror configuration is set, MLflow creates an *mlruns* folder where the data will be saved

In [None]:
!ls

- MLflow created a new folder *mlruns* where it will store the different run informations

In [None]:
!tree mlruns

#### With prior configuration

- Let's start by shutting down the `mlflow ui` and remvoing the `mlruns` folder

In [None]:
!rm -rf mlruns

- Set the **Backend store** to an sqlite database located in */tmp/mlruns.db* and the **Artefact store**  to a folder located in */tmp/mlruns*. For more informations on the different possibilities available (S3, blobstorage, etc) check [the official documentation](https://www.mlflow.org/docs/latest/tracking.html#where-runs-are-recorded).
- To run the MLflow server, you need to:
    - stop the execution of the UI (`mlflow ui` command)
    - execute the following command:
        - Linux: ```mlflow server --backend-store-uri sqlite:////tmp/mlruns.db --default-artifact-root /tmp/mlruns```
        - Windows: ```mlflow server --backend-store-uri sqlite:///mlruns.db --default-artifact-root mlruns```
- Set the tracking uri in the notebook ```mlflow.set_tracking_uri('http://127.0.0.1:5000')```

In [None]:
mlflow.set_tracking_uri('http://127.0.0.1:5000')

In [None]:
# Create the experiment in the new database
experiment_name = "ep_prediction_with_random_forest"
mlflow.set_experiment(experiment_name=experiment_name)

### Loggiong with autolog

- Autollog will log all the model parameters, training metrics, model binary, etc **BUT not the test metrics**, tthey needd to be logged manually

In [None]:
def train_model(train_df, max_depth=2):
    training_timestamp = datetime.now().strftime('%Y-%m-%d, %H:%M:%S')
    with mlflow.start_run(run_name=f"model_{training_timestamp}"):

        mlflow.autolog()
        
        # Split data
        X = train_df[["AT", "V", "AP", "RH"]]
        y = train_df["PE"]
        X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

        # Fit model
        model = RandomForestRegressor(max_depth=max_depth)
        model.fit(X_train, y_train)

        # Evaluate the model
        y_pred = model.predict(X_test)
        mse = mean_squared_error(y_test, y_pred)
        rmse = mean_squared_error(y_test, y_pred, squared=False)
        ## mlflow: log metrics
        mlflow.log_metrics({"testing_mse": mse, "testing_rmse": rmse})
        print(f"Test mse = {mse}, Test RMSE = {rmse}, Random forest max depth = {max_depth}")

In [None]:
for max_depth in range(2, 7, 2):
    _ = train_model(df, max_depth=max_depth)

### Search runs

- [In the UI directly](https://www.mlflow.org/docs/latest/search-syntax.html#search)
- [Programmatically with search_runs](https://www.mlflow.org/docs/latest/search-syntax.html#programmatically-searching-runs)

- Get the id of the experiment where we want to search runs

In [None]:
mlflow.get_experiment_by_name(experiment_name)

In [None]:
experiment_id = mlflow.get_experiment_by_name(experiment_name).experiment_id
experiment_id

- Get all runs for the experiment

In [None]:
mlflow.search_runs(experiment_id)

- Filter runs by max_depth and mse and order them by mse (more information about the filters can be found [here](https://www.mlflow.org/docs/latest/search-runs.html))

In [None]:
max_depth = 4
mlflow.search_runs(
    experiment_id,
    filter_string=f"params.max_depth = '{max_depth}' AND metrics.testing_mse <= 40",
    order_by=['metrics.testing_mse asc']
)

### Load a saved model

- [More informations on other format of model_uri](https://www.mlflow.org/docs/latest/python_api/mlflow.sklearn.html#mlflow.sklearn.load_model)

#### With the result of search_runs

In [None]:
run = mlflow.search_runs(
    experiment_id,
    filter_string=f"params.max_depth = '{max_depth}' AND metrics.testing_mse <= 40",
    order_by=["metrics.testing_mse asc"]
).iloc[0]
run

In [None]:
run.artifact_uri

In [None]:
model = mlflow.sklearn.load_model(model_uri=f"{run.artifact_uri}/model")
model

In [None]:
model.predict(df[:5][["AT", "V", "AP", "RH"]])

- Loading the model independently from the framework with `mlflow.pyfunc.load_model`

In [None]:
loaded_model = mlflow.pyfunc.load_model(f"{run.artifact_uri}/model")
loaded_model.predict(df[:5][["AT", "V", "AP", "RH"]])

- Clear Backend and artifact store (for linux)

In [None]:
!rm -rf /tmp/mlruns /tmp/mlruns.db