# Workshop MLflow

__*Objectives*__

Use **ML Flow Tracking** to follow a model key metrics

__*Steps*__

1. Import Libs
2. Load cleaned data
    * Load CSV
    * Short Data description
3. Understand the data
    * Pandas Profiling
4. Machine Learning
    * Preprocess
    * Metrics
    * Models
    * Results
5. MLFlow
    

In [1]:
import pandas as pd
import numpy as np
import os
import pandas_profiling as pdp
import mlflow
import mlflow.sklearn
from matplotlib import pyplot as plt
from sklearn.metrics import classification_report

  env = yaml.load(_conda_header)


In [2]:
# Display all the columns of Pandas Dataframe
pd.options.display.max_columns = None

## Import data set

In [3]:
df_all = pd.read_csv('./data/energydata_complete.csv')

print('nb observations: {} - nb features: {}'.format(*df_all.shape))
df_all.head()

nb observations: 19735 - nb features: 29


Unnamed: 0,date,Appliances,lights,T1,RH_1,T2,RH_2,T3,RH_3,T4,RH_4,T5,RH_5,T6,RH_6,T7,RH_7,T8,RH_8,T9,RH_9,T_out,Press_mm_hg,RH_out,Windspeed,Visibility,Tdewpoint,rv1,rv2
0,2016-01-11 17:00:00,60,30,19.89,47.596667,19.2,44.79,19.79,44.73,19.0,45.566667,17.166667,55.2,7.026667,84.256667,17.2,41.626667,18.2,48.9,17.033333,45.53,6.6,733.5,92.0,7.0,63.0,5.3,13.275433,13.275433
1,2016-01-11 17:10:00,60,30,19.89,46.693333,19.2,44.7225,19.79,44.79,19.0,45.9925,17.166667,55.2,6.833333,84.063333,17.2,41.56,18.2,48.863333,17.066667,45.56,6.483333,733.6,92.0,6.666667,59.166667,5.2,18.606195,18.606195
2,2016-01-11 17:20:00,50,30,19.89,46.3,19.2,44.626667,19.79,44.933333,18.926667,45.89,17.166667,55.09,6.56,83.156667,17.2,41.433333,18.2,48.73,17.0,45.5,6.366667,733.7,92.0,6.333333,55.333333,5.1,28.642668,28.642668
3,2016-01-11 17:30:00,50,40,19.89,46.066667,19.2,44.59,19.79,45.0,18.89,45.723333,17.166667,55.09,6.433333,83.423333,17.133333,41.29,18.1,48.59,17.0,45.4,6.25,733.8,92.0,6.0,51.5,5.0,45.410389,45.410389
4,2016-01-11 17:40:00,60,40,19.89,46.333333,19.2,44.53,19.79,45.0,18.89,45.53,17.2,55.09,6.366667,84.893333,17.2,41.23,18.1,48.59,17.0,45.4,6.133333,733.9,92.0,5.666667,47.666667,4.9,10.084097,10.084097


## Information about the data set

Column Name  | Description | Unit
------------ | ----------- | -----------
date | year-month-day hour:minute:second | 
lights | energy use of light fixtures the house | Wh 
T1 | Temperature in kitchen area, | Celsius 
RH_1 | Humidity in kitchen area, | % 
T2 | Temperature in living room area, | Celsius 
RH_2 | Humidity in living room area, | % 
T3 | Temperature in laundry room area 
RH_3 | Humidity in laundry room area, | % 
T4 | Temperature in office room, | Celsius 
RH_4 | Humidity in office room, | % 
T5 | Temperature in bathroom, | Celsius 
RH_5 | Humidity in bathroom, | % 
T6 | Temperature outside the building (north side), | Celsius 
RH_6 | Humidity outside the building (north side), | % 
T7 | Temperature in ironing room , | Celsius 
RH_7 | Humidity in ironing room, | % 
T8 | Temperature in teenager room 2, | Celsius 
RH_8 | Humidity in teenager room 2, | % 
T9 | Temperature in parents room, | Celsius 
RH_9 | Humidity in parents room, | % 
To | Temperature outside (from Chievres weather station), | Celsius 
Pressure | (from Chievres weather station), | mm Hg 
RH_out | Humidity outside (from Chievres weather station), | % 
Wind speed | (from Chievres weather station), | m/s 
Visibility | (from Chievres weather station), | km 
Tdewpoint | (from Chievres weather station), Â°C 
rv1 | Random variable 1, nondimensional 
rv2 | Random variable 2, nondimensional 
------------ | ----------- | -----------
Appliances | energy use | Wh


We will create a report named `report-all-data.html` in the repo `./analysis`.
This report helps us to understand all distribution and correlation in the data set. You can go into that repo and open it in your browser

In [4]:
# Just Random variable for robustness
df_all.drop(columns=['date', 'rv1', 'rv2'], inplace=True)

## Get report analysis : [PandasProfiling](https://github.com/pandas-profiling/pandas-profiling)

In [6]:
if not os.path.exists("./analysis"):
    os.mkdir("./analysis") # Create repo because does not exist

In [7]:
profile = pdp.ProfileReport(df_all)
profile.to_file(outputfile="./analysis/report-all-data.html")

Let's have a look at the created [report](./analysis/report-all-data.html)

## Let's talk about Machine Learning

__*What is the objective of the model?*__

=> Predict the Quantity of Energy used

We will use a first ML model to see what kind of information we need to record to (for example) evaluate the capacity of the model, if we suffer from overfitting or underfitting etc. From that we will understand why `mlflow` is a great tool for tracking metrics and save artifacts.

## Preprocess

In [8]:
from sklearn.model_selection import train_test_split

target_column = "Appliances" # "y"

train, test = train_test_split(df_all) # default test size 1/4

train_x = train.drop([target_column], axis=1)
test_x = test.drop([target_column], axis=1)
train_y = train[target_column]
test_y = test[target_column]

## Metrics to evaluate the results

In [25]:
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

def eval_metrics(actual, pred):
    rmse = np.sqrt(mean_squared_error(actual, pred))
    mae = mean_absolute_error(actual, pred)
    r2 = r2_score(actual, pred)
    return rmse, mae, r2


def scatter_plot_result(y_actual, y_pred, model_name):
    plt.scatter(y_actual, y_pred)
    
    plt.ylabel('Target predicted')
    plt.xlabel('True Target')
    plt.title(model_name)
    
    pos_x = y_actual.max() * 0.60
    pos_y = y_pred.max() * 0.90
    
    plt.text(pos_x, pos_y, r'$RMSE=%.2f, R^2$=%.2f, MAE=%.2f' % (np.sqrt(mean_squared_error(y_actual, y_pred)), 
                                              r2_score(y_actual, y_pred), 
                                              mean_absolute_error(y_actual, y_pred)))
    plt.savefig('./scatter_results-{}.png'.format(model_name))
    plt.close()

## Build our first model

In [27]:
from sklearn.ensemble import RandomForestRegressor

# Train model
rfp = RandomForestRegressor(random_state=0, n_estimators=100)
model = rfp.fit(train_x, train_y)


pred_test = model.predict(test_x)

print('rmse: {} - mae: {} - r2: {}'.format(*eval_metrics(test_y, pred_test)))
scatter_plot_result(test_y, pred_test, 'RandomForest')

rmse: 69.26025167803729 - mae: 32.868402918524524 - r2: 0.5332219311345148


To see scatter plot result: [see_plot](./scatter_results-RandomForest.png)

-> `Retrain your model with another set of parameters and compare results`

## Build a second model

Often, we use a second model in order to challenge the first one...

In [26]:
from sklearn.linear_model import ElasticNet
from sklearn.preprocessing import QuantileTransformer, quantile_transform


# Train model
lr = ElasticNet(random_state=0, alpha=0.5, l1_ratio=0.2)
model = lr.fit(train_x, train_y)


pred_test = model.predict(test_x)

print('rmse: {} - mae: {} - r2: {}'.format(*eval_metrics(test_y, pred_test)))
scatter_plot_result(test_y, pred_test, 'ElasticNet')

rmse: 93.26987572150462 - mae: 52.84971226579352 - r2: 0.15350361381773747


To see scatter plot result: [see_plot](./scatter_results-ElasticNet.png)

-> `Retrain your model with another set of parameters and compare results`

# Utility of ML Flow

At this point you may want to draw more visualizations to compare your models :
    * performance, 
    * feature importance
    * other metrics

You understand that we will have do this process **EVERY TIME**, to compare or analyse any model or ML code. 

Also, **if your data change**, your metrics can change. It would be great to have the history of the data ATTACHED to the code's history

__*This is where Tracking with MLflow is useful*__

Same exercise in `train.py`. 

## 1 - MLflow [Tracking](https://mlflow.org/docs/latest/tracking.html)

In [45]:
from sklearn.compose import TransformedTargetRegressor

# If you wish to try on classification problem
def log_metrics_classification(y_true, y_prediction):
    report = classification_report(y_true, y_prediction, output_dict=True)
    for class_ in ['0', '1']:
        for metric in report[class_]:
            log_name = class_ + '_' + metric
            # insert your code here ~ 1 line
         
        
def log_metrics_regression(y_true, y_prediction):
    rmse, mae, r2 = eval_metrics(y_true, y_prediction)
    # log metrics here ~ 3 lines

def set_mlfow_experiment(experiment_name):
    experiment_name = 'Default' if experiment_name is None else experiment_name
    mlflow.set_experiment(experiment_name)


def run_experiment_elasticnet(df, alpha, l1_ratio, experiment_name=None):
    # set exeperiment here ~ 1 line
    set_mlfow_experiment(experiment_name)
    
    # Split data
    train, test = train_test_split(df)
    
    train_x = train.drop([target_column], axis=1)
    test_x = test.drop([target_column], axis=1)
    train_y = train[target_column]
    test_y = test[target_column]
    
    with mlflow.start_run():
        print("Running with alpha: {} - l1_ratio: {}".format(alpha, l1_ratio))

        # fit models
        lr = ElasticNet(random_state=0, alpha=alpha, l1_ratio=l1_ratio)
        lr.fit(train_x, train_y)

        prediction_test = lr.predict(test_x)

        # log parameters
        # Your code here ~ 2 lines

        # log artifact
        scatter_name = './scatter_results-ElasticNet.png'
        # save scatter plot as artifact here ~ 2 lines (1 to create the file, 1 to save as artifact)

        # log metrics
        log_metrics_regression(test_y, prediction_test)

        # log sklearn model
        # log the sklearn model here  ~ 1 line
        

In [46]:
# play yourself with parameters
# ! both parameters have min 0 and max 1 ! 


# Remove break to see all runs
for alpha in np.arange(0.1, 1, 0.2):
    for l1_ratio in np.arange(0.1, 1, 0.2):
        run_experiment_elasticnet(df_all, alpha, l1_ratio)
    break

Running with alpha: 0.1 - l1_ratio: 0.1
Running with alpha: 0.1 - l1_ratio: 0.30000000000000004
Running with alpha: 0.1 - l1_ratio: 0.5000000000000001
Running with alpha: 0.1 - l1_ratio: 0.7000000000000001
Running with alpha: 0.1 - l1_ratio: 0.9000000000000001


## 2 - MLflow [Projects](https://mlflow.org/docs/latest/projects.html)

**Now we will "package" our project as an MLflow project:**
* MLflow Projects are just a convention for organizing and describing your code
* Each project is simply a directory of files, or a Git repository, containing your code
* MLflow can run some projects based on a convention for placing files in this directory but you can describe your project in more detail by adding a `MLproject file, which is a YAML formatted text` file

**We will make our `MLproject file` and define:**
* Name
* Entry point (you can define several entry point : etl -> train -> test, but here we just have the main entry point)
* (Environment is optional, see documentation for more information)


-> open it: [MLproject file](./MLproject)


**Once you finished your MLproject file**
The following command runs your project by read your MLproject file on your local system.

```bash 
~$ mlflow run .
```



**What if you want to change your code, push it on a remote repo and test it with its git hash ?**
```bash 
~$ mlflow run git@github.com:thomasOpenvalue/workshop-mlflow.git --version=84295be...
```

**more details**:
```bash
~$ mlflow run --help
```

## 3 - MLflow [Models](https://mlflow.org/docs/latest/models.html)

During your runs, you saved/logged your models. Each model is attached to a run and you can see it in your MLflow UI by clicking on a run.

You can re-use a given model or serve it. MLflow allows you to use a lot of model type like sagemaker models, sklearn models, keras models.
Also you can use the Model API as follow.

You we'll need to choose a **path** (if you used mlflow.save_model -> the same used path. If you used mlflow.log_model -> the name you gave for the log) and the **run_id** (choose a run in your UI)

for more details : [MLflow Model API](https://mlflow.org/docs/latest/models.html#model-api)

In [49]:
import mlflow.sklearn

sk_model = mlflow.sklearn.load_model(path="elastic_net", run_id="2cb....")

pred_test = sk_model.predict(test_x)

print('rmse: {} - mae: {} - r2: {}'.format(*eval_metrics(test_y, pred_test)))

