# Managing your machine learning lifecycle with MLFlow and HPE Ezmeral ML Ops - Lab 3
## Build, experiment, train your model, track and compare model performance with MLflow on HPE Ezmeral ML Ops.

### **Lab workflow**

In this lab, using your local Jupyter Notebook sandbox:

* You will develop, train and test your model. 
* You will test your ML algorithm with a variety of parameters to determine the model that yields to the best prediction result. 
* You will use **MLflow**, when running your machine learning code, for logging parameters, metrics and model artifacts.  
* You will also train your model against a larger dataset using remote training engine cluster that offers larger computing resources with GPU.

>**Note:** _This workshop is not intended to teach you about AI/ML model experimentation and development. 
It is intended to give a use case for data science end-to-end ML lifecycle management with MLflow on HPE Ezmeral ML Ops._

#### About the Dataset
The dataset is based on the 2019 New York City yellow cab trip data (approximately 375,000 trip records from January-June 2019). The dataset has many different properties (aka _"X - features"_) like the pickup time and location, the dropoff time and location, the trip distance, the number of passengers, and several other variables. The goal is to predict the taxi ride duration in NY City (the target aka _"Y - label"_) based on these features. 

>**Note:** _For this workshop, you will be using a premade subset of this dataset that requires little data preprocessing and preparation (i.e.: data cleaning, data transformation)._

### **1- Code versioning integrated with Jupyter Notebook cluster**

As you can see in the left hand side, as a data scientist, you have all the files (for example notebooks and code scripts) you need to do your work. These files are all pulled from the GitHub source control repository set up by the Operations team for the data science team.

The Jupyter Notebook ML Ops application includes the **Git** plugin that allows data scientists to use Git and do version control of their notebooks and model code scripts directly from within their local Jupyter Notebook.

From within the local Jupyter Notebook, data scientists can do the usual _git status_, _git add_, _git commit_, _git push_ to push their notebooks to their GitHub repository branch from the _**Notebook terminal**_ provided, and start versioning their notebooks and codes, and collaborate across projects.
If there are other changes that are pushed by other data scientists in the same GitHub repository branch, user can pull up the changes from the terminal using _git pull_ command. 

<img src="Launcher-terminal.jpg" height="600" width="600">

### Learn more about Git

Want to learn more about Git, check out the community blog series and the HPE DEV workshop-on-demand:

* [Getting started with Git - blog series](https://developer.hpe.com/blog/get-involved-in-the-open-source-community-part-1-getting-started-with-git/)
* [Git 101 - Get involved in the open source community - Workshop-on-demand](https://hackshack.hpedev.io/workshops)

<font color="red">**Note: While executing code cells, you might see a "deprecation" warning message - please ignore it.**</font>

_/opt/miniconda/lib/python3.7/site-packages/ipykernel/ipkernel.py:287: DeprecationWarning: `should_run_async` will not call `transform_cell` automatically in the future._

### **2- Checking Dataset is accessible**

The Python code here is used to verify the Dataset is accessible from your local Jupyter Notebook. 

In [None]:
userID = "student{{ STDID }}"
#smalldemodataset = True
tinydemodataset = True

import os
import shutil
import time

# Shared dataset across all tenant users, use case directory name and scoring file name
if tinydemodataset is True:
    datasetFile = "demodata-tiny.csv"
else:
    datasetFile = "demodata-small.csv"
    
locationTable = "lookup-ipyheader.csv"
usecaseDirectory = "NYCTaxi"

# Project repo path function - file system mount available to all app containers
def ProjectRepo(path):
    ProjectRepo = "/bd-fs-mnt/TenantShare/repo"
    return str(ProjectRepo + '/' + path)

print ("Verifying the Dataset is accessible:")

# Making sure the input Dataset is loaded and accessible in the shared persistent container storage for your tenant
# Check the dataset file exists in /db-fs-mnt/TenantShare/repo/data/NYCTaxi folder:
#print (os.listdir(ProjectRepo('data/NYCTaxi')))
datasetFilePath = "data" + '/' + usecaseDirectory + '/' + datasetFile
locationFilePath = "data" + '/' + usecaseDirectory + '/' + locationTable
pathData = ProjectRepo("data" + '/' + usecaseDirectory)

if (not os.path.exists(ProjectRepo(datasetFilePath))):
    print ("Error! Dataset file " + ProjectRepo(datasetFilePath) + " does not exist. Please contact the workshop administrator at hpedev.hackshack@hpe.com before continuing.")
else:
    print ("Demo dataset is " + ProjectRepo(datasetFilePath))
    
if (not os.path.exists(ProjectRepo(locationFilePath))):
    print ("Error! location table file " + ProjectRepo(locationFilePath) + " does not exist. Please contact the workshop administrator at hpedev.hackshack@hpe.com before continuing.")
else:
    print ("Location table file is " + ProjectRepo(locationFilePath))


### **3- Develop the model on local Jupyter, log and track your experiments on HPE Ezmeral ML Ops with MLflow**

In general, data scientists use their local Jupyter Notebook for **experimenting** several learning algorithms with a variety of parameters. They do so to determine the ML model that works best for the business problem they try to address and develop the model that yields to the best prediction result. Then, within their notebooks, they submit their code to large scaled computing training cluster environment to train and test their full ML models, in a reasonable time, typically against a larger training dataset and test dataset. The output of this step is a trained model ready for deployment in production.


#### About the machine learning model development workflow
Gradient boosting supervised machine learning algorithm implementation in the **scikit-learn (sklearn) library for Python** is used here to build the model that is capable of predicting the duration of taxi trips in New York city. Python libraries such as Numpy, Pandas, Scikit-learn, XGBoost are used to build the model. 

The machine learning workflow depicted in the diagram below follows the typical supervised machine learning workflow for models development:

<img src="ML-Workflow.jpg" height="600" width="600" align="right">

- After loading the dataset (historical data) from the central project repository, the ML algorithm separates data into features (the taxi ride properties) and label (the taxi ride duration). 
- Then the dataset is divided into two parts, one for training the model and one for testing the model. The typical split is 80/20.
- The ML algorithm is then defined and the model is built with the training data set to learn from. 
- The resulting model is then run on the test data which was not used to train the model.
- Next, the model accuracy is evaluated by comparing the test predictions to the test labels. Error metrics such as RMSE are used here to evaluate the predictions accuracy.
- The trained model is finally ***saved as an artifact on the MLflow tracking server***. The model is ready for deployment in production to serve predictions.
- The next step is then to move your trained model to production to serve your model as a prediction service (Lab Part 4) and make predictions on new data points (Lab Part 5).

### Importing necessary packages and library

In [None]:
print("Importing libraries")
import pandas as pd
import numpy as np
from scipy import stats
import math
import os
import datetime
import xgboost as xgb
import pickle
import matplotlib.pyplot as plt

### Read in Dataset

In [None]:
tinydemodataset = True
usecaseDirectory = "NYCTaxi"

if tinydemodataset is True:
    datasetFile = "demodata-tiny.csv"
else:
    datasetFile = "demodata-small.csv"
    
locationTable = "lookup-ipyheader.csv"
datasetFilePath = "data" + '/' + usecaseDirectory + '/' + datasetFile
locationFilePath = "data" + '/' + usecaseDirectory + '/' + locationTable

# Start time 
print("Start time for " + userID + ": ", datetime.datetime.now())

# Project repo path function
def ProjectRepo(path):
    ProjectRepo = "/bd-fs-mnt/TenantShare/repo"
    return str(ProjectRepo + '/' + path)

print("Reading in data")
# Reading in dataset table using pandas
dbName = "pqyellowtaxi"
##df = pd.read_csv(ProjectRepo('data/NYCTaxi/demodata.csv'))
try:
    df = pd.read_csv(ProjectRepo(datasetFilePath))
except Exception as e:
    logger.exception(
        "Unable to read the dataset, check with your administrator. Error: %s", e
        )

# Reading in latitude/longitude coordinate lookup table using pandas 
lookupDbName = "pqlookup"
##dflook = pd.read_csv(ProjectRepo('data/NYCTaxi/lookup-ipyheader.csv'))
try:
    dflook = pd.read_csv(ProjectRepo(locationFilePath))
except Exception as e:
    logger.exception(
        "Unable to read the location table, check with your administrator. Error: %s", e
        )
    
print("Done reading in data")

In [None]:
df.head()

In [None]:
dflook.head()

### Data preprocessing and preparation (data cleaning, data transformation)

In [None]:
print ("Data preparation")
# merging dataset and lookup tables on latitudes/coordinates
df = pd.merge(df, dflook[[lookupDbName + '.location_i', lookupDbName + '.long', lookupDbName + '.lat']], how='left', left_on=dbName + '.pulocationid', right_on=lookupDbName + '.location_i')
df.rename(columns = {(lookupDbName + '.long'):(dbName + '.startstationlongitude')}, inplace = True)
df.rename(columns = {(lookupDbName + '.lat'):(dbName + '.startstationlatitude')}, inplace = True)
df = pd.merge(df, dflook[[lookupDbName + '.location_i', lookupDbName + '.long', lookupDbName + '.lat']], how='left', left_on=dbName + '.dolocationid', right_on=lookupDbName + '.location_i')
df.rename(columns = {(lookupDbName + '.long'):(dbName + '.endstationlongitude')}, inplace = True)
df.rename(columns = {(lookupDbName + '.lat'):(dbName + '.endstationlatitude')}, inplace = True)


def fullName(colName):
    return dbName + '.' + colName

# convert string to datetime
df[fullName('tpep_pickup_datetime')] = pd.to_datetime(df[fullName('tpep_pickup_datetime')])
df[fullName('tpep_dropoff_datetime')] = pd.to_datetime(df[fullName('tpep_dropoff_datetime')])
df[fullName('duration')] = (df[fullName("tpep_dropoff_datetime")] - df[fullName("tpep_pickup_datetime")]).dt.total_seconds()

# feature engineering
# Feature engineering is the process of transforming raw data into inputs for a machine learning algorithm
df[fullName("weekday")] = (df[fullName('tpep_pickup_datetime')].dt.dayofweek < 5).astype(float)
df[fullName("hour")] = df[fullName('tpep_pickup_datetime')].dt.hour
df[fullName("work")] = (df[fullName('weekday')] == 1) & (df[fullName("hour")] >= 8) & (df[fullName("hour")] < 18)
df[fullName("month")] = df[fullName('tpep_pickup_datetime')].dt.month
# convert month to a categorical feature using one-hot encoding
df = pd.get_dummies(df, columns=[fullName("month")])

# Filter dataset to rides under 3 hours and under 150 miles to remove outliers
df = df[df[fullName('duration')] > 20]
df = df[df[fullName('duration')] < 10800]
df = df[df[fullName('trip_distance')] > 0]
df = df[df[fullName('trip_distance')] < 150]

# drop null rows
df = df.dropna(how='any',axis=0)

# select columns to be used as features
cols = [fullName('work'), fullName('startstationlatitude'), fullName('startstationlongitude'), fullName('endstationlatitude'), fullName('endstationlongitude'), fullName('trip_distance'), fullName('weekday'), fullName('hour')]
cols.extend([fullName('month_' + str(x)) for x in range(1, 7)])
cols.append(fullName('duration'))
dataset = df[cols]

# separate data into features (the taxi ride properties) and label (duration) using .iloc
X = dataset.iloc[:, 0:(len(cols) - 1)].values
y = dataset.iloc[:, (len(cols) - 1)].values
X = X.copy()
y = y.copy()
print (X.shape)
print (y.shape)

#del dataset
#del df

print("Done preparing data")


#### **Take a quick glance at our prepared dataset. These are the columns (features) that we will be working with. Our target variable is the "duration" column.**

In [None]:
dataset.head()

In [None]:
del dataset
del df

### Train and Test your model using your local Jupyter Notebook sandbox.
Execute the Notebook code cell below to train and test your model on your local Jupyter Notebook sandbox. 

Let the code cell run until completion. This should take a couple of minutes to complete the trainign of your model. You will see the message in the output of the code cell: **Training and testing of your model finished!** 

> **Note:** When executing a notebook code cell, a [*] next to the action, it means the execution step is busy working within the notebook.   
A [digit number] next to the action means the execution step has completed.


In [None]:
print("Start the training and testing of your model...")

# As we have one dataset, the data is split into a training data set and a test data set. The ideal split is 80:20. 
# 80% of the data is used to train the model and 20% is used for testing the model.
print("Load the sklearn module to split the dataset into a training data set and a test data set.")
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)


from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)


print("Build the model and fit the model on the training data. This will take a few minutes...")

# Define the ML algorithm (that is the learning algorithm to use). 
# We use the gradient boosting algorithm implementation in the scikit-learn (sklearn) library.
# We use here the XGBRegressor class of the xgboost package via the scikit-learn wrapper classes 
# to build the model.
# The hyperparameter tree_method is used to enable training of an XGBoost model using the GPU device. 
xgbr = xgb.XGBRegressor(objective ='reg:squarederror',
                        colsample_bytree = 1,
                        subsample = 1,
                        learning_rate = 0.15,
                        booster = "gbtree",
                        max_depth = 1,
                        eta = 0.5,
                        eval_metric = "rmse",) 

numtrainelements = str(len(X_train))
#print("num train elements: " + str(len(X_train)))
print("number of train elements: " + numtrainelements)

# then train (fit) the model on training data set comprised of inputs (features) and outputs (label)
# we provide our defined ML algorithm with data to learn from
print("Train start time: ", datetime.datetime.now())
tmp = datetime.datetime.now()
xgbr.fit(X_train, y_train)
#after training the model, check the model training score. The closer towards 1, the better the fit.
score = xgbr.score(X_train, y_train)  
print("Model training score. The closer towards 1, the better the fit: ", score)

print("Train end time: ", datetime.datetime.now())
cpu_time = datetime.datetime.now() - tmp
print ("Training time on local Jupyter Notebook: %s seconds" %(str(cpu_time)))
print("Test the model by making predictions on the test data set where only the features are provided")
y_pred = xgbr.predict(X_test)
y_pred = y_pred.clip(min=0)

print("Evaluating the model accuracy by comparing the test predictions with the test labels. RMSE is used as evaluation metric.")
# evaluating the model accuracy
from sklearn import metrics
from sklearn.metrics import mean_squared_log_error

mae = metrics.mean_absolute_error(y_test, y_pred)
mse = metrics.mean_squared_error(y_test, y_pred)
rmse = np.sqrt(metrics.mean_squared_error(y_test, y_pred))
rmsle = np.sqrt(mean_squared_log_error( y_test, y_pred))

print("  Root Mean Squared Error - the lower the value is, the better the fit: %s" % rmse)
print("  Mean Absolute Error: %s" % mae)
print("  Mean Squared Error: %s" % mse)
print("  Root Mean Squared Log Error: %s" % rmsle)
print()
print("Note that for RMSE the lower the value is, the better the fit")
print("")
print("Training and testing of your model finished!")

### Log and track your model performance with MLflow on HPE Ezmeral ML Ops.

The operations team has deployed an _MLflow tracking server_ instance on HPE Ezmeral ML Ops, with an embedded MinIO S3 store and a SQL database for storing the model metadata (the collection of parameters, metrics, tags, and model artifact URI) and model artifacts (model files) related to the training process of your ML model. You can use the **MLflow Tracking Python API** to log data such as parameters, metrics, tags and models artifacts to the MLflow tracking server when running your ML code and to query it.

MLflow tracking is organized around the concept of ***runs***, which are individual execution of some piece of data science code. And you typically organize `runs` into ***experiments*** to group runs of the same business problem you want to address. 

You need to tell your Python code script which MLflow tracking server URI and Experiment name to use to log the `runs`. You will do this using the Jupyter Notebook custom ***magic*** functions ***%loadMlflow*** and ***%Setexp*** respectively.

>**Note:** _The Jupyter Notebook ML Ops application used in our solution includes **custom magic** functions to log and track parameters, metrics and models to a remote MLflow tracking server._

#### Set the MLflow environment variables
Set the MLflow tracking server URI:

In [None]:
%loadMlflow

For this workshop, you set an MLflow expirement for your project, named `student<Id>-NYCTaxi-Experiment` using the custom magic command ***%Setexp***.

In [None]:
experimentName = userID + "-NYCTaxi-Experiment"
print ("Experiment name is: " + experimentName )

%Setexp --name $experimentName

#replaced the two lines below:
# mlflow.set_experiment('experimentName') - set the given experiment as the "active" experiment into which MLflow 'runs' are grouped/organized.
#            A new experiment is created if it does not exist. it also start a new MLflow run and set it as the active run.
# mlflow.set_tag('mlflow.user','studentxxx') - set a tag under the current run. 
#           where mlflow.user is the identifier of the user who created the run.

print ("Set the experiment " + experimentName + " as active experiment.")

#### Launch the training job runs for your experiment

For each training job run, the scikit-learn (sklearn) implementation for Python of the ML algorithm records a new `run` in MLflow tracking server to keep track of metadata information (input parameters, metrics, tags, model artifact URI) and model artifacts (model files) of the generated ML model.

The following API calls are used to start and manage MLflow runs:

* **start_run()** – Starts a new MLflow run, setting it as the active run under which metrics and parameters are logged.
* **log_params()** – Logs a parameter under the current run.
* **log_metric()** – Logs a metric under the current run.
* **set_tags()** - Logs a set of tags under the current run.
* **sklearn.log_model()** – Logs a Scikit-learn model as an MLflow artifact for the current run.

In [None]:
import mlflow
import mlflow.sklearn
from mlflow import log_metric, log_param, log_artifact

#display the tracking server URI
artifact_uri = mlflow.get_tracking_uri()
print("Artifact uri: {}".format(artifact_uri))

#get information about the experiment
experimentName = userID + "-NYCTaxi-Experiment"
experiment_name = mlflow.get_experiment_by_name(experimentName)
TeamExperimentId=experiment_name.experiment_id
print("Experiment_id: {}".format(experiment_name.experiment_id))
print("Name: {}".format(experiment_name.name))
print("Artifact Location: {}".format(experiment_name.artifact_location))
print("Tags: {}".format(experiment_name.tags))
print("Lifecycle_stage: {}".format(experiment_name.lifecycle_stage))
#
print("Your team Experiment Id is: ",TeamExperimentId)

tags = {"team": userID,
        "dataset": "Tiny",
        "Algo-Library": "XGBoost-sklearn",
        "Processor": "CPU"}

# mlflow: stop active runs if any
if mlflow.active_run():
    mlflow.end_run()
    
#Start a new MLflow run, setting it as the active run under which metrics, parameters and model artifacts (model files) will be logged
print("Start a new MLflow run, log ML algorithm parameters, training and test metrics:")
print ("")

MyRunName = "Run-Test-1"
mlflow.start_run(experiment_id=TeamExperimentId,run_name=MyRunName)
mlflow.log_param("learning_rate", 0.15)
mlflow.log_param("max_depth", 1)
mlflow.log_metric("tr_score", score)
mlflow.log_metric("eval_rmse", rmse)
mlflow.set_tags(tags)

# Log the scikit-learn model as an MLflow artifact for the current run.
mlflow.sklearn.log_model(xgbr, "model")
print("Model logged as an MLflow artifact for your current run " + MyRunName)

# mlflow: end tracking
mlflow.end_run()
print("")


### **4- Train the model on the remote shared training cluster**

In general, data scientists use their local Jupyter Notebook for **experimenting** several learning algorithms with a variety of parameters. They do so to determine the ML model that works best for the business problem they try to address and develop the model that yields to the best prediction result. Then, within their notebooks, they submit their code to large scaled computing training cluster environment to train and test their full ML models, in a reasonable time, typically against a larger training dataset and test dataset. The output of this step is a trained model ready for deployment in production.

### Test Connection to the remote tenant-shared training cluster

Next, before training your model on remote shared training cluster, you may want to test that the communication with the shared training cluster is indeed functioning properly.

> **Note:** _The Jupyter Notebook ML Ops application used in our solution includes **custom magic** functions to handle remotely submitting training code and retrieving results and logs. The Notebook uses these magic functions to make REST API calls to the API server that runs as part of the shared training environment. These calls submit training jobs and get results from within the Notebook session._

The Jupyter Notebook ML Ops app includes the following **line magic** functions: 
*	%attachments: Returns a list of connected training environments. 
*	%logs --url: URL of the training server load balancer used to monitor the status of the training job.

The Jupyter Notebook ML Ops app also includes the following **cell magic** function:
*	%%training_cluster_name 
This would submit training code to the shared training environment

#### Get the list of connected training environments

<b>%attachments</b> is a line magic command that output a table with the name(s) of the training cluster(s) available for us to use. Sometimes, Operations team may have created multiple training clusters for different projects depending on the needs of the model or size of data, e.g. some with GPU nodes, while others with CPUs only.

In [None]:
%attachments

#### Submit a notebook code cell to a remote training cluster

To utilize the training cluster, you will need grab the name of the training cluster you want to use and feed it into another custom cell magic command. 

With the **%%trainingengineshared** magic command specified at the beginning of the code cell, the Jupyter Notebook will 
submit the entire content of the cell to the training cluster named _trainingengineshared_. If you comment this magic command, the code will run on your local Jupyter Notebook.

The example cell below will execute a print statement on the training cluster.

In [None]:
%%capture history_url

%%trainingengineshared

import datetime

print('test')
print("Date time: ", datetime.datetime.now())

#### Retrieve the result of the job

The training cluster will send back a unique log url for the job submitted.   
You can use the _History URL_ with the _"%log --url"_ custom **line magic** command to track the status of the job in real time. 

A status of "**Finished**" means the execution of the job submitted to the training cluster has completed.

In [None]:
historyurl = history_url.stdout.split(' ')[2]
print(historyurl)

In [None]:
%logs --url $historyurl

#### Now let's submit the Python code to remote training engine cluster (with GPU) while logging and tracking parameters, metrics and models with MLflow

In [None]:
%%capture history_url

%%trainingengineshared

userID = "student{{ STDID }}"
smalldemodataset = True

print("Importing libraries")
import pandas as pd
import numpy as np
from scipy import stats
import math
import os
import datetime
import xgboost as xgb
import pickle

import mlflow
import mlflow.sklearn
from mlflow import log_metric, log_param, log_artifact

import matplotlib.pyplot as plt

usecaseDirectory = "NYCTaxi"

if smalldemodataset is True:
    datasetFile = "demodata-small.csv"
else:
    datasetFile = "demodata.csv"
    
locationTable = "lookup-ipyheader.csv"
datasetFilePath = "data" + '/' + usecaseDirectory + '/' + datasetFile
locationFilePath = "data" + '/' + usecaseDirectory + '/' + locationTable
studentRepoModel = "models" + '/' + usecaseDirectory + '/' + userID

# Start time 
print("Start time for " + userID + ": ", datetime.datetime.now())

# Project repo path function
def ProjectRepo(path):
    ProjectRepo = "/bd-fs-mnt/TenantShare/repo"
    return str(ProjectRepo + '/' + path)


print("Reading in data")
# Reading in dataset table using pandas
dbName = "pqyellowtaxi"
##df = pd.read_csv(ProjectRepo('data/NYCTaxi/demodata.csv'))
df = pd.read_csv(ProjectRepo(datasetFilePath))

# Reading in latitude/longitude coordinate lookup table using pandas 
lookupDbName = "pqlookup"
##dflook = pd.read_csv(ProjectRepo('data/NYCTaxi/lookup-ipyheader.csv'))
dflook = pd.read_csv(ProjectRepo(locationFilePath))
print("Done reading in data")


# merging dataset and lookup tables on latitudes/coordinates
df = pd.merge(df, dflook[[lookupDbName + '.location_i', lookupDbName + '.long', lookupDbName + '.lat']], how='left', left_on=dbName + '.pulocationid', right_on=lookupDbName + '.location_i')
df.rename(columns = {(lookupDbName + '.long'):(dbName + '.startstationlongitude')}, inplace = True)
df.rename(columns = {(lookupDbName + '.lat'):(dbName + '.startstationlatitude')}, inplace = True)
df = pd.merge(df, dflook[[lookupDbName + '.location_i', lookupDbName + '.long', lookupDbName + '.lat']], how='left', left_on=dbName + '.dolocationid', right_on=lookupDbName + '.location_i')
df.rename(columns = {(lookupDbName + '.long'):(dbName + '.endstationlongitude')}, inplace = True)
df.rename(columns = {(lookupDbName + '.lat'):(dbName + '.endstationlatitude')}, inplace = True)


def fullName(colName):
    return dbName + '.' + colName

# convert string to datetime
df[fullName('tpep_pickup_datetime')] = pd.to_datetime(df[fullName('tpep_pickup_datetime')])
df[fullName('tpep_dropoff_datetime')] = pd.to_datetime(df[fullName('tpep_dropoff_datetime')])
df[fullName('duration')] = (df[fullName("tpep_dropoff_datetime")] - df[fullName("tpep_pickup_datetime")]).dt.total_seconds()

# feature engineering
# Feature engineering is the process of transforming raw data into inputs for a machine learning algorithm
df[fullName("weekday")] = (df[fullName('tpep_pickup_datetime')].dt.dayofweek < 5).astype(float)
df[fullName("hour")] = df[fullName('tpep_pickup_datetime')].dt.hour
df[fullName("work")] = (df[fullName('weekday')] == 1) & (df[fullName("hour")] >= 8) & (df[fullName("hour")] < 18)
df[fullName("month")] = df[fullName('tpep_pickup_datetime')].dt.month
# convert month to a categorical feature using one-hot encoding
df = pd.get_dummies(df, columns=[fullName("month")])

# Filter dataset to rides under 3 hours and under 150 miles to remove outliers
df = df[df[fullName('duration')] > 20]
df = df[df[fullName('duration')] < 10800]
df = df[df[fullName('trip_distance')] > 0]
df = df[df[fullName('trip_distance')] < 150]

# drop null rows
df = df.dropna(how='any',axis=0)

# select columns to be used as features
cols = [fullName('work'), fullName('startstationlatitude'), fullName('startstationlongitude'), fullName('endstationlatitude'), fullName('endstationlongitude'), fullName('trip_distance'), fullName('weekday'), fullName('hour')]
cols.extend([fullName('month_' + str(x)) for x in range(1, 7)])
cols.append(fullName('duration'))
dataset = df[cols]

# separate data into features (the taxi ride properties) and label (duration) using .iloc
X = dataset.iloc[:, 0:(len(cols) - 1)].values
y = dataset.iloc[:, (len(cols) - 1)].values
X = X.copy()
y = y.copy()
print (X.shape)
print (y.shape)
del dataset
del df

print("Done cleaning data")


print("Training and testing...")

# As we have one dataset, the data is split into a training data set and a test data set. The ideal split is 80:20. 
# 80% of the data is used to train the model and 20% is used for testing the model.
print("Load the sklearn module to split the dataset into a training data set and a test data set.")
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)


from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
X_train = sc.fit_transform(X_train)
X_test = sc.transform(X_test)


print("Build the model and fit the model on the training data. This will take a few seconds on GPU based server...")

# Define the ML algorithm (that is the learning algorithm to use). 
# We use the gradient boosting algorithm implementation in the scikit-learn (sklearn) library.
# We use here the XGBRegressor class of the xgboost package via the scikit-learn wrapper classes 
# to build the model.
# The hyperparameter tree_method is used to enable training of an XGBoost model using the GPU device. 
xgbr = xgb.XGBRegressor(objective ='reg:squarederror',
                        tree_method = 'gpu_hist',
                        colsample_bytree = 1,
                        subsample = 1,
                        learning_rate = 0.15,
                        booster = "gbtree",
                        max_depth = 3,
                        eta = 0.5,
                        eval_metric = "rmse",) 

print("num train elements: " + str(len(X_train)))

# then train (fit) the model on training data set comprised of inputs (features) and outputs (label)
# we provide our defined ML algorithm with data to learn from
print("Train start time: ", datetime.datetime.now())
tmp = datetime.datetime.now()
xgbr.fit(X_train, y_train)
#after training the model, check the model training score. The closer towards 1, the better the fit.
score = xgbr.score(X_train, y_train)  
print("Model training score. The closer towards 1, the better the fit: ", score)

print("Train end time: ", datetime.datetime.now())
gpu_time = datetime.datetime.now() - tmp
print ("GPU Training time: %s seconds" %(str(gpu_time)))
print("Test the model by making predictions on the test data set where only the features are provided")
y_pred = xgbr.predict(X_test)
y_pred = y_pred.clip(min=0)

print("Evaluating the model accuracy by comparing the test predictions with the test labels. RMSE is used as evaluation metric.")
# evaluating the model accuracy
from sklearn import metrics
from sklearn.metrics import mean_squared_log_error

mae = metrics.mean_absolute_error(y_test, y_pred)
mse = metrics.mean_squared_error(y_test, y_pred)
rmse = np.sqrt(metrics.mean_squared_error(y_test, y_pred))
rmsle = np.sqrt(mean_squared_log_error( y_test, y_pred))

print("  Root Mean Squared Error - the lower the value is, the better the fit: %s" % rmse)
print("  Mean Absolute Error: %s" % mae)
print("  Mean Squared Error: %s" % mse)
print("  Root Mean Squared Log Error: %s" % rmsle)
print()
print("Note that for RMSE the lower the value is, the better the fit")

#display the tracking server URI
artifact_uri = mlflow.get_tracking_uri()
print("Artifact uri: {}".format(artifact_uri))

#get information about the experiment
experimentName = userID + "-NYCTaxi-Experiment"
print ("Experiment name is: " + experimentName )
experiment_name = mlflow.get_experiment_by_name(experimentName)
TeamExperimentId=experiment_name.experiment_id
print("Experiment_id: {}".format(experiment_name.experiment_id))
print("Name: {}".format(experiment_name.name))
print("Artifact Location: {}".format(experiment_name.artifact_location))
print("Tags: {}".format(experiment_name.tags))
print("Lifecycle_stage: {}".format(experiment_name.lifecycle_stage))
#
print("Your team Experiment Id is: ",TeamExperimentId)

tags = {"team": userID,
        "dataset": "Small",
        "Algo-Library": "XGBoost-sklearn",
        "Processor": "GPU"}

# mlflow: stop active runs if any
if mlflow.active_run():
    mlflow.end_run()
    
#Start a new MLflow run, setting it as the active run under which metrics, parameters and model artifacts (model files) will be logged
print("Start a new MLflow run, log ML algorithm parameters, training and test metrics")
print ("")

MyRunName = "Run-Training-Remote"
mlflow.start_run(experiment_id=TeamExperimentId,run_name=MyRunName)
mlflow.log_param("learning_rate", 0.15)
mlflow.log_param("max_depth", 3)
mlflow.log_metric("tr_score", score)
mlflow.log_metric("eval_rmse", rmse)
mlflow.set_tags(tags)

# Log the scikit-learn model as an MLflow artifact for the current run.
mlflow.sklearn.log_model(xgbr, "model")
print("Model logged as an MLflow artifact for your current run " + MyRunName)

# mlflow: end tracking
mlflow.end_run()
print("")


# Finish time
print("End time: ", datetime.datetime.now())

#### Monitor the training job

In [None]:
historyurl = history_url.stdout.split(' ')[2]
print(historyurl)

##### Run the cell code below regularly until the job status is marked as "**Finished**"

>**Note:** Depending on the number of concurrent jobs submitted to the training cluster environment (i.e: multiple participants run the workshop concurrently), the model training may take several minutes to complete.

In [None]:
%logs --url $historyurl

### **5- Using the MLflow server UI to evaluate how the model performed** 

#### Navigate to the MLflow tracking server UI 

You can use the MLflow tracking server UI to visualize, compare and search `runs` by Experiments. 

From your browser, go to the MLflow tracking server UI and make sure you refresh your navigator tab to update the MLflow UI page. 

#### Identify the `runs` for your Experiment

Select the Experiment name for your project: **\<StudentId\>-NYCTaxi-Experiment** on the left pane to visualize the associated set of `runs` grouped under your Experiment.

You will see the date and name of `runs` and the metadata information (input parameters, metrics, tags) logged for each `run`:

You will also see a list of `runs` with metrics showing how your model performed with each set of parameters. Each line in the table represents once of the time you trained the model.

<font color="green"> >Note: You may see a series of MLflow `runs` that have been run in the past and sorted in descendant order (from the newest to the oldest). Your `runs` are the most recent ones. </font>  

![MLFlow UI visualization of the runs](MLflow-UI-All-Runs.png)

In this part of the lab, you have trained the model twice. The first time from your local Jupyter Notebook with _learning_rate=0.15_ and _max_depth=1_, and the second time from the ML Ops remote training cluster with _learning_rate=0.15_ and _max_depth=3_. Both the models got registered under your Experiment. 

You can easily recognized each of your `runs` based on the **most recent date** (column Start Time), their **name** (column Run Name) and their **source** (column Source).

* The source of a model developed and trained in the ML Ops Jupyter Notebook is the following: ***ipykernel_launch*** with user being your ***Student\<ID\>*** 

* The source of a model developed and trained in the ML Ops remote training cluster is the following: ***train\<xx\>*** with user being ***root***.

#### Determine the best performing model

Using the MLFlow tracking server UI, you can easily determine the model that yields to the best prediction result by looking at the ***Metrics:*** (training score and evaluation rmse) logged in MLflow for each `run`. 

You can visualize the models in the MLflow tracking server UI and determine the best performing model based on the following criteria: 

* For RMSE the lower the value is, the better the fit.

* For the model training score, the closer towards 1, the better the fit.


### **6- Fetch the MLflow model artifact URI** 

Once you have identified your best performing model, you are ready to operationalize it by moving it to production for model serving.

In order to deploy a model into production and serve predictions, you first need to get the ***URI of the MLflow model artifact*** that is stored of the MinIO S3 bucket on the MLflow tracking server. 

You can do this using either the MLflow tracking server UI or using a programmatic approach with the MLflow tracking Python API.

#### Using the MLflow tracking server UI:

If you **click on the date of the run** for your best performing model (this is the `run` with owner "root" and source "Train\<xx\>"), you will get the complete details of the metadata information (parameters, metrics, tags) and model artifact for that particular `run`. 

At the top, MLflow shows the name of the `run` and its metadata information. To find the model artifact URI, scroll down to the **Artifacts** section below the metadata information in the selected `run`. 

![image.png](MLflow-UI-Model-Artifacts.png)

You can see the artifacts generated by the `run`:
* an ML model file with metadata that allows MLflow to run the model.
* a conda.yaml file that can be used for conda environment (a packaged and environment management system). 
* a serialized version of the model: ***model.pkl***. This is pickle format of the model that is saved in a file that you can use to deploy the model into an HPE Ezmeral ML Ops model serving endpoint.

To get the Model artifact URI, expand the **model** field, **select and copy (CTRL+C)** the ***Full Path*** string value to **your clipboard**. You will need it in the next part of the lab to register the model in the MLflow registry of HPE Ezmeral ML Ops. 

>**Note:** _Do not click the Register Model button in the MLflow tracking server UI. You must register the model through the MLflow registry of HPE Ezmeral ML Ops. You will register your model in the next part of the lab_

#### Using a programmatic approach:

An alternative way to interact with MLflow tracking server is using its API. As shown in the Python code cell below, a programmatic approach is used to determine the best performing model and get the MLflow model artifact URI. The MLflow tracking API for Python is used here.

The following API calls to query the MLflow tracking and fetch metadata information about the `runs` from your active Experiment:

* **search_runs()** – Query metadata information from logged `runs` and return the information that fits the search criteria in a dataframe (a tabular data structure).
* **get_run()** - Fetch metadata information for a particular `run`.

In [None]:
# Search for the best model
print("Best model information:")
best_run_df = mlflow.search_runs(TeamExperimentId,max_results=1,order_by=["metrics.eval_rmse ASC"])
print(best_run_df[["metrics.eval_rmse", "tags.mlflow.runName", "run_id"]])
print("")
best_run = mlflow.get_run(best_run_df.at[0, 'run_id'])
best_model_uri = f"{best_run.info.artifact_uri}/model"
# print best run info
#print(f"Run id: {best_run.info.run_id}")
#print(f"Run parameters: {best_run.data.params}")
#print("Run score: RMSE = {:.4f}\n\n".format(best_run.data.metrics['eval_rmse']))
print("")
print(f"Your best model artifact URI: {best_model_uri}")

**Select and copy (CTRL+C)** the model artifact URI above to **your clipboard**. You will need it in the next part of the lab to register the model in the MLflow registry of HPE Ezmeral ML Ops. 

### **7- Model registry and deployment**

You are now ready to register your best performing trained model in the MLflow registry of HPE Ezmeral ML Ops, and deploy it to production to serve predictions. 

So let's continue with the lab part 4.

<font color="red">**Go back to your JupyterHub account session to continue the hands-on from Lab 4 for model registry and model deployment:**</font>
<font color="blue">**4-WKSHP-MLOps-MLflow-Register-Model-Deployment.ipynb**.</font>


**======================================================================================**