d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px; height: 163px">
</div>

# Lab: Grid Search with MLflow

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) In this lab you:<br>
 - Import the housing data
 - Perform grid search using scikit-learn
 - Log the best model on MLflow
 - Load the saved model

In [3]:
%run "./../Includes/Classroom-Setup"

## Data Import

Load in same Airbnb data and create train/test split.

In [5]:
import pandas as pd
from sklearn.model_selection import train_test_split

df = pd.read_csv("/dbfs/mnt/training/airbnb/sf-listings/airbnb-cleaned-mlflow.csv")
X_train, X_test, y_train, y_test = train_test_split(df.drop(["price"], axis=1), df[["price"]].values.ravel(), random_state=42)

## Perform Grid Search using scikit-learn

We want to know which combination of hyperparameter values is the most effective. Fill in the code below to perform <a href="https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html#sklearn.model_selection.GridSearchCV" target="_blank"> grid search using `sklearn`</a> over the 2 hyperparameters we looked at in the 02 notebook, `n_estimators` and `max_depth`.

In [7]:
# TODO
import mlflow.sklearn
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV

# dictionary containing hyperparameter names and list of values we want to try
parameters = {'n_estimators': [100,1000] , 
              'max_depth': [5,10] }

rf = RandomForestRegressor()
grid_rf_model = GridSearchCV(rf, parameters, cv=3)
grid_rf_model.fit(X_train, y_train)

best_rf = grid_rf_model.best_estimator_
for p in parameters:
  print("Best '{}': {}".format(p, best_rf.get_params()[p]))

## Log Best Model on MLflow

Log the best model as `grid-random-forest-model`, its parameters, and its MSE metric under a run with name `RF-Grid-Search` in our new MLflow experiment.

In [9]:
# TODO
from sklearn.metrics import mean_squared_error

with mlflow.start_run(run_name= "RF-Grid-Search") as run:
  # Create predictions of X_test using best model
  predictions=best_rf.predict(X_test)
  
  # Log model with name
  mlflow.sklearn.log_model(best_rf, "random-forest-model")
  
  # Log params
  [mlflow.log_param(p, best_rf.get_params()[p]) for p in parameters]
  
  # Create and log MSE metrics using predictions of X_test and its actual value y_test
  mse = mean_squared_error(y_test, predictions)
  mlflow.log_metric("mse", mse)
  print(" mse: {}".format(mse))
  
  runID = run.info.run_uuid
  experimentID = run.info.experiment_id
  print("Inside MLflow Run with id {}".format(runID))

Check on the MLflow UI that the run `RF-Grid-Search` is logged has the best parameter values found by grid search.

-sandbox
## Load the Saved Model

Load the trained and tuned model we just saved. Check that the hyperparameters of this model matches that of the best model we found earlier.

<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** Use the `artifactURI` variable declared above.

In [12]:
from mlflow.tracking import MlflowClient

artifactURL = MlflowClient().get_experiment(experimentID).artifact_location
modelURL = artifactURL + "/" + runID + "/artifacts/random-forest-model"
model = mlflow.sklearn.load_model(artifactURL + "/" + runID + "/artifacts/random-forest-model")
model

Time permitting, continue to grid search over a wider number of parameters and automatically save the best performing parameters back to `mlflow`.

In [14]:
import mlflow.sklearn
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV

# dictionary containing hyperparameter names and list of values we want to try
parameters = {'n_estimators': [100,1000,1500], 
              'max_depth': [10,20,50],
              'max_features': [18,21]
              }

rf = RandomForestRegressor()
grid_rf_model = GridSearchCV(rf, parameters, cv=3)
grid_rf_model.fit(X_train, y_train)

best_rf = grid_rf_model.best_estimator_
for p in parameters:
  print("Best '{}': {}".format(p, best_rf.get_params()[p]))

In [15]:
from sklearn.metrics import mean_squared_error

with mlflow.start_run(run_name= "RF-Wide-Grid-Search") as run:
  # Create predictions of X_test using best model
  predictions=best_rf.predict(X_test)
  
  # Log model with name
  mlflow.sklearn.log_model(best_rf, "best-random-forest-model")
  
  # Log params
  [mlflow.log_param(p, best_rf.get_params()[p]) for p in parameters]
  
  # Create and log MSE metrics using predictions of X_test and its actual value y_test
  mse = mean_squared_error(y_test, predictions)
  mlflow.log_metric("mse", mse)
  print(" mse: {}".format(mse))
  
  runID = run.info.run_uuid
  experimentID = run.info.experiment_id
  print("Inside MLflow Run with id {}".format(runID))

Time permitting, use the `MlflowClient` to interact programatically with your run.

In [17]:
from  mlflow.tracking import MlflowClient

client = MlflowClient()
runs = pd.DataFrame([(run.run_uuid, run.start_time, run.artifact_uri) for run in client.list_run_infos(experimentID)])
runs.columns = ["run_uuid", "start_time", "artifact_uri"]

display(runs)

run_uuid,start_time,artifact_uri
437c3c4111b04b7bb00307d0d2243692,1586395567732,dbfs:/databricks/mlflow/1913157811207672/437c3c4111b04b7bb00307d0d2243692/artifacts
f5430e5ae4a540f79206594d4a2bcd3c,1586394972756,dbfs:/databricks/mlflow/1913157811207672/f5430e5ae4a540f79206594d4a2bcd3c/artifacts
dfac861e74dc4352875347cae414c321,1586121775085,dbfs:/databricks/mlflow/1913157811207672/dfac861e74dc4352875347cae414c321/artifacts
14fc7c35b52d45e98d7f4a9077d0cb3a,1586120448085,dbfs:/databricks/mlflow/1913157811207672/14fc7c35b52d45e98d7f4a9077d0cb3a/artifacts
a97b30a82a314f4c85ff9967f5474796,1586120006048,dbfs:/databricks/mlflow/1913157811207672/a97b30a82a314f4c85ff9967f5474796/artifacts
60ce2d84685f493c953ae9d686587c5b,1586119393644,dbfs:/databricks/mlflow/1913157811207672/60ce2d84685f493c953ae9d686587c5b/artifacts
598866e29af047dd9ede639379a592f6,1585964594532,dbfs:/databricks/mlflow/1913157811207672/598866e29af047dd9ede639379a592f6/artifacts


In [18]:
last_run = runs.sort_values("start_time", ascending=False).iloc[0]

dbutils.fs.ls(last_run["artifact_uri"]+"/best-random-forest-model/")

In [19]:
client.get_run(last_run.run_uuid).data.metrics

In [20]:
dbutils.fs.ls(last_run.artifact_uri)

In [21]:
import mlflow.sklearn

model = mlflow.sklearn.load_model(last_run.artifact_uri + "/best-random-forest-model/")
model.feature_importances_

-sandbox
## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Next Lesson<br>

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> See the solutions folder for an example solution to this lab.

### [Start the next lesson, Packaging ML Projects.]($../03-Packaging-ML-Projects )

-sandbox
&copy; 2019 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>