d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px; height: 163px">
</div>

# Capstone Project: Managing the Machine Learning Lifecycle

Create a workflow that includes pre-processing logic, the optimal ML algorithm and hyperparameters, and post-processing logic.

## Instructions

In this course, we've primarily used Random Forest in `sklearn` to model the Airbnb dataset.  In this exercise, perform the following tasks:
<br><br>
0. Create custom pre-processing logic to featurize the data
0. Try a number of different algorithms and hyperparameters.  Choose the most performant solution
0. Create related post-processing logic
0. Package the results and execute it as its own run

Run the following cell.

In [4]:
%run "./Includes/Classroom-Setup"

Clear the project directory in case you have lingering files from other runs.  Create a fresh directory.  Use this throughout this notebook.

In [6]:
project_path = userhome+"/ml-production/Capstone/"

dbutils.fs.rm(project_path, True)
dbutils.fs.mkdirs(project_path)

print("Created directory: {}".format(project_path))

## Pre-processing

Take a look at the dataset and notice that there are plenty of strings and `NaN` values present. Our end goal is to train a sklearn regression model to predict the price of an airbnb listing.


Before we can start training, we need to pre-process our data to be compatible with sklearn models by making all features purely numerical.

In [8]:
import pandas as pd

airbnbDF = spark.read.parquet("/mnt/training/airbnb/sf-listings/sf-listings-correct-types.parquet").toPandas()
display(airbnbDF)

host_is_superhost,cancellation_policy,instant_bookable,host_total_listings_count,neighbourhood_cleansed,zipcode,latitude,longitude,property_type,room_type,accommodates,bathrooms,bedrooms,beds,bed_type,minimum_nights,number_of_reviews,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,price
t,moderate,f,1.0,Western Addition,94117.0,37.769310377340766,-122.43385634489,Apartment,Entire home/apt,3.0,1.0,1.0,2.0,Real Bed,1.0,127.0,97.0,10.0,10.0,10.0,10.0,10.0,10.0,$170.00
f,strict,f,2.0,Bernal Heights,94110.0,37.745112331410034,-122.42101788836888,Apartment,Entire home/apt,5.0,1.0,2.0,3.0,Real Bed,30.0,112.0,98.0,10.0,10.0,10.0,10.0,10.0,9.0,$235.00
f,strict,f,10.0,Haight Ashbury,94117.0,37.766689597862175,-122.45250461761628,Apartment,Private room,2.0,4.0,1.0,1.0,Real Bed,32.0,17.0,85.0,8.0,8.0,9.0,9.0,9.0,8.0,$65.00
t,moderate,t,4.0,Outer Mission,94127.0,37.73074592978503,-122.44840862635226,House,Private room,1.0,2.0,1.0,1.0,Real Bed,3.0,76.0,95.0,9.0,9.0,10.0,10.0,9.0,9.0,$60.00
f,strict,f,10.0,Haight Ashbury,94117.0,37.76487219421756,-122.45182799146508,House,Private room,2.0,4.0,1.0,1.0,Real Bed,32.0,7.0,91.0,9.0,9.0,9.0,9.0,9.0,9.0,$65.00
f,strict,f,2.0,Western Addition,94117.0,37.77524858589268,-122.43637374831292,House,Entire home/apt,5.0,1.5,2.0,2.0,Real Bed,5.0,26.0,97.0,10.0,10.0,10.0,10.0,10.0,10.0,$575.00
f,moderate,f,1.0,Western Addition,94115.0,37.78470745496073,-122.44555431261594,Apartment,Entire home/apt,7.0,1.0,2.0,1.0,Real Bed,2.0,27.0,88.0,9.0,7.0,10.0,10.0,9.0,9.0,$255.00
t,moderate,f,2.0,Mission,94110.0,37.75918889708064,-122.42236687240562,Apartment,Private room,3.0,1.0,1.0,2.0,Real Bed,1.0,559.0,98.0,10.0,10.0,10.0,10.0,10.0,9.0,$139.00
f,moderate,f,1.0,Mission,94110.0,37.75174004606522,-122.4094205953428,Apartment,Entire home/apt,4.0,2.5,3.0,3.0,Real Bed,3.0,24.0,95.0,9.0,9.0,10.0,10.0,9.0,9.0,$285.00
f,strict,t,1.0,Potrero Hill,94107.0,37.76258885144137,-122.40543055237004,House,Private room,2.0,1.0,1.0,1.0,Real Bed,1.0,386.0,93.0,9.0,9.0,10.0,10.0,9.0,9.0,$135.00


In the following cells we will walk you through the most basic pre-processing step necessary. Feel free to add additional steps afterwards to improve your model performance.

First, convert the `price` from a string to a float since the regression model will be predicting numerical values.

In [11]:
#from org.apache.spark.sql.types import DoubleType
airbnbDF['price']= airbnbDF['price'].apply(lambda x: x.replace('$',"").replace(',',"")).astype('float')

Take a look at our remaining columns with strings (or numbers) and decide if you would like to keep them as features or not.

Remove the features you decide not to keep.

In [13]:
airbnbDF=airbnbDF.drop(['review_scores_accuracy', 'review_scores_cleanliness', 'review_scores_checkin', 'review_scores_communication','review_scores_location','review_scores_value'],axis=1)

For the string columns that you've decided to keep, pick a numerical encoding for the string columns. Don't forget to deal with the `NaN` entries in those columns first.

In [15]:
airbnb_processed=airbnbDF.copy()

In [16]:
airbnb_processed["review_scores_rating"].fillna(airbnb_processed["review_scores_rating"].mean(), inplace=True)
airbnb_processed["host_is_superhost"].fillna('f', inplace=True)
airbnb_processed["host_total_listings_count"].fillna(1.0, inplace=True)
airbnb_processed["zipcode"].fillna('94110', inplace=True)
airbnb_processed["bathrooms"].fillna(airbnb_processed["bathrooms"].mean(), inplace=True)
airbnb_processed["beds"].fillna(airbnb_processed["beds"].mean(), inplace=True)
airbnb_processed["bed_type"].fillna('Real Bed', inplace=True)
airbnb_processed=pd.get_dummies(airbnb_processed,drop_first=True)
airbnb_processed.dropna(inplace=True)
airbnb_processed["latitude"] = round(airbnb_processed["latitude"],3)
airbnb_processed["longitude"] = round(airbnb_processed["longitude"],3)


Before we create a train test split, check that all your columns are numerical. Remember to drop the original string columns after creating numerical representations of them.

Make sure to drop the price column from the training data when doing the train test split.

In [18]:
# TODO
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(airbnb_processed.drop(["price"],axis=1),airbnb_processed[["price"]].values.ravel(),
                                                   random_state=42)

## Model

After cleaning our data, we can start creating our model!

Firstly, if there are still `NaN`'s in your data, you may want to impute these values instead of dropping those entries entirely. Make sure that any further processing/imputing steps after the train test split is part of a model/pipeline that can be saved.

In the following cell, create and fit a single sklearn model.

In [21]:
import mlflow.sklearn
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV

# dictionary containing hyperparameter names and list of values we want to try
parameters = {"n_estimators": [1000,2000],
              "max_depth": [120,200]}

rf = RandomForestRegressor()
grid_rf_model = GridSearchCV(rf, parameters, cv=3)
grid_rf_model.fit(X_train, y_train)

best_rf = grid_rf_model.best_estimator_
for p in parameters:
  print("Best '{}': {}".format(p, best_rf.get_params()[p]))

Pick and calculate a regression metric for evaluating your model.

In [23]:
from sklearn.metrics import mean_squared_error

best_rf_mse = mean_squared_error(y_test,best_rf.predict(X_test))

best_rf_mse

Log your model on MLflow with the same metric you calculated above so we can compare all the different models you have tried! Make sure to also log any hyperparameters that you plan on tuning!

In [25]:
import mlflow.sklearn

with mlflow.start_run(run_name="RF Model Capstone") as run:
  mlflow.sklearn.log_model(best_rf,"model")
  
  mlflow.log_metric("mse",best_rf_mse)
  
  for p in parameters:
     mlflow.log_param(p,best_rf.get_params()[p])
  sklearnRunID = run.info.run_uuid 
  sklearnURI = run.info.artifact_uri 
  
  experimentID = run.info.experiment_id 

Change and re-run the above 3 code cells to log different models and/or models with different hyperparameters until you are satisfied with the performance of at least 1 of them.

In [27]:
import tensorflow as tf
tf.set_random_seed(42) # For reproducibility

from keras.models import Sequential
from keras.layers import Dense

nn = Sequential([
  Dense(200, input_dim=110, activation='relu'),
  Dense(100, activation='relu'),
  Dense(1, activation='linear')
])

nn.compile(optimizer="adam", loss="mse")
nn.fit(X_train,y_train,validation_split=.25, epochs=20, verbose=1)

In [28]:
nn_mse = mean_squared_error(y_test,nn.predict(X_test)) 

nn_mse

In [29]:
import mlflow.keras

with mlflow.start_run(run_name="NN Model Capstone") as run:
  mlflow.keras.log_model(nn,"model")
  mlflow.log_metric("mse",nn_mse)

  
  kerasRunID = run.info.run_uuid 
  kerasURI = run.info.artifact_uri 

Look through the MLflow UI for the best model. Copy its `URI` so you can load it as a `pyfunc` model.

In [31]:
print(dbutils.fs.head(sklearnURI+"/model/MLmodel"))

In [32]:
import mlflow.pyfunc

rf_pyfunc_capstone_model = mlflow.pyfunc.load_model(model_uri=(sklearnURI+"/model").replace("dbfs:","/dbfs")) 
type(rf_pyfunc_capstone_model)

## Post-processing

Our model currently gives us the predicted price per night for each Airbnb listing. Now we would like our model to tell us what the price per person would be for each listing, assuming the number of renters is equal to the `accommodates` value.

-sandbox
Fill in the following model class to add in a post-processing step which will get us from total price per night to **price per person per night**.

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> Check out <a href="https://www.mlflow.org/docs/latest/models.html#id13" target="_blank">the MLFlow docs for help.</a>

In [35]:
# TODO

class Airbnb_Model(mlflow.pyfunc.PythonModel):

    def __init__(self, model):
        self.rf = model
            
    
    def predict(self, context, model_input):
        X_test_processed=model_input.copy()
        X_test_processed['Price']=self.rf.predict(X_test_processed)
        X_test_processed['Price_per_person']=X_test_processed['Price']/X_test_processed['accommodates']
        return X_test_processed['Price_per_person'].to_numpy()

Construct and save the model to the given `final_model_path`.

In [37]:
final_model_path =  project_path.replace("dbfs:", "/dbfs") + "model"
dbutils.fs.rm(final_model_path, True) # remove folder if already exists

rf_postprocess_model = Airbnb_Model(model = best_rf)
mlflow.pyfunc.save_model(path=final_model_path, python_model=rf_postprocess_model)

Load the model in `python_function` format and apply it to our test data `X_test` to check that we are getting price per person predictions now.

In [39]:
loaded_postprocess_model = mlflow.pyfunc.load_pyfunc(final_model_path)
loaded_postprocess_model.predict(X_test)

## Packaging your Model

Now we would like to package our completed model!

-sandbox
First save your testing data at `test_data_path` so we can test the packaged model.

<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** When using `.to_csv` make sure to set `index=False` so you don't end up with an extra index column in your saved dataframe.

In [42]:
# TODO
#save the testing data 
test_data_path = project_path.replace("dbfs:", "/dbfs") + "test_data.csv"
dbutils.fs.rm(test_data_path, True) # Clears the directory if it already exists
X_test.to_csv(test_data_path,index=False)

prediction_path = project_path.replace("dbfs:", "/dbfs") + "predictions.csv"

First we will determine what the project script should do. Fill out the `model_predict` function to load out the trained model you just saved (at `final_model_path`) and make price per person predictions on the data at `test_data_path`. Then those predictions should be saved under `prediction_path` for the user to access later.

Run the cell to check that your function is behaving correctly and that you have predictions saved at `demo_prediction_path`.

In [44]:
# TODO
import click
import mlflow.pyfunc
import pandas as pd

@click.command()
@click.option("--final_model_path", default="", type=str)
@click.option("--test_data_path", default="", type=str)
@click.option("--prediction_path", default="", type=str)
def model_predict(final_model_path, test_data_path, prediction_path):
  with mlflow.start_run() as run:
    loaded_postprocess_model = mlflow.pyfunc.load_pyfunc(final_model_path)
    test_data = pd.read_csv(test_data_path)
    predicted_array=loaded_postprocess_model.predict(test_data)
    pd.DataFrame(predicted_array).to_csv(prediction_path,index=False)


# test model_predict function    
demo_prediction_path = project_path.replace("dbfs:", "/dbfs") + "demo_predictions.csv"

from click.testing import CliRunner
runner = CliRunner()
result = runner.invoke(model_predict, ['--final_model_path', final_model_path, 
                                       '--test_data_path', test_data_path,
                                       '--prediction_path', demo_prediction_path], catch_exceptions=True)

assert result.exit_code == 0, "Code failed" # Check to see that it worked
print("Price per person predictions: ")
print(pd.read_csv(demo_prediction_path))

Next, we will create a MLproject file and put it under our `project_path`. Complete the parameters and command of the file.

In [46]:
# TODO
dbutils.fs.put(project_path + "MLproject", 
'''
name: Capstone-Project
conda_env: conda.yaml
entry_points:
  main:
    parameters:
      final_model_path: {type: str, default: "/dbfs/user/shashank.rao@rhsmith.umd.edu/ml-production/Capstone/model" }
      test_data_path: {type: str, default: "/dbfs/user/shashank.rao@rhsmith.umd.edu/ml-production/Capstone/test_data.csv"}
      prediction_path: {type:str, default:"/dbfs/user/shashank.rao@rhsmith.umd.edu/ml-production/Capstone/predictions.csv"}
    command:  "python predict.py --final_model_path {final_model_path} --test_data_path {test_data_path} --prediction_path {prediction_path}"
'''.strip(), overwrite=True)

We then create a `conda.yaml` file to list the dependencies needed to run our script.

In [48]:
dbutils.fs.put(project_path + "conda.yaml", 
'''
name: Capstone
channels:
  - defaults
dependencies:
  - cloudpickle=0.8.0
  - numpy=1.16.2
  - pandas=0.24.2
  - scikit-learn=0.20.3
  - pip:
    - mlflow==1.5.0
'''.strip(), overwrite=True)

-sandbox
<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> You can check the versions match your current environment using the following cell.

In [50]:
import cloudpickle
print("cloudpickle: " + cloudpickle.__version__)
import numpy
print("numpy: " + numpy.__version__)
import pandas
print("pandas: " + pandas.__version__)
import sklearn
print("sklearn: " + sklearn.__version__)
import mlflow
print("mlflow: " + mlflow.__version__)

Now we will put the `predict.py` script into our project package. Complete the `.py` file by copying and placing the `model_predict` function you defined above.

In [52]:
# TODO
dbutils.fs.put(project_path + "predict.py", 
'''
import click
import mlflow.pyfunc
import pandas as pd
import numpy as np
import traceback

@click.command()
@click.option("--final_model_path", default="", type=str)
@click.option("--test_data_path", default="", type=str)
@click.option("--prediction_path", default="", type=str)
def model_predict(final_model_path, test_data_path, prediction_path):
  with mlflow.start_run() as run:
    loaded_postprocess_model = mlflow.pyfunc.load_pyfunc(final_model_path)
    test_data = pd.read_csv(test_data_path)
    predicted_array = loaded_postprocess_model.predict(test_data)
    pd.DataFrame(predicted_array).to_csv(prediction_path,index=False)
     
if __name__ == "__main__":
   model_predict()
'''.strip(), overwrite=True)

Let's double check all the files we've created are in the `project_path` folder. You should have at least the following 3 files:
* `MLproject`
* `conda.yaml`
* `predict.py`

In [54]:
dbutils.fs.ls(project_path)

Under `project_path` is your completely packaged project. Run the project to use the model saved at `final_model_path` to predict the price per person of each Airbnb listing in `test_data_path` and save those predictions under `prediction_path`.

In [56]:
# TODO
import mlflow

mlflow.projects.run(uri=project_path.replace("dbfs:","/dbfs"),
  parameters={
    "final_model_path": "/dbfs/user/shashank.rao@rhsmith.umd.edu/ml-production/Capstone/model" ,
    "test_data_path": "/dbfs/user/shashank.rao@rhsmith.umd.edu/ml-production/Capstone/test_data.csv",
    "prediction_path":"/dbfs/user/shashank.rao@rhsmith.umd.edu/ml-production/Capstone/predictions.csv"
})

Run the following cell to check that your model's predictions are there!

In [58]:
print("Price per person predictions: ")
print(pd.read_csv(prediction_path))

Run the following command to clear the project and data files from your directory.

In [60]:
dbutils.fs.rm(project_path, True)

-sandbox
&copy; 2019 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>