d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px; height: 163px">
</div>

# Capstone Project: Managing the Machine Learning Lifecycle

Create a workflow that includes pre-processing logic, the optimal ML algorithm and hyperparameters, and post-processing logic.

## Instructions

In this course, we've primarily used Random Forest in `sklearn` to model the Airbnb dataset.  In this exercise, perform the following tasks:
<br><br>
0. Create custom pre-processing logic to featurize the data
0. Try a number of different algorithms and hyperparameters.  Choose the most performant solution
0. Create related post-processing logic
0. Package the results and execute it as its own run

Run the following cell.

In [0]:
%run "./Includes/Classroom-Setup"

Clear the project directory in case you have lingering files from other runs.  Create a fresh directory.  Use this throughout this notebook.

In [0]:
project_path = userhome+"/ml-production/Capstone/"

dbutils.fs.rm(project_path, True)
dbutils.fs.mkdirs(project_path)

print("Created directory: {}".format(project_path))

## Pre-processing

Take a look at the dataset and notice that there are plenty of strings and `NaN` values present. Our end goal is to train a sklearn regression model to predict the price of an airbnb listing.


Before we can start training, we need to pre-process our data to be compatible with sklearn models by making all features purely numerical.

In [0]:
import pandas as pd

airbnbDF = spark.read.parquet("/mnt/training/airbnb/sf-listings/sf-listings-correct-types.parquet").toPandas()
display(airbnbDF)

host_is_superhost,cancellation_policy,instant_bookable,host_total_listings_count,neighbourhood_cleansed,zipcode,latitude,longitude,property_type,room_type,accommodates,bathrooms,bedrooms,beds,bed_type,minimum_nights,number_of_reviews,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,price
t,moderate,f,1.0,Western Addition,94117.0,37.769310377340766,-122.43385634489,Apartment,Entire home/apt,3.0,1.0,1.0,2.0,Real Bed,1.0,127.0,97.0,10.0,10.0,10.0,10.0,10.0,10.0,$170.00
f,strict,f,2.0,Bernal Heights,94110.0,37.745112331410034,-122.42101788836888,Apartment,Entire home/apt,5.0,1.0,2.0,3.0,Real Bed,30.0,112.0,98.0,10.0,10.0,10.0,10.0,10.0,9.0,$235.00
f,strict,f,10.0,Haight Ashbury,94117.0,37.766689597862175,-122.45250461761628,Apartment,Private room,2.0,4.0,1.0,1.0,Real Bed,32.0,17.0,85.0,8.0,8.0,9.0,9.0,9.0,8.0,$65.00
t,moderate,t,4.0,Outer Mission,94127.0,37.73074592978503,-122.44840862635226,House,Private room,1.0,2.0,1.0,1.0,Real Bed,3.0,76.0,95.0,9.0,9.0,10.0,10.0,9.0,9.0,$60.00
f,strict,f,10.0,Haight Ashbury,94117.0,37.76487219421756,-122.45182799146508,House,Private room,2.0,4.0,1.0,1.0,Real Bed,32.0,7.0,91.0,9.0,9.0,9.0,9.0,9.0,9.0,$65.00
f,strict,f,2.0,Western Addition,94117.0,37.77524858589268,-122.43637374831292,House,Entire home/apt,5.0,1.5,2.0,2.0,Real Bed,5.0,26.0,97.0,10.0,10.0,10.0,10.0,10.0,10.0,$575.00
f,moderate,f,1.0,Western Addition,94115.0,37.78470745496073,-122.44555431261594,Apartment,Entire home/apt,7.0,1.0,2.0,1.0,Real Bed,2.0,27.0,88.0,9.0,7.0,10.0,10.0,9.0,9.0,$255.00
t,moderate,f,2.0,Mission,94110.0,37.75918889708064,-122.42236687240562,Apartment,Private room,3.0,1.0,1.0,2.0,Real Bed,1.0,559.0,98.0,10.0,10.0,10.0,10.0,10.0,9.0,$139.00
f,moderate,f,1.0,Mission,94110.0,37.75174004606522,-122.4094205953428,Apartment,Entire home/apt,4.0,2.5,3.0,3.0,Real Bed,3.0,24.0,95.0,9.0,9.0,10.0,10.0,9.0,9.0,$285.00
f,strict,t,1.0,Potrero Hill,94107.0,37.76258885144137,-122.40543055237004,House,Private room,2.0,1.0,1.0,1.0,Real Bed,1.0,386.0,93.0,9.0,9.0,10.0,10.0,9.0,9.0,$135.00


In the following cells we will walk you through the most basic pre-processing step necessary. Feel free to add additional steps afterwards to improve your model performance.

First, convert the `price` from a string to a float since the regression model will be predicting numerical values.

In [0]:
airbnbDF.shape #check shape of airbnb dataset

In [0]:
airbnbDF['price'].apply(type).value_counts() #check data type

In [0]:
airbnbDF['price'] = airbnbDF['price'].str.replace("$","")
airbnbDF['price'] = airbnbDF['price'].str.replace(",","") #strip $ and , from string before converting to float

In [0]:
airbnbDF['price']=airbnbDF['price'].astype(float) #convert string to float

In [0]:
airbnbDF['price'].apply(type).value_counts() #check new data type

Take a look at our remaining columns with strings (or numbers) and decide if you would like to keep them as features or not.

Remove the features you decide not to keep.

In [0]:
airbnbDF = airbnbDF.drop(['host_total_listings_count','zipcode','latitude','longitude'], axis=1) #drop columns we dont need

In [0]:
airbnbDF.dtypes #see data types 

For the string columns that you've decided to keep, pick a numerical encoding for the string columns. Don't forget to deal with the `NaN` entries in those columns first.

In [0]:
airbnbDF["host_is_superhost"].isnull().value_counts() #count of null values for string

In [0]:
airbnbDF['host_is_superhost'] = pd.Categorical(airbnbDF['host_is_superhost']) #convert string to pandas categorical object
airbnbDF["cancellation_policy"] = pd.Categorical(airbnbDF["cancellation_policy"])
airbnbDF["instant_bookable"] = pd.Categorical(airbnbDF["instant_bookable"])
airbnbDF["neighbourhood_cleansed"] = pd.Categorical(airbnbDF["neighbourhood_cleansed"])
airbnbDF["property_type"] = pd.Categorical(airbnbDF["property_type"])
airbnbDF["room_type"] = pd.Categorical(airbnbDF["room_type"])
airbnbDF["bed_type"] = pd.Categorical(airbnbDF["bed_type"])

In [0]:
airbnbDF_dummies_1 = pd.get_dummies(airbnbDF['host_is_superhost'], prefix = 'host_is_superhost') #convert categorical variables to dummy variables
airbnbDF_dummies_2 = pd.get_dummies(airbnbDF["cancellation_policy"], prefix = 'cancellation_policy')
airbnbDF_dummies_3 = pd.get_dummies(airbnbDF["instant_bookable"], prefix = 'instant_bookable')
airbnbDF_dummies_4 = pd.get_dummies(airbnbDF["neighbourhood_cleansed"], prefix = 'neighbourhood_cleansed')
airbnbDF_dummies_5 = pd.get_dummies(airbnbDF["property_type"], prefix = 'property_type')
airbnbDF_dummies_6 = pd.get_dummies(airbnbDF["room_type"], prefix = 'room_type')
airbnbDF_dummies_7 = pd.get_dummies(airbnbDF["bed_type"], prefix = 'bed_type')

In [0]:
airbnbDF = pd.concat([airbnbDF, airbnbDF_dummies_1, airbnbDF_dummies_2, airbnbDF_dummies_3, airbnbDF_dummies_4, airbnbDF_dummies_5, airbnbDF_dummies_6, airbnbDF_dummies_7], axis=1 ) #concat dummies to df

In [0]:
airbnbDF.drop(["host_is_superhost","cancellation_policy",'instant_bookable','neighbourhood_cleansed','property_type','room_type','bed_type'],inplace =True, axis=1) #drop string cols

In [0]:
airbnbDF.describe() #see summary statistics

Unnamed: 0,accommodates,bathrooms,bedrooms,beds,minimum_nights,number_of_reviews,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,price,host_is_superhost_f,host_is_superhost_t,cancellation_policy_flexible,cancellation_policy_moderate,cancellation_policy_strict,cancellation_policy_super_strict_30,cancellation_policy_super_strict_60,instant_bookable_f,instant_bookable_t,neighbourhood_cleansed_Bayview,neighbourhood_cleansed_Bernal Heights,neighbourhood_cleansed_Castro/Upper Market,neighbourhood_cleansed_Chinatown,neighbourhood_cleansed_Crocker Amazon,neighbourhood_cleansed_Diamond Heights,neighbourhood_cleansed_Downtown/Civic Center,neighbourhood_cleansed_Excelsior,neighbourhood_cleansed_Financial District,neighbourhood_cleansed_Glen Park,neighbourhood_cleansed_Golden Gate Park,neighbourhood_cleansed_Haight Ashbury,neighbourhood_cleansed_Inner Richmond,neighbourhood_cleansed_Inner Sunset,neighbourhood_cleansed_Lakeshore,neighbourhood_cleansed_Marina,neighbourhood_cleansed_Mission,...,neighbourhood_cleansed_Seacliff,neighbourhood_cleansed_South of Market,neighbourhood_cleansed_Twin Peaks,neighbourhood_cleansed_Visitacion Valley,neighbourhood_cleansed_West of Twin Peaks,neighbourhood_cleansed_Western Addition,property_type_Aparthotel,property_type_Apartment,property_type_Barn,property_type_Bed and breakfast,property_type_Boat,property_type_Boutique hotel,property_type_Bungalow,property_type_Cabin,property_type_Castle,property_type_Condominium,property_type_Dorm,property_type_Guest suite,property_type_Guesthouse,property_type_Hostel,property_type_Hotel,property_type_House,property_type_In-law,property_type_Loft,property_type_Other,property_type_Resort,property_type_Serviced apartment,property_type_Timeshare,property_type_Tiny house,property_type_Townhouse,property_type_Treehouse,property_type_Vacation home,room_type_Entire home/apt,room_type_Private room,room_type_Shared room,bed_type_Airbed,bed_type_Couch,bed_type_Futon,bed_type_Pull-out Sofa,bed_type_Real Bed
count,4804.0,4781.0,4804.0,4798.0,4804.0,4804.0,4370.0,4369.0,4370.0,4368.0,4369.0,4368.0,4367.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,...,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0
mean,3.429017,1.378896,1.457952,1.899125,20823.92,49.912781,95.787872,9.794232,9.673684,9.89011,9.88304,9.631639,9.496909,222.415903,0.574105,0.420067,0.158618,0.396128,0.440883,0.002914,0.001457,0.635928,0.364072,0.025604,0.06224,0.062656,0.010616,0.007286,0.003331,0.05995,0.022481,0.013114,0.011032,0.001041,0.04413,0.036636,0.021649,0.005412,0.025396,0.108659,...,0.003747,0.04975,0.0102,0.009575,0.020608,0.077435,0.000208,0.405079,0.000208,0.012281,0.000624,0.032057,0.002082,0.001457,0.001457,0.075978,0.005204,0.037885,0.006037,0.004163,0.000416,0.34617,0.016861,0.014988,0.008326,0.00458,0.004371,0.00458,0.000416,0.013947,0.000416,0.000208,0.575146,0.405079,0.019775,0.002082,0.001873,0.006453,0.006037,0.983555
std,2.249982,1.26548,1.423788,1.58973,1442775.0,68.268786,5.354162,0.499087,0.626834,0.396126,0.384341,0.617341,0.645238,375.445072,0.494529,0.493621,0.365357,0.489143,0.496545,0.053911,0.038148,0.481219,0.481219,0.157966,0.241616,0.242369,0.102497,0.085053,0.057621,0.237419,0.148258,0.113775,0.104465,0.032248,0.205405,0.187886,0.145549,0.073376,0.15734,0.311244,...,0.061103,0.217451,0.100488,0.097394,0.142082,0.267309,0.014428,0.490958,0.014428,0.11015,0.024984,0.176169,0.045582,0.038148,0.038148,0.264991,0.071958,0.190938,0.077469,0.064395,0.020402,0.475798,0.128764,0.121515,0.090878,0.067524,0.065978,0.067524,0.020402,0.117282,0.020402,0.014428,0.494372,0.490958,0.139241,0.045582,0.043247,0.080079,0.077469,0.127191
min,1.0,0.0,0.0,0.0,1.0,0.0,20.0,2.0,2.0,2.0,2.0,3.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,2.0,1.0,1.0,1.0,1.0,6.0,94.0,10.0,9.0,10.0,10.0,9.0,9.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
50%,2.0,1.0,1.0,1.0,2.0,23.0,97.0,10.0,10.0,10.0,10.0,10.0,10.0,150.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
75%,4.0,1.5,2.0,2.0,4.0,66.0,99.0,10.0,10.0,10.0,10.0,10.0,10.0,250.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
max,16.0,15.0,15.0,15.0,100000000.0,568.0,100.0,10.0,10.0,10.0,10.0,10.0,10.0,10000.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [0]:
#impute nulls in other non-string columns with median or mean
airbnbDF["bathrooms"].fillna(airbnbDF["bathrooms"].median(), inplace = True)
airbnbDF["beds"].fillna(airbnbDF["beds"].median(), inplace = True)
airbnbDF["review_scores_rating"].fillna(airbnbDF["review_scores_rating"].mean(), inplace = True)
airbnbDF["review_scores_accuracy"].fillna(airbnbDF["review_scores_accuracy"].mean(), inplace = True)
airbnbDF["review_scores_cleanliness"].fillna(airbnbDF["review_scores_cleanliness"].mean(), inplace = True)
airbnbDF["review_scores_checkin"].fillna(airbnbDF["review_scores_checkin"].mean(), inplace = True)
airbnbDF["review_scores_communication"].fillna(airbnbDF["review_scores_communication"].mean(), inplace = True)
airbnbDF["review_scores_location"].fillna(airbnbDF["review_scores_location"].mean(), inplace = True) 
airbnbDF["review_scores_value"].fillna(airbnbDF["review_scores_value"].mean(), inplace = True)


In [0]:
airbnbDF.describe() #see summary statistics again

Unnamed: 0,accommodates,bathrooms,bedrooms,beds,minimum_nights,number_of_reviews,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,price,host_is_superhost_f,host_is_superhost_t,cancellation_policy_flexible,cancellation_policy_moderate,cancellation_policy_strict,cancellation_policy_super_strict_30,cancellation_policy_super_strict_60,instant_bookable_f,instant_bookable_t,neighbourhood_cleansed_Bayview,neighbourhood_cleansed_Bernal Heights,neighbourhood_cleansed_Castro/Upper Market,neighbourhood_cleansed_Chinatown,neighbourhood_cleansed_Crocker Amazon,neighbourhood_cleansed_Diamond Heights,neighbourhood_cleansed_Downtown/Civic Center,neighbourhood_cleansed_Excelsior,neighbourhood_cleansed_Financial District,neighbourhood_cleansed_Glen Park,neighbourhood_cleansed_Golden Gate Park,neighbourhood_cleansed_Haight Ashbury,neighbourhood_cleansed_Inner Richmond,neighbourhood_cleansed_Inner Sunset,neighbourhood_cleansed_Lakeshore,neighbourhood_cleansed_Marina,neighbourhood_cleansed_Mission,...,neighbourhood_cleansed_Seacliff,neighbourhood_cleansed_South of Market,neighbourhood_cleansed_Twin Peaks,neighbourhood_cleansed_Visitacion Valley,neighbourhood_cleansed_West of Twin Peaks,neighbourhood_cleansed_Western Addition,property_type_Aparthotel,property_type_Apartment,property_type_Barn,property_type_Bed and breakfast,property_type_Boat,property_type_Boutique hotel,property_type_Bungalow,property_type_Cabin,property_type_Castle,property_type_Condominium,property_type_Dorm,property_type_Guest suite,property_type_Guesthouse,property_type_Hostel,property_type_Hotel,property_type_House,property_type_In-law,property_type_Loft,property_type_Other,property_type_Resort,property_type_Serviced apartment,property_type_Timeshare,property_type_Tiny house,property_type_Townhouse,property_type_Treehouse,property_type_Vacation home,room_type_Entire home/apt,room_type_Private room,room_type_Shared room,bed_type_Airbed,bed_type_Couch,bed_type_Futon,bed_type_Pull-out Sofa,bed_type_Real Bed
count,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,...,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0,4804.0
mean,3.429017,1.377082,1.457952,1.898002,20823.92,49.912781,95.787872,9.794232,9.673684,9.89011,9.88304,9.631639,9.496909,222.415903,0.574105,0.420067,0.158618,0.396128,0.440883,0.002914,0.001457,0.635928,0.364072,0.025604,0.06224,0.062656,0.010616,0.007286,0.003331,0.05995,0.022481,0.013114,0.011032,0.001041,0.04413,0.036636,0.021649,0.005412,0.025396,0.108659,...,0.003747,0.04975,0.0102,0.009575,0.020608,0.077435,0.000208,0.405079,0.000208,0.012281,0.000624,0.032057,0.002082,0.001457,0.001457,0.075978,0.005204,0.037885,0.006037,0.004163,0.000416,0.34617,0.016861,0.014988,0.008326,0.00458,0.004371,0.00458,0.000416,0.013947,0.000416,0.000208,0.575146,0.405079,0.019775,0.002082,0.001873,0.006453,0.006037,0.983555
std,2.249982,1.262717,1.423788,1.589054,1442775.0,68.268786,5.106534,0.47595,0.597844,0.377719,0.366524,0.588655,0.615185,375.445072,0.494529,0.493621,0.365357,0.489143,0.496545,0.053911,0.038148,0.481219,0.481219,0.157966,0.241616,0.242369,0.102497,0.085053,0.057621,0.237419,0.148258,0.113775,0.104465,0.032248,0.205405,0.187886,0.145549,0.073376,0.15734,0.311244,...,0.061103,0.217451,0.100488,0.097394,0.142082,0.267309,0.014428,0.490958,0.014428,0.11015,0.024984,0.176169,0.045582,0.038148,0.038148,0.264991,0.071958,0.190938,0.077469,0.064395,0.020402,0.475798,0.128764,0.121515,0.090878,0.067524,0.065978,0.067524,0.020402,0.117282,0.020402,0.014428,0.494372,0.490958,0.139241,0.045582,0.043247,0.080079,0.077469,0.127191
min,1.0,0.0,0.0,0.0,1.0,0.0,20.0,2.0,2.0,2.0,2.0,3.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,2.0,1.0,1.0,1.0,1.0,6.0,95.0,9.794232,9.673684,10.0,10.0,9.0,9.0,100.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
50%,2.0,1.0,1.0,1.0,2.0,23.0,97.0,10.0,10.0,10.0,10.0,10.0,10.0,150.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
75%,4.0,1.5,2.0,2.0,4.0,66.0,99.0,10.0,10.0,10.0,10.0,10.0,10.0,250.0,1.0,1.0,0.0,1.0,1.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0
max,16.0,15.0,15.0,15.0,100000000.0,568.0,100.0,10.0,10.0,10.0,10.0,10.0,10.0,10000.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,...,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


Before we create a train test split, check that all your columns are numerical. Remember to drop the original string columns after creating numerical representations of them.

Make sure to drop the price column from the training data when doing the train test split.

In [0]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(airbnbDF.drop(["price"], axis=1), airbnbDF[["price"]].values.ravel(), random_state=42)

## Model

After cleaning our data, we can start creating our model!

Firstly, if there are still `NaN`'s in your data, you may want to impute these values instead of dropping those entries entirely. Make sure that any further processing/imputing steps after the train test split is part of a model/pipeline that can be saved.

In the following cell, create and fit a single sklearn model.

In [0]:
#model 1 - random forest over a wide range of Hyperparameters
import mlflow.sklearn
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV

# dictionary containing hyperparameter names and list of values we want to try
parameters = {"n_estimators": [100, 1000],
              "max_depth": [5, 10],
              "max_leaf_nodes": [10, 20],
              "min_samples_leaf": [3, 8]}

rf = RandomForestRegressor()
grid_rf_model = GridSearchCV(rf, parameters, cv=3)
grid_rf_model.fit(X_train, y_train)

best_rf = grid_rf_model.best_estimator_
for p in parameters:
  print("Best '{}': {}".format(p, best_rf.get_params()[p]))


Pick and calculate a regression metric for evaluating your model.

In [0]:
#metric calculated and logged in below command

Log your model on MLflow with the same metric you calculated above so we can compare all the different models you have tried! Make sure to also log any hyperparameters that you plan on tuning!

In [0]:
import mlflow.sklearn

from sklearn.metrics import mean_squared_error

with mlflow.start_run(run_name="RF-Wider-Grid-Search") as run:
  # Create predictions of X_test using best model
  predictions = best_rf.predict(X_test)
  
  # Log model with name
  mlflow.sklearn.log_model(best_rf, "wider-grid-random-forest-model")
  
  # Log params
  for p in parameters:
     mlflow.log_param(p,best_rf.get_params()[p])
  
  # Create and log MSE metrics using predictions of X_test and its actual value y_test
  mse = mean_squared_error(y_test, predictions)
  print(" mse: {}".format(mse))
  
  mlflow.log_metric("mse",mse)
  
  runID = run.info.run_uuid
  experimentID = run.info.experiment_id
  print("Inside MLflow Run with id {}".format(runID))
  print("Inside MLflow experiment with id {}".format(experimentID))

Change and re-run the above 3 code cells to log different models and/or models with different hyperparameters until you are satisfied with the performance of at least 1 of them.

In [0]:
#random forest model 2 with narrower hyper parameters

import mlflow.sklearn
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV

# dictionary containing hyperparameter names and list of values we want to try
parameters = {"n_estimators": [100, 1000],
              "max_depth": [5, 10] }

rf2 = RandomForestRegressor()
grid_rf2_model = GridSearchCV(rf2, parameters, cv=3)
grid_rf2_model.fit(X_train, y_train)

best_rf2 = grid_rf2_model.best_estimator_
for p in parameters:
  print("Best '{}': {}".format(p, best_rf2.get_params()[p]))

In [0]:
#metric calculated and logged with model 2 and its parameters

import mlflow.sklearn

from sklearn.metrics import mean_squared_error

with mlflow.start_run(run_name="RF-Narrower-Grid-Search") as run:
  # Create predictions of X_test using best model
  predictions = best_rf2.predict(X_test)
  
  # Log model with name
  mlflow.sklearn.log_model(best_rf2, "narrower-grid-random-forest-model")
  
  # Log params
  for p in parameters:
     mlflow.log_param(p,best_rf2.get_params()[p])
  
  # Create and log MSE metrics using predictions of X_test and its actual value y_test
  mse = mean_squared_error(y_test, predictions)
  print(" mse: {}".format(mse))
  
  mlflow.log_metric("mse",mse)
  
  runID = run.info.run_uuid
  experimentID = run.info.experiment_id
  print("Inside MLflow Run with id {}".format(runID))
  print("Inside MLflow experiment with id {}".format(experimentID))

In [0]:
#model 2 selected as it has lower mse

Look through the MLflow UI for the best model. Copy its `URI` so you can load it as a `pyfunc` model.

In [0]:
import mlflow.pyfunc
from  mlflow.tracking import MlflowClient

client = MlflowClient()
artifactURI = MlflowClient().get_experiment(experimentID).artifact_location
rf_run = sorted(client.list_run_infos(experimentID), key=lambda r: r.start_time, reverse=True)[0]
rf_path = rf_run.artifact_uri+"/narrower-grid-random-forest-model/"

rf_pyfunc_model = mlflow.pyfunc.load_model(rf_path.replace("dbfs:", "/dbfs"))

## Post-processing

Our model currently gives us the predicted price per night for each Airbnb listing. Now we would like our model to tell us what the price per person would be for each listing, assuming the number of renters is equal to the `accommodates` value.

-sandbox
Fill in the following model class to add in a post-processing step which will get us from total price per night to **price per person per night**.

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> Check out <a href="https://www.mlflow.org/docs/latest/models.html#id13" target="_blank">the MLFlow docs for help.</a>

In [0]:

class Airbnb_Model(mlflow.pyfunc.PythonModel):

    def __init__(self, model):
        self.best_rf = model
    
    def postprocess_result(self, results, model_input):   
        #x = model_input["accommodates"].div(other = results) 
        import numpy as np
        x = np.divide(results,model_input["accommodates"]) #numpy division
        return x
      
    def predict(self, context, model_input):
        processed_model_input = model_input.copy()
        results = self.best_rf.predict(processed_model_input)
        return self.postprocess_result(results, model_input)

Construct and save the model to the given `final_model_path`.

In [0]:
final_model_path =  project_path.replace("dbfs:", "/dbfs") + "model"

# Construct and save the model
final_model_path =  userhome + "/ml-production/Airbnb_Model/"

dbutils.fs.rm(final_model_path, True) # remove folder if already exists

rf_postprocess_model = Airbnb_Model(model = best_rf)
mlflow.pyfunc.save_model(path=final_model_path.replace("dbfs:", "/dbfs"), python_model=rf_postprocess_model)


Load the model in `python_function` format and apply it to our test data `X_test` to check that we are getting price per person predictions now.

In [0]:
# Load the model in `python_function` format
loaded_postprocess_model = mlflow.pyfunc.load_pyfunc(final_model_path.replace("dbfs:", "/dbfs"))

# Apply the model
loaded_postprocess_model.predict(X_test) #on all 1201 test instances

## Packaging your Model

Now we would like to package our completed model!

-sandbox
First save your testing data at `test_data_path` so we can test the packaged model.

<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** When using `.to_csv` make sure to set `index=False` so you don't end up with an extra index column in your saved dataframe.

In [0]:
# TODO
#save the testing data 
test_data_path = project_path.replace("dbfs:", "/dbfs") + "test_data.csv"
X_test.to_csv(path_or_buf=test_data_path,index=False)
print(test_data_path)
prediction_path = project_path.replace("dbfs:", "/dbfs") + "predictions.csv"

print(prediction_path)
print(final_model_path)

First we will determine what the project script should do. Fill out the `model_predict` function to load out the trained model you just saved (at `final_model_path`) and make price per person predictions on the data at `test_data_path`. Then those predictions should be saved under `prediction_path` for the user to access later.

Run the cell to check that your function is behaving correctly and that you have predictions saved at `demo_prediction_path`.

In [0]:
# TODO
import click
import mlflow.pyfunc
import pandas as pd
import mlflow.sklearn
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score
from sklearn.model_selection import train_test_split

@click.command()
@click.option("--final_model_path", default="/dbfs/user/vivek.sivalingam@rhsmith.umd.edu/ml-production/Airbnb_Model/", type=str)
@click.option("--test_data_path", default="/dbfs/user/vivek.sivalingam@rhsmith.umd.edu/ml-production/Capstone/test_data.csv", type=str)
@click.option("--prediction_path", default="/dbfs/user/vivek.sivalingam@rhsmith.umd.edu/ml-production/Capstone/predictions.csv", type=str)
def model_predict(final_model_path, test_data_path, prediction_path):
    # FILL_IN
  
  
    # Load the model in `python_function` format
    loaded_postprocess_model = mlflow.pyfunc.load_pyfunc(final_model_path.replace("dbfs:", "/dbfs"))

    df = pd.read_csv(test_data_path)
    # Apply the model
    result = loaded_postprocess_model.predict(df) #on all 1201 test instances
    
    #save the predicted data 
    prediction_path = project_path.replace("dbfs:", "/dbfs") + "predictions.csv"
    result.to_csv(path_or_buf=prediction_path,index=False)
        

    
    
    
# test model_predict function    
demo_prediction_path = project_path.replace("dbfs:", "/dbfs") + "demo_predictions.csv"
result.to_csv(path_or_buf=demo_prediction_path,index=False)

from click.testing import CliRunner
runner = CliRunner()
result = runner.invoke(model_predict, ['--final_model_path', final_model_path, 
                                       '--test_data_path', test_data_path,
                                       '--prediction_path', demo_prediction_path], catch_exceptions=True)

assert result.exit_code == 0, "Code failed" # Check to see that it worked
print("Price per person predictions: ")
print(pd.read_csv(demo_prediction_path))


Next, we will create a MLproject file and put it under our `project_path`. Complete the parameters and command of the file.

In [0]:
# TODO
dbutils.fs.put(project_path + "MLproject", 
'''
name: Capstone-Project

conda_env: conda.yaml

entry_points:
  main:
    parameters:
      final_model_path: {type: str, default: "/dbfs/user/vivek.sivalingam@rhsmith.umd.edu/ml-production/Airbnb_Model/"}
      test_data_path: {type: str, default: "/dbfs/user/vivek.sivalingam@rhsmith.umd.edu/ml-production/Capstone/test_data.csv"}
      prediction_path: {type: str, default: "/dbfs/user/vivek.sivalingam@rhsmith.umd.edu/ml-production/Capstone/predictions.csv"}
    command:  "python predict.py --final_model_path {final_model_path} --test_data_path {test_data_path} --prediction_path {prediction_path}"
'''.strip(), overwrite=True)

We then create a `conda.yaml` file to list the dependencies needed to run our script.

In [0]:
dbutils.fs.put(project_path + "conda.yaml", 
'''
name: Capstone
channels:
  - defaults
dependencies:
  - cloudpickle=0.8.0
  - numpy=1.16.2
  - pandas=0.24.2
  - scikit-learn=0.20.3
  - pip:
    - mlflow==1.5.0
'''.strip(), overwrite=True)

-sandbox
<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> You can check the versions match your current environment using the following cell.

In [0]:
import cloudpickle
print("cloudpickle: " + cloudpickle.__version__)
import numpy
print("numpy: " + numpy.__version__)
import pandas
print("pandas: " + pandas.__version__)
import sklearn
print("sklearn: " + sklearn.__version__)
import mlflow
print("mlflow: " + mlflow.__version__)

Now we will put the `predict.py` script into our project package. Complete the `.py` file by copying and placing the `model_predict` function you defined above.

In [0]:
# TODO
dbutils.fs.put(project_path + "predict.py", 
'''
import click
import mlflow.pyfunc
import pandas as pd

# put model_predict function with decorators here
    
if __name__ == "__main__":
  model_predict()

'''.strip(), overwrite=True)

Let's double check all the files we've created are in the `project_path` folder. You should have at least the following 3 files:
* `MLproject`
* `conda.yaml`
* `predict.py`

In [0]:
dbutils.fs.ls(project_path)

Under `project_path` is your completely packaged project. Run the project to use the model saved at `final_model_path` to predict the price per person of each Airbnb listing in `test_data_path` and save those predictions under `prediction_path`.

In [0]:
import mlflow
# TODO
mlflow.projects.run(uri = project_path.replace("dbfs:", "/dbfs"),
   # FILL_IN
    parameters= {
      "final_model_path": "/dbfs/user/vivek.sivalingam@rhsmith.umd.edu/ml-production/Airbnb_Model/",
      "test_data_path": "/dbfs/user/vivek.sivalingam@rhsmith.umd.edu/ml-production/Capstone/test_data.csv",
      "prediction_path": "/dbfs/user/vivek.sivalingam@rhsmith.umd.edu/ml-production/Capstone/predictions.csv"
}
)

Run the following cell to check that your model's predictions are there!

In [0]:
print("Price per person predictions: ")
print(pd.read_csv(prediction_path))

Run the following command to clear the project and data files from your directory.

In [0]:
dbutils.fs.rm(project_path, True)

-sandbox
&copy; 2019 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>