d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px; height: 163px">
</div>

# Featurization

Cleaning data and adding features creates the inputs for machine learning models, which are only as strong as the data they are fed.  This lesson examines the process of featurization including common tasks such as handling categorical features and normalization, imputing missing data, and creating a pipeline of featurization steps.

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) In this lesson you:<br>
* Differentiate Spark transformers, estimators, and pipelines
* One-hot encode categorical features
* Impute missing data
* Combine different featurization stages into a pipeline

<iframe  
src="//fast.wistia.net/embed/iframe/9j0djq95kk?videoFoam=true"
style="border:1px solid #1cb1c2;"
allowtransparency="true" scrolling="no" class="wistia_embed"
name="wistia_embed" allowfullscreen mozallowfullscreen webkitallowfullscreen
oallowfullscreen msallowfullscreen width="640" height="360" ></iframe>
<div>
<a target="_blank" href="https://fast.wistia.net/embed/iframe/9j0djq95kk?seo=false">
  <img alt="Opens in new tab" src="https://files.training.databricks.com/static/images/external-link-icon-16x16.png"/>&nbsp;Watch full-screen.</a>
</div>

-sandbox
### Transformers, Estimators, and Pipelines

Spark's machine learning library, `MLlib`, has three main abstractions:<br><br>

1. A **transformer** takes a DataFrame as an input and returns a new DataFrame with one or more columns appended to it.  
  - Transformers implement a `.transform()` method.  
2. An **estimator** takes a DataFrame as an input and returns a model, which itself is a transformer.
  - Estimators implements a `.fit()` method.
3. A **pipeline** combines together transformers and estimators to make it easier to combine multiple algorithms.
  - Pipelines implement a `.fit()` method.

These basic building blocks form the machine learning process in Spark from featurization through model training and deployment.  

Machine learning models are only as strong as the data they see and can only work on numerical data.  **Featurization is the process of creating this input data for a model.**  There are a number of common featurization approaches:<br><br>

* Encoding categorical variables
* Normalizing
* Creating new features
* Handling missing values
* Binning/discretizing

This lesson builds a pipeline of transformers and estimators in order to featurize a dataset.

<div><img src="https://files.training.databricks.com/images/eLearning/ML-Part-1/pipeline.jpg" style="height: 400px; margin: 20px"/></div>

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> `MLlib` can refer to both the general machine learning library in Spark or the RDD-specific API.  `SparkML` refers to the DataFrame-specific API, which is preferred over working on RDD's wherever possible.

Run the following cell to set up our environment.

In [0]:
%run "./Includes/Classroom-Setup"

### Categorical Features and One-Hot Encoding

Categorical features refer to a discrete number of groups.  In the case of the AirBnB dataset we'll use in this lesson, one categorical variable is room type.  There are three types of rooms: `Private room`, `Entire home/apt`, and `Shared room`.

A machine learning model does not know how to handle these room types.  Instead, we must first *encode* each unique string into a number.  Second, we must *one-hot encode* each of those values to a location in an array.  This allows our machine learning algorithms to model effects of each category.

| Room type       | Room type index | One-hot encoded room type index |
|-----------------|-----------------|---------------------------------|
| Private room    | 0               | [1, 0 ]                         |
| Entire home/apt | 1               | [0, 1]                          |
| Shared room     | 2               | [0, 0]                          |

Import the AirBnB dataset.

In [0]:
airbnbDF = spark.read.parquet("/mnt/training/airbnb/sf-listings/sf-listings-correct-types.parquet")

display(airbnbDF)

host_is_superhost,cancellation_policy,instant_bookable,host_total_listings_count,neighbourhood_cleansed,zipcode,latitude,longitude,property_type,room_type,accommodates,bathrooms,bedrooms,beds,bed_type,minimum_nights,number_of_reviews,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,price
t,moderate,f,1.0,Western Addition,94117.0,37.769310377340766,-122.43385634489,Apartment,Entire home/apt,3.0,1.0,1.0,2.0,Real Bed,1.0,127.0,97.0,10.0,10.0,10.0,10.0,10.0,10.0,$170.00
f,strict,f,2.0,Bernal Heights,94110.0,37.745112331410034,-122.42101788836888,Apartment,Entire home/apt,5.0,1.0,2.0,3.0,Real Bed,30.0,112.0,98.0,10.0,10.0,10.0,10.0,10.0,9.0,$235.00
f,strict,f,10.0,Haight Ashbury,94117.0,37.766689597862175,-122.45250461761628,Apartment,Private room,2.0,4.0,1.0,1.0,Real Bed,32.0,17.0,85.0,8.0,8.0,9.0,9.0,9.0,8.0,$65.00
t,moderate,t,4.0,Outer Mission,94127.0,37.73074592978503,-122.44840862635226,House,Private room,1.0,2.0,1.0,1.0,Real Bed,3.0,76.0,95.0,9.0,9.0,10.0,10.0,9.0,9.0,$60.00
f,strict,f,10.0,Haight Ashbury,94117.0,37.76487219421756,-122.45182799146508,House,Private room,2.0,4.0,1.0,1.0,Real Bed,32.0,7.0,91.0,9.0,9.0,9.0,9.0,9.0,9.0,$65.00
f,strict,f,2.0,Western Addition,94117.0,37.77524858589268,-122.43637374831292,House,Entire home/apt,5.0,1.5,2.0,2.0,Real Bed,5.0,26.0,97.0,10.0,10.0,10.0,10.0,10.0,10.0,$575.00
f,moderate,f,1.0,Western Addition,94115.0,37.78470745496073,-122.44555431261594,Apartment,Entire home/apt,7.0,1.0,2.0,1.0,Real Bed,2.0,27.0,88.0,9.0,7.0,10.0,10.0,9.0,9.0,$255.00
t,moderate,f,2.0,Mission,94110.0,37.75918889708064,-122.42236687240562,Apartment,Private room,3.0,1.0,1.0,2.0,Real Bed,1.0,559.0,98.0,10.0,10.0,10.0,10.0,10.0,9.0,$139.00
f,moderate,f,1.0,Mission,94110.0,37.75174004606522,-122.4094205953428,Apartment,Entire home/apt,4.0,2.5,3.0,3.0,Real Bed,3.0,24.0,95.0,9.0,9.0,10.0,10.0,9.0,9.0,$285.00
f,strict,t,1.0,Potrero Hill,94107.0,37.76258885144137,-122.40543055237004,House,Private room,2.0,1.0,1.0,1.0,Real Bed,1.0,386.0,93.0,9.0,9.0,10.0,10.0,9.0,9.0,$135.00


Take the unique values of `room_type` and index them to a numerical value.  Fit the `StringIndexer` estimator to the unique room types using the `.fit()` method and by passing in the data.

The trained `StringIndexer` model then becomes a transformer.  Use it to transform the results using the `.transform()` method and by passing in the data.

In [0]:
from pyspark.ml.feature import StringIndexer

uniqueTypesDF = airbnbDF.select("room_type").distinct() # Use distinct values to demonstrate how StringIndexer works

indexer = StringIndexer(inputCol="room_type",outputCol="room_type_index") # Set input column and new output column
indexerModel = indexer.fit(uniqueTypesDF) # Fit the indexer to learn room type/index pairs
indexedDF = indexerModel.transform(uniqueTypesDF) # Append a new column with the index

display(indexedDF)

room_type,room_type_index
Shared room,2.0
Entire home/apt,1.0
Private room,0.0


-sandbox
Now each room has a unique numerical value assigned.  While we could pass the new `room_type_index` into a machine learning model, it would assume that `Shared room` is twice as much as `Entire home/apt`, which is not the case.  Instead, we need to change these values to a binary yes/no value if a listing is for a shared room, entire home, or private room.

Do this by training and fitting the `OneHotEncoderEstimator`, which only operates on numerical values (this is why we needed to use `StringIndexer` first).

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> Certain models, such as random forest, do not need one-hot encoding (and can actually be negatively affected by the process).  The models we'll explore in this course, however, do need this process.

In [0]:
from pyspark.ml.feature import OneHotEncoderEstimator

# create OneHotEncoderEstimator
encoder = OneHotEncoderEstimator(inputCols=["room_type_index"], outputCols=["encoded_room_type"])

# fit to the model
encoderModel = encoder.fit(indexedDF)

# get the encodedDF
encodedDF = encoderModel.transform(indexedDF)

display(encodedDF)

room_type,room_type_index,encoded_room_type
Shared room,2.0,"List(0, 2, List(), List())"
Entire home/apt,1.0,"List(0, 2, List(1), List(1.0))"
Private room,0.0,"List(0, 2, List(0), List(1.0))"


The new column `encoded_room_type` is a vector.  The difference between a sparse and dense vector is whether Spark records all of the empty values.  In a sparse vector, like we see here, Spark saves space by only recording the places where the vector has a non-zero value.  The value of 0 in the first position indicates that it's a sparse vector.  The second value indicates the length of the vector.

Here's how to read the mapping above:<br><br>

* `Shared room` maps to the vector `[0, 0]`
* `Entire home/apt` maps to the vector `[0, 1]`
* `Private room` maps to the vector `[1, 0]`

-sandbox
### Imputing Null or Missing Data

Null values refer to unknown or missing data as well as irrelevant responses. Strategies for dealing with this scenario include:<br><br>

* **Dropping these records:** Works when you do not need to use the information for downstream workloads
* **Adding a placeholder (e.g. `-1`):** Allows you to see missing data later on without violating a schema
* **Basic imputing:** Allows you to have a "best guess" of what the data could have been, often by using the mean of non-missing data
* **Advanced imputing:** Determines the "best guess" of what data should be using more advanced strategies such as clustering machine learning algorithms or oversampling techniques <a href="https://jair.org/index.php/jair/article/view/10302" target="_blank">such as SMOTE.</a>

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> Try to determine why a value is null.  This can provide information that can be helpful to the model.

Describe the dataset and take a look at the `count` values.  There's a fair amount of missing data in this dataset.

In [0]:
display(airbnbDF.describe())

summary,host_is_superhost,cancellation_policy,instant_bookable,host_total_listings_count,neighbourhood_cleansed,zipcode,latitude,longitude,property_type,room_type,accommodates,bathrooms,bedrooms,beds,bed_type,minimum_nights,number_of_reviews,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,price
count,4776,4804,4804,4776.0,4804,4774,4804.0,4804.0,4804,4804,4804.0,4781.0,4804.0,4798.0,4804,4804.0,4804.0,4370.0,4369.0,4370.0,4368.0,4369.0,4368.0,4367.0,4804
mean,,,,6.08500837520938,,94114.98575319505,37.76354552280585,-122.43260551327457,,,3.429017485428809,1.3788956285295964,1.4579517069109076,1.8991246352646936,,20823.923397169023,49.91278101582015,95.78787185354692,9.79423208972305,9.673684210526316,9.89010989010989,9.883039597161822,9.631639194139192,9.496908632928784,
stddev,,,,20.54665224944482,,10.772635297473283,0.0228748676898511,0.0268041290509439,,,2.2499818382467858,1.2654799285506535,1.423787841854443,1.589729932989645,,1442774.5278940618,68.2687864462308,5.354162269377868,0.499086768724072,0.6268344438946888,0.3961264087727749,0.3843412952850329,0.617341089982517,0.6452384295034256,
min,f,flexible,f,0.0,Bayview,-- default zip code --,37.7047830305603,-122.51149998987212,Aparthotel,Entire home/apt,1.0,0.0,0.0,0.0,Airbed,1.0,0.0,20.0,2.0,2.0,2.0,2.0,3.0,2.0,$0.00
max,t,super_strict_60,t,964.0,Western Addition,94158,37.81030633452129,-122.37042748617776,Vacation home,Shared room,16.0,15.0,15.0,15.0,Real Bed,100000000.0,568.0,100.0,10.0,10.0,10.0,10.0,10.0,10.0,$995.00


Try dropping missing values.

In [0]:
countWithoutDropping = airbnbDF.count()

# drop missing values
countWithDropping = airbnbDF.na.drop(subset=["zipcode","host_is_superhost"]).count()

print("Count without dropping nulls:\t", countWithoutDropping)
print("Count with dropping nulls:\t", countWithDropping)

Another common option for working with missing data is to impute the missing values with a best guess for their value.  Try imputing a list of columns with their median.

In [0]:
from pyspark.ml.feature import Imputer

imputeCols = [
  "host_total_listings_count",
  "bathrooms",
  "beds", 
  "review_scores_rating",
  "review_scores_accuracy",
  "review_scores_cleanliness",
  "review_scores_checkin",
  "review_scores_communication",
  "review_scores_location",
  "review_scores_value"
]

# create Imputer using median 
imputer = Imputer(strategy="median",inputCols=imputeCols, outputCols=imputeCols)
imputerModel = imputer.fit(airbnbDF)#fit the modle
imputedDF = imputerModel.transform(airbnbDF)# get the imputed dataframe

display(imputedDF)

host_is_superhost,cancellation_policy,instant_bookable,host_total_listings_count,neighbourhood_cleansed,zipcode,latitude,longitude,property_type,room_type,accommodates,bathrooms,bedrooms,beds,bed_type,minimum_nights,number_of_reviews,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,price
t,moderate,f,1.0,Western Addition,94117.0,37.769310377340766,-122.43385634489,Apartment,Entire home/apt,3.0,1.0,1.0,2.0,Real Bed,1.0,127.0,97.0,10.0,10.0,10.0,10.0,10.0,10.0,$170.00
f,strict,f,2.0,Bernal Heights,94110.0,37.745112331410034,-122.42101788836888,Apartment,Entire home/apt,5.0,1.0,2.0,3.0,Real Bed,30.0,112.0,98.0,10.0,10.0,10.0,10.0,10.0,9.0,$235.00
f,strict,f,10.0,Haight Ashbury,94117.0,37.766689597862175,-122.45250461761628,Apartment,Private room,2.0,4.0,1.0,1.0,Real Bed,32.0,17.0,85.0,8.0,8.0,9.0,9.0,9.0,8.0,$65.00
t,moderate,t,4.0,Outer Mission,94127.0,37.73074592978503,-122.44840862635226,House,Private room,1.0,2.0,1.0,1.0,Real Bed,3.0,76.0,95.0,9.0,9.0,10.0,10.0,9.0,9.0,$60.00
f,strict,f,10.0,Haight Ashbury,94117.0,37.76487219421756,-122.45182799146508,House,Private room,2.0,4.0,1.0,1.0,Real Bed,32.0,7.0,91.0,9.0,9.0,9.0,9.0,9.0,9.0,$65.00
f,strict,f,2.0,Western Addition,94117.0,37.77524858589268,-122.43637374831292,House,Entire home/apt,5.0,1.5,2.0,2.0,Real Bed,5.0,26.0,97.0,10.0,10.0,10.0,10.0,10.0,10.0,$575.00
f,moderate,f,1.0,Western Addition,94115.0,37.78470745496073,-122.44555431261594,Apartment,Entire home/apt,7.0,1.0,2.0,1.0,Real Bed,2.0,27.0,88.0,9.0,7.0,10.0,10.0,9.0,9.0,$255.00
t,moderate,f,2.0,Mission,94110.0,37.75918889708064,-122.42236687240562,Apartment,Private room,3.0,1.0,1.0,2.0,Real Bed,1.0,559.0,98.0,10.0,10.0,10.0,10.0,10.0,9.0,$139.00
f,moderate,f,1.0,Mission,94110.0,37.75174004606522,-122.4094205953428,Apartment,Entire home/apt,4.0,2.5,3.0,3.0,Real Bed,3.0,24.0,95.0,9.0,9.0,10.0,10.0,9.0,9.0,$285.00
f,strict,t,1.0,Potrero Hill,94107.0,37.76258885144137,-122.40543055237004,House,Private room,2.0,1.0,1.0,1.0,Real Bed,1.0,386.0,93.0,9.0,9.0,10.0,10.0,9.0,9.0,$135.00


### Creating a Pipeline

Passing around estimator objects, trained estimators, and transformed dataframes quickly becomes cumbersome.  Spark uses the convention established by `scikit-learn` to combine each of these steps into a single pipeline.
We can now combine all of these steps into a single pipeline.

In [0]:
from pyspark.ml import Pipeline

# create a pipeline using the objects created above
pipeline = Pipeline(stages = [
  indexer,
  encoder,
  imputer
])

The pipeline is itself is now an estimator.  Train the model with its `.fit()` method and then transform the original dataset.  We've now combined all of our featurization steps into one pipeline with three stages.

In [0]:
# crate the model using pipeline
pipelineModel = pipeline.fit(airbnbDF)

# get the transform dataframe
transformedDF = pipelineModel.transform(airbnbDF)

display(transformedDF)

host_is_superhost,cancellation_policy,instant_bookable,host_total_listings_count,neighbourhood_cleansed,zipcode,latitude,longitude,property_type,room_type,accommodates,bathrooms,bedrooms,beds,bed_type,minimum_nights,number_of_reviews,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,price,room_type_index,encoded_room_type
t,moderate,f,1.0,Western Addition,94117.0,37.769310377340766,-122.43385634489,Apartment,Entire home/apt,3.0,1.0,1.0,2.0,Real Bed,1.0,127.0,97.0,10.0,10.0,10.0,10.0,10.0,10.0,$170.00,0.0,"List(0, 2, List(0), List(1.0))"
f,strict,f,2.0,Bernal Heights,94110.0,37.745112331410034,-122.42101788836888,Apartment,Entire home/apt,5.0,1.0,2.0,3.0,Real Bed,30.0,112.0,98.0,10.0,10.0,10.0,10.0,10.0,9.0,$235.00,0.0,"List(0, 2, List(0), List(1.0))"
f,strict,f,10.0,Haight Ashbury,94117.0,37.766689597862175,-122.45250461761628,Apartment,Private room,2.0,4.0,1.0,1.0,Real Bed,32.0,17.0,85.0,8.0,8.0,9.0,9.0,9.0,8.0,$65.00,1.0,"List(0, 2, List(1), List(1.0))"
t,moderate,t,4.0,Outer Mission,94127.0,37.73074592978503,-122.44840862635226,House,Private room,1.0,2.0,1.0,1.0,Real Bed,3.0,76.0,95.0,9.0,9.0,10.0,10.0,9.0,9.0,$60.00,1.0,"List(0, 2, List(1), List(1.0))"
f,strict,f,10.0,Haight Ashbury,94117.0,37.76487219421756,-122.45182799146508,House,Private room,2.0,4.0,1.0,1.0,Real Bed,32.0,7.0,91.0,9.0,9.0,9.0,9.0,9.0,9.0,$65.00,1.0,"List(0, 2, List(1), List(1.0))"
f,strict,f,2.0,Western Addition,94117.0,37.77524858589268,-122.43637374831292,House,Entire home/apt,5.0,1.5,2.0,2.0,Real Bed,5.0,26.0,97.0,10.0,10.0,10.0,10.0,10.0,10.0,$575.00,0.0,"List(0, 2, List(0), List(1.0))"
f,moderate,f,1.0,Western Addition,94115.0,37.78470745496073,-122.44555431261594,Apartment,Entire home/apt,7.0,1.0,2.0,1.0,Real Bed,2.0,27.0,88.0,9.0,7.0,10.0,10.0,9.0,9.0,$255.00,0.0,"List(0, 2, List(0), List(1.0))"
t,moderate,f,2.0,Mission,94110.0,37.75918889708064,-122.42236687240562,Apartment,Private room,3.0,1.0,1.0,2.0,Real Bed,1.0,559.0,98.0,10.0,10.0,10.0,10.0,10.0,9.0,$139.00,1.0,"List(0, 2, List(1), List(1.0))"
f,moderate,f,1.0,Mission,94110.0,37.75174004606522,-122.4094205953428,Apartment,Entire home/apt,4.0,2.5,3.0,3.0,Real Bed,3.0,24.0,95.0,9.0,9.0,10.0,10.0,9.0,9.0,$285.00,0.0,"List(0, 2, List(0), List(1.0))"
f,strict,t,1.0,Potrero Hill,94107.0,37.76258885144137,-122.40543055237004,House,Private room,2.0,1.0,1.0,1.0,Real Bed,1.0,386.0,93.0,9.0,9.0,10.0,10.0,9.0,9.0,$135.00,1.0,"List(0, 2, List(1), List(1.0))"


## Exercise: Finish Featurizing the Dataset

One common way of handling categorical data is to divide it into bins, a process technically known as discretizing.  For instance, the dataset contains a number of rating scores that can be translated into a value of `1` if they are a highly rated host or `0` if not.

Finish featurizing the dataset by binning the review scores rating into high versus low rated hosts.  Also filter the extreme values and clean the column `price`.

-sandbox
### Step 1: Binning `review_scores_rating`

Divide the hosts by whether their `review_scores_rating` is above 97.  Do this using the transformer `Binarizer` with the output column `high_rating`.  This should create the objects `binarizer` and the transformed DataFrame `transformedBinnedDF`.

<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** Note that `Binarizer` is a transformer, so it does not have a `.fit()` method<br>
<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** See the <a href="http://spark.apache.org/docs/latest/api/python/pyspark.ml.html?highlight=binarizer#pyspark.ml.feature.Binarizer" target="_blank">Binarizer Docs</a> for more details.</a>

In [0]:
from pyspark.ml.feature import Binarizer

binarizer = Binarizer(threshold=97.0, inputCol="review_scores_rating", outputCol="high_rating")
transformedBinnedDF = binarizer.transform(airbnbDF)

In [0]:
# TEST - Run this cell to test your solution
from pyspark.ml.feature import Binarizer

dbTest("ML1-P-05-01-01", True, type(binarizer) == type(Binarizer()))
dbTest("ML1-P-05-01-02", True, binarizer.getInputCol() == 'review_scores_rating')
dbTest("ML1-P-05-01-03", True, binarizer.getOutputCol() == 'high_rating')
dbTest("ML1-P-05-01-04", True, "high_rating" in transformedBinnedDF.columns)

print("Tests passed!")

-sandbox
### Step 2: Regular Expressions on Strings

Clean the column `price` by creating two new columns:<br><br>

1. `price`: a new column that contains a cleaned version of price.  This can be done using the regular expression replacement of `"[\$,]"` with `""`.  Cast the column as a decimal.
2. `raw_price`: the collumn `price` in its current form

<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** See the <a href="http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=regexp_replace#pyspark.sql.functions.regexp_replace" target="_blank">`regex_replace` Docs</a> for more details.

In [0]:
# TODO
from pyspark.sql.functions import col, regexp_replace
from pyspark.sql.types import DecimalType

transformedBinnedRegexDF = (transformedBinnedDF
                            .withColumn("price_raw",col("price"))
                            .withColumn("price",regexp_replace("price","[\$,]","").cast("Decimal"))                           
                                  
                           )
display(transformedBinnedRegexDF)

host_is_superhost,cancellation_policy,instant_bookable,host_total_listings_count,neighbourhood_cleansed,zipcode,latitude,longitude,property_type,room_type,accommodates,bathrooms,bedrooms,beds,bed_type,minimum_nights,number_of_reviews,review_scores_rating,review_scores_accuracy,review_scores_cleanliness,review_scores_checkin,review_scores_communication,review_scores_location,review_scores_value,price,high_rating,price_raw
t,moderate,f,1.0,Western Addition,94117.0,37.769310377340766,-122.43385634489,Apartment,Entire home/apt,3.0,1.0,1.0,2.0,Real Bed,1.0,127.0,97.0,10.0,10.0,10.0,10.0,10.0,10.0,170,0.0,$170.00
f,strict,f,2.0,Bernal Heights,94110.0,37.745112331410034,-122.42101788836888,Apartment,Entire home/apt,5.0,1.0,2.0,3.0,Real Bed,30.0,112.0,98.0,10.0,10.0,10.0,10.0,10.0,9.0,235,1.0,$235.00
f,strict,f,10.0,Haight Ashbury,94117.0,37.766689597862175,-122.45250461761628,Apartment,Private room,2.0,4.0,1.0,1.0,Real Bed,32.0,17.0,85.0,8.0,8.0,9.0,9.0,9.0,8.0,65,0.0,$65.00
t,moderate,t,4.0,Outer Mission,94127.0,37.73074592978503,-122.44840862635226,House,Private room,1.0,2.0,1.0,1.0,Real Bed,3.0,76.0,95.0,9.0,9.0,10.0,10.0,9.0,9.0,60,0.0,$60.00
f,strict,f,10.0,Haight Ashbury,94117.0,37.76487219421756,-122.45182799146508,House,Private room,2.0,4.0,1.0,1.0,Real Bed,32.0,7.0,91.0,9.0,9.0,9.0,9.0,9.0,9.0,65,0.0,$65.00
f,strict,f,2.0,Western Addition,94117.0,37.77524858589268,-122.43637374831292,House,Entire home/apt,5.0,1.5,2.0,2.0,Real Bed,5.0,26.0,97.0,10.0,10.0,10.0,10.0,10.0,10.0,575,0.0,$575.00
f,moderate,f,1.0,Western Addition,94115.0,37.78470745496073,-122.44555431261594,Apartment,Entire home/apt,7.0,1.0,2.0,1.0,Real Bed,2.0,27.0,88.0,9.0,7.0,10.0,10.0,9.0,9.0,255,0.0,$255.00
t,moderate,f,2.0,Mission,94110.0,37.75918889708064,-122.42236687240562,Apartment,Private room,3.0,1.0,1.0,2.0,Real Bed,1.0,559.0,98.0,10.0,10.0,10.0,10.0,10.0,9.0,139,1.0,$139.00
f,moderate,f,1.0,Mission,94110.0,37.75174004606522,-122.4094205953428,Apartment,Entire home/apt,4.0,2.5,3.0,3.0,Real Bed,3.0,24.0,95.0,9.0,9.0,10.0,10.0,9.0,9.0,285,0.0,$285.00
f,strict,t,1.0,Potrero Hill,94107.0,37.76258885144137,-122.40543055237004,House,Private room,2.0,1.0,1.0,1.0,Real Bed,1.0,386.0,93.0,9.0,9.0,10.0,10.0,9.0,9.0,135,0.0,$135.00


In [0]:
# TEST - Run this cell to test your solution
from pyspark.sql.types import DecimalType

dbTest("ML1-P-05-02-01", True, type(transformedBinnedRegexDF.schema["price"].dataType) == type(DecimalType()))
dbTest("ML1-P-05-02-02", True, "price_raw" in transformedBinnedRegexDF.columns)
dbTest("ML1-P-05-02-03", True, "price" in transformedBinnedRegexDF.columns)

print("Tests passed!")

### Step 3: Filter Extremes

The dataset contains extreme values, including negative prices and minimum stays of over one year.  Filter out all prices of $0 or less and all `minimum_nights` of 365 or higher.  Save the results to `transformedBinnedRegexFilteredDF`.

In [0]:
from pyspark.sql.functions import col

transformedBinnedRegexFilteredDF = (transformedBinnedRegexDF
                                    .filter(col("price") > 0)
                                    .filter(col("minimum_nights") <= 365)
                                    
                                   )

In [0]:
# TEST - Run this cell to test your solution
dbTest("ML1-P-05-03-01", 4789, transformedBinnedRegexFilteredDF.count())

print("Tests passed!")

## Review

**Question:** What's the difference between a transformer, estimator, and pipeline?  
**Answer:** The Spark machine learning API and `feature` library is based on these main abstractions:
0. *Transformers* transform your data by appending a new column to a DataFrame.
0. *Estimators* learn something about your data and implement the `.fit()` method.  A trained estimator then becomes a transformer
0. *Pipelines* link together transformers and estimators into a single object for convenience.<br>

**Question:** How do you handle categorical features?  
**Answer:** Categorical features are a robust subject, so much so that there is a field dedicated to their study: discrete mathematics.  The most common way of handling categorical features is to one-hot encode them where each unique value is translated to a position in an array.  There are a host of other techniques as well.  For instance, high cordiality features are categorical features with many unique values.  In this case, one-hot encoding that many features would create too many dimensions.  One alternative is to bin the values to reduce the number of features but still contribute some information to the machine learning model.

**Question:** What's the best way to handle null values?  
**Answer:** The answer depends largely on what you hope to do with your data moving forward. You can drop null values or impute them with a number of different techniques.  For instance, clustering your data to fill null values with the values of nearby neighbors often gives more insight to machine learning models than using a simple mean.

## Next Steps

Start the next lesson, [Regression Modeling]($./06-Regression-Modeling ).

## Additional Topics & Resources

**Q:** Where can I find out more information on featurizing using Spark?  
**A:** Check out <a href="http://spark.apache.org/docs/latest/ml-features.html" target="_blank">the Apache Spark website for a more thorough treatment</a>

-sandbox
&copy; 2019 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>