## Exercise: Finish Featurizing the Dataset

One common way of handling categorical data is to divide it into bins, a process technically known as discretizing.  For instance, the dataset contains a number of rating scores that can be translated into a value of `1` if they are a highly rated host or `0` if not.

Finish featurizing the dataset by binning the review scores rating into high versus low rated hosts.  Also filter the extreme values and clean the column `price`.

Run the following cell to set up our environment.

This notebook walks through steps to complete a data preprocessing pipeline for an Airbnb dataset. 

### Step-by-Step Walkthrough

#### Step 1: Setting Up the Environment
The initial cell (`%run "./Includes/Classroom-Setup"`) runs a setup script that loads necessary libraries and mounts the Databricks dataset directory.

#### Step 2: Load and Prepare Data
In this step, the Airbnb dataset is loaded, and preliminary transformations are defined:

1. **Data Loading**: The dataset is loaded from a `.parquet` file with:
   ```python
   airbnbDF = spark.read.parquet("/mnt/training/airbnb/sf-listings/sf-listings-correct-types.parquet")
   ```
2. **Encoding Categorical Data**:
   - `StringIndexer` is used to convert the `room_type` column into numerical indices (like `0` for "Private room").
   - `OneHotEncoder` then transforms these indices into a one-hot encoded format.
   
3. **Handling Missing Data**:
   - An `Imputer` is used to fill missing values in numeric columns (`host_total_listings_count`, `bathrooms`, etc.) with the median values.

4. **Pipeline Creation**:
   - A `Pipeline` is created to chain these transformations (indexing, encoding, and imputing) into a single step.
   - `pipelineModel.fit()` fits the pipeline to the data, and `transform()` is applied to generate `transformedDF`, which contains all the transformations.
   
   ```python
   pipelineModel = pipeline.fit(airbnbDF)
   transformedDF = pipelineModel.transform(airbnbDF)
   ```

#### Step 3: Binarizing Review Scores
To classify listings into high- and low-rated hosts:

1. **Binarizer Setup**:
   - `Binarizer` is used to convert `review_scores_rating` values into binary classifications.
   - A threshold of `4.0` is set, so ratings â‰¥ 4 are classified as 1 (high rating), and others as 0.
   
   ```python
   binarizer = Binarizer(threshold=4.0, inputCol="review_scores_rating", outputCol="high_rating")
   transformedBinnedDF = binarizer.transform(transformedDF)
   ```

2. **Validation**:
   - The `dbTest()` function checks if the transformations are correctly applied by validating column types and values.

#### Step 4: Cleaning the `price` Column
To clean and convert the `price` column:

1. **Removing Symbols**:
   - Using `regexp_replace`, the code removes `$` and `,` symbols, converting the price to `decimal(10,2)` type.
   
   ```python
   transformedBinnedRegexDF = transformedBinnedDF.withColumn("price_raw", col("price")).withColumn("price", regexp_replace(col("price"), "[$,]", "").cast("decimal(10,2)"))
   ```

2. **Validation**:
   - Another `dbTest()` validates that the column is converted correctly and includes a backup `price_raw` column.

#### Step 5: Filtering Extreme Values
To address extreme values in the dataset:

1. **Removing Outliers**:
   - This step filters out rows where `price` is less than or equal to $0 or where `minimum_nights` is 365 or higher.
   
   ```python
   transformedBinnedRegexFilteredDF = transformedBinnedRegexDF.filter((col("price_raw").isNotNull()) & (col("price") > 0) & (col("minimum_nights") < 365))
   ```

2. **Validation**:
   - The final `dbTest()` checks that the correct number of rows remains after filtering.

This completes the data preprocessing, preparing the dataset for machine learning tasks by ensuring that the data is clean, standardized, and properly formatted for modeling.

In [0]:
%run "./Includes/Classroom-Setup"

**Restore the Dataset from the Featurization module**

In [0]:
from pyspark.ml import Pipeline
from pyspark.ml.feature import StringIndexer
from pyspark.ml.feature import OneHotEncoder
from pyspark.ml.feature import Imputer

airbnbDF = spark.read.parquet("/mnt/training/airbnb/sf-listings/sf-listings-correct-types.parquet")

indexer = StringIndexer(inputCol="room_type", outputCol="room_type_index")
encoder = OneHotEncoder(inputCols=["room_type_index"], outputCols=["encoded_room_type"])
imputeCols = [
  "host_total_listings_count",
  "bathrooms",
  "beds", 
  "review_scores_rating",
  "review_scores_accuracy",
  "review_scores_cleanliness",
  "review_scores_checkin",
  "review_scores_communication",
  "review_scores_location",
  "review_scores_value"
]
imputer = Imputer(strategy="median", inputCols=imputeCols, outputCols=imputeCols)

pipeline = Pipeline(stages=[
  indexer, 
  encoder, 
  imputer
])

pipelineModel = pipeline.fit(airbnbDF)
transformedDF = pipelineModel.transform(airbnbDF)

display(transformedDF)

### Step 1: Binning `review_scores_rating`

Divide the hosts by whether their `review_scores_rating` is above 97.  Do this using the transformer `Binarizer` with the output column `high_rating`.  This should create the objects `binarizer` and the transformed DataFrame `transformedBinnedDF`.

<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** Note that `Binarizer` is a transformer, so it does not have a `.fit()` method<br>
<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** See the <a href="http://spark.apache.org/docs/latest/api/python/pyspark.ml.html?highlight=binarizer#pyspark.ml.feature.Binarizer" target="_blank">Binarizer Docs</a> for more details.</a>

In [0]:
# TODO
from pyspark.ml.feature import Binarizer

binarizer = Binarizer(threshold=4.0, inputCol="review_scores_rating", outputCol="high_rating")
transformedBinnedDF = binarizer.transform(transformedDF)

In [0]:
# TEST - Run this cell to test your solution
from pyspark.ml.feature import Binarizer

dbTest("ML1-P-05-01-01", True, type(binarizer) == type(Binarizer()))
dbTest("ML1-P-05-01-02", True, binarizer.getInputCol() == 'review_scores_rating')
dbTest("ML1-P-05-01-03", True, binarizer.getOutputCol() == 'high_rating')
dbTest("ML1-P-05-01-04", True, "high_rating" in transformedBinnedDF.columns)

print("Tests passed!")

### Step 2: Regular Expressions on Strings

Clean the column `price` by creating two new columns:<br><br>

1. `price`: a new column that contains a cleaned version of price.  This can be done using the regular expression replacement of `"[\$,]"` with `""`.  Cast the column as a decimal.
2. `raw_price`: the collumn `price` in its current form

<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** See the <a href="http://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=regexp_replace#pyspark.sql.functions.regexp_replace" target="_blank">`regex_replace` Docs</a> for more details.

In [0]:
# TODO
from pyspark.sql.functions import col, regexp_replace
transformedBinnedRegexDF = transformedBinnedDF.withColumn("price_raw", col("price")).withColumn("price", regexp_replace(col("price"), "[$,]", "").cast("decimal(10,2)"))

In [0]:
# TEST - Run this cell to test your solution
from pyspark.sql.types import DecimalType

dbTest("ML1-P-05-02-01", True, type(transformedBinnedRegexDF.schema["price"].dataType) == type(DecimalType()))
dbTest("ML1-P-05-02-02", True, "price_raw" in transformedBinnedRegexDF.columns)
dbTest("ML1-P-05-02-03", True, "price" in transformedBinnedRegexDF.columns)

print("Tests passed!")

### Step 3: Filter Extremes

The dataset contains extreme values, including negative prices and minimum stays of over one year.  Filter out all prices of $0 or less and all `minimum_nights` of 365 or higher.  Save the results to `transformedBinnedRegexFilteredDF`.

In [0]:
# TODO
transformedBinnedRegexFilteredDF = transformedBinnedRegexDF.filter((col("price_raw").isNotNull()) & (col("price") > 0) & (col("minimum_nights") < 365))

In [0]:
# TEST - Run this cell to test your solution
dbTest("ML1-P-05-03-01", 4789, transformedBinnedRegexFilteredDF.count())

print("Tests passed!")