In [1]:
from pyspark.sql import SparkSession

# Create a SparkSession
spark = SparkSession.builder \
    .appName("example") \
    .getOrCreate()

# Which MLlib Module?

PySpark MLlib has many modules and algorithms available for machine learning. Each of them solves a specific subset of problem types. Given the problem we are trying to solve, which module is the correct one for continuous value prediction?

<center><img src="images/04.01.png"  style="width: 400px, height: 300px;"/></center>

- `ml.regression`

# Creating Time Splits

Often with time series, you acquire new data as it is made available and you will want to retrain your model using the newest data. In the video, we showed how to do a percentage split for test and training sets but suppose you wish to train on all available data except for the last 45days which you want to use for a test set.

In this exercise, we will create a function to find the split date for using the last 45 days of data for testing and the rest for training

In [4]:
df = spark.read.csv("dataset/2017_StPaul_MN_Real_Estate.csv", header=True)
df = df.withColumn('ACRES', df.ACRES.cast('double'))
df = df.withColumn('SalesClosePrice', df.SalesClosePrice.cast('double'))
df = df.withColumn('FOUNDATIONSIZE', df.FOUNDATIONSIZE.cast('double'))
df = df.withColumn('LISTPRICE', df.LISTPRICE.cast('double'))
df = df.withColumn('AssessedValuation', df.AssessedValuation.cast('double'))
df = df.withColumn('Taxes', df.Taxes.cast('double'))
df = df.withColumn('Bedrooms', df.Bedrooms.cast('double'))
df = df.withColumn('BATHSTOTAL', df.BATHSTOTAL.cast('double'))
df = df.withColumn('SQFTBELOWGROUND', df.SQFTBELOWGROUND.cast('double'))
df = df.withColumn('SQFTABOVEGROUND', df.SQFTABOVEGROUND.cast('double'))

# Import needed functions
from pyspark.sql.functions import to_date, dayofweek, to_timestamp,col
df = df.withColumn("LISTDATE", to_timestamp("LISTDATE", 'M/d/yyyy H:mm'))
df = df.withColumn("offmarketdate", to_timestamp("offmarketdate", 'M/d/yyyy H:mm'))
# sorted(df.columns)
df.show(3)

+---+-----+-------------------+--------------------+----------+----------+---------------+----------------+---------------+-------------------+---------+---------------+-----------------+------------+--------------+-----+---------+-----------------+--------------------+------------+-------------------+----------+---------+--------------------+--------------------+----------+------------------+---------------+----+--------------------+---------------+------+----------+---------+------------------+-------+----------+----------+---------+------------------+--------------------+-----+------------------+--------------------+----------------+--------------+---------+------------+----------+----------+---------+---------------------+--------------------+---------+---------+-----------+-----------------+-----+----------+--------------------+---------+----------+---------+----------+---------+----------+---------+----------+--------+---------------+-----------------+--------------+-------------

In [5]:
from datetime import timedelta
def train_test_split_date(df, split_col, test_days=45):
  """Calculate the date to split test and training sets"""
  # Find how many days our data spans
  max_date = df.agg({split_col: 'max'}).collect()[0][0]
  min_date = df.agg({split_col: 'min'}).collect()[0][0]
  # Subtract an integer number of days from the last date in dataset
  split_date = max_date - timedelta(days=test_days)
  return split_date

# Find the date to use in spitting test and train
split_date = train_test_split_date(df, 'offmarketdate')

# Create Sequential Test and Training Sets
train_df = df.where(df['offmarketdate'] < split_date) 
test_df = df.where(df['offmarketdate'] >= split_date).where(df['LISTDATE'] <= split_date) 

# Adjusting Time Features

We have mentioned throughout this course some of the dangers of leaking information to your model during training. Data leakage will cause your model to have very optimistic metrics for accuracy but once real data is run through it the results are often very disappointing.

In this exercise, we are going to ensure that DAYSONMARKET only reflects what information we have at the time of predicting the value. I.e., if the house is still on the market, we don't know how many more days it will stay on the market. We need to adjust our test_df to reflect what information we currently have as of 2017-12-10.

In [7]:
from pyspark.sql.functions import datediff, to_date, lit

split_date = to_date(lit('2017-12-10'))
# Create Sequential Test set
test_df = df.where(df['offmarketdate'] >= split_date).where(df['LISTDATE'] <= split_date)

# Create a copy of DAYSONMARKET to review later
test_df = test_df.withColumn('DAYSONMARKET_Original', test_df['DAYSONMARKET'])

# Recalculate DAYSONMARKET from what we know on our split date
test_df = test_df.withColumn('DAYSONMARKET', datediff(split_date, 'LISTDATE'))

# Review the difference
test_df[['LISTDATE', 'offmarketdate', 'DAYSONMARKET_Original', 'DAYSONMARKET']].show(5)

+-------------------+-------------------+---------------------+------------+
|           LISTDATE|      offmarketdate|DAYSONMARKET_Original|DAYSONMARKET|
+-------------------+-------------------+---------------------+------------+
|2017-10-06 00:00:00|2018-01-24 00:00:00|                  110|          65|
|2017-09-18 00:00:00|2017-12-12 00:00:00|                   82|          83|
|2017-11-07 00:00:00|2017-12-12 00:00:00|                   35|          33|
|2017-10-30 00:00:00|2017-12-11 00:00:00|                   42|          41|
|2017-07-14 00:00:00|2017-12-19 00:00:00|                  158|         149|
+-------------------+-------------------+---------------------+------------+
only showing top 5 rows



# Feature Engineering For Random Forests

Considering what steps you'll need to take to preprocess your data before running a machine learning algorithm is important or you could get invalid results. Which of the following preprocessing techniques are needed for Random Forest Regression?

- Perform value replacement for missing values and encode categorical text features to numeric.

# Dropping Columns with Low Observations

After doing a lot of feature engineering it's a good idea to take a step back and look at what you've created. If you've used some automation techniques on your categorical features like exploding or OneHot Encoding you may find that you now have hundreds of new binary features. While the subject of feature selection is material for a whole other course but there are some quick steps you can take to reduce the dimensionality of your data set.

In this exercise, we are going to remove columns that have less than 30 observations. 30 is a common minimum number of observations for statistical significance. Any less than that and the relationships cause overfitting because of a sheer coincidence!

In [8]:
# # PotentialShortSale
# obs_threshold = 30
# cols_to_remove = list()
# # Inspect first 10 binary columns in list
# for col in binary_cols[0:10]:
#   # Count the number of 1 values in the binary column
#   obs_count = df.agg({col:'sum'}).collect()[0][0]
#   # If less than our observation threshold, remove
#   if obs_count < obs_threshold:
#     cols_to_remove.append(col)
    
# # Drop columns and print starting and ending dataframe shapes
# new_df = df.drop(*cols_to_remove)

# print('Rows: ' + str(df.count()) + ' Columns: ' + str(len(df.columns)))
# print('Rows: ' + str(new_df.count()) + ' Columns: ' + str(len(new_df.columns)))

# Naively Handling Missing and Categorical Values

Random Forest Regression is robust enough to allow us to ignore many of the more time consuming and tedious data preparation steps. While some implementations of Random Forest handle missing and categorical values automatically, PySpark's does not. The math remains the same however so we can get away with some naive value replacements.

For missing values since our data is strictly positive, we will assign -1. The random forest will split on this value and handle it differently than the rest of the values in the same feature.

For categorical values, we can just map the text values to numbers and again the random forest will appropriately handle them by splitting on them. In this example, we will dust off pipelines from Introduction to PySpark to write our code more concisely. Please note that the exercise will start by displaying the dtypes of the columns in the dataframe, compare them to the results at the end of this exercise.

In [9]:
# from pyspark.ml import Pipeline
# from pyspark.ml.feature import StringIndexer, OneHotEncoder, VectorAssembler
# # Replace missing values
# df = df.fillna(-1, subset=['WALKSCORE' , 'BIKESCORE'])

# # Create list of StringIndexers using list comprehension
# indexers = [StringIndexer(inputCol=col, outputCol=col+"_IDX")\
#             .setHandleInvalid("keep") for col in categorical_cols]
# # Create pipeline of indexers
# indexer_pipeline = Pipeline(stages=indexers)
# # Fit and Transform the pipeline to the original data
# df_indexed = indexer_pipeline.fit(df).transform(df)

# # Clean up redundant columns
# df_indexed = df_indexed.drop(*categorical_cols)
# # Inspect data transformations
# print(df_indexed.dtypes)

# Building a Regression Model

One of the great things about PySpark ML module is that most algorithms can be tried and tested without changing much code. Random Forest Regression is a fairly simple ensemble model, using bagging to fit. Another tree based ensemble model is Gradient Boosted Trees which uses a different approach called boosting to fit. In this exercise let's train a `GBTRegressor`

In [11]:
features = df.columns
features = features.remove('SalesClosePrice')

In [14]:
# from pyspark.ml.regression import GBTRegressor

# # Train a Gradient Boosted Trees (GBT) model.
# gbt = GBTRegressor(featuresCol='features',
#                            labelCol='SalesClosePrice',
#                            predictionCol="Prediction_Price",
#                            seed=42
#                            )

# # Train model.
# model = gbt.fit(train_df)

# Evaluating & Comparing Algorithms

Now that we've created a new model with GBTRegressor its time to compare it against our baseline of RandomForestRegressor. To do this we will compare the predictions of both models to the actual data and calculate RMSE and R^2.

In [15]:
# from pyspark.ml.evaluation import RegressionEvaluator

# # Select columns to compute test error
# evaluator = RegressionEvaluator(labelCol='SalesClosePrice', 
#                                 predictionCol='Prediction_Price')
# # Dictionary of model predictions to loop over
# models = {'Gradient Boosted Trees': gbt_predictions, 'Random Forest Regression': rfr_predictions}
# for key, preds in models.items():
#   # Create evaluation metrics
#   rmse = evaluator.evaluate(preds, {evaluator.metricName: "rmse"})
#   r2 = evaluator.evaluate(preds, {evaluator.metricName: "r2"})
  
#   # Print Model Metrics
#   print(key + ' RMSE: ' + str(rmse))
#   print(key + ' R^2: ' + str(r2))

# Understanding Metrics

Recall that R^2 and RMSE are both metrics to evaluate the performance of regression models. Both provide a different way to interpret the fit of our model. Which of the following statements is TRUE regarding R^2 or RMSE?


- RMSE is comparable across predictions looking at the same dependent variable.
- R^2 is comparable across predictions regardless of dependent variable.
- RMSE is a measure of unexplained variance in the dependent variable.

# Interpreting Results

It is almost always important to know which features are influencing your prediction the most. Perhaps its counterintuitive and that's an insight? Perhaps a hand full of features account for most of the accuracy of your model and you don't need to perform time acquiring or massaging other features.

In [16]:
# # Convert feature importances to a pandas column
# fi_df = pd.DataFrame(importances, columns=['importances'])

# # Convert list of feature names to pandas column
# fi_df['feature'] = pd.Series(feature_cols)

# # Sort the data based on feature importance
# fi_df.sort_values(by=['importances'], ascending=False, inplace=True)

# # Inspect Results
# fi_df.head(10)

# Saving & Loading Models

Often times you may find yourself going back to a previous model to see what assumptions or settings were used when diagnosing where your prediction errors were coming from. Perhaps there was something wrong with the data? Maybe you need to incorporate a new feature to capture an unusual event that occurred?

In this example, you will practice saving and loading a model.

In [17]:
# from pyspark.ml.regression import RandomForestRegressionModel

# # Save model
# model.save('rfr_no_listprice')

# # Load model
# loaded_model = RandomForestRegressionModel.load('rfr_no_listprice')