# Regression (Notebook for Databricks)

We'll do the following steps to build our model, 

Steps:
1. Use the features: **`bedrooms`**, **`bathrooms`**, **`bathrooms_na`**, **`minimum_nights`**, and **`number_of_reviews`** as input to your VectorAssembler.
2. Build a Linear Regression Model
3. Evaluate the **`RMSE`** and the **`R2`**.

In [5]:
import os
import findspark
findspark.init()
from pyspark.sql import SparkSession

# from pyspark import SparkConf, SparkContext
from datetime import datetime, date, timedelta
from dateutil import relativedelta
from pyspark.sql import SQLContext, Row
from pyspark.sql.types import *
from pyspark.sql import DataFrame
from pyspark.sql.functions import *
from pyspark.sql.functions import to_timestamp, to_date
from pyspark.sql import functions as F
from pyspark.sql.functions import collect_list, collect_set, concat, first, array_distinct, col, size, expr
import random
import warnings
warnings.filterwarnings('ignore')

In [6]:
#Start the spark session (Although it is not required if notebook directly ran in Databricks)
spark = SparkSession.builder \
    .appName("Flight Data Analysis in Spark") \
    .getOrCreate()

## Load Dataset and Train Model

In [7]:
file_path = "./cleaned_listings.csv"
# airbnb_df = spark.read.format("delta").load(file_path)

#Read the cleaned csv file 
airbnb_df = spark.read.csv(file_path, header="true", inferSchema="true", multiLine="true", escape='"')
train_df, test_df = airbnb_df.randomSplit([.8, .2], seed=42)

In [5]:
#Vectorize the dependent variables 

from pyspark.ml.feature import VectorAssembler
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.regression import LinearRegression

vec_assembler = VectorAssembler(inputCols=["bedrooms","bathrooms","bathrooms_na","minimum_nights","number_of_reviews"],outputCol="features")

vtrain_df = vec_assembler.transform(train_df)
vtest_df = vec_assembler.transform(test_df)

lr_model = LinearRegression(labelCol="price").fit(vtrain_df)

24/05/01 18:50:21 WARN Instrumentation: [a14b8fd5] regParam is zero, which might cause numerical instability and overfitting.
24/05/01 18:50:22 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS
24/05/01 18:50:22 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.lapack.JNILAPACK


In [6]:
pred_df = lr_model.transform(vtest_df)

regression_evaluator = RegressionEvaluator(predictionCol="prediction", labelCol="price", metricName="rmse")
rmse = regression_evaluator.evaluate(pred_df)
r2 = regression_evaluator.setMetricName("r2").evaluate(pred_df)

print(f"RMSE is {rmse}")
print(f"R2 is {r2}")

RMSE is 419.0121578376188
R2 is 0.07456171275982248


In [7]:
# OOS R-squared is low. Note this notebook is only for practice for Spark ML. 

In [8]:
for col, coef in zip(vec_assembler.getInputCols(), lr_model.coefficients):
    print(col, coef)
  
print(f"intercept: {lr_model.intercept}")

bedrooms 114.17777449189113
bathrooms -5.8636569331674835
bathrooms_na -93.46199646696445
minimum_nights 0.11479885115899408
number_of_reviews -0.2841304691298576
intercept: 89.84420157032639


## Additional notes on Spark Distributed Computing

## Distributed Setting

Although we can quickly solve for the parameters when the data is small, the closed form solution doesn't scale well to large datasets. 

Spark uses the following approach to solve a linear regression problem:

* First, Spark tries to use matrix decomposition to solve the linear regression problem. 
* If it fails, Spark then uses <a href="https://spark.apache.org/docs/latest/ml-advanced.html#limited-memory-bfgs-l-bfgs" target="_blank">L-BFGS</a> to solve for the parameters. L-BFGS is a limited-memory version of BFGS that is particularly suited to problems with very large numbers of variables. The <a href="https://en.wikipedia.org/wiki/Broyden%E2%80%93Fletcher%E2%80%93Goldfarb%E2%80%93Shanno_algorithm" target="_blank">BFGS</a> method belongs to <a href="https://en.wikipedia.org/wiki/Quasi-Newton_method" target="_blank">quasi-Newton methods</a>, which are used to either find zeroes or local maxima and minima of functions iteratively. 


## Improving the Model

In [28]:
train_df, test_df = airbnb_df.randomSplit([.8, .2], seed=42)

#### One Hot Encoding

In [29]:
#Categorical Variables 

#One Hot Encoding
from pyspark.ml.feature import OneHotEncoder, StringIndexer

categorical_cols = [field for (field, dataType) in train_df.dtypes if dataType == "string"]
index_output_cols = [x + "Index" for x in categorical_cols]
ohe_output_cols = [x + "OHE" for x in categorical_cols]

string_indexer = StringIndexer(inputCols=categorical_cols, outputCols=index_output_cols, handleInvalid="skip")
ohe_encoder = OneHotEncoder(inputCols=index_output_cols, outputCols=ohe_output_cols)

#### Vector Assembler 

In [30]:
from pyspark.ml.feature import VectorAssembler

numeric_cols = [field for (field, dataType) in train_df.dtypes if ((dataType == "double") & (field != "price"))]
assembler_inputs = ohe_output_cols + numeric_cols
vec_assembler = VectorAssembler(inputCols=assembler_inputs, outputCol="features")

#### Linear Regression

In [31]:
from pyspark.ml.regression import LinearRegression

lr = LinearRegression(labelCol="price", featuresCol="features")

#### Pipeline

In [1]:
from pyspark.ml import Pipeline

stages = [string_indexer, ohe_encoder, vec_assembler, lr]
pipeline = Pipeline(stages=stages)

pipeline_model = pipeline.fit(train_df)

#### Saving Models

In [33]:
pipeline_model.write().overwrite().save('./model')

#### Loading Models

In [34]:
from pyspark.ml import PipelineModel

saved_pipeline_model = PipelineModel.load('./model')

#### Model Testing

In [35]:
pred_df = saved_pipeline_model.transform(test_df)

display(pred_df.select("features", "price", "prediction"))

DataFrame[features: vector, price: double, prediction: double]

In [36]:
# pred_df.take(1)

#### Model Evaluation

In [37]:
# from pyspark.ml.evaluation import RegressionEvaluator

# regression_evaluator = RegressionEvaluator(predictionCol="prediction", labelCol="price", metricName="rmse")

# rmse = regression_evaluator.evaluate(pred_df)
# r2 = regression_evaluator.setMetricName("r2").evaluate(pred_df)
# print(f"RMSE is {rmse}")
# print(f"R2 is {r2}")

## Using RFormula 

Instead of manually specifying which columns are categorical to the StringIndexer and OneHotEncoder, RFormula will do it for us.

In [8]:
from pyspark.ml import Pipeline
from pyspark.ml.feature import RFormula
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator

r_formula = RFormula(formula="price~.",
                     featuresCol="features",
                     handleInvalid="skip",
                     labelCol="price")
lr = LinearRegression(labelCol="price")

In [9]:
r_formula.explainParams()

"featuresCol: features column name. (default: features, current: features)\nforceIndexLabel: Force to index label whether it is numeric or string (default: False)\nformula: R model formula (current: price~.)\nhandleInvalid: how to handle invalid entries. Options are 'skip' (filter out rows with invalid values), 'error' (throw an error), or 'keep' (put invalid data in a special additional bucket, at index numLabels). (default: error, current: skip)\nlabelCol: label column name. (default: label, current: price)\nstringIndexerOrderType: How to order categories of a string feature column used by StringIndexer. The last category after ordering is dropped when encoding strings. Supported options: frequencyDesc, frequencyAsc, alphabetDesc, alphabetAsc. The default value is frequencyDesc. When the ordering is set to alphabetDesc, RFormula drops the same category as R when encoding strings. (default: frequencyDesc)"

In [10]:
pipeline = Pipeline(stages=[r_formula,lr])
pipeline_model = pipeline.fit(train_df)
pred_df = pipeline_model.transform(test_df)

regression_evaluator = RegressionEvaluator(labelCol="price")

rmse = regression_evaluator.setMetricName("rmse").evaluate(pred_df)
r2 = regression_evaluator.setMetricName("r2").evaluate(pred_df)
print(f"RMSE is {rmse}")
print(f"R2 is {r2}")

24/05/02 10:10:21 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
24/05/02 10:10:26 WARN Instrumentation: [9d20d039] regParam is zero, which might cause numerical instability and overfitting.
24/05/02 10:10:27 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS
24/05/02 10:10:27 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.lapack.JNILAPACK
24/05/02 10:10:27 WARN Instrumentation: [9d20d039] Cholesky solver failed due to singular covariance matrix. Retrying with Quasi-Newton solver.


RMSE is 405.6176153007995
R2 is 0.13422393436448787


### Use Log Transformmation for Price

Since we came to the price is super right skewed, a log transformed price will have a much better distribution 



In [12]:
from pyspark.sql.functions import col, log

log_train_df = train_df.withColumn("logprice",log("price"))
log_test_df = test_df.withColumn("logprice",log("price"))

r_formula = RFormula(formula="logprice ~ . - price",handleInvalid="skip") # Look at handleInvalid
lr.setLabelCol("logprice")
pipeline = Pipeline(stages=[r_formula, lr])
pipeline_model = pipeline.fit(log_train_df)
pred_df = pipeline_model.transform(log_test_df)

24/05/02 10:20:17 WARN Instrumentation: [2d12f4b4] regParam is zero, which might cause numerical instability and overfitting.
24/05/02 10:20:17 WARN Instrumentation: [2d12f4b4] Cholesky solver failed due to singular covariance matrix. Retrying with Quasi-Newton solver.


In [14]:
# In order to interpret our RMSE, we need to convert our predictions back from logarithmic scale.
from pyspark.sql.functions import exp

exp_df = pred_df.withColumn("prediction",exp("prediction"))

rmse = regression_evaluator.setMetricName("rmse").evaluate(exp_df)
r2 = regression_evaluator.setMetricName("r2").evaluate(exp_df)
print(f"RMSE is {rmse}")
print(f"R2 is {r2}")

RMSE is 408.09168053567373
R2 is 0.12363011935789525
