# Regression (Notebook for Databricks)

We'll do the following steps to build our model, 

Steps:
1. Use the features: **`bedrooms`**, **`bathrooms`**, **`bathrooms_na`**, **`minimum_nights`**, and **`number_of_reviews`** as input to your VectorAssembler.
2. Build a Linear Regression Model
3. Evaluate the **`RMSE`** and the **`R2`**.

In [1]:
import os
import findspark
findspark.init()
from pyspark.sql import SparkSession

# from pyspark import SparkConf, SparkContext
from datetime import datetime, date, timedelta
from dateutil import relativedelta
from pyspark.sql import SQLContext, Row
from pyspark.sql.types import *
from pyspark.sql import DataFrame
from pyspark.sql.functions import *
from pyspark.sql.functions import to_timestamp, to_date
from pyspark.sql import functions as F
from pyspark.sql.functions import collect_list, collect_set, concat, first, array_distinct, col, size, expr
import random
import warnings
warnings.filterwarnings('ignore')

In [2]:
#Start the spark session (Although it is not required if notebook directly ran in Databricks)
spark = SparkSession.builder \
    .appName("Flight Data Analysis in Spark") \
    .getOrCreate()

24/05/01 18:49:53 WARN Utils: Your hostname, Rishikesans-MacBook-Pro.local resolves to a loopback address: 127.0.0.1; using 192.168.4.26 instead (on interface en0)
24/05/01 18:49:53 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/05/01 18:49:53 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


## Load Dataset and Train Model

In [3]:
file_path = "./cleaned_listings.csv"
# airbnb_df = spark.read.format("delta").load(file_path)

#Read the cleaned csv file 
airbnb_df = spark.read.csv(file_path, header="true", inferSchema="true", multiLine="true", escape='"')
train_df, test_df = airbnb_df.randomSplit([.8, .2], seed=42)

In [5]:
#Vectorize the dependent variables 

from pyspark.ml.feature import VectorAssembler
from pyspark.ml.evaluation import RegressionEvaluator
from pyspark.ml.regression import LinearRegression

vec_assembler = VectorAssembler(inputCols=["bedrooms","bathrooms","bathrooms_na","minimum_nights","number_of_reviews"],outputCol="features")

vtrain_df = vec_assembler.transform(train_df)
vtest_df = vec_assembler.transform(test_df)

lr_model = LinearRegression(labelCol="price").fit(vtrain_df)

24/05/01 18:50:21 WARN Instrumentation: [a14b8fd5] regParam is zero, which might cause numerical instability and overfitting.
24/05/01 18:50:22 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS
24/05/01 18:50:22 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.lapack.JNILAPACK


In [6]:
pred_df = lr_model.transform(vtest_df)

regression_evaluator = RegressionEvaluator(predictionCol="prediction", labelCol="price", metricName="rmse")
rmse = regression_evaluator.evaluate(pred_df)
r2 = regression_evaluator.setMetricName("r2").evaluate(pred_df)

print(f"RMSE is {rmse}")
print(f"R2 is {r2}")

RMSE is 419.0121578376188
R2 is 0.07456171275982248


In [7]:
# OOS R-squared is low. Note this notebook is only for practice for Spark ML. 

In [8]:
for col, coef in zip(vec_assembler.getInputCols(), lr_model.coefficients):
    print(col, coef)
  
print(f"intercept: {lr_model.intercept}")

bedrooms 114.17777449189113
bathrooms -5.8636569331674835
bathrooms_na -93.46199646696445
minimum_nights 0.11479885115899408
number_of_reviews -0.2841304691298576
intercept: 89.84420157032639


## Additional notes on Spark Distributed Computing

## Distributed Setting

Although we can quickly solve for the parameters when the data is small, the closed form solution doesn't scale well to large datasets. 

Spark uses the following approach to solve a linear regression problem:

* First, Spark tries to use matrix decomposition to solve the linear regression problem. 
* If it fails, Spark then uses <a href="https://spark.apache.org/docs/latest/ml-advanced.html#limited-memory-bfgs-l-bfgs" target="_blank">L-BFGS</a> to solve for the parameters. L-BFGS is a limited-memory version of BFGS that is particularly suited to problems with very large numbers of variables. The <a href="https://en.wikipedia.org/wiki/Broyden%E2%80%93Fletcher%E2%80%93Goldfarb%E2%80%93Shanno_algorithm" target="_blank">BFGS</a> method belongs to <a href="https://en.wikipedia.org/wiki/Quasi-Newton_method" target="_blank">quasi-Newton methods</a>, which are used to either find zeroes or local maxima and minima of functions iteratively. 


## Improving the Model

In [9]:
train_df, test_df = airbnb_df.randomSplit([.8, .2], seed=42)

#### One Hot Encoding

In [10]:
#Categorical Variables 

#One Hot Encoding
from pyspark.ml.feature import OneHotEncoder, StringIndexer

categorical_cols = [field for (field, dataType) in train_df.dtypes if dataType == "string"]
index_output_cols = [x + "Index" for x in categorical_cols]
ohe_output_cols = [x + "OHE" for x in categorical_cols]

string_indexer = StringIndexer(inputCols=categorical_cols, outputCols=index_output_cols, handleInvalid="skip")
ohe_encoder = OneHotEncoder(inputCols=index_output_cols, outputCols=ohe_output_cols)

#### Vector Assembler 

In [11]:
from pyspark.ml.feature import VectorAssembler

numeric_cols = [field for (field, dataType) in train_df.dtypes if ((dataType == "double") & (field != "price"))]
assembler_inputs = ohe_output_cols + numeric_cols
vec_assembler = VectorAssembler(inputCols=assembler_inputs, outputCol="features")

#### Linear Regression

In [12]:
from pyspark.ml.regression import LinearRegression

lr = LinearRegression(labelCol="price", featuresCol="features")

#### Pipeline

In [13]:
from pyspark.ml import Pipeline

stages = [string_indexer, ohe_encoder, vec_assembler, lr]
pipeline = Pipeline(stages=stages)

pipeline_model = pipeline.fit(train_df)

24/05/01 19:06:10 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
24/05/01 19:06:11 WARN Instrumentation: [f6d3f813] regParam is zero, which might cause numerical instability and overfitting.
24/05/01 19:06:12 WARN Instrumentation: [f6d3f813] Cholesky solver failed due to singular covariance matrix. Retrying with Quasi-Newton solver.
                                                                                

#### Saving Models

In [23]:
pipeline_model.write().overwrite().save('./model')

#### Loading Models

In [24]:
from pyspark.ml import PipelineModel

saved_pipeline_model = PipelineModel.load('./model')

#### Model Testing

In [21]:
pred_df = saved_pipeline_model.transform(test_df)

display(pred_df.select("features", "price", "prediction"))

DataFrame[features: vector, price: double, prediction: double]

In [26]:
# pred_df.take(1)

#### Model Evaluation

In [27]:
# from pyspark.ml.evaluation import RegressionEvaluator

# regression_evaluator = RegressionEvaluator(predictionCol="prediction", labelCol="price", metricName="rmse")

# rmse = regression_evaluator.evaluate(pred_df)
# r2 = regression_evaluator.setMetricName("r2").evaluate(pred_df)
# print(f"RMSE is {rmse}")
# print(f"R2 is {r2}")