## DATA INGESTION AND EXPLORATION

In [1]:
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession
sc = SparkContext.getOrCreate()
sc.stop()
sc = SparkContext('local')
pyspark = SparkSession(sc)

In [2]:
file_path = "../../datasets/data_sf-airbnb/sf-airbnb-clean.parquet/"
airbnb_df = pyspark.read.parquet(file_path)
airbnb_df.select("neighbourhood_cleansed", "room_type", "bedrooms", "bathrooms",
                "number_of_reviews", "price").show(5)

+----------------------+---------------+--------+---------+-----------------+-----+
|neighbourhood_cleansed|      room_type|bedrooms|bathrooms|number_of_reviews|price|
+----------------------+---------------+--------+---------+-----------------+-----+
|      Western Addition|Entire home/apt|     1.0|      1.0|            180.0|170.0|
|        Bernal Heights|Entire home/apt|     2.0|      1.0|            111.0|235.0|
|        Haight Ashbury|   Private room|     1.0|      4.0|             17.0| 65.0|
|        Haight Ashbury|   Private room|     1.0|      4.0|              8.0| 65.0|
|      Western Addition|Entire home/apt|     2.0|      1.5|             27.0|785.0|
+----------------------+---------------+--------+---------+-----------------+-----+
only showing top 5 rows



we will predict the price per night for rental property with the given features

### CREATE TRAINING AND TESTS DATASETS 

We determine test/train size according to the size of the dataset. The dataset size varies and hence the train/test split also varies.
- in this case we will split the dataset into 80/20

In [3]:
seed = 42

In [4]:
train_df, test_df = airbnb_df.randomSplit([.8, .2], seed=seed)
print("train_df: {}".format(train_df.count()))
print("test_df: {}".format(test_df.count()))

train_df: 5780
test_df: 1366


In Spark **Catalyst Optimizer** determines the optimal way to partition data as a function of the cluster resources and size of data set. Data in Spark is row partitioned and each worker split independently, if data in partitions changes then random split won't give us the same data as a result.

**BEST PRACTICE:** Since we use training data again and again, its best to cache the trainig data for faster and better performance

### Preparing Features with Transformers

Predicting price with linear regression, linear regression in spark requires all input features to be in a single vector in the dataframe so we will transform the data as per the need.<br>
**VectorAssembler transformer**  takes input list columns and create a new Dataframe with additional columns.

In [5]:
from pyspark.ml.feature import VectorAssembler
vec_assembler = VectorAssembler(inputCols=["bedrooms"], outputCol="features")
vec_train_df = vec_assembler.transform(train_df)
vec_train_df.select("bedrooms", "features", "price").show(10)

+--------+--------+-----+
|bedrooms|features|price|
+--------+--------+-----+
|     1.0|   [1.0]|200.0|
|     1.0|   [1.0]|130.0|
|     1.0|   [1.0]| 95.0|
|     1.0|   [1.0]|250.0|
|     3.0|   [3.0]|250.0|
|     1.0|   [1.0]|115.0|
|     1.0|   [1.0]|105.0|
|     1.0|   [1.0]| 86.0|
|     1.0|   [1.0]|100.0|
|     2.0|   [2.0]|220.0|
+--------+--------+-----+
only showing top 10 rows



We also have multiple linear regression in which there are multiple independent values

## using estimators to build models

In [6]:
from pyspark.ml.regression import LinearRegression
lr = LinearRegression(featuresCol="features", labelCol="price")
lr_model = lr.fit(vec_train_df)

lr_model is a transformer, the output of an estimators fit() method is transformer. Once the transformer has learned the parameter, the transformer can apply these parameters to new data points to generate predictions.

In [7]:
m = round(lr_model.coefficients[0], 2)
b = round(lr_model.intercept, 2)
print("m,b {},{}".format(m,b))

m,b 123.68,47.51


# Create a pipeline

**If we want to apply our model to our test set, then we need to prepare that data in the same way as the training set (i.e., pass it through the vector assembler).**
- Oftentimes data preparation pipelines will have multiple steps, and it becomes cumbersome to remember not only which steps to apply, but also the ordering of the steps.<br>
- This is the motivation for the Pipeline API: you simply specify the stages you want your data to pass through, in order, and Spark takes care of the processing for you.
- **They provide the user with better code reusability and organization. In Spark, Pipelines are estimators, whereas PipelineModels—fitted Pipelines—are transformers.**

**NOTE IN SIMPLE TERMS:** Pipeline makes the code organized as the code has been run randomly before (esp in the notebooks), for making and tuning the model.

In [8]:
from pyspark.ml import Pipeline
pipeline = Pipeline(stages=[vec_assembler, lr])
pipeline_model = pipeline.fit(train_df)

Since pipeline model is transformer it is straightforward to appy test data to it

In [9]:
pred_df = pipeline_model.transform(test_df)

In [10]:
pred_df.select("bedrooms", "features", "price", "prediction").show(10)

+--------+--------+------+------------------+
|bedrooms|features| price|        prediction|
+--------+--------+------+------------------+
|     1.0|   [1.0]|  85.0|171.18598011578285|
|     1.0|   [1.0]|  45.0|171.18598011578285|
|     1.0|   [1.0]|  70.0|171.18598011578285|
|     1.0|   [1.0]| 128.0|171.18598011578285|
|     1.0|   [1.0]| 159.0|171.18598011578285|
|     2.0|   [2.0]| 250.0|294.86172649777757|
|     1.0|   [1.0]|  99.0|171.18598011578285|
|     1.0|   [1.0]|  95.0|171.18598011578285|
|     1.0|   [1.0]| 100.0|171.18598011578285|
|     1.0|   [1.0]|2010.0|171.18598011578285|
+--------+--------+------+------------------+
only showing top 10 rows



###  NOW BUILDING A MULTI FEATURE MODEL PIPELINE FOR THE SAME PROBLEM

Note: we will transform the categorical columns with one hot encoding

"Dog" = [ 1, 0, 0]<br>
"Cat" = [ 0, 1, 0]<br>
"Fish" = [0, 0, 1]<br><br>If we had a zoo of 300 animals, would OHE massively increase consumption of memory/compute resources? Not with Spark! Spark internally uses a SparseVector when the majority of the entries are 0, as is often the case after OHE, so it does not waste space storing 0 values.<br><br>DenseVector(0, 0, 0, 7, 0, 2, 0, 0, 0, 0)<br>
SparseVector(10, [3, 5], [7, 2])<br><br>The DenseVector in this example contains 10 values, all but 2 of which are 0. 
In SparceVector we only track the non-zero values by their index and values, rest are considered 0.

After we have created our category indices, we can pass those input to OneHotEncoder (OneHotEncoderEstimator in Spark 2.3/2.4). 

**Name-----Spark2.3/2.4-----Spark3.0**<br>
StringIndexer-----Single column as input/output-----Multiple columns as input/output<br>
OneHotEncoder-----Deprecated-----Multiple columns as input/output<br>
OneHotEncoderEstimator-----Multiple columns as input/output-----N/A

In the given probel we take any string type as a categorical feature, but sometimes we may numeric features which needs to be treated as categorical or vice versa. We must carefully determine which one is numeric and which ones are categorical.

In [16]:
from pyspark.ml.feature import OneHotEncoder, StringIndexer

categorical_cols = [field for (field, dataType) in train_df.dtypes if dataType == "string"]
index_output_cols = [x + "Index" for x in categorical_cols]
ohe_output_cols = [x + "OHE" for x in categorical_cols]

string_indexer = StringIndexer(inputCols=categorical_cols,
                              outputCols=index_output_cols,
                              handleInvalid="skip")
ohe_encoder = OneHotEncoder(inputCols=index_output_cols,
                            outputCols=ohe_output_cols)

numeric_cols = [field for (field, dataType) in train_df.dtypes if ((dataType == "double") & (field != "price"))]
assemble_inputs = ohe_output_cols + numeric_cols

vec_assembler = VectorAssembler(inputCols=assemble_inputs,
                               outputCol="features")

StringIndexer handle invalid data with handleInvalid parameter that specifies how to handle them. Options are:
- skip
- error
- keep

We need to explicitely tell StringIndexer which parameter to treat as Categorical. Another way is by using VectorIndexer which is computationally expensive as it goes through every record to find out distinct values, user can also define maxCategories but its hard to determine.

<br>**Best approach** is **RFormula**<br>
In RFormula we provide label and which features we want to include. It supports a limited subset of the R operators, including ~, ., :, +, and -. For example, you might specify formula = "y ~ bedrooms + bathrooms", which means to predict y given just bedrooms and bathrooms, or formula = "y ~ .", which means to use all of the available features (and automatically excludes y from the features). 

RFormula will automatically StringIndex and OHE all of your string columns, convert your numeric columns to double type, and combine all of these into a single vector using VectorAssembler under the hood. Thus, we can replace all of the preceding code with a single line, and we will get the same result:

In [17]:
from pyspark.ml.feature import RFormula

r_formula = RFormula(formula="price ~ .",
                    featuresCol="features",
                    labelCol="price",
                    handleInvalid="skip")

#### RFormula downside 
The downside of RFormula automatically combining the StringIndexer and OneHotEncoder is that one-hot encoding is not required or recommended for all algorithms. For example, tree-based algorithms can handle categorical variables directly if you just use the StringIndexer for the categorical features. You do not need to one-hot encode categorical features for tree-based methods, and it will often make your tree-based models worse. Unfortunately, **there is no one-size-fits-all solution for feature engineering, and the ideal approach is closely related to the downstream algorithms** you plan to apply to your data set.

##### NOTE
If someone else performs the feature engineering for you, make sure they document how they generated those features.

Here we will put feature preparation and model building into the pipeline and apply it to our dataset

In [20]:
lr = LinearRegression(labelCol="price", featuresCol="features")
pipeline = Pipeline(stages =  [string_indexer, ohe_encoder, vec_assembler, lr])

pipeline_model = pipeline.fit(train_df)
pred_df = pipeline_model.transform(test_df)
pred_df.select("features", "price", "prediction").show(5)

+--------------------+-----+------------------+
|            features|price|        prediction|
+--------------------+-----+------------------+
|(98,[0,3,6,22,43,...| 85.0| 55.24365707389188|
|(98,[0,3,6,22,43,...| 45.0|23.357685914717877|
|(98,[0,3,6,22,43,...| 70.0|28.474464479034395|
|(98,[0,3,6,12,42,...|128.0| -91.6079079594947|
|(98,[0,3,6,12,43,...|159.0| 95.05688229945372|
+--------------------+-----+------------------+
only showing top 5 rows



We can see that the feature column is represented as a SparseVector. There are 98 features after one-hot encoding, followed by non-zero indices and the values themselves.<br>
**In our predictions we have negative value, rent can never be negative, hence the model needs to be improved.**

## EVALUATING MODELS 

In Spark we have classification, regression, clustering and ranking evaluators. Since its a regression problem therefore we will use root-mean-square (RMSE) and R squared to evaluate the model performance.

**RMSE ranges from 0-infinity, the closer to infinity the better the model**

In [22]:
from pyspark.ml.evaluation import RegressionEvaluator
regression_evaluator = RegressionEvaluator(
predictionCol = "prediction",
labelCol="price",
metricName='rmse')
rmse = regression_evaluator.evaluate(pred_df)
print(f"RMSE is {rmse:.1f}")

RMSE is 220.6


### Interpreting the value of RMSE

So how do we know if 220.6 is a good value for the RMSE? **There are various ways to interpret this value, one of which is to build a simple baseline model and compute its RMSE to compare against. A common baseline model for regression tasks is to compute the average value of the label on the training set ȳ (pronounced y-bar), then predict ȳ for every record in the test data set and compute the resulting RMSE** (its implemented elsewhere). If you try this, you will see that our baseline model has an RMSE of 240.7, so we beat our baseline. If you don’t beat the baseline, then something probably went wrong in your model building process.<br>
**ROLE OF UNIT**<br>
Keep in mind that the unit of your label directly impacts your RMSE. For example, if your label is height, then your RMSE will be higher if you use centimeters rather than meters as your unit of measurement. You could arbitrarily decrease the RMSE by using a different unit, which is why it is important to compare your RMSE against a baseline.

### R-squared another evaluator