# Advanced Certification Program in Computational Data Science
## A Program by IISc and TalentSprint
### Additional Notebook (ungraded) on PySpark ML


## Learning Objectives

At the end of the experiment, you will be able to

* understand the concept of machine learning using PySpark
* Explore and visualize California housing dataset
* understand code implementation for performing machine learning using PySpark

### Introduction

### Machine Learning using PySpark

PySpark MLlib is a machine-learning library. It is like a wrapper over PySpark Core to do data analysis using machine-learning algorithms. It works on distributed systems and is scalable. It can be used for classification, clustering, linear regression, and other machine-learning algorithms in PySpark MLlib.

* It has a number advantages such as it is faster than previous approaches like MapReduce.
* It has multiple functions that it offers such as running distributed SQL.

To know more about a pyspark's ML pipeline click [here](https://spark.apache.org/docs/2.3.1/api/python/pyspark.ml.html#module-pyspark.ml.classification)


**Problem Statement:** Predicting House Prices using California Housing Dataset

In this section, we'll make use of the California Housing data set. Note, of course, that this is actually 'small', but, the purpose of this notebook is meant to give you an idea of how we can use PySpark to build a machine learning model.

**Dataset Description** : The California Housing data set appeared in a 1997 paper titled Sparse Spatial Autoregressions, written by Pace, R. Kelley and Ronald Barry and published in the Statistics and Probability Letters journal. The researchers built this data set by using the 1990 California census data.

The data contains one row per census block group. A block group is the smallest geographical unit for which the U.S. Census Bureau publishes sample data (a block group typically has a population of 600 to 3,000 people). In this sample a block group on average includes 1425.5 individuals living in a geographically compact area.

These spatial data contain 20,640 observations on housing prices with 9 economic variables:

`Longitude`:refers to the angular distance of a geographic place north or south of the earth’s equator for each block group

`Latitude` :refers to the angular distance of a geographic place east or west of the earth’s equator for each block group

`Housing Median Age`:is the median age of the people that belong to a block group. Note that the median is the value that lies at the midpoint of a frequency distribution of observed values

`Total Rooms`:is the total number of rooms in the houses per block group

`Total Bedrooms`:is the total number of bedrooms in the houses per block group

`Population`:is the number of inhabitants of a block group

`Households`:refers to units of houses and their occupants per block group

`Median Income`:is used to register the median income of people that belong to a block group

`Median House Value`:is the dependent variable and refers to the median house value per block group


The Median house value is the dependent variable and will be assigned the role of the target variable in our ML model.

In [None]:
#@title Run this cell to download the dataset
from IPython import get_ipython
ipython = get_ipython()
ipython.magic("sx wget https://cdn.iisc.talentsprint.com/DLFA/Experiment_related_data/cal_housing.data")

### Importing the required libraries and packages

In [None]:
!pip install pyspark

In [None]:
import pandas as pd
import numpy as np

from pyspark import SparkConf, SparkContext
from pyspark.sql import SparkSession, SQLContext

from pyspark.sql.types import StructType
from pyspark.sql.types import StructField
from pyspark.sql.types import FloatType

import pyspark.sql.functions as F
from pyspark.sql.functions import udf, col

from pyspark.ml.regression import LinearRegression
from pyspark.mllib.evaluation import RegressionMetrics

from pyspark.ml.tuning import ParamGridBuilder, CrossValidator, CrossValidatorModel
from pyspark.ml.feature import VectorAssembler, StandardScaler
from pyspark.ml.evaluation import RegressionEvaluator
import seaborn as sns
import matplotlib.pyplot as plt

In [None]:
# Visualization
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"

pd.set_option('display.max_columns', 200)
pd.set_option('display.max_colwidth', 400)

from matplotlib import rcParams
sns.set(context='notebook', style='whitegrid', rc={'figure.figsize': (18,4)})
rcParams['figure.figsize'] = 18,4

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

In [None]:
# Setting random seed
rnd_seed=23
np.random.seed=rnd_seed
np.random.set_state=rnd_seed

#### Creating the Spark Session

In [None]:
spark = SparkSession.builder.master("local[2]").appName("Linear-Regression-California-Housing").getOrCreate()

In [None]:
spark

Creating Spark Context

In [None]:
sc = spark.sparkContext
sc

Creating SQL Context

In [None]:
sqlContext = SQLContext(spark.sparkContext)
sqlContext

#### Load The Data From the File

In [None]:
HOUSING_DATA = '/content/cal_housing.data'

Specifying the schema when loading data into a DataFrame will give better performance than schema inference.

In [None]:
# Define the schema, corresponding to a line in the csv data file.
schema = StructType([
    StructField("long", FloatType(), nullable=True),
    StructField("lat", FloatType(), nullable=True),
    StructField("medage", FloatType(), nullable=True),
    StructField("totrooms", FloatType(), nullable=True),
    StructField("totbdrms", FloatType(), nullable=True),
    StructField("pop", FloatType(), nullable=True),
    StructField("houshlds", FloatType(), nullable=True),
    StructField("medinc", FloatType(), nullable=True),
    StructField("medhv", FloatType(), nullable=True)]
)

In [None]:
# Load housing data
housing_df = spark.read.csv(path=HOUSING_DATA, schema=schema).cache()

In [None]:
# Inspect first five rows
housing_df.take(5)

In [None]:
# Display first five rows
housing_df.show(5)

In [None]:
# Show the dataframe columns
housing_df.columns

In [None]:
# Show the schema of the dataframe
housing_df.printSchema()

### Data Exploration

In [None]:
# Run a sample selection
housing_df.select('pop','totbdrms').show(10)

### Distribution of the median age of the people living in the area

In [None]:
# Group by housing median age and see the distribution
result_df = housing_df.groupBy("medage").count().sort("medage", ascending=False)

In [None]:
result_df.show(10)

In [None]:
result_df.toPandas().plot.bar(x='medage',figsize=(14, 6))

Most of the residents are either in their youth or middle age group.

#### Summary Statistics
Spark DataFrames include some built-in functions for statistical processing. The describe() function performs summary statistics calculations on all numeric columns and returns them as a DataFrame.

In [None]:
(housing_df.describe().select(
                    "summary",
                    F.round("medage", 4).alias("medage"),
                    F.round("totrooms", 4).alias("totrooms"),
                    F.round("totbdrms", 4).alias("totbdrms"),
                    F.round("pop", 4).alias("pop"),
                    F.round("houshlds", 4).alias("houshlds"),
                    F.round("medinc", 4).alias("medinc"),
                    F.round("medhv", 4).alias("medhv"))
                    .show())

Look at the minimum and maximum values of all the (numerical) attributes. We see that multiple attributes have a wide range of values: we will need to normalize the dataset.

### Data Preprocessing

* Standardize the data, as we have seen that the range of minimum and maximum values is quite big.

* There are possibly some additional attributes that we could add, such as a feature that registers the number of bedrooms per room or the rooms per household.

* The dependent variable is large in value; To make it easier to work with it, we will slightly adjust the values.

#### Preprocessing The Target Values

First, let's start with the medianHouseValue, the dependent variable. To facilitate our working with the target values, we will express the house values in units of 100,000. That means that a target such as 452600.000000 should become 4.526.

In [None]:
# Adjust the values of `medianHouseValue`
housing_df = housing_df.withColumn("medhv", col("medhv")/100000)

In [None]:
# Show the first 2 lines of `df`
housing_df.show(2)

We can clearly see that the values have been adjusted correctly when we look at the result of the show() method.

### Feature Engineering

Now that we have adjusted the values in medianHouseValue, we will now add the following columns to the data set:

*   Rooms per household which refers to the number of rooms in households per block group;

*   Population per household, which basically gives us an indication of how many people live in households per block group;
*   Bedrooms per room which will give us an idea about how many rooms are bedrooms per block group;

As we are working with DataFrames, it is best to use the select() method to select the columns that we are going to work with, namely totalRooms, households, and population. Additionally, we need to indicate that we are working with columns by adding the col() function to our code. Otherwise, we won't be able to do element-wise operations like the division step ahead.




In [None]:
housing_df.columns

In [None]:
# Add the new columns to `df`
housing_df = (housing_df.withColumn("rms_per_hh", F.round(col("totrooms")/col("houshlds"), 2))
                       .withColumn("pop_per_hh", F.round(col("pop")/col("houshlds"), 2))
                       .withColumn("bdrms_per_rm", F.round(col("totbdrms")/col("totrooms"), 2)))

In [None]:
# Inspect the result
housing_df.show(5)

We can see that, for the first row, there are about 6.98 rooms per household, the households in the block group consist of about 2.5 people and the amount of bedrooms is quite low with 0.14.

We do not want to necessarily standardize our target values so we should ensure to to isolate those in our data set. Also let us leave out variables that we do not want to consider in our analysis, such as longitude, latitude, housingMedianAge and totalRooms.

In this case, we will use the select() method and passing the column names in the order that is more appropriate. In this case, the target variable medianHouseValue is put first, so that it will not be affected by the standardization.

In [None]:
# Re-order and select columns
housing_df = housing_df.select("medhv",
                              "totbdrms",
                              "pop",
                              "houshlds",
                              "medinc",
                              "rms_per_hh",
                              "pop_per_hh",
                              "bdrms_per_rm")

#### Feature Extraction

Now that the data is re-ordered, we are ready to normalize the data. We will choose the features to be normalized.

In [None]:
featureCols = ["totbdrms", "pop", "houshlds", "medinc", "rms_per_hh", "pop_per_hh", "bdrms_per_rm"]

**Use a VectorAssembler to put features into a feature vector column**

In [None]:
# Put features into a feature vector column
assembler = VectorAssembler(inputCols=featureCols, outputCol="features")

In [None]:
assembled_df = assembler.transform(housing_df)

In [None]:
assembled_df.show(10, truncate=False)

All the features have transformed into a Dense Vector.



#### Standardization

Next, we can finally scale the data using StandardScaler. The input columns are the features, and the output column with the rescaled values that will be included in the scaled_df will be named "features_scaled".

In [None]:
# Initialize the `standardScaler`
standardScaler = StandardScaler(inputCol="features", outputCol="features_scaled")

In [None]:
# Fit the DataFrame to the scaler
scaled_df = standardScaler.fit(assembled_df).transform(assembled_df)

In [None]:
# Inspect the result
scaled_df.select("features", "features_scaled").show(10, truncate=False)

#### Building A Machine Learning Model With Spark ML

With all the preprocessing done, it's finally time to start building our Linear Regression model! First, split the data into training and test sets using the randomSplit() method:

In [None]:
# Split the data into train and test sets
train_data, test_data = scaled_df.randomSplit([.8,.2], seed=rnd_seed)

We pass in a list with two numbers that represent the size that we want training and test sets to have including a seed.

Note that the argument elasticNetParam corresponds to  α  or the vertical intercept and that the regParam or the regularization paramater corresponds to  λ .

In [None]:
train_data.columns

**Create an ElasticNet model**

ElasticNet is a linear regression model trained with L1 and L2 prior as regularizer. Elastic-net is useful when there are multiple features which are correlated with one another. Lasso is likely to pick one of these at random, while elastic-net is likely to pick both.

In [None]:
# Initialize `lr`
lr = (LinearRegression(featuresCol='features_scaled', labelCol="medhv", predictionCol='predmedhv',
                               maxIter=10, regParam=0.3, elasticNetParam=0.8, standardization=False))

In [None]:
# Fit the data to the model
linearModel = lr.fit(train_data)

### Evaluating the Model

We can now generate predictions for our test data by using the transform() method to predict the labels for our test_data. Then, we can use RDD operations to extract the predictions as well as the true labels from the DataFrame.

#### Inspect the Model Co-efficients

In [None]:
# Coefficients for the model
linearModel.coefficients

In [None]:
featureCols

In [None]:
# Intercept for the model
linearModel.intercept

In [None]:
coeff_df = pd.DataFrame({"Feature": ["Intercept"] + featureCols, "Co-efficients": np.insert(linearModel.coefficients.toArray(), 0, linearModel.intercept)})
coeff_df = coeff_df[["Feature", "Co-efficients"]]

In [None]:
coeff_df

#### Generating Predictions

In [None]:
# Generate predictions
predictions = linearModel.transform(test_data)

In [None]:
# Extract the predictions and the "known" correct labels
predandlabels = predictions.select("predmedhv", "medhv")

In [None]:
predandlabels.show()

#### Inspect the Metrics

We will now inspect the metrics using the LinearRegressionModel.summary attribute, to pull up the rootMeanSquaredError and the r2 score.

In [None]:
# Get the RMSE
print("RMSE: {0}".format(linearModel.summary.rootMeanSquaredError))

In [None]:
print("MAE: {0}".format(linearModel.summary.meanAbsoluteError))

In [None]:
# Get the R2
print("R2: {0}".format(linearModel.summary.r2))

**Using the RegressionEvaluator from pyspark.ml package**

In [None]:
evaluator = RegressionEvaluator(predictionCol="predmedhv", labelCol='medhv', metricName='rmse')
print("RMSE: {0}".format(evaluator.evaluate(predandlabels)))

In [None]:
evaluator = RegressionEvaluator(predictionCol="predmedhv", labelCol='medhv', metricName='mae')
print("MAE: {0}".format(evaluator.evaluate(predandlabels)))

In [None]:
evaluator = RegressionEvaluator(predictionCol="predmedhv", labelCol='medhv', metricName='r2')
print("R2: {0}".format(evaluator.evaluate(predandlabels)))

**Using the RegressionMetrics from pyspark.mllib package**



In [None]:
# mllib is old that is why the methods are available in rdd
metrics = RegressionMetrics(predandlabels.rdd)

In [None]:
print("RMSE: {0}".format(metrics.rootMeanSquaredError))

In [None]:
print("MAE: {0}".format(metrics.meanAbsoluteError))

In [None]:
print("R2: {0}".format(metrics.r2))

Improvements to the model are still needed! One should play around with the parameters that we passed to the model and the variables that we included in the original DataFrame.

In [None]:
spark.stop()