<font size=5>

Regression with boston_data.csv. Dataset downloaded from Kaggle, to predict Boston housing price



</font>

| Code   | Description   |
|:---|:---|
|**CRIM** | per capita crime rate by town |
|**ZN**  | proportion of residential land zoned for lots over 25,000 sq.ft. | 
|**INDUS**  | proportion of non-retail business acres per town | 
|**CHAS**  | Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) | 
|**NOX**  | nitric oxides concentration (parts per 10 million) | 
|**RM**  | average number of rooms per dwelling | 
|**AGE**  | proportion of owner-occupied units built prior to 1940 | 
|**DIS**  | weighted distances to five Boston employment centres | 
|**RAD**  | index of accessibility to radial highways | 
|**TAX**  | full-value property-tax rate per $10,000 | 
|**PTRATIO**  | pupil-teacher ratio by town | 
|**B**  | 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town | 
|**LSTAT**  | % lower status of the population | 
|**MEDV**  | Median value of owner-occupied homes in \$1000's | 



<font size=5>mdev is the label, all other columns are features. </font>

<font size=5> Import PySpark libraries, create SparkContext and SQL context, then load the csv data file. </font>

In [1]:
import findspark
findspark.init()
from pyspark import SparkConf, SparkContext
from pyspark.sql import SQLContext

sc= SparkContext()
sqlContext = SQLContext(sc)
boston_house_df = sqlContext.read.format('csv').options(header='true', inferschema='true')\
.load('BostonHousing.csv')

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/02/20 23:15:28 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
                                                                                

<font size=5> Show statistics of each column, including feature columns and label column (medv)  </font>

In [2]:
boston_house_df.describe().toPandas().transpose()

24/02/20 23:15:40 WARN SparkStringUtils: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.
                                                                                

Unnamed: 0,0,1,2,3,4
summary,count,mean,stddev,min,max
crim,506,3.6135235573122535,8.601545105332491,0.00632,88.9762
zn,506,11.363636363636363,23.32245299451514,0.0,100.0
indus,506,11.136778656126504,6.860352940897589,0.46,27.74
chas,506,0.0691699604743083,0.2539940413404101,0,1
nox,506,0.5546950592885372,0.11587767566755584,0.385,0.871
rm,506,6.284634387351787,0.7026171434153232,3.561,8.78
age,506,68.57490118577078,28.148861406903595,2.9,100.0
dis,506,3.795042687747034,2.10571012662761,1.1296,12.1265
rad,506,9.549407114624506,8.707259384239366,1,24


<font size=5>

We need to find out corelationship beween each feature column with label medv.  The corelationship is between 0 to |1|, the more close to -1, or 1, that means that feature column is more negatively or positively corelated to medv, the more close to 0, that means less or little corelationship between the feature column and label medv.

   
    
</font>

In [4]:
!pip install matplotlib

import pandas as pd
import matplotlib.pyplot as plt 
%matplotlib inline

Defaulting to user installation because normal site-packages is not writeable
Collecting matplotlib
  Downloading matplotlib-3.8.3-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (11.6 MB)
[K     |████████████████████████████████| 11.6 MB 441 kB/s eta 0:00:01
[?25hCollecting contourpy>=1.0.1
  Downloading contourpy-1.2.0-cp39-cp39-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (310 kB)
[K     |████████████████████████████████| 310 kB 2.0 MB/s eta 0:00:01
[?25hCollecting cycler>=0.10
  Downloading cycler-0.12.1-py3-none-any.whl (8.3 kB)
Collecting pyparsing>=2.3.1
  Downloading pyparsing-3.1.1-py3-none-any.whl (103 kB)
[K     |████████████████████████████████| 103 kB 5.8 MB/s eta 0:00:01
Collecting kiwisolver>=1.3.1
  Downloading kiwisolver-1.4.5-cp39-cp39-manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.6 MB)
[K     |████████████████████████████████| 1.6 MB 1.5 MB/s eta 0:00:01
[?25hCollecting pillow>=8
  Downloading pillow-10.2.0-cp39-cp39-manylinux_2_28_x86_64.whl (4

In [5]:
import six
for i in boston_house_df.columns:
    if not( isinstance(boston_house_df.select(i).take(1)[0][0], six.string_types)):
        print( "Correlation to medv for ", i, boston_house_df.stat.corr('medv',i))

Correlation to medv for  crim -0.38830460858681154
Correlation to medv for  zn 0.3604453424505433
Correlation to medv for  indus -0.4837251600283728
Correlation to medv for  chas 0.1752601771902987
Correlation to medv for  nox -0.4273207723732821
Correlation to medv for  rm 0.6953599470715401
Correlation to medv for  age -0.3769545650045961
Correlation to medv for  dis 0.249928734085904
Correlation to medv for  rad -0.38162623063977735
Correlation to medv for  tax -0.46853593356776674
Correlation to medv for  ptratio -0.5077866855375622
Correlation to medv for  b 0.3334608196570661
Correlation to medv for  lstat -0.7376627261740145
Correlation to medv for  medv 1.0


<font size=5>

Spark ML requires features of the dataset are vectorized before the dataset can be fit into ML model, 
VectorAssembler is to convert a Spark Dataframe into Spark Vectorized Dataframe

</font>

In [6]:
from pyspark.ml.feature import VectorAssembler
vectorAssembler = VectorAssembler(inputCols = ['crim', 'zn', 'indus', 'chas', 'nox', 'rm', 'age', 'dis', 'rad', 'tax', 'ptratio', 'b', 'lstat'], outputCol = 'features')
#vectorAssembler = VectorAssembler(inputCols = ['rm'], outputCol = 'features')
vector_house_df = vectorAssembler.transform(boston_house_df)
vector_house_df = vector_house_df.select(['features', 'medv'])
vector_house_df.show(2)

+--------------------+----+
|            features|medv|
+--------------------+----+
|[0.00632,18.0,2.3...|24.0|
|[0.02731,0.0,7.07...|21.6|
+--------------------+----+
only showing top 2 rows



<font size=5>  

Now randomly split Spark Vectorized DataFrame (dataset) into training data (70%) and testing data (30%)
    
    
</font>

In [7]:
splits = vector_house_df.randomSplit([0.7, 0.3])
train_df = splits[0]
test_df = splits[1]
print(test_df.count())

165


In [8]:
train_df.show(2)

+--------------------+----+
|            features|medv|
+--------------------+----+
|[0.00906,90.0,2.9...|32.2|
|[0.01096,55.0,2.2...|22.0|
+--------------------+----+
only showing top 2 rows



<font size=5>

Let's do Linear Regression first, fit the Linear Regression model with train_df
    
</font>

In [9]:
from pyspark.ml.regression import LinearRegression
lr = LinearRegression(featuresCol = 'features', labelCol='medv', maxIter=10, regParam=0.3, elasticNetParam=0.8)
lr_model = lr.fit(train_df)
print("Coefficients: " + str(lr_model.coefficients))
print("Intercept: " + str(lr_model.intercept))

24/02/20 23:17:55 WARN InstanceBuilder: Failed to load implementation from:dev.ludovic.netlib.blas.JNIBLAS


Coefficients: [0.0,0.009427524950167965,-0.0022258321413950273,1.9659202130436537,-8.894033462775255,4.466720468019413,0.0,-0.7418278967344886,0.0,-0.0011827394930907188,-0.777257990411827,0.00894098372282634,-0.5326790627828858]
Intercept: 20.27208952119553


<font size=5>
Linear Regression produced slope coefficients and intercept

y=a1 X x1 + a2 X x2 +...+ an X xn + b

a1,a2,...an are coefficients for the xn in their space
b is intercept

x1, x2, ... xn are independent variables

</font>

In [10]:
trainingSummary = lr_model.summary
print("RMSE: %f" % trainingSummary.rootMeanSquaredError)
print("r2: %f" % trainingSummary.r2)

RMSE: 4.774136
r2: 0.727668


In [11]:
train_df.describe().show()

+-------+------------------+
|summary|              medv|
+-------+------------------+
|  count|               341|
|   mean|22.625513196480945|
| stddev|  9.16185635492107|
|    min|               5.6|
|    max|              50.0|
+-------+------------------+



<font size=5>

Test the model with test_df, testing produces metrics that evaluates the performance of the regressor with RMSE and R2 score.

  
    
</font>

In [12]:
lr_predictions = lr_model.transform(test_df)
lr_predictions.select("prediction","medv","features").show(5)
from pyspark.ml.evaluation import RegressionEvaluator
lr_evaluator = RegressionEvaluator(predictionCol="prediction", \
                 labelCol="medv",metricName="r2")
print("R Squared (R2) on test data = %g" % lr_evaluator.evaluate(lr_predictions))

+------------------+----+--------------------+
|        prediction|medv|            features|
+------------------+----+--------------------+
|30.640060838337625|24.0|[0.00632,18.0,2.3...|
|30.745344753034924|29.1|[0.01439,60.0,2.9...|
| 27.35987906871954|24.5|[0.01501,80.0,2.0...|
| 26.13167288247681|30.1|[0.01709,90.0,2.0...|
|31.149432949240776|32.9|[0.01778,95.0,1.4...|
+------------------+----+--------------------+
only showing top 5 rows

R Squared (R2) on test data = 0.698078


In [13]:
test_result = lr_model.evaluate(test_df)
print("Root Mean Squared Error (RMSE) on test data = %g" % test_result.rootMeanSquaredError)

Root Mean Squared Error (RMSE) on test data = 5.09163


In [14]:
print("numIterations: %d" % trainingSummary.totalIterations)
print("objectiveHistory: %s" % str(trainingSummary.objectiveHistory))
trainingSummary.residuals.show(2)

numIterations: 10
objectiveHistory: [0.49999999999999956, 0.4314949120493033, 0.23473980547247938, 0.21191207840803963, 0.18162256736074042, 0.17839841909088772, 0.17740732906544376, 0.17641092742974027, 0.17475058692648102, 0.17416984301717134, 0.17405392110913726]
+------------------+
|         residuals|
+------------------+
|1.2857752089491719|
|-5.627171751581095|
+------------------+
only showing top 2 rows



In [15]:
predictions = lr_model.transform(test_df)
predictions.select("prediction","medv","features").show(5)

+------------------+----+--------------------+
|        prediction|medv|            features|
+------------------+----+--------------------+
|30.640060838337625|24.0|[0.00632,18.0,2.3...|
|30.745344753034924|29.1|[0.01439,60.0,2.9...|
| 27.35987906871954|24.5|[0.01501,80.0,2.0...|
| 26.13167288247681|30.1|[0.01709,90.0,2.0...|
|31.149432949240776|32.9|[0.01778,95.0,1.4...|
+------------------+----+--------------------+
only showing top 5 rows



<font size=5>
    
Now try Gradient Boost Tree Regressor with the same train_df and test_df 
    
    
</font>

In [16]:
from pyspark.ml import Pipeline
from pyspark.ml.regression import GBTRegressor
from pyspark.ml.feature import VectorIndexer
from pyspark.ml.evaluation import RegressionEvaluator


In [17]:
gbt = GBTRegressor(featuresCol="features",labelCol='medv', maxIter=10)
gbt_model = gbt.fit(train_df)


In [18]:
gbt_predictions = gbt_model.transform(test_df)
gbt_predictions.select("prediction","medv","features").show(5)

+------------------+----+--------------------+
|        prediction|medv|            features|
+------------------+----+--------------------+
| 19.90443776764149|24.0|[0.00632,18.0,2.3...|
|29.744096476599704|29.1|[0.01439,60.0,2.9...|
| 28.34085483804861|24.5|[0.01501,80.0,2.0...|
| 30.84232164628737|30.1|[0.01709,90.0,2.0...|
| 33.50761840332213|32.9|[0.01778,95.0,1.4...|
+------------------+----+--------------------+
only showing top 5 rows



<font size=5>

Test the model with test_df, testing produces metrics that evaluates the performance of the regressor with RMSE and R2 score.

Looks like the metrics of Gradient Boost Tree are better that those of Linear Regressor
    
    
</font>

In [19]:
from pyspark.ml.evaluation import RegressionEvaluator
gbt_evaluator = RegressionEvaluator(predictionCol="prediction", \
                 labelCol="medv",metricName="r2")
print("R Squared (R2) on test data = %g" % gbt_evaluator.evaluate(gbt_predictions))

R Squared (R2) on test data = 0.84279


In [20]:
gbt_evaluator = RegressionEvaluator(predictionCol="prediction", \
                 labelCol="medv",metricName="rmse")

In [21]:
print("RMSE on test data = %g" % gbt_evaluator.evaluate(gbt_predictions))

RMSE on test data = 3.67409


<font size=5>

Now try Random Forest Regressor with the same train_df and test_df
    
</font>

In [22]:
from pyspark.ml import Pipeline
from pyspark.ml.regression import RandomForestRegressor
from pyspark.ml.feature import VectorIndexer
from pyspark.ml.evaluation import RegressionEvaluator


In [23]:

rf = RandomForestRegressor(featuresCol="features",labelCol='medv', maxDepth=3)
rf_model = rf.fit(train_df)

In [24]:
rf_predictions = rf_model.transform(test_df)
rf_predictions.select("prediction","medv","features").show(5)

+------------------+----+--------------------+
|        prediction|medv|            features|
+------------------+----+--------------------+
| 25.67269168097973|24.0|[0.00632,18.0,2.3...|
| 27.13433297860046|29.1|[0.01439,60.0,2.9...|
|25.535747261295093|24.5|[0.01501,80.0,2.0...|
|30.456809311689433|30.1|[0.01709,90.0,2.0...|
|33.463759352761414|32.9|[0.01778,95.0,1.4...|
+------------------+----+--------------------+
only showing top 5 rows



<font size=5>
    
Test the model with test_df, testing produces metrics that evaluates the performance of the regressor with RMSE and R2 score.

Looks like the metrics of Random Forest are better that those of Linear Regressor, but similar to those of Gradient Boost Tree
    
</font>

In [25]:
from pyspark.ml.evaluation import RegressionEvaluator
rf_evaluator = RegressionEvaluator(predictionCol="prediction", \
                 labelCol="medv",metricName="r2")
print("R Squared (R2) on test data = %g" % rf_evaluator.evaluate(rf_predictions))

R Squared (R2) on test data = 0.786508


In [26]:
rf_evaluator = RegressionEvaluator(predictionCol="prediction", \
                 labelCol="medv",metricName="rmse")

In [27]:
print("RMSE on test data = %g" % gbt_evaluator.evaluate(gbt_predictions))

RMSE on test data = 3.67409


<font size=5>

Finally, try Decision Tree regressor with the same train_df and test_df
    
    
</font>

In [28]:
from pyspark.ml import Pipeline
from pyspark.ml.regression import DecisionTreeRegressor
from pyspark.ml.feature import VectorIndexer
from pyspark.ml.evaluation import RegressionEvaluator


In [29]:
dt = DecisionTreeRegressor(featuresCol="features",labelCol='medv', maxDepth=3)
dt_model = dt.fit(train_df)

In [30]:
dt_predictions = dt_model.transform(test_df)
dt_predictions.select("prediction","medv","features").show(5)

+------------------+----+--------------------+
|        prediction|medv|            features|
+------------------+----+--------------------+
|24.857142857142858|24.0|[0.00632,18.0,2.3...|
|24.857142857142858|29.1|[0.01439,60.0,2.9...|
|24.857142857142858|24.5|[0.01501,80.0,2.0...|
|24.857142857142858|30.1|[0.01709,90.0,2.0...|
|33.781481481481485|32.9|[0.01778,95.0,1.4...|
+------------------+----+--------------------+
only showing top 5 rows



<font size=5>
    
Test the model with test_df, testing produces metrics that evaluates the performance of the regressor with RMSE and R2 score.

Looks like the metrics of Decision Tree Regressor are slightly better than that those of Linear Regressor, but not as good as Gradient Boost Tree and Random Forest

</font>

In [31]:
from pyspark.ml.evaluation import RegressionEvaluator
dt_evaluator = RegressionEvaluator(predictionCol="prediction", \
                 labelCol="medv",metricName="r2")
print("R Squared (R2) on test data = %g" % dt_evaluator.evaluate(dt_predictions))

R Squared (R2) on test data = 0.798582


In [32]:
dt_evaluator = RegressionEvaluator(predictionCol="prediction", \
                 labelCol="medv",metricName="rmse")

In [33]:
print("RMSE on test data = %g" % dt_evaluator.evaluate(dt_predictions))

RMSE on test data = 4.15871


<font size=5>

This concludes the testing of Spark ML regressors

</font>