# Pyspark Linear Regression

This exercise is a pyspark adaptation of the famous [kaggle dataset](https://www.kaggle.com/schirmerchad/bostonhoustingmlnd) and of [this article by Susan Li](https://towardsdatascience.com/building-a-linear-regression-with-pyspark-and-mllib-d065c3ba246a).

From kaggle:

The Boston Housing Dataset

The Boston Housing Dataset is a derived from information collected by the U.S. Census Service concerning housing in the area of Boston MA. The following describes the dataset columns:

* CRIM - per capita crime rate by town
* ZN - proportion of residential land zoned for lots over 25,000 sq.ft.
* INDUS - proportion of non-retail business acres per town.
* CHAS - Charles River dummy variable (1 if tract bounds river; 0 otherwise)
* NOX - nitric oxides concentration (parts per 10 million)
* RM - average number of rooms per dwelling
* AGE - proportion of owner-occupied units built prior to 1940
* DIS - weighted distances to five Boston employment centres
* RAD - index of accessibility to radial highways
* TAX - full-value property-tax rate per \$10,000
* PTRATIO - pupil-teacher ratio by town
* B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
* LSTAT - % lower status of the population
* MEDV - Median value of owner-occupied homes in \$1000's __TARGET__

In [None]:
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('adpro').getOrCreate()

In [None]:
house_df = spark.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('../sparkfiles/boston.csv')

In [None]:
house_df.printSchema()

In [None]:
house_df.show(5)

All our columns are numeric and are all tiddied up! Let's go!

In [None]:
from pyspark.ml.feature import VectorAssembler

In [None]:
vectorAssembler = VectorAssembler(inputCols = ['crim', 'zn', 'indus', 'chas', 'nox', 'rm', 'age', 'dis', 'rad', 'tax', 'ptratio', 'black', 'lstat'], outputCol = 'features')

In [None]:
vhouse_df = vectorAssembler.transform(house_df)
vhouse_df = vhouse_df.select(['features', 'medv'])
vhouse_df.show(3)

Let's split into a Train and Test sets. Linear regression is not considered a prediction algorithm by many. Still, we can make the case for a train-test exercise to test the possibility of what could happen if we received new data into our dataset.

In [None]:
splits = vhouse_df.randomSplit([0.7, 0.3]) #70-30 split
train_df = splits[0]
test_df = splits[1]

In [None]:
train_df ## At this point, we just have the transformation 

In [None]:
from pyspark.ml.regression import LinearRegression

In [None]:
lr = LinearRegression(featuresCol = 'features', labelCol='medv', maxIter=10, regParam=0.3, elasticNetParam=0.8)
lr_model = lr.fit(train_df)

In [None]:
print("Coefficients: " + str(lr_model.coefficients))
print("Intercept: " + str(lr_model.intercept))

Check how the "training" went.

In [None]:
trainingSummary = lr_model.summary
print("RMSE: %f" % trainingSummary.rootMeanSquaredError)
print("r2: %f" % trainingSummary.r2)

In [None]:
train_df.describe().show()

The RMSE is actually pretty good!

In [None]:
lr_predictions = lr_model.transform(test_df)
lr_predictions.select("prediction","medv","features").show(5)

In [None]:
from pyspark.ml.evaluation import RegressionEvaluator

lr_evaluator = RegressionEvaluator(predictionCol="prediction", \
                 labelCol="medv",metricName="r2")

print("R Squared (R2) on test data = %g" % lr_evaluator.evaluate(lr_predictions))

In [None]:
test_result = lr_model.evaluate(test_df)
print("Root Mean Squared Error (RMSE) on test data = %g" % test_result.rootMeanSquaredError)

In [None]:
spark.stop()