## Evaluating a Regression Model

In this exercise, you will create a pipeline for a linear regression model, and then test and evaluate the model.

### Prepare the Data

First, import the libraries you will need and prepare the training and test data:

In [1]:
# Import Spark SQL and Spark ML libraries
from pyspark.sql.types import *
from pyspark.sql.functions import *

from pyspark.ml import Pipeline
from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import VectorAssembler

# Load the source data
csv = spark.read.csv('wasb:///data/flights.csv', inferSchema=True, header=True)

# Select features and label
data = csv.select("DayofMonth", "DayOfWeek", "OriginAirportID", "DestAirportID", "DepDelay", col("ArrDelay").alias("label"))

# Split the data
splits = data.randomSplit([0.7, 0.3])
train = splits[0]
test = splits[1].withColumnRenamed("label", "trueLabel")

Starting Spark application


ID,YARN Application ID,Kind,State,Spark UI,Driver log,Current session?
1,application_1613740567063_0004,pyspark,idle,Link,Link,✔


SparkSession available as 'spark'.


### Define the Pipeline and Train the Model
Now define a pipeline that creates a feature vector and trains a regression model

In [2]:
# Define the pipeline
assembler = VectorAssembler(inputCols = ["DayofMonth", "DayOfWeek", "OriginAirportID", "DestAirportID", "DepDelay"], outputCol="features")
lr = LinearRegression(labelCol="label",featuresCol="features", maxIter=10, regParam=0.3)
pipeline = Pipeline(stages=[assembler, lr])

# Train the model
piplineModel = pipeline.fit(train)

### Test the Model
Now you're ready to apply the model to the test data.

In [3]:
prediction = piplineModel.transform(test)
predicted = prediction.select("features", "prediction", "trueLabel")
predicted.show()

+--------------------+-------------------+---------+
|            features|         prediction|trueLabel|
+--------------------+-------------------+---------+
|[1.0,1.0,10140.0,...| -5.563127660479272|      -18|
|[1.0,1.0,10140.0,...| -5.563127660479272|      -17|
|[1.0,1.0,10140.0,...|-3.5678037165765453|      -12|
|[1.0,1.0,10140.0,...| -5.759115702461106|      -14|
|[1.0,1.0,10140.0,...| -3.763791758558379|      -12|
|[1.0,1.0,10140.0,...|  17.18710965242025|       19|
|[1.0,1.0,10140.0,...|  20.18009556827434|       14|
|[1.0,1.0,10140.0,...|  31.15437725973933|       41|
|[1.0,1.0,10140.0,...| -5.766618724949969|       -5|
|[1.0,1.0,10140.0,...| -4.768956752998605|       -6|
|[1.0,1.0,10140.0,...|0.21935310675821018|        6|
|[1.0,1.0,10140.0,...|  2.214677050660937|       13|
|[1.0,1.0,10140.0,...|  37.13284606895864|       38|
|[1.0,1.0,10140.0,...|-13.749278686467942|      -13|
|[1.0,1.0,10140.0,...| -12.75161671451658|      -12|
|[1.0,1.0,10140.0,...| -10.75629277061385|    

### Examine the Predicted and Actual Values
You can plot the predicted values against the actual values to see how accurately the model has predicted. In a perfect model, the resulting scatter plot should form a perfect diagonal line with each predicted value being identical to the actual value - in practice, some variance is to be expected.
Run the cells below to create a temporary table from the **predicted** DataFrame and then retrieve the predicted and actual label values using SQL. You can then display the results as a scatter plot, specifying **-** as the function to show the unaggregated values.

In [4]:
predicted.createOrReplaceTempView("regressionPredictions")

In [5]:
%%sql
SELECT trueLabel, prediction FROM regressionPredictions

### Retrieve the Root Mean Square Error (RMSE)
There are a number of metrics used to measure the variance between predicted and actual values. Of these, the root mean square error (RMSE) is a commonly used value that is measured in the same units as the predicted and actual values - so in this case, the RMSE indicates the average number of minutes between predicted and actual flight delay values. You can use the **RegressionEvaluator** class to retrieve the RMSE.


In [6]:
from pyspark.ml.evaluation import RegressionEvaluator

evaluator = RegressionEvaluator(labelCol="trueLabel", predictionCol="prediction", metricName="rmse")
rmse = evaluator.evaluate(prediction)
print "Root Mean Square Error (RMSE):", rmse

Root Mean Square Error (RMSE): 13.2646903332