# Linear Regression using ![LOGO](http://demo.epigno.systems/python_spark.png)

In this notebook we will will employ a simple linear regression model to predict the amount of energy output of a power plant. The dataset used for this analysis comes from [UC Irvine machine learning repository](http://mlr.cs.umass.edu/ml/datasets/Combined+Cycle+Power+Plant). The dataset contains 9568 data points collected from a Combined Cycle Power Plant over 6 years (2006-2011).

**Data information** as described on the site above:

Features consist of hourly average ambient variables 
- Temperature (T) in the range 1.81C and 37.11C,
- Ambient Pressure (AP) in the range 992.89-1033.30 milibar,
- Relative Humidity (RH) in the range 25.56% to 100.16%
- Exhaust Vacuum (V) in teh range 25.36-81.56 cm Hg
- Net hourly electrical energy output (EP) 420.26-495.76 MW
The averages are taken from various sensors located around the plant that record the ambient variables every second. The variables are given without normalization.

The original headers were renamed as below:
- T  -> temperature
- V  -> exhaust\_vacuum
- AP -> ambient\_pressure
- RH -> relative\_humidity
- EP -> energy\_output

Our goal is to predict the `energy_output` (label) based on the other four features.

Alternative data [Link](http://www.caiso.com/Pages/TodaysOutlook.aspx#SupplyandDemand)

In [2]:
# importing the necessary libraries

from pyspark.ml.regression import LinearRegression as LR
from pyspark.ml.feature import VectorAssembler as VA
from pyspark.sql.functions import *

In [3]:
# the data
file_name = "/FileStore/tables/6zm535q61494044083775/data.csv"

In [4]:
# loading the data
data = sqlContext.read.options(header='true', inferschema='true').format('csv').load(file_name)

In [5]:
# check the types of data
data.cache()
print(data.dtypes)

In [6]:
# simple data description
display(data.describe())

We will need a transformer to combine all the features into a single vector. That can be achieved in spark using the `VectorAssembler` library. [VectorAssembler APIs](https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.feature.VectorAssembler)

Here are some examples for Scala, Java & Python. [Details](https://spark.apache.org/docs/latest/ml-features.html#vectorassembler)

In [8]:
# define the features into a list
features = ["temperature", "exhaust_vacuum", "ambient_pressure", "relative_humidity"]

In [9]:
# prepare the data
lr_data = data.select(col("energy_output").alias("label"), "temperature", "exhaust_vacuum", "ambient_pressure", "relative_humidity")
lr_data.printSchema()

In [10]:
# split the dataset into training and test
(training, test) = lr_data.randomSplit([.7, .3], seed = 196)

In [11]:
# A vector is what the ML algorithm reads to train a model
training_vector = VA(inputCols=features, outputCol="features").transform(training).select("label", "features")
test_vector     = VA(inputCols=features, outputCol="features").transform(test).select("label", "features")

In [12]:
# Create a Linear Regression Model object
lr = LR()

# Fit the model to the data
model = lr.fit(training_vector)

# We use explain params to dump the parameters we can use
# print(lr.explainParams())

In [13]:
# run the model on the test data
results = model.transform(test_vector)

# results.show()

In [14]:
# evaluate the model
from pyspark.ml.evaluation import RegressionEvaluator as RE

# Root Mean Square Error
eval = RE(labelCol="label", predictionCol="prediction", metricName="rmse")
rmse = eval.evaluate(results)
print("RMSE: %.3f" % rmse)

# Mean Square Error
mse = eval.evaluate(results, {eval.metricName: "mse"})
print("MSE: %.3f" % mse)

# Mean Absolute Error
mae = eval.evaluate(results, {eval.metricName: "mae"})
print("MAE: %.3f" % mae)

# r2
r2 = eval.evaluate(results, {eval.metricName: "r2"})
print("r2: %.3f" %r2)