#Power Plant ML Pipeline Application
This is an end-to-end example of using a number of different machine learning algorithms to solve a supervised regression problem.

###Table of Contents
- *Step 1: Business Understanding*
- *Step 2: Extract-Transform-Load (ETL) Your Data*
- *Step 3: Explore Your Data*
- *Step 4: Visualize Your Data*
- *Step 5: Data Preparation*
- *Step 6: Data Modeling*


*We are trying to predict power output given a set of readings from various sensors in a gas-fired power generation plant.  Power generation is a complex process, and understanding and predicting power output is an important element in managing a plant and its connection to the power grid.*

More information about Peaker or Peaking Power Plants can be found on Wikipedia https://en.wikipedia.org/wiki/Peaking_power_plant


Given this business problem, we need to translate it to a Machine Learning task.  The ML task is regression since the label (or target) we are trying to predict is numeric.


The example data is provided by UCI at [UCI Machine Learning Repository Combined Cycle Power Plant Data Set](https://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant)

You can read the background on the UCI page, but in summary we have collected a number of readings from sensors at a Gas Fired Power Plant

(also called a Peaker Plant) and now we want to use those sensor readings to predict how much power the plant will generate.


More information about Machine Learning with Spark can be found in the programming guide in the [SparkML Guide](https://spark.apache.org/docs/latest/mllib-guide.html)


*Please note this example only works with Spark version 1.4 or higher*

In [2]:
assert int(sc.version.replace(".", "")) >= 140, "Spark 1.4.0+ is required to run this notebook. Please attach it to a Spark 1.4.0+ cluster."

##Step 1: Business Understanding
The first step in any machine learning task is to understand the business need. 

As described in the overview we are trying to predict power output given a set of readings from various sensors in a gas-fired power generation plant.

The problem is a regression problem since the label (or target) we are trying to predict is numeric

##Step 2: Extract-Transform-Load (ETL) Your Data

Now that we understand what we are trying to do, the first step is to load our data into a format we can query and use.  This is known as ETL or "Extract-Transform-Load".  We will load our file from Amazon s3.

Note: Alternatively we could upload our data using "Databricks Menu > Tables > Create Table", assuming we had the raw files on our local computer.

%md Our data is available on Amazon s3 at the following path:  
`dbfs:/databricks-datasets/power-plant/data`

**ToDo:** Let's start by printing the first 5 lines of the file.  
*Hint*: To read the file into an RDD use `sc.textFile("dbfs:/databricks-datasets/power-plant/data")`  
*Hint*: Then you will need to figure out how to `take` and print the first 5 lines of the RDD.

In [5]:
rawTextRdd = sc.textFile("dbfs:/databricks-datasets/power-plant/data")
for line in rawTextRdd.take(5):
    print(line)


The file is a .tsv (Tab Seperated Values) file of floating point numbers.  

Our schema definition from UCI appears below:

- AT = Atmospheric Temperature in C
- V = Exhaust Vacuum Speed
- AP = Atmospheric Pressure
- RH = Relative Humidity
- PE = Power Output.  This is the value we are trying to predict given the measurements above.


**ToDo:** Transform the RDD so that each row is a tuple of float values.  Then print the first 5 rows.  
*Hint:* Use filter to exclude lines that start with AT to remove the header.  
*Hint:* Use map to transform each line into a PowerPlantRow of data fields.  
*Hint:* Use python's str.split break up each line into individual fields.

In [7]:
from collections import namedtuple
PowerPlantRow=namedtuple("PowerPlantRow", ["AT", "V", "AP", "RH", "PE"])
rawDataRdd=rawTextRdd\
  .map(lambda x: x.split("\t"))\
  .filter(lambda line: line[0] != "AT")\
  .map(lambda line: PowerPlantRow(float(line[0]), float(line[1]), float(line[2]), float(line[3]), float(line[4])))
rawDataRdd.take(5)

##Step 3: Explore Your Data
Now that your data is loaded, let's explore it, verify it, and do some basic analysis and visualizations.

**ToDo:** Transform your `rawDataRdd` into a Dataframe named `power_plant`.  Then use the `display(power_plant)` function to visualize it.

In [10]:
powerPlant=None

Next, let's register our dataframe as an SQL table.  Because this lab may be run many times, we'll take the precaution of removing any existing tables first.

**ToDo:** Execute the prepared code in the following cell...

In [12]:
sqlContext.sql("DROP TABLE IF EXISTS power_plant")
dbutils.fs.rm("dbfs:/user/hive/warehouse/power_plant", True)
None

**ToDo:** Register your `powerPlant` dataframe as the table named `power_plant`

**ToDo:** Perform the query `SELECT * FROM power_plant`

In [16]:
%sql 

**ToDo:** Use the `desc power_plant` SQL command to describe the schema

In [18]:
%sql 

**Schema Definition**

Our schema definition from UCI appears below:

- AT = Atmospheric Temperature in C
- V = Exhaust Vacuum Speed
- AP = Atmospheric Pressure
- RH = Relative Humidity
- PE = Power Output

PE is our label or target. This is the value we are trying to predict given the measurements.

*Reference [UCI Machine Learning Repository Combined Cycle Power Plant Data Set](https://archive.ics.uci.edu/ml/datasets/Combined+Cycle+Power+Plant)*

**ToDo:** Display summary statistics for the the columns.  
*Hint:* To access the table from python use `sqlContext.table("power_plant")`  
*Hint:* We can use the describe function with no parameters to get some basic stats for each column like count, mean, max, min and standard deviation. The describe function is a method attached to a dataframe. More information can be found in the [Spark API docs](https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.DataFrame)

##Step 4: Visualize Your Data

To understand our data, we will look for correlations between features and the label.  This can be important when choosing a model.  E.g., if features and a label are linearly correlated, a linear model like Linear Regression can do well; if the relationship is very non-linear, more complex models such as Decision Trees can be better. We use Databrick's built in visualization to view each of our predictors in relation to the label column as a scatter plot to see the correlation between the predictors and the label.

**ToDo:** Do a scatter plot of Power(PE) as a function of Temperature (AT).  
*Bonus:* Name the y-axis "Power" and the x-axis "Temperature"

Notice there appears to be a strong linear correlation between temperature and Power Output

**ToDo:** Do a scatter plot of Power(PE) as a function of ExhaustVacuum (V).  
*Bonus:* Name the y-axis "Power" and the x-axis "ExhaustVacuum"

The linear correlation is not as strong between Exhaust Vacuum Speed and Power Output but there is some semblance of a pattern.

**ToDo:** Do a scatter plot of Power(PE) as a function of Pressure (AP).  
*Bonus:* Name the y-axis "Power" and the x-axis "Pressure"

**ToDo:** Do a scatter plot of Power(PE) as a function of Humidity (RH).  
*Bonus:* Name the y-axis "Power" and the x-axis "Humidity"

...and atmospheric pressure and relative humidity seem to have little to no linear correlation