# Population vs. Median Home Prices
#### *Linear Regression with Single Variable*
 *Raúl - RDD API patch*


*Note, this notebook requires Spark 1.6+*

In [3]:
%scala if (org.apache.spark.BuildInfo.sparkBranch < "1.6") sys.error("Attach this notebook to a cluster running Spark 1.6+")

### Load and parse the data

In [5]:
# Use the Spark CSV datasource with options specifying:
#  - First line of file is a header
#  - Automatically infer the schema of the data
data = sqlContext.read.format("com.databricks.spark.csv")\
  .option("header", "true")\
  .option("inferSchema", "true")\
  .load("/databricks-datasets/samples/population-vs-price/data_geo.csv")
data.cache()  # Cache data for faster reuse
data.count()

In [6]:
display(data)

In [7]:
data = data.dropna()  # drop rows with missing values
data.count()

In [8]:
# This will let us access the table from our SQL notebook!
data.registerTempTable("data_geo")

## Limit data to Population vs. Price
(for our ML example)

We also use `LabeledPoint` to convert the feature (population) to a Vector type, to prep the data for ML algorithms.

In [10]:
from pyspark.mllib.regression import LabeledPoint  # convenience for specifying schema
data = data.select("2014 Population estimate", "2015 median sales price")\
  .rdd.map(lambda r: LabeledPoint(r[1], [r[0]]))\
  .toDF()
display(data)

## Scatterplot of the data using ggplot

In [12]:
import numpy as np
import matplotlib.pyplot as plt

x = data.rdd.map(lambda p: (p.features[0])).collect()
y = data.rdd.map(lambda p: (p.label)).collect()

from pandas import *
from ggplot import *
pydf = DataFrame({'pop':x,'price':y})
p = ggplot(pydf, aes('pop','price')) + \
    geom_point(color='blue') 
display(p)

## Linear Regression

**Goal**
* Predict y = 2015 Median Housing Price
* Using feature x = 2014 Population Estimate

**References**
* [MLlib LinearRegression user guide](http://spark.apache.org/docs/latest/ml-classification-regression.html#linear-regression)
* [PySpark LinearRegression API](http://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.regression.LinearRegression)

In [14]:
# Import LinearRegression class
from pyspark.ml.regression import LinearRegression

In [15]:
# Define LinearRegression algorithm
lr = LinearRegression()

In [16]:
modelA = lr.fit(data)

In [17]:
# Fit 2 models, using different regularization parameters
modelA = lr.fit(data, {lr.regParam:0.0})
modelB = lr.fit(data, {lr.regParam:100.0})

In [18]:
print ">>>> ModelA intercept: %r, coefficient: %r" % (modelA.intercept, modelA.coefficients[0])

In [19]:
print ">>>> ModelB intercept: %r, coefficient: %r" % (modelB.intercept, modelB.coefficients[0])

## Make predictions

Calling `transform()` on data adds a new column of predictions.

In [21]:
# Make predictions
predictionsA = modelA.transform(data)
display(predictionsA)

## Evaluate the Model
#### Predicted vs. True label

In [23]:
from pyspark.ml.evaluation import RegressionEvaluator
evaluator = RegressionEvaluator(metricName="rmse")
RMSE = evaluator.evaluate(predictionsA)
print("ModelA: Root Mean Squared Error = " + str(RMSE))

In [24]:
predictionsB = modelB.transform(data)
RMSE = evaluator.evaluate(predictionsB)
print("ModelB: Root Mean Squared Error = " + str(RMSE))

# Linear Regression Plots

In [26]:
import numpy as np
from pandas import *
from ggplot import *

pop = data.map(lambda p: (p.features[0])).collect()
price = data.map(lambda p: (p.label)).collect()
predA = predictionsA.select("prediction").map(lambda r: r[0]).collect()
predB = predictionsB.select("prediction").map(lambda r: r[0]).collect()

pydf = DataFrame({'pop':pop,'price':price,'predA':predA, 'predB':predB})

## View the Python Pandas DataFrame (pydf)

In [28]:
pydf

## ggplot figure
Now that the Python Pandas DataFrame (pydf), use ggplot and display the scatterplot and the two regression models

In [30]:
p = ggplot(pydf, aes('pop','price')) + \
    geom_point(color='blue') + \
    geom_line(pydf, aes('pop','predA'), color='red') + \
    geom_line(pydf, aes('pop','predB'), color='green') + \
    scale_x_log10() + scale_y_log10()
display(p)