d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px; height: 163px">
</div>

# Regression Modeling

Linear regression is the most commonly employed machine learning model since it is highly interpretable and well studied.  This is often the first pass for data scientists modeling continuous variables.  This lesson trains simple and multivariate regression models and interprets the results.

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) In this lesson you:<br>
* Motivate the use of linear regression
* Train a simple regression model
* Interpret regression models
* Train a multivariate regression model

<iframe  
src="//fast.wistia.net/embed/iframe/xfemo2c5fn?videoFoam=true"
style="border:1px solid #1cb1c2;"
allowtransparency="true" scrolling="no" class="wistia_embed"
name="wistia_embed" allowfullscreen mozallowfullscreen webkitallowfullscreen
oallowfullscreen msallowfullscreen width="640" height="360" ></iframe>
<div>
<a target="_blank" href="https://fast.wistia.net/embed/iframe/xfemo2c5fn?seo=false">
  <img alt="Opens in new tab" src="https://files.training.databricks.com/static/images/external-link-icon-16x16.png"/>&nbsp;Watch full-screen.</a>
</div>

-sandbox
### Lines through Data

Take the example of Boston housing data where we have median value for a number of neighborhoods and variables such as the number of rooms, per capita crime, and economic status of residents.  We might have a number of questions about this data including:<br><br>

1. *Is there a relationship* between our features and median home value?
2. If there is a relationship, *how strong is that relationship?*
3. *Which of the features* affect median home value?
4. *How accurately can we estimate* the effect of each feature on home value?
5. *How accurately can we predict* on unseen data?
6. Is the relationship between our features and home value *linear*?
7. Are there *interaction effects* (e.g. value goes up when an area is not industrial and has more rooms on average) between the features?

Generally speaking, machine learning models either allow us to infer something about our data or create accurate predictions.  **There is a trade-off between model accuracy and interpretability.**  Linear regression is a highly interpretable model, allowing us to infer the answers to the questions above.  The predictive power of this model is somewhat limited, however, so if we're concerned about how our model will work on unseen data, we might choose a different model.

<div><img src="https://files.training.databricks.com/images/eLearning/ML-Part-1/rm-vs-mdv.png" style="height: 600px; margin: 20px"/></div>

At a high level, **linear regression can be thought of as lines put through data.**  The line plotted above uses a linear regression model to create a best guess for the relationship between average number of rooms in a home and home value.

In [0]:
%run "./Includes/Classroom-Setup"

-sandbox
### Simple Linear Regression

Simple linear regression looks to predict a response `Y` using a single input variable `X`.  In the case of the image above, we're predicting median home value, or `Y`, based on the average number of rooms.  More technically, linear regression is estimating the following equation:

&nbsp;&nbsp;&nbsp;&nbsp;`Y ≈ β<sub>0</sub> + β<sub>1</sub>X`

In this case, `β<sub>0</sub>` and `β<sub>1</sub>` are our **coefficients** where `β<sub>0</sub>` represents the line's intercept with the Y axis and `β<sub>1</sub>` represents the number we multiply by X in order to attain a prediction.  **A simple linear regression model will try to fit our data a closely as possible by estimating these coefficients,** putting a line through the data.

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> In the case of inferential statistics where we're interested in learning about the relationship between our input features and outputs, it's common to skip the train/test split step, as you'll see in this lesson.

Import the Boston dataset.

In [0]:
bostonDF = (spark.read
  .option("HEADER", True)
  .option("inferSchema", True)
  .csv("/mnt/training/bostonhousing/bostonhousing/bostonhousing.csv")
  .drop("_c0")
)

display(bostonDF)

crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,black,lstat,medv
0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2
0.02985,0.0,2.18,0,0.458,6.43,58.7,6.0622,3,222,18.7,394.12,5.21,28.7
0.08829,12.5,7.87,0,0.524,6.012,66.6,5.5605,5,311,15.2,395.6,12.43,22.9
0.14455,12.5,7.87,0,0.524,6.172,96.1,5.9505,5,311,15.2,396.9,19.15,27.1
0.21124,12.5,7.87,0,0.524,5.631,100.0,6.0821,5,311,15.2,386.63,29.93,16.5
0.17004,12.5,7.87,0,0.524,6.004,85.9,6.5921,5,311,15.2,386.71,17.1,18.9


Create a column `features` that has a single input variable `rm` by using `VectorAssembler`

In [0]:
from pyspark.ml.feature import VectorAssembler

# use only Room as the feature
featureCol = ["rm"]

#create VectorAssembler
assembler = VectorAssembler(inputCols=featureCol,outputCol="features")

bostonFeaturizedDF = assembler.transform(bostonDF)

display(bostonFeaturizedDF)

crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,black,lstat,medv,features
0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0,"List(1, 1, List(), List(6.575))"
0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6,"List(1, 1, List(), List(6.421))"
0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7,"List(1, 1, List(), List(7.185))"
0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4,"List(1, 1, List(), List(6.998))"
0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2,"List(1, 1, List(), List(7.147))"
0.02985,0.0,2.18,0,0.458,6.43,58.7,6.0622,3,222,18.7,394.12,5.21,28.7,"List(1, 1, List(), List(6.43))"
0.08829,12.5,7.87,0,0.524,6.012,66.6,5.5605,5,311,15.2,395.6,12.43,22.9,"List(1, 1, List(), List(6.012))"
0.14455,12.5,7.87,0,0.524,6.172,96.1,5.9505,5,311,15.2,396.9,19.15,27.1,"List(1, 1, List(), List(6.172))"
0.21124,12.5,7.87,0,0.524,5.631,100.0,6.0821,5,311,15.2,386.63,29.93,16.5,"List(1, 1, List(), List(5.631))"
0.17004,12.5,7.87,0,0.524,6.004,85.9,6.5921,5,311,15.2,386.71,17.1,18.9,"List(1, 1, List(), List(6.004))"


-sandbox
Fit a linear regression model.

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> See the <a href="http://spark.apache.org/docs/latest/api/python/pyspark.ml.html?highlight=vectorassembler#pyspark.ml.regression.LinearRegression" target="_blank">LinearRegression</a> documentation for more details.

In [0]:
from pyspark.ml.regression import LinearRegression

# create a Linear Regression on the features columns
lr = LinearRegression(featuresCol="features",labelCol="medv")

# generate the model
lrModel = lr.fit(bostonFeaturizedDF)

### Model Interpretation

Interpreting a linear model entails answering a number of questions:<br><br>

1. What did the model estimate my coefficients to be?
2. Are my coefficients statistically significant?
3. How accurate was my model?

-sandbox
Recalling that our model looks like `Y ≈ β<sub>0</sub> + β<sub>1</sub>X`, take a look at the model.

In [0]:
print("β0 (intercept): {}".format(lrModel.intercept))
print("β1 (coefficient for rm): {}".format(*lrModel.coefficients))

-sandbox
For a 5 bedroom home, our model would predict `-35.7 + (9.1 * 5)` or `$18,900`.  That's not too bad.

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> The intercept of `-34.7` doesn't make a lot of sense on its own since this would imply that a studio apartment would be worth negative dollars.  Also, we don't have any 1 or 2 bedroom homes in our dataset, so the model will perform poorly on data in this range.

-sandbox
In order to determine whether our coefficients are statistically significant, we need to quantify the likelihood of seeing the association by chance.  One way of doing this is using a p-value.  As a general rule of thumb, a p-value of under .05 indicates statistical significance in that there is less than a 1 in 20 chance of seeing the correlation by mere chance.

Do this using the `summary` attribute of `lrModel`.

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> The t-statistic can be used instead of p-values.  <a href="https://en.wikipedia.org/wiki/P-value" target="_blank">Read more about p-values here.</a>

In [0]:
# get summary from the model
summary = lrModel.summary

summary.pValues

-sandbox
These small p-values indicate that it is highly unlikely to see the correlation of the number of rooms to housing price by chance.  The first value in the list is the p-value for the `rm` feature and the second is that for the intercept.

Finally, we need a way to quantify how accurate our model is.  **R<sup>2</sup> is a measure of the proportion of variance in the dataset explained by the model.**  With R<sup>2</sup>, a higher number is better.

In [0]:
# get R-square
summary.r2

This indicates that 48% of the variability in home value can be explained using `rm` and the intercept.  While this isn't too high, it's not too bad considering that we're training a model using only one variable.

Finally, take a look at the `summary` attribute of `lrModel` so see other ways of summarizing model performance.

In [0]:
[attr for attr in dir(summary) if attr[0] != "_"]

-sandbox
### Multivariate Regression

While simple linear regression involves just a single input feature, multivariate regression takes an arbitrary number of input features.  The same principles apply that we explored in the simple regression example.  The equation for multivariate regression looks like the following where each feature `p` has its own coefficient:

&nbsp;&nbsp;&nbsp;&nbsp;`Y ≈ β<sub>0</sub> + β<sub>1</sub>X<sub>1</sub> + β<sub>2</sub>X<sub>2</sub> + ... + β<sub>p</sub>X<sub>p</sub>`

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> Our ability to visually explain how our model is performing becomes more limited as our number of features go up since we can only intuitively visualize data in two, possibly three dimensions.  With multivariate regression, we're therefore still putting lines through data, but this is happening in a higher dimensional space.

Train a multivariate regression model using `rm`, `crim`, and `lstat` as the input features.

In [0]:
from pyspark.ml.feature import VectorAssembler

# Train a multivariate regression model using rm, crim, and lstat
featureCols = ["rm","crim","lstat"]

# Create VectorAssembler
assemblerMultivariate = VectorAssembler(inputCols=featureCols,outputCol="features")

# get the transform dataframe
bostonFeaturizedMultivariateDF = assemblerMultivariate.transform(bostonDF)

display(bostonFeaturizedMultivariateDF)

crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,black,lstat,medv,features
0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0,"List(1, 3, List(), List(6.575, 0.00632, 4.98))"
0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6,"List(1, 3, List(), List(6.421, 0.02731, 9.14))"
0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7,"List(1, 3, List(), List(7.185, 0.02729, 4.03))"
0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4,"List(1, 3, List(), List(6.998, 0.03237, 2.94))"
0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2,"List(1, 3, List(), List(7.147, 0.06905, 5.33))"
0.02985,0.0,2.18,0,0.458,6.43,58.7,6.0622,3,222,18.7,394.12,5.21,28.7,"List(1, 3, List(), List(6.43, 0.02985, 5.21))"
0.08829,12.5,7.87,0,0.524,6.012,66.6,5.5605,5,311,15.2,395.6,12.43,22.9,"List(1, 3, List(), List(6.012, 0.08829, 12.43))"
0.14455,12.5,7.87,0,0.524,6.172,96.1,5.9505,5,311,15.2,396.9,19.15,27.1,"List(1, 3, List(), List(6.172, 0.14455, 19.15))"
0.21124,12.5,7.87,0,0.524,5.631,100.0,6.0821,5,311,15.2,386.63,29.93,16.5,"List(1, 3, List(), List(5.631, 0.21124, 29.93))"
0.17004,12.5,7.87,0,0.524,6.004,85.9,6.5921,5,311,15.2,386.71,17.1,18.9,"List(1, 3, List(), List(6.004, 0.17004, 17.1))"


Train the model.

In [0]:
from pyspark.ml.regression import LinearRegression

lrMultivariate = (LinearRegression()
  .setLabelCol("medv")
  .setFeaturesCol("features")
)

# Train the model
lrModelMultivariate = lrMultivariate.fit(bostonFeaturizedMultivariateDF)

# Get the summary
summaryMultivariate = lrModelMultivariate.summary

-sandbox
Take a look at the coefficients and R<sup>2</sup> score.

In [0]:
print("β0 (intercept): {}".format(lrModelMultivariate.intercept))
for i, (col, coef) in enumerate(zip(featureCols, lrModelMultivariate.coefficients)):
  print("β{} (coefficient for {}): {}".format(i+1, col, coef))
  
print("\nR2 score: {}".format(lrModelMultivariate.summary.r2))

-sandbox
Our R<sup>2</sup> score improved from 48% to 64%, indicating that our new model can explain more of the variance in the data.

## Exercise: Improve on our Model

Improve on the model trained above by adding features and interpreting the results.

### Step 1: Prepare the Features for a New Model

Prepare a new column `allFeatures` for a new model that uses all of the features in `bostonDF` except for the label `medv`.  Create the following variables:<br><br>

1. `allFeatures`: a list of all the column names
2. `assemblerAllFeatures`: A `VectorAssembler` that uses `allFeatures` to create the output column `allFeatures`
3. `bostonFeaturizedAllFeaturesDF`: The transformed `bostonDF`

In [0]:
from pyspark.ml.feature import VectorAssembler

allFeatures = ["crim","zn","indus","chas","nox","rm","age","dis","rad","tax","ptratio","black","lstat"]
assemblerAllFeatures = VectorAssembler(inputCols=allFeatures, outputCol="allFeatures")

bostonFeaturizedAllFeaturesDF = assemblerAllFeatures.transform(bostonDF)

In [0]:
# TEST - Run this cell to test your solution
from pyspark.ml.feature import VectorAssembler

_features = ['crim',
  'zn',
  'indus',
  'chas',
  'nox',
  'rm',
  'age',
  'dis',
  'rad',
  'tax',
  'ptratio',
  'black',
  'lstat'
]

dbTest("ML1-P-06-01-01", _features, allFeatures)
dbTest("ML1-P-06-01-02", True, type(assemblerAllFeatures) == type(VectorAssembler()))
dbTest("ML1-P-06-01-03", True, assemblerAllFeatures.getOutputCol() == 'allFeatures')
dbTest("ML1-P-06-01-04", True, "allFeatures" in bostonFeaturizedAllFeaturesDF.columns)

print("Tests passed!")

### Step 2: Train the Model

Create a linear regression model `lrAllFeatures`.  Save the trained model to lrModelAllFeatures.

In [0]:
from pyspark.ml.regression import LinearRegression

lrAllFeatures = LinearRegression(featuresCol="allFeatures", labelCol="medv")
lrModelAllFeatures = lrAllFeatures.fit(bostonFeaturizedAllFeaturesDF)

In [0]:
# TEST - Run this cell to test your solution
from pyspark.ml.regression import LinearRegression

dbTest("ML1-P-06-02-01", True, type(lrAllFeatures) == type(LinearRegression()))
dbTest("ML1-P-06-02-02", True, lrAllFeatures.getLabelCol() == 'medv')
dbTest("ML1-P-06-02-03", True, lrAllFeatures.getFeaturesCol() == 'allFeatures')
dbTest("ML1-P-06-02-04", True, "LinearRegressionModel" in str(type(lrModelAllFeatures)))

print("Tests passed!")

### Step 3: Interpret the Coefficients and Variance Explained

Take a look at the coefficients and variance explained.  What do these mean?

In [0]:
print("β0 (intercept): {}".format(lrModelAllFeatures.intercept))
for i, (col, coef) in enumerate(zip(allFeatures, lrModelAllFeatures.coefficients)):
  print("β{} (coefficient for {}): {}".format(i+1, col, coef))
  
print("\nR2 score: {}".format(lrModelAllFeatures.summary.r2))

### Step 4: Interpret the Statistical Significance of the Coefficients

Print out the p-values associated with each coefficient and the intercept.  Which were statistically significant?

In [0]:
summary = lrModelAllFeatures.summary

for i,j in enumerate(summary.pValues):
  print("p-value for β{}: {}".format(i+1, j))

At a significance level of 5% (0.05), all values are statistically significant except B3(indus) and B7(age). Note B14 here is the p-value of the intercept.

## Review

**Question:** What are the pros and cons of linear regression?  
**Answer:** Linear regression is an excellent tool for first getting to know your data, bridging the gap between data exploration and prediction.  It is a highly interpretable model that offers a sense for which of our features are statistically significant and how much they influence the final model.  There are two main drawbacks to this model, however.  The first is that it does not have strong predictive power.  Other models such as random forests or neural networks are able to find more complex relationships that linear regression struggles to model.  The second limitation is that it assumes a linear relationship between features and outcomes.  This assumption often works well for a first model however more precise models demand a way of understanding more complex relationships.

**Question:** When should I use simple regression and how does it work?  
**Answer:** In practice, simple regression is not used often since data scientists normally model many features rather than just one.  It works by estimating the line that best fits the data.  Underneath the hood, the simple regression algorithm is trying to minimize the distance between the line and the observed data.  In practice, it will never reduce this distance to zero, but the algorithm will have found the best fit when it can not reduce the distance any further.

**Question:** What am I looking for when I interpret the a model and how do I see this in the results?  
**Answer:** Interpreting a model looks at a number of factors including (but not limited to):
0. *Coefficients:* what did the model estimate the coefficients to be?
0. *Statistical Significance* were each of the coefficients statistically significant or should some be removed?
0. *Accuracy* how well did my model explain the signal in the dataset?<br>

**Question:** Does Spark standardize my data?  
**Answer:** Yes.  Spark standardizes each feature by default so the user does not need to take this pre-processing step.

## Next Steps

Start the next lesson, [Classification]($./07-Classification ).

## Additional Topics & Resources

**Q:** Where can I find out more information on machine learning using Spark?  
**A:** Check out <a href="https://spark.apache.org/docs/latest/ml-guide.html" target="_blank">the Apache Spark website for more information</a>

-sandbox
&copy; 2019 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>