d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px; height: 163px">
</div>

# Machine Learning Workflows

Machine learning practitioners generally follow an iterative workflow.  This lesson walks through that workflow at a high level before exploring train/test splits, a baseline model, and evaluation.

## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) In this lesson you:<br>
* Define the data analytics development cycle
* Motivate and perform a split between training and test data
* Train a baseline model
* Evaluate a baseline model's performance and improve it

<iframe  
src="//fast.wistia.net/embed/iframe/qimsc8jn4a?videoFoam=true"
style="border:1px solid #1cb1c2;"
allowtransparency="true" scrolling="no" class="wistia_embed"
name="wistia_embed" allowfullscreen mozallowfullscreen webkitallowfullscreen
oallowfullscreen msallowfullscreen width="640" height="360" ></iframe>
<div>
<a target="_blank" href="https://fast.wistia.net/embed/iframe/qimsc8jn4a?seo=false">
  <img alt="Opens in new tab" src="https://files.training.databricks.com/static/images/external-link-icon-16x16.png"/>&nbsp;Watch full-screen.</a>
</div>

-sandbox
### The Development Cycle

Data scientists follow an iterative workflow that keeps their work closely aligned to both business problems and their data.  This cycle begins with a thorough understanding of the business problem and the data itself, a process called _exploratory data analysis_.  Once the motivating business question and data are understood, the next step is preparing the data for modeling.  This includes removing or imputing missing values and outliers as well as creating features to train the model on.  The majority of a data scientist's work is spent in these earlier steps.

After preparing the features in a way that the model can benefit from, the modeling stage uses those features to determine the best way to represent the data.  The various models are then evaluated and this whole process is repeated until the best solution is developed and deployed into production.

<div><img src="https://files.training.databricks.com/images/eLearning/ML-Part-1/CRISP-DM.png" style="height: 400px; margin: 20px"/></div>

The above model addresses the high-level development cycle of data products.  This lesson addresses how to implement this at more practical level.

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> <a href="https://en.wikipedia.org/wiki/Cross-industry_standard_process_for_data_mining" target="_blank">See the Cross-Industry Standard Process for Data Mining</a> for details on the method above.

Run the following cell to set up our environment.

In [0]:
%run "./Includes/Classroom-Setup"

-sandbox
### Train/Test Split

To implement the development cycle detailed above, data scientists first divide their data randomly into two subsets.  This allows for the evaluation of the model on unseen data.<br><br>

1. The **training set** is used to train the model on
2. The **test set** is used to test how well the model performs on unseen data

This split avoids the memorization of data, known as **overfitting**.  Overfitting occurs when our model learns patterns caused by random chance rather than true signal.  By evaluating our model's performance on unseen data, we can minimize overfitting.

Splitting training and test data should be done so that the amount of data in the test set is a good sample of the overall data.  **A split of 80% of your data in the training set and 20% in the test set is a good place to start.**

<div><img src="https://files.training.databricks.com/images/eLearning/ML-Part-1/train-test-split.png" style="height: 400px; margin: 20px"/></div>

Import the Boston dataset.

In [0]:
bostonDF = (spark.read
  .option("HEADER", True)
  .option("inferSchema", True)
  .csv("/mnt/training/bostonhousing/bostonhousing/bostonhousing.csv")
)

display(bostonDF)

_c0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,black,lstat,medv
1,0.00632,18.0,2.31,0,0.538,6.575,65.2,4.09,1,296,15.3,396.9,4.98,24.0
2,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6
3,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7
4,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4
5,0.06905,0.0,2.18,0,0.458,7.147,54.2,6.0622,3,222,18.7,396.9,5.33,36.2
6,0.02985,0.0,2.18,0,0.458,6.43,58.7,6.0622,3,222,18.7,394.12,5.21,28.7
7,0.08829,12.5,7.87,0,0.524,6.012,66.6,5.5605,5,311,15.2,395.6,12.43,22.9
8,0.14455,12.5,7.87,0,0.524,6.172,96.1,5.9505,5,311,15.2,396.9,19.15,27.1
9,0.21124,12.5,7.87,0,0.524,5.631,100.0,6.0821,5,311,15.2,386.63,29.93,16.5
10,0.17004,12.5,7.87,0,0.524,6.004,85.9,6.5921,5,311,15.2,386.71,17.1,18.9


-sandbox
Split the dataset into two DataFrames:<br><br>

1. `trainDF`: our training data
2. `testDF`: our test data

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> Using a seed ensures that the random split we conduct will be the same split if we rerun the code again.  Reproducible experiments are a hallmark of good science.<br>
<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> Conventions using other machine learning tools often entail creating 4 objects: `X_train`, `y_train`, `X_test`, and `y_test` where your features `X` are separate from your label `y`.  Since Spark is distributed, the Spark convention keeps the features and labels together when the split is performed.

In [0]:
# use randomSplit Function 
trainDF, testDF = bostonDF.randomSplit([0.8, 0.2],seed=42)

print("We have {} training examples and {} test examples.".format(trainDF.count(), testDF.count()))

### Baseline Model

A **baseline model** offers an educated best guess to improve upon as different models are trained and evaluated.  It represents the simplest model we can create.  This is generally approached as the center of the data.  In the case of regression, this could involve predicting the average of the outcome regardless of the features it sees.  In the case of classification, the center of the data is the mode, or the most common class.  

A baseline model could also be a random value or a preexisting model.  Through each new model, we can track improvements with respect to this baseline.

Create a baseline model by calculating the average housing value in the training dataset.

In [0]:
from pyspark.sql.functions import avg

# find the average of "medv"
trainAvg = trainDF.select(avg("medv")).first()[0]
print("Average home value: {}".format(trainAvg))

Take the average calculated on the training dataset and append it as the column `prediction` on the test dataset.

In [0]:
from pyspark.sql.functions import lit

# set the average to be the prediction
testPredictionDF = testDF.withColumn("prediction",lit(trainAvg))

display(testPredictionDF)

_c0,crim,zn,indus,chas,nox,rm,age,dis,rad,tax,ptratio,black,lstat,medv,prediction
2,0.02731,0.0,7.07,0,0.469,6.421,78.9,4.9671,2,242,17.8,396.9,9.14,21.6,22.61536585365856
3,0.02729,0.0,7.07,0,0.469,7.185,61.1,4.9671,2,242,17.8,392.83,4.03,34.7,22.61536585365856
4,0.03237,0.0,2.18,0,0.458,6.998,45.8,6.0622,3,222,18.7,394.63,2.94,33.4,22.61536585365856
14,0.62976,0.0,8.14,0,0.538,5.949,61.8,4.7075,4,307,21.0,396.9,8.26,20.4,22.61536585365856
17,1.05393,0.0,8.14,0,0.538,5.935,29.3,4.4986,4,307,21.0,386.85,6.58,23.1,22.61536585365856
36,0.06417,0.0,5.96,0,0.499,5.933,68.2,3.3603,5,279,19.2,396.9,9.68,18.9,22.61536585365856
51,0.08873,21.0,5.64,0,0.439,5.963,45.7,6.8147,4,243,16.8,395.56,13.45,19.7,22.61536585365856
52,0.04337,21.0,5.64,0,0.439,6.115,63.0,6.8147,4,243,16.8,393.97,9.43,20.5,22.61536585365856
54,0.04981,21.0,5.64,0,0.439,5.998,21.4,6.8147,4,243,16.8,396.9,8.43,23.4,22.61536585365856
60,0.10328,25.0,5.13,0,0.453,5.927,47.2,6.932,8,284,19.7,396.9,9.22,19.6,22.61536585365856


-sandbox
### Evaluation and Improvement

Evaluation offers a way of measuring how well predictions match the observed data.  In other words, an evaluation metric measures how closely predicted responses are to the true response.

There are a number of different evaluation metrics.  The most common evaluation metric in regression tasks is **mean squared error (MSE)**.  This is calculated by subtracting each predicted response from the corresponding true response and squaring the result.  This assures that the result is always positive.  The lower the MSE, the better the model is performing.  Technically:

<div><img src="https://files.training.databricks.com/images/eLearning/ML-Part-1/mse.png" style="height: 100px; margin: 20px"/></div>

Since we care about how our model performs on unseen data, we are more concerned about the test error, or the MSE calculated on the unseen data.

-sandbox
Define the evaluator with the prediction column, label column, and MSE metric.

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> We'll explore various model parameters in later lessons.

In [0]:
from pyspark.ml.evaluation import RegressionEvaluator

evaluator = RegressionEvaluator(predictionCol="prediction",labelCol="medv",metricName="mse")

Evaluate `testPredictionDF` using the `.evaluator()` method.

In [0]:
from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import VectorAssembler

# Check Baseline Model error
testError = evaluator.evaluate(testPredictionDF)
print("Error on the test set for the baseline model: {}".format(testError))


# Check Linear Regression error
featureCols=["rm","crim","lstat"]
assembler=VectorAssembler(inputCols=featureCols,outputCol="features")
trainFeaturizedDF=assembler.transform(trainDF)

lr=LinearRegression(labelCol="medv",featuresCol="features")
lrModel=lr.fit(trainFeaturizedDF)

testFeaturizedDF = assembler.transform(testDF)
testLRResultDF = lrModel.transform(testFeaturizedDF)
testLRError = evaluator.evaluate(testLRResultDF)
print("Error on the test set for the Linear Regression model: {}".format(testLRError))


This score indicates that the average squared distance between the true home value and the prediction of the baseline is about 79.  Taking the square root of that number gives us the error in the units of the quantity being estimated.  In other words, taking the square root of 79 gives us an average error of about $8,890.  That's not great, but it's also not too bad for a naive approach.

## Exercise: Train/Test Split and Baseline Model

Do a train/test split on a Dataset, create a baseline model, and evaluate the result.  Optionally, try to beat this baseline model by training a linear regression model.

### Step 1: Train/Test Split

Import the bike sharing dataset and take a look at what's in it.  This dataset contains number of bikes rented (`cnt`) by season, year, month, and hour and for a number of weather conditions.

In [0]:
bikeDF = (spark
  .read
  .option("header", True)
  .option("inferSchema", True)
  .csv("/mnt/training/bikeSharing/data-001/hour.csv")
  .drop("instant", "dteday", "casual", "registered", "holiday", "weekday") # Drop unnecessary features
)

display(bikeDF)

season,yr,mnth,hr,workingday,weathersit,temp,atemp,hum,windspeed,cnt
1,0,1,0,0,1,0.24,0.2879,0.81,0.0,16
1,0,1,1,0,1,0.22,0.2727,0.8,0.0,40
1,0,1,2,0,1,0.22,0.2727,0.8,0.0,32
1,0,1,3,0,1,0.24,0.2879,0.75,0.0,13
1,0,1,4,0,1,0.24,0.2879,0.75,0.0,1
1,0,1,5,0,2,0.24,0.2576,0.75,0.0896,1
1,0,1,6,0,1,0.22,0.2727,0.8,0.0,2
1,0,1,7,0,1,0.2,0.2576,0.86,0.0,3
1,0,1,8,0,1,0.24,0.2879,0.75,0.0,8
1,0,1,9,0,1,0.32,0.3485,0.76,0.0,14


Perform a train/test split.  Put 70% of the data into `trainBikeDF` and 30% into `testBikeDF`.  Use a seed of `42` so you have the same split every time you perform the operation.

In [0]:
# TODO
trainBikeDF, testBikeDF = bikeDF.randomSplit([0.7, 0.3],seed=42)

In [0]:
# TEST - Run this cell to test your solution
_traincount = trainBikeDF.count()
_testcount = testBikeDF.count()

dbTest("ML1-P-03-01-01", True, _traincount < 13000 and _traincount > 12000)
dbTest("ML1-P-03-01-02", True, _testcount < 5500 and _testcount > 4800)

print("Tests passed!")

### Step 2: Create a Baseline Model

Calculate the average of the column `cnt` and save it to the variable `trainCnt`.  Then create a new DataFrame `bikeTestPredictionDF` that appends a new column `prediction` that's the value of `trainCnt`.

In [0]:
from pyspark.sql.functions import avg

avgTrainCnt = trainBikeDF.select(avg("cnt")).first()[0]
bikeTestPredictionDF = testBikeDF.withColumn("prediction",lit(avgTrainCnt))

In [0]:
# TEST - Run this cell to test your solution
dbTest("ML1-P-03-02-01", True, avgTrainCnt < 195 and avgTrainCnt > 180)
dbTest("ML1-P-03-02-02", True, "prediction" in bikeTestPredictionDF.columns)

print("Tests passed!")

-sandbox
### Step 3: Evaluate the Result

Evaluate the result using `mse` as the error metric.  Save the result to `testError`.

<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** Your baseline prediction will not be very accurate.  Be sure to take the square root of the MSE to return the results to the proper units (that is, bike counts).

In [0]:
# TODO
from pyspark.ml.evaluation import RegressionEvaluator

evaluator = RegressionEvaluator(predictionCol="prediction",labelCol="cnt",metricName="mse")
testError = evaluator.evaluate(bikeTestPredictionDF)
print(testError)


In [0]:
# TEST - Run this cell to test your solution
dbTest("ML1-P-03-03-01", True, testError > 33000 and testError < 35000)

print("Tests passed!")

### Step 4 (Optional): Beat the Baseline

Use a linear regression model (explored in the previous lesson) to beat the baseline model score.

In [0]:
from pyspark.ml.regression import LinearRegression
from pyspark.ml.feature import VectorAssembler

# Check Baseline Model error
testError = evaluator.evaluate(bikeTestPredictionDF)
print("Error on the test set for the baseline model: {}".format(testError))


# Check Linear Regression error
featureCols = ["temp","atemp","hum","windspeed","season","yr","mnth","hr","workingday","weathersit"]
assembler = VectorAssembler(inputCols=featureCols,outputCol="features")
trainBikeFeaturizedDF = assembler.transform(trainBikeDF)

lr=LinearRegression(labelCol="cnt",featuresCol="features")
lrModel=lr.fit(trainBikeFeaturizedDF)

testBikeFeaturizedDF = assembler.transform(testBikeDF)
testLRResultDF = lrModel.transform(testBikeFeaturizedDF)
testLRError = evaluator.evaluate(testLRResultDF)
print("Error on the test set for the Linear Regression model: {}".format(testLRError))

## Review
**Question:** What does a data scientist's workflow look like?  
**Answer:** Data scientists employ an iterative workflow that includes the following steps:
0. *Business and Data Understanding:* ensures a rigorous understanding of both the business problem and the available data
0. *Data Preparation:* involves cleaning data so that it can be fed into algorithms and create new features
0. *Modeling:* entails training many models and many combinations of parameters for a given model
0. *Evaluation:* compares model performance and chooses the best option
0. *Deployment:* launches a model into production where it's used to inform business decision-making<br>

**Question:** Why should I divide all my data into two subsets?  
**Answer:** It's important to gauge how a model performs on unseen data.  Without this check, the model will "overfit," where it memorizes both the signal and the noise in the dataset rather than just learning the true signal.  In practice, many models are trained on the training dataset and tested on the test dataset.  Before launching into production, the final model is often retrained on all of the data since the more data a model sees, the better it generally performs.

**Question:** What does a baseline model do?  
**Answer:** Since training machine learning models is an iterative process, using a naive baseline model offers a reference point for what a basic solution would offer.  Baseline models are sometimes arbitrary (such as a using a random prediction), but that reference point grounds future hypotheses.

**Question:** How do I evaluate the performance of a regression model?  
**Answer:** There are a number of ways of evaluating regression models.  The most common way of evaluating regression models is using Mean Squared Error (MSE).  This calculates the average squared distance between the predicted value and the true value. By squaring the error, we will always get a positive number so this evaluation metric does not care if the prediction is above or below the true value.  There are many alternatives, including Root Mean Squared Error (RMSE).  RMSE is a helpful metric because, by taking the square root of the MSE, the error has the same units as the dependent variable.

## Next Steps

Start the next lesson, [Featurization]($./05-Featurization ).

-sandbox
&copy; 2019 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>