d-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# Cross-Validation Lab

**Objective**: *Assess your ability to apply cross-validated hyperparameter tuning to a model.*

In this lab, you will apply what you've learned in this lesson. When complete, please use the answers to the exercises to answer questions in the following quiz within Coursera.

In [0]:
%run "../../Includes/Classroom-Setup"

-sandbox

## Exercise 1

In this exercise, you will create an enhanced user-level table to try to better predict whether or not each user takes at least *8,000* steps in a day. For this exercise, assume we only have access to heart rate information.

Fill in the blanks in the below cell to create the `adsda.ht_user_metrics_cv_lab` table.

<img alt="Side Note" title="Side Note" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.05em; transform:rotate(15deg)" src="https://files.training.databricks.com/static/images/icon-note.webp"/> Note that this lab is focused on predicting whether users take 8,000 steps per day rather than 10,000 steps per day.

In [0]:
%sql
-- ANSWER
CREATE OR REPLACE TABLE adsda.ht_user_metrics_cv_lab
USING DELTA LOCATION "/adsda/ht-user-metrics-cv-lab" AS (
  SELECT min(resting_heartrate) AS min_resting_heartrate,
         avg(resting_heartrate) AS avg_resting_heartrate,
         max(resting_heartrate) AS max_resting_heartrate,
         max(resting_heartrate) - min(resting_heartrate) AS resting_heartrate_change,
         min(active_heartrate) AS min_active_heartrate,
         avg(active_heartrate) AS avg_active_heartrate,
         max(active_heartrate) AS max_active_heartrate,
         max(active_heartrate) - min(active_heartrate) AS active_heartrate_change,
         CASE WHEN avg(steps) > 8000 THEN 1 ELSE 0 END AS steps_8000
  FROM adsda.ht_daily_metrics
  GROUP BY device_id
)

-sandbox
**Coursera Quiz:** How many users in `adsda.ht_user_metrics_cv_lab` take, on average, 8,000 steps per day?

<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** Refer back to the previous lab for guidance on how to answer this question.

-sandbox
## Exercise 2

In this exercise, you will split your data into a cross-validation set (`cross_val_df`) and test set (`test_df`).

Fill in the blanks below to split your data.

<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** Refer to the previous demo for guidance.

In [0]:
# ANSWER
from sklearn.model_selection import train_test_split

ht_user_metrics_pd_df = spark.table("adsda.ht_user_metrics_cv_lab").toPandas()

cross_val_df, test_df = train_test_split(ht_user_metrics_pd_df, train_size=0.80, test_size=0.20, random_state=42)

**Coursera Quiz:** How many rows are in the `cross_val_df` DataFrame?

Fill in the blanks below to answer the question.

In [0]:
# ANSWER
cross_val_df.shape

## Exercise 3

In this exercise, you will prepare your random forest classifier.

Fill in the blanks below to complete the task.

In [0]:
# ANSWER
from sklearn.ensemble import RandomForestClassifier

rfc = RandomForestClassifier(random_state=42)

## Exercise 4

In this exercise, you will create a hyperparameter grid to use during the grid search process.

Use the following hyperparameter values:

1. `max_depth`: 5, 8, 20
1. `n_estimators`: 25, 50, 100
1. `min_samples_split`: 2, 4
1. `max_features`: 3, 4
1. `max_samples`: 0.6, 0.8

Fill in the blanks below to create the grid.

In [0]:
# ANSWER
parameter_grid = {
  "max_depth": [5, 8, 20],
  "n_estimators": [25, 50, 100],
  "min_samples_split": [2, 4],
  "max_features": [3, 4],
  "max_samples": [0.6, 0.8]
}

-sandbox
**Coursera Quiz**: How many total unique combinations of hyperparameters are there in `parameter_grid`?

Use the below empty cell to determine the answer to the above question.

<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** Refer to the previous lesson's lab for guidance.

In [0]:
# ANSWER
len(parameter_grid["max_depth"]) * len(parameter_grid["n_estimators"]) * len(parameter_grid["min_samples_split"]) * len(parameter_grid["max_features"]) * len(parameter_grid["max_samples"])

## Exercise 5

In this exercise, you will create the cross-validated grid-search object that you will use to optimize your hyperparameter values while using cross-validation.

Fill in the blanks below to create the object.

:NOTE: Please use 4-fold cross-validation.

In [0]:
# ANSWER
from sklearn.model_selection import GridSearchCV

grid_search = GridSearchCV(estimator=rfc, cv=4, param_grid=parameter_grid)

## Exercise 6

In this exercise, you will fit the grid search process.

Fill in the blanks below to perform the grid search process.

In [0]:
# ANSWER
grid_search.fit(cross_val_df.drop("steps_8000", axis=1), cross_val_df["steps_8000"])

-sandbox
**Coursera Quiz**: How many unique models are being trained by the cross-validated grid search process?

* 4
* 289
* 21
* 288

<img alt="Hint" title="Hint" style="vertical-align: text-bottom; position: relative; height:1.75em; top:0.3em" src="https://files.training.databricks.com/static/images/icon-light-bulb.svg"/>&nbsp;**Hint:** Consider the number of unique feature combinations, the number of cross-validation folds and the final retraining of the model on the entire cross-validation set.

## Exercise 7

In this exercise, you will return a Pandas DataFrame of the `grid_search` results.

Fill in the blanks below to return the DataFrame.

In [0]:
# ANSWER
import pandas as pd
pd.DataFrame(grid_search.cv_results_)

## Exercise 8

In this exercise, you will identify the optimal hyperparameter values.

Fill in the blanks below to indentify the optimal hyperparameter values.

In [0]:
# ANSWER
grid_search.best_params_

**Coursera Quiz:** What is the optimal hyperparameter value for `max_depth` according to the cross-validated grid search process?

## Exercise 9

In this exercise, you will identify the test accuracy achieved by the final, refit model.

Fill in the blanks below to identify the test accuracy.

In [0]:
# ANSWER
from sklearn.metrics import accuracy_score

accuracy_score(
  test_df["steps_8000"], 
  grid_search.predict(test_df.drop("steps_8000", axis=1))
)

**Coursera Quiz:** What is the test set accuracy?

Congratulations! That concludes our lesson on cross-validated hyperparameter optimization and our course!

Be sure to submit your quiz answers to Coursera to fully complete the course!

-sandbox
&copy; 2020 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>