Skip to content

Commit

Permalink
minor slide updates and created student script
Browse files Browse the repository at this point in the history
  • Loading branch information
bradleyboehmke committed Feb 21, 2019
1 parent 523945d commit ff1a689
Show file tree
Hide file tree
Showing 4 changed files with 127 additions and 8 deletions.
Binary file modified .DS_Store
Binary file not shown.
8 changes: 4 additions & 4 deletions docs/03-supervised-modeling-process.Rmd
Expand Up @@ -93,7 +93,7 @@ knitr::include_graphics("images/modeling_process.png")
]

---
# Prerequisites
# Prerequisites .red[`r anicon::faa("hand-point-right", color = "red", animate = "horizontal")` code chunk 1]

.pull-left[

Expand Down Expand Up @@ -169,7 +169,7 @@ knitr::include_graphics("images/nope.png")
- as $n \geq p$ (where *p* represents the number of features), larger samples sizes are often required to identify consistent signals in the features

---
# Mechanics of data splitting
# Mechanics of data splitting .red[`r anicon::faa("hand-point-right", color = "red", animate = "horizontal")` code chunk 2]

Two most common ways of splitting data include:

Expand Down Expand Up @@ -219,7 +219,7 @@ ggplot(train, aes(x = price)) +

---
class: yourturn
# Your Turn!
# Your Turn! .red[`r anicon::faa("hand-point-right", color = "red", animate = "horizontal")` code chunk 3]

1. Use __rsample__ to split the Ames housing data (`ames`) and the Employee attrition data (`churn`) using stratified sampling and with a 80% split.

Expand Down Expand Up @@ -970,7 +970,7 @@ Let's put these pieces together and analyze the Ames housing data:


---
# Putting the process together
# Putting the process together .red[`r anicon::faa("hand-point-right", color = "red", animate = "horizontal")` code chunk 4]

.scrollable90[
.pull-left[
Expand Down
8 changes: 4 additions & 4 deletions docs/03-supervised-modeling-process.html
Expand Up @@ -73,7 +73,7 @@
]

---
# Prerequisites
# Prerequisites .red[<span>&lt;i class="fas fa-hand-point-right faa-horizontal animated " style=" color:red;"&gt;&lt;/i&gt;</span> code chunk 1]

.pull-left[

Expand Down Expand Up @@ -149,7 +149,7 @@
- as `\(n \geq p\)` (where *p* represents the number of features), larger samples sizes are often required to identify consistent signals in the features

---
# Mechanics of data splitting
# Mechanics of data splitting .red[<span>&lt;i class="fas fa-hand-point-right faa-horizontal animated " style=" color:red;"&gt;&lt;/i&gt;</span> code chunk 2]

Two most common ways of splitting data include:

Expand Down Expand Up @@ -187,7 +187,7 @@

---
class: yourturn
# Your Turn!
# Your Turn! .red[<span>&lt;i class="fas fa-hand-point-right faa-horizontal animated " style=" color:red;"&gt;&lt;/i&gt;</span> code chunk 3]

1. Use __rsample__ to split the Ames housing data (`ames`) and the Employee attrition data (`churn`) using stratified sampling and with a 80% split.

Expand Down Expand Up @@ -807,7 +807,7 @@


---
# Putting the process together
# Putting the process together .red[<span>&lt;i class="fas fa-hand-point-right faa-horizontal animated " style=" color:red;"&gt;&lt;/i&gt;</span> code chunk 4]

.scrollable90[
.pull-left[
Expand Down
119 changes: 119 additions & 0 deletions student-scripts/03-supervised-modeling-process.Rmd
@@ -0,0 +1,119 @@
---
title: "Supervised Modeling Process"
output: html_notebook
---

# Prerequisites

```{r slide-4}
# Packages required
library(rsample)
library(caret)
library(tidyverse)
# Data required
## ames data
ames <- AmesHousing::make_ames()
## attrition data
churn <- rsample::attrition
```


# Mechanics of data splitting

Two most common ways of splitting data include:

* simple random sampling: randomly select observations
* stratified sampling: preserving distributions
- classification: sampling within the classes to preserve the
distribution of the outcome in the training and test sets
- regression: determine the quartiles of the data set and sample within those
artificial groups

```{r slide-8}
set.seed(123) # for reproducibility
split <- initial_split(diamonds, strata = "price", prop = 0.7)
train <- training(split)
test <- testing(split)
# Do the distributions line up?
ggplot(train, aes(x = price)) +
geom_line(stat = "density",
trim = TRUE) +
geom_line(data = test,
stat = "density",
trim = TRUE, col = "red")
```


# Your Turn!

1. Use __rsample__ to split the Ames housing data (`ames`) and the Employee attrition data (`churn`) using stratified sampling and with a 80% split.
2. Verify that the distribution between training and test sets are similar.

```{r slide-9}
# ames data
set.seed(123)
ames_split <- initial_split(ames, prop = _____, strata = "Sale_Price")
ames_train <- training(_____)
ames_test <- testing(_____)
# attrition data
set.seed(123)
churn_split <- initial_split(churn, prop = _____, strata = "Attrition")
churn_train <- training(_____)
churn_test <- testing(_____)
```


# Putting the process together

Let's put these pieces together and analyze the Ames housing data:

1. Split into training vs testing data
2. Specify a resampling procedure
3. Create our hyperparameter grid
4. Execute grid search
5. Evaluate performance

___This grid search takes ~2 min___

```{r slide-35}
# 1. stratified sampling with the rsample package
set.seed(123)
split <- initial_split(ames, prop = 0.7, strata = "Sale_Price")
ames_train <- training(split)
ames_test <- testing(split)
# 2. create a resampling method
cv <- trainControl(
method = "repeatedcv",
number = 10,
repeats = 5
)
# 3. create a hyperparameter grid search
hyper_grid <- expand.grid(k = seq(2, 26, by = 2))
# 4. execute grid search with knn model
# use RMSE as preferred metric
knn_fit <- train(
Sale_Price ~ .,
data = ames_train,
method = "knn",
trControl = cv,
tuneGrid = hyper_grid,
metric = "RMSE"
)
# 5. evaluate results
# print model results
knn_fit
# plot cross validation results
ggplot(knn_fit$results, aes(k, RMSE)) +
geom_line() +
geom_point() +
scale_y_continuous(labels = scales::dollar)
```

0 comments on commit ff1a689

Please sign in to comment.