diff --git a/_freeze/archive/2023-07-nyr/01-introduction/execute-results/html.json b/_freeze/archive/2023-07-nyr/01-introduction/execute-results/html.json new file mode 100644 index 00000000..b2c733c7 --- /dev/null +++ b/_freeze/archive/2023-07-nyr/01-introduction/execute-results/html.json @@ -0,0 +1,21 @@ +{ + "hash": "3ea97871f74836d15d22db5ec0939940", + "result": { + "markdown": "---\ntitle: \"1 - Introduction\"\nsubtitle: \"Machine learning with tidymodels\"\nformat:\n revealjs: \n slide-number: true\n footer: \n include-before-body: header.html\n include-after-body: footer-annotations.html\n theme: [default, tidymodels.scss]\n width: 1280\n height: 720\nknitr:\n opts_chunk: \n echo: true\n collapse: true\n comment: \"#>\"\n---\n\n\n\n\n::: r-fit-text\nWelcome!\n:::\n\n## Who are you?\n\n- You can use the magrittr `%>%` or base R `|>` pipe\n\n- You are familiar with functions from dplyr, tidyr, ggplot2\n\n- You have exposure to basic statistical concepts\n\n- You do **not** need intermediate or expert familiarity with modeling or ML\n\n## Who are tidymodels?\n\n- Simon Couch\n- Hannah Frick\n- Emil Hvitfeldt\n- Max Kuhn\n\n. . .\n\nMany thanks to Davis Vaughan, Julia Silge, David Robinson, Julie Jung, Alison Hill, and DesirΓ©e De Leon for their role in creating these materials!\n\n## Asking for help\n\n. . .\n\nπŸŸͺ \"I'm stuck and need help!\"\n\n. . .\n\n🟩 \"I finished the exercise\"\n\n\n## πŸ‘€ {.annotation}\n\n![](images/pointing.svg){.absolute top=\"0\" right=\"0\"}\n\n## Tentative plan for this workshop\n\n::: columns\n::: {.column width=\"50%\"}\n- *Today:* \n\n - Your data budget\n - What makes a model\n - Evaluating models\n:::\n::: {.column width=\"50%\"}\n- *Tomorrow:*\n \n - Feature engineering\n - Tuning hyperparameters\n - Racing methods\n - Iterative search methods\n:::\n:::\n\n## {.center}\n\n### Introduce yourself to your neighbors πŸ‘‹\n\n

\n\nCheck Slack (`#ml-ws-2023`) for an RStudio Cloud link.\n\n## What is machine learning?\n\n![](https://imgs.xkcd.com/comics/machine_learning.png){fig-align=\"center\"}\n\n::: footer\n\n:::\n\n## What is machine learning?\n\n![](images/what_is_ml.jpg){fig-align=\"center\"}\n\n::: footer\nIllustration credit: \n:::\n\n## What is machine learning?\n\n![](images/ml_illustration.jpg){fig-align=\"center\"}\n\n::: footer\nIllustration credit: \n:::\n\n## Your turn {transition=\"slide-in\"}\n\n![](images/parsnip-flagger.jpg){.absolute top=\"0\" right=\"0\" width=\"150\" height=\"150\"}\n\n. . .\n\n*How are statistics and machine learning related?*\n\n*How are they similar? Different?*\n\n\n::: {.cell}\n::: {.cell-output-display}\n```{=html}\n
\n
\n03:00\n
\n```\n:::\n:::\n\n\n::: notes\nthe \"two cultures\"\n\nmodel first vs. data first\n\ninference vs. prediction\n:::\n\n## What is tidymodels? ![](hexes/tidymodels.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidymodels)\n#> ── Attaching packages ──────────────────────────── tidymodels 1.1.0 ──\n#> βœ” broom 1.0.5 βœ” rsample 1.1.1.9000\n#> βœ” dials 1.2.0 βœ” tibble 3.2.1 \n#> βœ” dplyr 1.1.2 βœ” tidyr 1.3.0 \n#> βœ” infer 1.0.4 βœ” tune 1.1.1.9001\n#> βœ” modeldata 1.1.0 βœ” workflows 1.1.3 \n#> βœ” parsnip 1.1.0.9003 βœ” workflowsets 1.0.1 \n#> βœ” purrr 1.0.1 βœ” yardstick 1.2.0.9001\n#> βœ” recipes 1.0.6\n#> ── Conflicts ─────────────────────────────── tidymodels_conflicts() ──\n#> βœ– purrr::discard() masks scales::discard()\n#> βœ– dplyr::filter() masks stats::filter()\n#> βœ– dplyr::lag() masks stats::lag()\n#> βœ– recipes::step() masks stats::step()\n#> β€’ Use tidymodels_prefer() to resolve common conflicts.\n```\n:::\n\n\n## {background-image=\"images/tm-org.png\" background-size=\"contain\"}\n\n## The whole game\n\nPart of any modelling process is\n\n* Splitting your data into training and test set\n* Using a resampling scheme\n* Fitting models\n* Assessing performance\n* Choosing a model\n* Fitting and assessing the final model\n\n\n## The whole game\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](images/whole-game-split.jpg){fig-align='center' width=3543}\n:::\n:::\n\n\n## The whole game\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](images/whole-game-model-1.jpg){fig-align='center' width=3543}\n:::\n:::\n\n\n:::notes\nStress that we are **not** fitting a model on the entire training set other than for illustrative purposes in deck 2.\n:::\n\n## The whole game\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](images/whole-game-model-n.jpg){fig-align='center' width=3543}\n:::\n:::\n\n\n## The whole game\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](images/whole-game-resamples.jpg){fig-align='center' width=3543}\n:::\n:::\n\n\n## The whole game\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](images/whole-game-select.jpg){fig-align='center' width=3543}\n:::\n:::\n\n\n## The whole game\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](images/whole-game-final-fit.jpg){fig-align='center' width=3543}\n:::\n:::\n\n\n## The whole game\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](images/whole-game-final-performance.jpg){fig-align='center' width=3543}\n:::\n:::\n\n\n\n## Let's install some packages\n\nIf you are using your own laptop instead of RStudio Cloud:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ninstall.packages(\"pak\")\n\npkgs <- c(\"bonsai\", \"doParallel\", \"embed\", \"finetune\", \"lightgbm\", \"lme4\", \n \"parallelly\", \"plumber\", \"probably\", \"ranger\", \"rpart\", \"rpart.plot\", \n \"stacks\", \"textrecipes\", \"tidymodels\", \"tidymodels/modeldatatoo\", \n \"vetiver\")\npak::pak(pkgs)\n```\n:::\n\n\n. . .\n\nCheck Slack (`#ml-ws-2023`) for an RStudio Cloud link.\n\n\n## Our versions\n\n\n::: {.cell}\n\n:::\n\n\nbonsai (0.2.1.9000, Github (tidymodels/bonsai@aab79), broom (1.0.5, local), dials (1.2.0, CRAN), doParallel (1.0.17, CRAN), dplyr (1.1.2, CRAN), embed (1.0.0, CRAN), finetune (1.1.0.9000, Github (tidymodels/finetune@52d), ggplot2 (3.4.2, CRAN), lightgbm (3.3.5, CRAN), lme4 (1.1-33, CRAN), modeldata (1.1.0, CRAN), modeldatatoo (0.1.0.9000, Github (tidymodels/modeldatatoo), parallelly (1.36.0, CRAN), parsnip (1.1.0.9003, Github (tidymodels/parsnip@e627), plumber (1.2.1, CRAN), probably (1.0.2, CRAN), purrr (1.0.1, CRAN), ranger (0.15.1, CRAN), recipes (1.0.6, CRAN), rpart (4.1.19, CRAN), rpart.plot (3.1.1, CRAN), rsample (1.1.1.9000, Github (tidymodels/rsample@afc4), scales (1.2.1, CRAN), stacks (1.0.2.9000, local), textrecipes (1.0.2, CRAN), tibble (3.2.1, CRAN), tidymodels (1.1.0, CRAN), tidyr (1.3.0, CRAN), tune (1.1.1.9001, Github (tidymodels/tune@fea8b02), vetiver (0.2.0, CRAN), workflows (1.1.3, CRAN), workflowsets (1.0.1, CRAN), yardstick (1.2.0.9001, Github (tidymodels/yardstick@6c), and Quarto (1.3.433)\n", + "supporting": [], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": { + "include-in-header": [ + "\n\n" + ], + "include-after-body": [ + "\n\n\n" + ] + }, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/_freeze/archive/2023-07-nyr/02-data-budget/execute-results/html.json b/_freeze/archive/2023-07-nyr/02-data-budget/execute-results/html.json new file mode 100644 index 00000000..3dd24596 --- /dev/null +++ b/_freeze/archive/2023-07-nyr/02-data-budget/execute-results/html.json @@ -0,0 +1,21 @@ +{ + "hash": "1bdca46c14f3060519ecc1b8ebabc623", + "result": { + "markdown": "---\ntitle: \"2 - Your data budget\"\nsubtitle: \"Machine learning with tidymodels\"\nformat:\n revealjs: \n slide-number: true\n footer: \n include-before-body: header.html\n include-after-body: footer-annotations.html\n theme: [default, tidymodels.scss]\n width: 1280\n height: 720\nknitr:\n opts_chunk: \n echo: true\n collapse: true\n comment: \"#>\"\n fig.path: \"figures/\"\n---\n\n\n\n\n## {background-image=\"https://media.giphy.com/media/Lr3UeH9tYu3qJtsSUg/giphy.gif\" background-size=\"40%\"}\n\n\n## Data on Chicago taxi trips\n\n::: columns\n::: {.column width=\"60%\"}\n- The city of Chicago releases anonymized trip-level data on taxi trips in the city.\n- We pulled a sample of 10,000 rides occurring in early 2022.\n- Type `?modeldatatoo::data_taxi()` to learn more about this dataset, including references.\n:::\n\n::: {.column width=\"40%\"}\n![](images/taxi_spinning.svg)\n:::\n\n:::\n\n::: footer\nCredit: \n:::\n\n## Which of these variables can we use?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidymodels)\nlibrary(modeldatatoo)\n\ntaxi <- data_taxi()\n\nnames(taxi)\n#> [1] \"tip\" \"id\" \"duration\" \"distance\" \"fare\" \n#> [6] \"tolls\" \"extras\" \"total_cost\" \"payment_type\" \"company\" \n#> [11] \"local\" \"dow\" \"month\" \"hour\"\n```\n:::\n\n\n## Checklist for predictors\n\n- Is it ethical to use this variable? (Or even legal?)\n\n- Will this variable be available at prediction time?\n\n- Does this variable contribute to explainability?\n\n\n## Data on Chicago taxi trips\n\nWe are using a slightly modified version from the modeldatatoo data.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntaxi <- taxi %>%\n mutate(month = factor(month, levels = c(\"Jan\", \"Feb\", \"Mar\", \"Apr\"))) %>% \n select(-c(id, duration, fare, tolls, extras, total_cost, payment_type)) %>% \n drop_na()\n```\n:::\n\n\n## Data on Chicago taxi trips\n\n::: columns\n::: {.column width=\"60%\"}\n- `N = 10,000`\n- A nominal outcome, `tip`, with levels `\"yes\"` and `\"no\"`\n- 6 other variables\n - `company`, `local`, and `dow`, and `month` are **nominal** predictors\n - `distance` and `hours` are **numeric** predictors\n:::\n\n::: {.column width=\"40%\"}\n![](images/taxi.png)\n:::\n:::\n\n::: footer\nCredit: \n:::\n\n:::notes\n`tip`: Whether the rider left a tip. A factor with levels \"yes\" and \"no\".\n\n`distance`: The trip distance, in odometer miles.\n\n`company`: The taxi company, as a factor. Companies that occurred few times were binned as \"other\".\n\n`local`: Whether the trip started in the same community area as it began. See the source data for community area values.\n\n`dow`: The day of the week in which the trip began, as a factor.\n\n`month`: The month in which the trip began, as a factor.\n\n`hour`: The hour of the day in which the trip began, as a numeric.\n\n:::\n\n## Data on Chicago taxi trips\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntaxi\n#> # A tibble: 8,807 Γ— 7\n#> tip distance company local dow month hour\n#> \n#> 1 yes 1.24 Sun Taxi no Thu Feb 13\n#> 2 no 5.39 Flash Cab no Sat Mar 12\n#> 3 yes 3.01 City Service no Wed Feb 17\n#> 4 no 18.4 Sun Taxi no Sat Apr 6\n#> 5 yes 1.76 Sun Taxi no Sun Jan 15\n#> 6 yes 13.6 Sun Taxi no Mon Feb 17\n#> 7 yes 3.71 City Service no Mon Mar 21\n#> 8 yes 4.8 other no Tue Mar 9\n#> 9 yes 18.0 City Service no Fri Jan 19\n#> 10 no 17.5 other yes Thu Apr 12\n#> # β„Ή 8,797 more rows\n```\n:::\n\n\n\n## Data splitting and spending\n\nFor machine learning, we typically split data into training and test sets:\n\n. . .\n\n- The **training set** is used to estimate model parameters.\n- The **test set** is used to find an independent assessment of model performance.\n\n. . .\n\nDo not 🚫 use the test set during training.\n\n## Data splitting and spending\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](figures/test-train-split-1.svg)\n:::\n:::\n\n\n# The more data
we spend πŸ€‘

the better estimates
we'll get.\n\n## Data splitting and spending\n\n- Spending too much data in **training** prevents us from computing a good assessment of predictive **performance**.\n\n. . .\n\n- Spending too much data in **testing** prevents us from computing a good estimate of model **parameters**.\n\n## Your turn {transition=\"slide-in\"}\n\n![](images/parsnip-flagger.jpg){.absolute top=\"0\" right=\"0\" width=\"150\" height=\"150\"}\n\n*When is a good time to split your data?*\n\n\n::: {.cell}\n::: {.cell-output-display}\n```{=html}\n
\n
\n03:00\n
\n```\n:::\n:::\n\n\n# The testing data is precious πŸ’Ž\n\n## The initial split ![](hexes/rsample.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"} {.annotation}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(123)\ntaxi_split <- initial_split(taxi)\ntaxi_split\n#> \n#> <6605/2202/8807>\n```\n:::\n\n\n:::notes\nHow much data in training vs testing?\nThis function uses a good default, but this depends on your specific goal/data\nWe will talk about more powerful ways of splitting, like stratification, later\n:::\n\n## Accessing the data ![](hexes/rsample.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntaxi_train <- training(taxi_split)\ntaxi_test <- testing(taxi_split)\n```\n:::\n\n\n## The training set![](hexes/rsample.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntaxi_train\n#> # A tibble: 6,605 Γ— 7\n#> tip distance company local dow month hour\n#> \n#> 1 yes 4.54 City Service no Sat Mar 16\n#> 2 no 10.2 Flash Cab no Mon Feb 8\n#> 3 yes 12.4 other no Sun Apr 15\n#> 4 yes 15.3 Sun Taxi no Mon Apr 18\n#> 5 no 6.41 Flash Cab no Wed Apr 14\n#> 6 yes 1.56 other no Tue Jan 13\n#> 7 yes 3.13 Flash Cab no Sun Apr 12\n#> 8 yes 7.54 other no Tue Apr 8\n#> 9 yes 6.98 Flash Cab no Tue Apr 5\n#> 10 yes 0.7 Taxi Affiliation Services no Tue Jan 9\n#> # β„Ή 6,595 more rows\n```\n:::\n\n\n## The test set ![](hexes/rsample.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\nπŸ™ˆ\n\n. . .\n\nThere are 2202 rows and 7 columns in the test set.\n\n## Your turn {transition=\"slide-in\"}\n\n![](images/parsnip-flagger.jpg){.absolute top=\"0\" right=\"0\" width=\"150\" height=\"150\"}\n\n*Split your data so 20% is held out for the test set.*\n\n*Try out different values in `set.seed()` to see how the results change.*\n\n\n::: {.cell}\n::: {.cell-output-display}\n```{=html}\n
\n
\n05:00\n
\n```\n:::\n:::\n\n\n## Data splitting and spending ![](hexes/rsample.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(123)\ntaxi_split <- initial_split(taxi, prop = 0.8)\ntaxi_train <- training(taxi_split)\ntaxi_test <- testing(taxi_split)\n\nnrow(taxi_train)\n#> [1] 7045\nnrow(taxi_test)\n#> [1] 1762\n```\n:::\n\n\n# What about a validation set?\n\n## {background-color=\"white\" background-image=\"https://www.tmwr.org/premade/validation.svg\" background-size=\"50%\"}\n\n:::notes\nWe will use this tomorrow\n:::\n\n## {background-color=\"white\" background-image=\"https://www.tmwr.org/premade/validation-alt.svg\" background-size=\"40%\"}\n\n# Exploratory data analysis for ML 🧐\n\n## Your turn {transition=\"slide-in\"}\n\n![](images/parsnip-flagger.jpg){.absolute top=\"0\" right=\"0\" width=\"150\" height=\"150\"}\n\n*Explore the `taxi_train` data on your own!*\n\n* *What's the distribution of the outcome, tip?*\n* *What's the distribution of numeric variables like distance?*\n* *How does tip differ across the categorical variables?*\n\n\n::: {.cell}\n::: {.cell-output-display}\n```{=html}\n
\n
\n08:00\n
\n```\n:::\n:::\n\n\n::: notes\nMake a plot or summary and then share with neighbor\n:::\n\n## \n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ntaxi_train %>% \n ggplot(aes(x = tip)) +\n geom_bar()\n```\n\n::: {.cell-output-display}\n![](figures/taxi-tip-counts-1.svg){fig-align='center'}\n:::\n:::\n\n\n## \n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ntaxi_train %>% \n ggplot(aes(x = tip, fill = local)) +\n geom_bar() +\n scale_fill_viridis_d(end = .5)\n```\n\n::: {.cell-output-display}\n![](figures/taxi-tip-by-local-1.svg){fig-align='center'}\n:::\n:::\n\n\n## \n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ntaxi_train %>% \n mutate(tip = forcats::fct_rev(tip)) %>% \n ggplot(aes(x = hour, fill = tip)) +\n geom_bar()\n```\n\n::: {.cell-output-display}\n![](figures/taxi-tip-by-hour-1.svg){fig-align='center'}\n:::\n:::\n\n\n## \n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ntaxi_train %>% \n mutate(tip = forcats::fct_rev(tip)) %>% \n ggplot(aes(x = hour, fill = tip)) +\n geom_bar(position = \"fill\")\n```\n\n::: {.cell-output-display}\n![](figures/taxi-tip-by-hour-fill-1.svg){fig-align='center'}\n:::\n:::\n\n\n## \n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ntaxi_train %>% \n mutate(tip = forcats::fct_rev(tip)) %>% \n ggplot(aes(x = distance)) +\n geom_histogram(bins = 100) +\n facet_grid(vars(tip))\n```\n\n::: {.cell-output-display}\n![](figures/taxi-tip-by-distance-1.svg){fig-align='center'}\n:::\n:::\n\n\n# Split smarter\n\n##\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](figures/taxi-tip-pct-1.svg)\n:::\n:::\n\n\nStratified sampling would split within response values\n\n:::notes\nBased on our EDA, we know that the source data contains fewer `\"no\"` tip values than `\"yes\"`. We want to make sure we allot equal proportions of those responses so that both the training and testing data have enough of each to give accurate estimates.\n:::\n\n## Stratification\n\nUse `strata = tip`\n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(123)\ntaxi_split <- initial_split(taxi, prop = 0.8, strata = tip)\ntaxi_split\n#> \n#> <7045/1762/8807>\n```\n:::\n\n\n## Stratification\n\nStratification often helps, with very little downside\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](figures/taxi-tip-pct-by-split-1.svg)\n:::\n:::\n\n\n## The whole game - status update\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](images/whole-game-split.jpg){fig-align='center' width=3543}\n:::\n:::\n", + "supporting": [], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": { + "include-in-header": [ + "\n\n" + ], + "include-after-body": [ + "\n\n\n" + ] + }, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/_freeze/archive/2023-07-nyr/03-what-makes-a-model/execute-results/html.json b/_freeze/archive/2023-07-nyr/03-what-makes-a-model/execute-results/html.json new file mode 100644 index 00000000..7b494928 --- /dev/null +++ b/_freeze/archive/2023-07-nyr/03-what-makes-a-model/execute-results/html.json @@ -0,0 +1,21 @@ +{ + "hash": "f763c4ce44cf1dec8e9b54c63d259ee8", + "result": { + "markdown": "---\ntitle: \"3 - What makes a model?\"\nsubtitle: \"Machine learning with tidymodels\"\nformat:\n revealjs: \n slide-number: true\n footer: \n include-before-body: header.html\n include-after-body: footer-annotations.html\n theme: [default, tidymodels.scss]\n width: 1280\n height: 720\nknitr:\n opts_chunk: \n echo: true\n collapse: true\n comment: \"#>\"\n fig.path: \"figures/\"\n---\n\n\n\n\n## Your turn {transition=\"slide-in\"}\n\n![](images/parsnip-flagger.jpg){.absolute top=\"0\" right=\"0\" width=\"150\" height=\"150\"}\n\n*How do you fit a linear model in R?*\n\n*How many different ways can you think of?*\n\n\n::: {.cell}\n::: {.cell-output-display}\n```{=html}\n
\n
\n03:00\n
\n```\n:::\n:::\n\n\n. . .\n\n- `lm` for linear model\n\n- `glm` for generalized linear model (e.g. logistic regression)\n\n- `glmnet` for regularized regression\n\n- `keras` for regression using TensorFlow\n\n- `stan` for Bayesian regression\n\n- `spark` for large data sets\n\n## To specify a model ![](hexes/parsnip.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n. . .\n\n::: columns\n::: {.column width=\"40%\"}\n- Choose a [model]{.underline}\n- Specify an engine\n- Set the mode\n:::\n\n::: {.column width=\"60%\"}\n![](images/taxi_spinning.svg)\n:::\n:::\n\n## To specify a model ![](hexes/parsnip.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nlogistic_reg()\n#> Logistic Regression Model Specification (classification)\n#> \n#> Computational engine: glm\n```\n:::\n\n\n\n:::notes\nModels have default engines\n:::\n\n## To specify a model ![](hexes/parsnip.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n::: columns\n::: {.column width=\"40%\"}\n- Choose a model\n- Specify an [engine]{.underline}\n- Set the mode\n:::\n\n::: {.column width=\"60%\"}\n![](images/taxi_spinning.svg)\n:::\n:::\n\n## To specify a model ![](hexes/parsnip.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlogistic_reg() %>%\n set_engine(\"glmnet\")\n#> Logistic Regression Model Specification (classification)\n#> \n#> Computational engine: glmnet\n```\n:::\n\n\n## To specify a model ![](hexes/parsnip.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlogistic_reg() %>%\n set_engine(\"stan\")\n#> Logistic Regression Model Specification (classification)\n#> \n#> Computational engine: stan\n```\n:::\n\n\n## To specify a model ![](hexes/parsnip.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n::: columns\n::: {.column width=\"40%\"}\n- Choose a model\n- Specify an engine\n- Set the [mode]{.underline}\n:::\n\n::: {.column width=\"60%\"}\n![](images/taxi_spinning.svg)\n:::\n:::\n\n\n## To specify a model ![](hexes/parsnip.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndecision_tree()\n#> Decision Tree Model Specification (unknown mode)\n#> \n#> Computational engine: rpart\n```\n:::\n\n\n:::notes\nSome models have a default mode\n:::\n\n## To specify a model ![](hexes/parsnip.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndecision_tree() %>% \n set_mode(\"classification\")\n#> Decision Tree Model Specification (classification)\n#> \n#> Computational engine: rpart\n```\n:::\n\n\n. . .\n\n

\n\n::: r-fit-text\nAll available models are listed at \n:::\n\n## {background-iframe=\"https://www.tidymodels.org/find/parsnip/\"}\n\n::: footer\n:::\n\n## To specify a model ![](hexes/parsnip.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n::: columns\n::: {.column width=\"40%\"}\n- Choose a [model]{.underline}\n- Specify an [engine]{.underline}\n- Set the [mode]{.underline}\n:::\n\n::: {.column width=\"60%\"}\n![](images/taxi_spinning.svg)\n:::\n:::\n\n## Your turn {transition=\"slide-in\"}\n\n![](images/parsnip-flagger.jpg){.absolute top=\"0\" right=\"0\" width=\"150\" height=\"150\"}\n\n*Run the `tree_spec` chunk in your `.qmd`.*\n\n*Edit this code to use a different model.*\n\n\n::: {.cell}\n::: {.cell-output-display}\n```{=html}\n
\n
\n05:00\n
\n```\n:::\n:::\n\n\n

\n\n::: r-fit-text\nAll available models are listed at \n:::\n\n## Models we'll be using today\n\n* Logistic regression\n* Decision trees\n\n## Logistic regression\n\n::: columns\n::: {.column width=\"60%\"}\n\n::: {.cell}\n::: {.cell-output-display}\n![](figures/unnamed-chunk-10-1.svg)\n:::\n:::\n\n:::\n\n::: {.column width=\"40%\"}\n:::\n:::\n\n## Logistic regression\n\n::: columns\n::: {.column width=\"60%\"}\n\n::: {.cell}\n::: {.cell-output-display}\n![](figures/unnamed-chunk-11-1.svg)\n:::\n:::\n\n:::\n\n::: {.column width=\"40%\"}\n:::\n:::\n\n## Logistic regression\n\n::: columns\n::: {.column width=\"60%\"}\n\n::: {.cell}\n::: {.cell-output-display}\n![](figures/unnamed-chunk-12-1.svg)\n:::\n:::\n\n:::\n\n::: {.column width=\"40%\"}\n\n- Logit of outcome probability modeled as linear combination of predictors:\n\n$log(\\frac{p}{1 - p}) = \\beta_0 + \\beta_1\\cdot \\text{distance}$\n\n- Find a sigmoid line that separates the two classes\n\n:::\n:::\n\n## Decision trees\n\n::: columns\n::: {.column width=\"50%\"}\n\n::: {.cell}\n\n:::\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](figures/unnamed-chunk-14-1.svg){fig-align='center'}\n:::\n:::\n\n\n:::\n\n::: {.column width=\"50%\"}\n:::\n:::\n\n## Decision trees\n\n::: columns\n::: {.column width=\"50%\"}\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](figures/unnamed-chunk-15-1.svg){fig-align='center'}\n:::\n:::\n\n:::\n\n::: {.column width=\"50%\"}\n- Series of splits or if/then statements based on predictors\n\n- First the tree *grows* until some condition is met (maximum depth, no more data)\n\n- Then the tree is *pruned* to reduce its complexity\n:::\n:::\n\n## Decision trees\n\n::: columns\n::: {.column width=\"50%\"}\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](figures/unnamed-chunk-16-1.svg){fig-align='center'}\n:::\n:::\n\n:::\n\n::: {.column width=\"50%\"}\n\n::: {.cell}\n::: {.cell-output-display}\n![](figures/unnamed-chunk-17-1.svg)\n:::\n:::\n\n:::\n:::\n\n## All models are wrong, but some are useful!\n\n::: columns\n::: {.column width=\"50%\"}\n### Logistic regression\n\n::: {.cell}\n::: {.cell-output-display}\n![](figures/unnamed-chunk-18-1.svg)\n:::\n:::\n\n:::\n\n::: {.column width=\"50%\"}\n### Decision trees\n\n::: {.cell}\n::: {.cell-output-display}\n![](figures/unnamed-chunk-19-1.svg)\n:::\n:::\n\n:::\n:::\n\n# A model workflow\n\n## Workflows bind preprocessors and models\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](images/good_workflow.png){fig-align='center' width=70%}\n:::\n:::\n\n\n:::notes\nExplain that PCA that is a preprocessor / dimensionality reduction, used to decorrelate data\n:::\n\n\n## What is wrong with this? {.annotation}\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](images/bad_workflow.png){fig-align='center' width=70%}\n:::\n:::\n\n\n## Why a `workflow()`? ![](hexes/workflows.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n. . .\n\n- Workflows handle new data better than base R tools in terms of new factor levels\n\n. . .\n\n- You can use other preprocessors besides formulas (more on feature engineering tomorrow!)\n\n. . .\n\n- They can help organize your work when working with multiple models\n\n. . .\n\n- [Most importantly]{.underline}, a workflow captures the entire modeling process: `fit()` and `predict()` apply to the preprocessing steps in addition to the actual model fit\n\n::: notes\nTwo ways workflows handle levels better than base R:\n\n- Enforces that new levels are not allowed at prediction time (this is an optional check that can be turned off)\n\n- Restores missing levels that were present at fit time, but happen to be missing at prediction time (like, if your \"new\" data just doesn't have an instance of that level)\n:::\n\n## A model workflow ![](hexes/workflows.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"} ![](hexes/parsnip.png){.absolute top=-20 right=64 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntree_spec <-\n decision_tree() %>% \n set_mode(\"classification\")\n\ntree_spec %>% \n fit(tip ~ ., data = taxi_train) \n#> parsnip model object\n#> \n#> n= 7045 \n#> \n#> node), split, n, loss, yval, (yprob)\n#> * denotes terminal node\n#> \n#> 1) root 7045 2069 yes (0.70631654 0.29368346) \n#> 2) company=Chicago Independents,City Service,Sun Taxi,Taxicab Insurance Agency Llc,other 4328 744 yes (0.82809612 0.17190388) \n#> 4) distance< 4.615 2365 254 yes (0.89260042 0.10739958) *\n#> 5) distance>=4.615 1963 490 yes (0.75038207 0.24961793) \n#> 10) distance>=12.565 1069 81 yes (0.92422825 0.07577175) *\n#> 11) distance< 12.565 894 409 yes (0.54250559 0.45749441) \n#> 22) company=Chicago Independents,Sun Taxi,Taxicab Insurance Agency Llc 278 71 yes (0.74460432 0.25539568) *\n#> 23) company=City Service,other 616 278 no (0.45129870 0.54870130) \n#> 46) distance< 7.205 178 59 yes (0.66853933 0.33146067) *\n#> 47) distance>=7.205 438 159 no (0.36301370 0.63698630) *\n#> 3) company=Flash Cab,Taxi Affiliation Services 2717 1325 yes (0.51232978 0.48767022) \n#> 6) distance< 3.235 1331 391 yes (0.70623591 0.29376409) *\n#> 7) distance>=3.235 1386 452 no (0.32611833 0.67388167) \n#> 14) distance>=12.39 344 90 yes (0.73837209 0.26162791) *\n#> 15) distance< 12.39 1042 198 no (0.19001919 0.80998081) *\n```\n:::\n\n\n## A model workflow ![](hexes/workflows.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"} ![](hexes/parsnip.png){.absolute top=-20 right=64 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntree_spec <-\n decision_tree() %>% \n set_mode(\"classification\")\n\nworkflow() %>%\n add_formula(tip ~ .) %>%\n add_model(tree_spec) %>%\n fit(data = taxi_train) \n#> ══ Workflow [trained] ════════════════════════════════════════════════\n#> Preprocessor: Formula\n#> Model: decision_tree()\n#> \n#> ── Preprocessor ──────────────────────────────────────────────────────\n#> tip ~ .\n#> \n#> ── Model ─────────────────────────────────────────────────────────────\n#> n= 7045 \n#> \n#> node), split, n, loss, yval, (yprob)\n#> * denotes terminal node\n#> \n#> 1) root 7045 2069 yes (0.70631654 0.29368346) \n#> 2) company=Chicago Independents,City Service,Sun Taxi,Taxicab Insurance Agency Llc,other 4328 744 yes (0.82809612 0.17190388) \n#> 4) distance< 4.615 2365 254 yes (0.89260042 0.10739958) *\n#> 5) distance>=4.615 1963 490 yes (0.75038207 0.24961793) \n#> 10) distance>=12.565 1069 81 yes (0.92422825 0.07577175) *\n#> 11) distance< 12.565 894 409 yes (0.54250559 0.45749441) \n#> 22) company=Chicago Independents,Sun Taxi,Taxicab Insurance Agency Llc 278 71 yes (0.74460432 0.25539568) *\n#> 23) company=City Service,other 616 278 no (0.45129870 0.54870130) \n#> 46) distance< 7.205 178 59 yes (0.66853933 0.33146067) *\n#> 47) distance>=7.205 438 159 no (0.36301370 0.63698630) *\n#> 3) company=Flash Cab,Taxi Affiliation Services 2717 1325 yes (0.51232978 0.48767022) \n#> 6) distance< 3.235 1331 391 yes (0.70623591 0.29376409) *\n#> 7) distance>=3.235 1386 452 no (0.32611833 0.67388167) \n#> 14) distance>=12.39 344 90 yes (0.73837209 0.26162791) *\n#> 15) distance< 12.39 1042 198 no (0.19001919 0.80998081) *\n```\n:::\n\n\n## A model workflow ![](hexes/workflows.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"} ![](hexes/parsnip.png){.absolute top=-20 right=64 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntree_spec <-\n decision_tree() %>% \n set_mode(\"classification\")\n\nworkflow(tip ~ ., tree_spec) %>% \n fit(data = taxi_train) \n#> ══ Workflow [trained] ════════════════════════════════════════════════\n#> Preprocessor: Formula\n#> Model: decision_tree()\n#> \n#> ── Preprocessor ──────────────────────────────────────────────────────\n#> tip ~ .\n#> \n#> ── Model ─────────────────────────────────────────────────────────────\n#> n= 7045 \n#> \n#> node), split, n, loss, yval, (yprob)\n#> * denotes terminal node\n#> \n#> 1) root 7045 2069 yes (0.70631654 0.29368346) \n#> 2) company=Chicago Independents,City Service,Sun Taxi,Taxicab Insurance Agency Llc,other 4328 744 yes (0.82809612 0.17190388) \n#> 4) distance< 4.615 2365 254 yes (0.89260042 0.10739958) *\n#> 5) distance>=4.615 1963 490 yes (0.75038207 0.24961793) \n#> 10) distance>=12.565 1069 81 yes (0.92422825 0.07577175) *\n#> 11) distance< 12.565 894 409 yes (0.54250559 0.45749441) \n#> 22) company=Chicago Independents,Sun Taxi,Taxicab Insurance Agency Llc 278 71 yes (0.74460432 0.25539568) *\n#> 23) company=City Service,other 616 278 no (0.45129870 0.54870130) \n#> 46) distance< 7.205 178 59 yes (0.66853933 0.33146067) *\n#> 47) distance>=7.205 438 159 no (0.36301370 0.63698630) *\n#> 3) company=Flash Cab,Taxi Affiliation Services 2717 1325 yes (0.51232978 0.48767022) \n#> 6) distance< 3.235 1331 391 yes (0.70623591 0.29376409) *\n#> 7) distance>=3.235 1386 452 no (0.32611833 0.67388167) \n#> 14) distance>=12.39 344 90 yes (0.73837209 0.26162791) *\n#> 15) distance< 12.39 1042 198 no (0.19001919 0.80998081) *\n```\n:::\n\n\n## Your turn {transition=\"slide-in\"}\n\n![](images/parsnip-flagger.jpg){.absolute top=\"0\" right=\"0\" width=\"150\" height=\"150\"}\n\n*Run the `tree_wflow` chunk in your `.qmd`.*\n\n*Edit this code to make a workflow with your own model of choice.*\n\n\n::: {.cell}\n::: {.cell-output-display}\n```{=html}\n
\n
\n05:00\n
\n```\n:::\n:::\n\n\n## Predict with your model ![](hexes/workflows.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"} ![](hexes/parsnip.png){.absolute top=-20 right=64 width=\"64\" height=\"74.24\"}\n\nHow do you use your new `tree_fit` model?\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntree_spec <-\n decision_tree() %>% \n set_mode(\"classification\")\n\ntree_fit <-\n workflow(tip ~ ., tree_spec) %>% \n fit(data = taxi_train) \n```\n:::\n\n\n## Your turn {transition=\"slide-in\"}\n\n![](images/parsnip-flagger.jpg){.absolute top=\"0\" right=\"0\" width=\"150\" height=\"150\"}\n\n*Run:*\n\n`predict(tree_fit, new_data = taxi_test)`\n\n*What do you get?*\n\n\n::: {.cell}\n::: {.cell-output-display}\n```{=html}\n
\n
\n03:00\n
\n```\n:::\n:::\n\n\n## Your turn\n\n![](images/parsnip-flagger.jpg){.absolute top=\"0\" right=\"0\" width=\"150\" height=\"150\"}\n\n*Run:*\n\n`augment(tree_fit, new_data = taxi_test)`\n\n*What do you get?*\n\n\n::: {.cell}\n::: {.cell-output-display}\n```{=html}\n
\n
\n03:00\n
\n```\n:::\n:::\n\n\n# The tidymodels prediction guarantee!\n\n. . .\n\n- The predictions will always be inside a **tibble**\n- The column names and types are **unsurprising** and **predictable**\n- The number of rows in `new_data` and the output **are the same**\n\n## Understand your model ![](hexes/workflows.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"} ![](hexes/parsnip.png){.absolute top=-20 right=64 width=\"64\" height=\"74.24\"}\n\nHow do you **understand** your new `tree_fit` model?\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](figures/unnamed-chunk-27-1.svg){fig-align='center'}\n:::\n:::\n\n\n## Understand your model ![](hexes/workflows.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"} ![](hexes/parsnip.png){.absolute top=-20 right=64 width=\"64\" height=\"74.24\"}\n\nHow do you **understand** your new `tree_fit` model?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(rpart.plot)\ntree_fit %>%\n extract_fit_engine() %>%\n rpart.plot(roundint = FALSE)\n```\n:::\n\n\nYou can `extract_*()` several components of your fitted workflow.\n\n::: notes\n`roundint = FALSE` is only to quiet a warning\n:::\n\n\n## Understand your model ![](hexes/workflows.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"} ![](hexes/parsnip.png){.absolute top=-20 right=64 width=\"64\" height=\"74.24\"}\n\nHow do you **understand** your new `tree_fit` model?\n\n. . .\n\nYou can use your fitted workflow for model and/or prediction explanations:\n\n. . .\n\n- overall variable importance, such as with the [vip](https://koalaverse.github.io/vip/) package\n\n. . .\n\n- flexible model explainers, such as with the [DALEXtra](https://dalex.drwhy.ai/) package\n\n. . .\n\nLearn more at \n\n## {background-iframe=\"https://hardhat.tidymodels.org/reference/hardhat-extract.html\"}\n\n::: footer\n:::\n\n## Your turn {transition=\"slide-in\"}\n\n![](images/parsnip-flagger.jpg){.absolute top=\"0\" right=\"0\" width=\"150\" height=\"150\"}\n\n*Extract the model engine object from your fitted workflow.*\n\n⚠️ *Never `predict()` with any extracted components!*\n\n\n::: {.cell}\n::: {.cell-output-display}\n```{=html}\n
\n
\n05:00\n
\n```\n:::\n:::\n\n\n:::notes\nAfterward, ask what kind of object people got from the extraction, and what they did with it (e.g. give it to `summary()`, `plot()`, `broom::tidy()` ). Live code along\n:::\n\n# Deploy your model ![](hexes/vetiver.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n## {background-image=\"https://vetiver.rstudio.com/images/ml_ops_cycle.png\" background-size=\"contain\"}\n\n## Deploying a model ![](hexes/vetiver.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\nHow do you use your new `tree_fit` model in **production**?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(vetiver)\nv <- vetiver_model(tree_fit, \"taxi\")\nv\n#> \n#> ── taxi ─ model for deployment \n#> A rpart classification modeling workflow using 6 features\n```\n:::\n\n\nLearn more at \n\n## Deploy your model ![](hexes/vetiver.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\nHow do you use your new model `tree_fit` in **production**?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(plumber)\npr() %>%\n vetiver_api(v)\n#> # Plumber router with 2 endpoints, 4 filters, and 1 sub-router.\n#> # Use `pr_run()` on this object to start the API.\n#> β”œβ”€β”€[queryString]\n#> β”œβ”€β”€[body]\n#> β”œβ”€β”€[cookieParser]\n#> β”œβ”€β”€[sharedSecret]\n#> β”œβ”€β”€/logo\n#> β”‚ β”‚ # Plumber static router serving from directory: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/vetiver\n#> β”œβ”€β”€/ping (GET)\n#> └──/predict (POST)\n```\n:::\n\n\nLearn more at \n\n:::notes\nLive-code making a prediction\n:::\n\n## Your turn {transition=\"slide-in\"}\n\n![](images/parsnip-flagger.jpg){.absolute top=\"0\" right=\"0\" width=\"150\" height=\"150\"}\n\n*Run the `vetiver` chunk in your `.qmd`.*\n\n*Check out the automated visual documentation.*\n\n\n::: {.cell}\n::: {.cell-output-display}\n```{=html}\n
\n
\n05:00\n
\n```\n:::\n:::\n\n\n## The whole game - status update\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](images/whole-game-model-1.jpg){fig-align='center' width=3543}\n:::\n:::\n\n\n:::notes\nStress that fitting a model on the entire training set was only for illustrating how to fit a model\n:::\n", + "supporting": [], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": { + "include-in-header": [ + "\n\n" + ], + "include-after-body": [ + "\n\n\n" + ] + }, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/_freeze/archive/2023-07-nyr/04-evaluating-models/execute-results/html.json b/_freeze/archive/2023-07-nyr/04-evaluating-models/execute-results/html.json new file mode 100644 index 00000000..325a6fd2 --- /dev/null +++ b/_freeze/archive/2023-07-nyr/04-evaluating-models/execute-results/html.json @@ -0,0 +1,21 @@ +{ + "hash": "385a477f21d39283e40c50ca0fb00615", + "result": { + "markdown": "---\ntitle: \"4 - Evaluating models\"\nsubtitle: \"Machine learning with tidymodels\"\nformat:\n revealjs: \n slide-number: true\n footer: \n include-before-body: header.html\n include-after-body: footer-annotations.html\n theme: [default, tidymodels.scss]\n width: 1280\n height: 720\nknitr:\n opts_chunk: \n echo: true\n collapse: true\n comment: \"#>\"\n fig.path: \"figures/\"\n---\n\n\n\n\n## Looking at predictions\n\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\naugment(taxi_fit, new_data = taxi_train) %>%\n relocate(tip, .pred_class, .pred_yes, .pred_no)\n#> # A tibble: 7,045 Γ— 10\n#> tip .pred_class .pred_yes .pred_no distance company local dow month hour\n#> \n#> 1 no no 0.0625 0.937 5.39 Flash … no Sat Mar 12\n#> 2 no yes 0.924 0.0758 18.4 Sun Ta… no Sat Apr 6\n#> 3 no no 0.391 0.609 5.8 other no Tue Jan 10\n#> 4 no no 0.112 0.888 6.85 Flash … no Fri Apr 8\n#> 5 no no 0.129 0.871 9.5 City S… no Wed Jan 7\n#> 6 no no 0.326 0.674 12 other no Fri Apr 11\n#> 7 no no 0.0917 0.908 8.9 Taxi A… no Mon Feb 14\n#> 8 no yes 0.902 0.0980 1.38 other no Fri Apr 16\n#> 9 no no 0.0917 0.908 9.12 Flash … no Wed Apr 9\n#> 10 no yes 0.933 0.0668 2.28 City S… no Thu Apr 16\n#> # β„Ή 7,035 more rows\n```\n:::\n\n\n## Confusion matrix ![](hexes/yardstick.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n![](images/confusion-matrix.png)\n\n## Confusion matrix ![](hexes/yardstick.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\naugment(taxi_fit, new_data = taxi_train) %>%\n conf_mat(truth = tip, estimate = .pred_class)\n#> Truth\n#> Prediction yes no\n#> yes 4639 660\n#> no 337 1409\n```\n:::\n\n\n## Confusion matrix ![](hexes/yardstick.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\naugment(taxi_fit, new_data = taxi_train) %>%\n conf_mat(truth = tip, estimate = .pred_class) %>%\n autoplot(type = \"heatmap\")\n```\n\n::: {.cell-output-display}\n![](figures/unnamed-chunk-5-1.svg)\n:::\n:::\n\n\n## Metrics for model performance ![](hexes/yardstick.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n::: columns\n::: {.column width=\"60%\"}\n\n::: {.cell}\n\n```{.r .cell-code}\naugment(taxi_fit, new_data = taxi_train) %>%\n accuracy(truth = tip, estimate = .pred_class)\n#> # A tibble: 1 Γ— 3\n#> .metric .estimator .estimate\n#> \n#> 1 accuracy binary 0.858\n```\n:::\n\n:::\n\n::: {.column width=\"40%\"}\n![](images/confusion-matrix-accuracy.png)\n:::\n:::\n\n## Dangers of accuracy ![](hexes/yardstick.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\nWe need to be careful of using `accuracy()` since it can give \"good\" performance by only predicting one way with imbalanced data\n\n\n::: {.cell}\n\n```{.r .cell-code}\naugment(taxi_fit, new_data = taxi_train) %>%\n mutate(.pred_class = factor(\"yes\", levels = c(\"yes\", \"no\"))) %>%\n accuracy(truth = tip, estimate = .pred_class)\n#> # A tibble: 1 Γ— 3\n#> .metric .estimator .estimate\n#> \n#> 1 accuracy binary 0.706\n```\n:::\n\n\n## Metrics for model performance ![](hexes/yardstick.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n::: columns\n::: {.column width=\"60%\"}\n\n::: {.cell}\n\n```{.r .cell-code}\naugment(taxi_fit, new_data = taxi_train) %>%\n sensitivity(truth = tip, estimate = .pred_class)\n#> # A tibble: 1 Γ— 3\n#> .metric .estimator .estimate\n#> \n#> 1 sensitivity binary 0.932\n```\n:::\n\n:::\n\n::: {.column width=\"40%\"}\n![](images/confusion-matrix-sensitivity.png)\n:::\n:::\n\n\n## Metrics for model performance ![](hexes/yardstick.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n::: columns\n::: {.column width=\"60%\"}\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"3-6\"}\naugment(taxi_fit, new_data = taxi_train) %>%\n sensitivity(truth = tip, estimate = .pred_class)\n#> # A tibble: 1 Γ— 3\n#> .metric .estimator .estimate\n#> \n#> 1 sensitivity binary 0.932\n```\n:::\n\n\n
\n\n\n::: {.cell}\n\n```{.r .cell-code}\naugment(taxi_fit, new_data = taxi_train) %>%\n specificity(truth = tip, estimate = .pred_class)\n#> # A tibble: 1 Γ— 3\n#> .metric .estimator .estimate\n#> \n#> 1 specificity binary 0.681\n```\n:::\n\n:::\n\n::: {.column width=\"40%\"}\n![](images/confusion-matrix-specificity.png)\n:::\n:::\n\n## Metrics for model performance ![](hexes/yardstick.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\nWe can use `metric_set()` to combine multiple calculations into one\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntaxi_metrics <- metric_set(accuracy, specificity, sensitivity)\n\naugment(taxi_fit, new_data = taxi_train) %>%\n taxi_metrics(truth = tip, estimate = .pred_class)\n#> # A tibble: 3 Γ— 3\n#> .metric .estimator .estimate\n#> \n#> 1 accuracy binary 0.858\n#> 2 specificity binary 0.681\n#> 3 sensitivity binary 0.932\n```\n:::\n\n\n## Metrics for model performance ![](hexes/yardstick.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntaxi_metrics <- metric_set(accuracy, specificity, sensitivity)\n\naugment(taxi_fit, new_data = taxi_train) %>%\n group_by(local) %>%\n taxi_metrics(truth = tip, estimate = .pred_class)\n#> # A tibble: 6 Γ— 4\n#> local .metric .estimator .estimate\n#> \n#> 1 yes accuracy binary 0.840\n#> 2 no accuracy binary 0.862\n#> 3 yes specificity binary 0.346\n#> 4 no specificity binary 0.719\n#> 5 yes sensitivity binary 0.969\n#> 6 no sensitivity binary 0.925\n```\n:::\n\n\n## Two class data\n\nThese metrics assume that we know the threshold for converting \"soft\" probability predictions into \"hard\" class predictions.\n\n. . .\n\nIs a 50% threshold good? \n\nWhat happens if we say that we need to be 80% sure to declare an event?\n\n- sensitivity ⬇️, specificity ⬆️\n\n. . .\n\nWhat happens for a 20% threshold?\n\n- sensitivity ⬆️, specificity ⬇️\n\n## Varying the threshold\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](figures/thresholds-1.svg)\n:::\n:::\n\n\n## ROC curves\n\nTo make an ROC (receiver operator characteristic) curve, we:\n\n- calculate the sensitivity and specificity for all possible thresholds\n\n- plot false positive rate (x-axis) versus true positive rate (y-axis)\n\ngiven that sensitivity is the true positive rate, and specificity is the true negative rate. Hence `1 - specificity` is the false positive rate.\n\n. . .\n\nWe can use the area under the ROC curve as a classification metric: \n\n- ROC AUC = 1 πŸ’― \n- ROC AUC = 1/2 😒\n\n:::notes\nROC curves are insensitive to class imbalance.\n:::\n\n## ROC curves ![](hexes/yardstick.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Assumes _first_ factor level is event; there are options to change that\naugment(taxi_fit, new_data = taxi_train) %>% \n roc_curve(truth = tip, .pred_yes) %>%\n slice(1, 20, 50)\n#> # A tibble: 3 Γ— 3\n#> .threshold specificity sensitivity\n#> \n#> 1 -Inf 0 1 \n#> 2 0.25 0.486 0.972\n#> 3 0.6 0.705 0.920\n\naugment(taxi_fit, new_data = taxi_train) %>% \n roc_auc(truth = tip, .pred_yes)\n#> # A tibble: 1 Γ— 3\n#> .metric .estimator .estimate\n#> \n#> 1 roc_auc binary 0.868\n```\n:::\n\n\n## ROC curve plot ![](hexes/yardstick.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell output-location='column'}\n\n```{.r .cell-code}\naugment(taxi_fit, new_data = taxi_train) %>% \n roc_curve(truth = tip, .pred_yes) %>%\n autoplot()\n```\n\n::: {.cell-output-display}\n![](figures/roc-curve-1.svg)\n:::\n:::\n\n\n## Your turn {transition=\"slide-in\"}\n\n![](images/parsnip-flagger.jpg){.absolute top=\"0\" right=\"0\" width=\"150\" height=\"150\"}\n\n*Compute and plot an ROC curve for your current model.*\n\n*What data are being used for this ROC curve plot?*\n\n\n::: {.cell}\n::: {.cell-output-display}\n```{=html}\n
\n
\n05:00\n
\n```\n:::\n:::\n\n\n## {background-iframe=\"https://yardstick.tidymodels.org/reference/index.html\"}\n\n::: footer\n:::\n\n# ⚠️ DANGERS OF OVERFITTING ⚠️\n\n## Dangers of overfitting ⚠️\n\n![](https://raw.githubusercontent.com/topepo/2022-nyr-workshop/main/images/tuning-overfitting-train-1.svg)\n\n## Dangers of overfitting ⚠️\n\n![](https://raw.githubusercontent.com/topepo/2022-nyr-workshop/main/images/tuning-overfitting-test-1.svg)\n\n## Dangers of overfitting ⚠️ ![](hexes/yardstick.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntaxi_fit %>%\n augment(taxi_train)\n#> # A tibble: 7,045 Γ— 10\n#> tip distance company local dow month hour .pred_class .pred_yes .pred_no\n#> \n#> 1 no 5.39 Flash … no Sat Mar 12 no 0.0625 0.937 \n#> 2 no 18.4 Sun Ta… no Sat Apr 6 yes 0.924 0.0758\n#> 3 no 5.8 other no Tue Jan 10 no 0.391 0.609 \n#> 4 no 6.85 Flash … no Fri Apr 8 no 0.112 0.888 \n#> 5 no 9.5 City S… no Wed Jan 7 no 0.129 0.871 \n#> 6 no 12 other no Fri Apr 11 no 0.326 0.674 \n#> 7 no 8.9 Taxi A… no Mon Feb 14 no 0.0917 0.908 \n#> 8 no 1.38 other no Fri Apr 16 yes 0.902 0.0980\n#> 9 no 9.12 Flash … no Wed Apr 9 no 0.0917 0.908 \n#> 10 no 2.28 City S… no Thu Apr 16 yes 0.933 0.0668\n#> # β„Ή 7,035 more rows\n```\n:::\n\n\nWe call this \"resubstitution\" or \"repredicting the training set\"\n\n## Dangers of overfitting ⚠️ ![](hexes/yardstick.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntaxi_fit %>%\n augment(taxi_train) %>%\n accuracy(tip, .pred_class)\n#> # A tibble: 1 Γ— 3\n#> .metric .estimator .estimate\n#> \n#> 1 accuracy binary 0.858\n```\n:::\n\n\nWe call this a \"resubstitution estimate\"\n\n## Dangers of overfitting ⚠️ ![](hexes/yardstick.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n::: columns\n::: {.column width=\"50%\"}\n\n::: {.cell}\n\n```{.r .cell-code}\ntaxi_fit %>%\n augment(taxi_train) %>%\n accuracy(tip, .pred_class)\n#> # A tibble: 1 Γ— 3\n#> .metric .estimator .estimate\n#> \n#> 1 accuracy binary 0.858\n```\n:::\n\n:::\n\n::: {.column width=\"50%\"}\n:::\n:::\n\n## Dangers of overfitting ⚠️ ![](hexes/yardstick.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n::: columns\n::: {.column width=\"50%\"}\n\n::: {.cell}\n\n```{.r .cell-code}\ntaxi_fit %>%\n augment(taxi_train) %>%\n accuracy(tip, .pred_class)\n#> # A tibble: 1 Γ— 3\n#> .metric .estimator .estimate\n#> \n#> 1 accuracy binary 0.858\n```\n:::\n\n:::\n\n::: {.column width=\"50%\"}\n\n::: {.cell}\n\n```{.r .cell-code}\ntaxi_fit %>%\n augment(taxi_test) %>%\n accuracy(tip, .pred_class)\n#> # A tibble: 1 Γ— 3\n#> .metric .estimator .estimate\n#> \n#> 1 accuracy binary 0.795\n```\n:::\n\n:::\n:::\n\n. . .\n\n⚠️ Remember that we're demonstrating overfitting \n\n. . .\n\n⚠️ Don't use the test set until the *end* of your modeling analysis\n\n\n## {background-image=\"https://media.giphy.com/media/55itGuoAJiZEEen9gg/giphy.gif\" background-size=\"70%\"}\n\n## Your turn {transition=\"slide-in\"}\n\n![](images/parsnip-flagger.jpg){.absolute bottom=\"0\" left=\"0\" width=\"150\" height=\"150\"}\n\n*Use `augment()` and and a metric function to compute a classification metric like `brier_class()`.*\n\n*Compute the metrics for both training and testing data to demonstrate overfitting!*\n\n*Notice the evidence of overfitting!* ⚠️\n\n\n::: {.cell}\n::: {.cell-output-display}\n```{=html}\n
\n
\n05:00\n
\n```\n:::\n:::\n\n\n## Dangers of overfitting ⚠️ ![](hexes/yardstick.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n::: columns\n::: {.column width=\"50%\"}\n\n::: {.cell}\n\n```{.r .cell-code}\ntaxi_fit %>%\n augment(taxi_train) %>%\n brier_class(tip, .pred_yes)\n#> # A tibble: 1 Γ— 3\n#> .metric .estimator .estimate\n#> \n#> 1 brier_class binary 0.113\n```\n:::\n\n:::\n\n::: {.column width=\"50%\"}\n\n::: {.cell}\n\n```{.r .cell-code}\ntaxi_fit %>%\n augment(taxi_test) %>%\n brier_class(tip, .pred_yes)\n#> # A tibble: 1 Γ— 3\n#> .metric .estimator .estimate\n#> \n#> 1 brier_class binary 0.152\n```\n:::\n\n:::\n:::\n\n. . .\n\nWhat if we want to compare more models?\n\n. . .\n\nAnd/or more model configurations?\n\n. . .\n\nAnd we want to understand if these are important differences?\n\n# The testing data are precious πŸ’Ž\n\n# How can we use the *training* data to compare and evaluate different models? πŸ€”\n\n## {background-color=\"white\" background-image=\"https://www.tmwr.org/premade/resampling.svg\" background-size=\"80%\"}\n\n## Cross-validation\n\n![](https://www.tmwr.org/premade/three-CV.svg)\n\n## Cross-validation\n\n![](https://www.tmwr.org/premade/three-CV-iter.svg)\n\n## Your turn {transition=\"slide-in\"}\n\n![](images/parsnip-flagger.jpg){.absolute top=\"0\" right=\"0\" width=\"150\" height=\"150\"}\n\n*If we use 10 folds, what percent of the training data*\n\n- *ends up in analysis*\n- *ends up in assessment*\n\n*for* **each** *fold?*\n\n![](images/taxi_spinning.svg){width=\"300\"}\n\n\n::: {.cell}\n::: {.cell-output-display}\n```{=html}\n
\n
\n03:00\n
\n```\n:::\n:::\n\n\n## Cross-validation ![](hexes/rsample.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nvfold_cv(taxi_train) # v = 10 is default\n#> # 10-fold cross-validation \n#> # A tibble: 10 Γ— 2\n#> splits id \n#> \n#> 1 Fold01\n#> 2 Fold02\n#> 3 Fold03\n#> 4 Fold04\n#> 5 Fold05\n#> 6 Fold06\n#> 7 Fold07\n#> 8 Fold08\n#> 9 Fold09\n#> 10 Fold10\n```\n:::\n\n\n## Cross-validation ![](hexes/rsample.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\nWhat is in this?\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntaxi_folds <- vfold_cv(taxi_train)\ntaxi_folds$splits[1:3]\n#> [[1]]\n#> \n#> <6340/705/7045>\n#> \n#> [[2]]\n#> \n#> <6340/705/7045>\n#> \n#> [[3]]\n#> \n#> <6340/705/7045>\n```\n:::\n\n\n::: notes\nTalk about a list column, storing non-atomic types in dataframe\n:::\n\n## Cross-validation ![](hexes/rsample.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nvfold_cv(taxi_train, v = 5)\n#> # 5-fold cross-validation \n#> # A tibble: 5 Γ— 2\n#> splits id \n#> \n#> 1 Fold1\n#> 2 Fold2\n#> 3 Fold3\n#> 4 Fold4\n#> 5 Fold5\n```\n:::\n\n\n## Cross-validation ![](hexes/rsample.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nvfold_cv(taxi_train, strata = tip)\n#> # 10-fold cross-validation using stratification \n#> # A tibble: 10 Γ— 2\n#> splits id \n#> \n#> 1 Fold01\n#> 2 Fold02\n#> 3 Fold03\n#> 4 Fold04\n#> 5 Fold05\n#> 6 Fold06\n#> 7 Fold07\n#> 8 Fold08\n#> 9 Fold09\n#> 10 Fold10\n```\n:::\n\n\n. . .\n\nStratification often helps, with very little downside\n\n## Cross-validation ![](hexes/rsample.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\nWe'll use this setup:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(123)\ntaxi_folds <- vfold_cv(taxi_train, v = 10, strata = tip)\ntaxi_folds\n#> # 10-fold cross-validation using stratification \n#> # A tibble: 10 Γ— 2\n#> splits id \n#> \n#> 1 Fold01\n#> 2 Fold02\n#> 3 Fold03\n#> 4 Fold04\n#> 5 Fold05\n#> 6 Fold06\n#> 7 Fold07\n#> 8 Fold08\n#> 9 Fold09\n#> 10 Fold10\n```\n:::\n\n\n. . .\n\nSet the seed when creating resamples\n\n# We are equipped with metrics and resamples!\n\n## Fit our model to the resamples\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntaxi_res <- fit_resamples(taxi_wflow, taxi_folds)\ntaxi_res\n#> # Resampling results\n#> # 10-fold cross-validation using stratification \n#> # A tibble: 10 Γ— 4\n#> splits id .metrics .notes \n#> \n#> 1 Fold01 \n#> 2 Fold02 \n#> 3 Fold03 \n#> 4 Fold04 \n#> 5 Fold05 \n#> 6 Fold06 \n#> 7 Fold07 \n#> 8 Fold08 \n#> 9 Fold09 \n#> 10 Fold10 \n```\n:::\n\n\n## Evaluating model performance ![](hexes/tune.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntaxi_res %>%\n collect_metrics()\n#> # A tibble: 2 Γ— 6\n#> .metric .estimator mean n std_err .config \n#> \n#> 1 accuracy binary 0.793 10 0.00293 Preprocessor1_Model1\n#> 2 roc_auc binary 0.809 10 0.00461 Preprocessor1_Model1\n```\n:::\n\n\n. . .\n\nWe can reliably measure performance using only the **training** data πŸŽ‰\n\n## Comparing metrics ![](hexes/yardstick.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\nHow do the metrics from resampling compare to the metrics from training and testing?\n\n\n::: {.cell}\n\n:::\n\n\n::: columns\n::: {.column width=\"50%\"}\n\n::: {.cell}\n\n```{.r .cell-code}\ntaxi_res %>%\n collect_metrics() %>% \n select(.metric, mean, n)\n#> # A tibble: 2 Γ— 3\n#> .metric mean n\n#> \n#> 1 accuracy 0.793 10\n#> 2 roc_auc 0.809 10\n```\n:::\n\n:::\n\n::: {.column width=\"50%\"}\nThe ROC AUC previously was\n\n- 0.87 for the training set\n- 0.81 for test set\n:::\n:::\n\n. . .\n\nRemember that:\n\n⚠️ the training set gives you overly optimistic metrics\n\n⚠️ the test set is precious\n\n## Evaluating model performance ![](hexes/tune.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Save the assessment set results\nctrl_taxi <- control_resamples(save_pred = TRUE)\ntaxi_res <- fit_resamples(taxi_wflow, taxi_folds, control = ctrl_taxi)\n\ntaxi_preds <- collect_predictions(taxi_res)\ntaxi_preds\n#> # A tibble: 7,045 Γ— 7\n#> id .pred_yes .pred_no .row .pred_class tip .config \n#> \n#> 1 Fold01 0.936 0.0638 10 yes no Preprocessor1_Model1\n#> 2 Fold01 0.898 0.102 20 yes no Preprocessor1_Model1\n#> 3 Fold01 0.898 0.102 47 yes no Preprocessor1_Model1\n#> 4 Fold01 0.101 0.899 51 no no Preprocessor1_Model1\n#> 5 Fold01 0.871 0.129 59 yes no Preprocessor1_Model1\n#> 6 Fold01 0.0815 0.918 60 no no Preprocessor1_Model1\n#> 7 Fold01 0.162 0.838 92 no no Preprocessor1_Model1\n#> 8 Fold01 0.26 0.74 97 no no Preprocessor1_Model1\n#> 9 Fold01 0.274 0.726 98 no no Preprocessor1_Model1\n#> 10 Fold01 0.804 0.196 104 yes no Preprocessor1_Model1\n#> # β„Ή 7,035 more rows\n```\n:::\n\n\n## Evaluating model performance ![](hexes/tune.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntaxi_preds %>% \n group_by(id) %>%\n taxi_metrics(truth = tip, estimate = .pred_class)\n#> # A tibble: 30 Γ— 4\n#> id .metric .estimator .estimate\n#> \n#> 1 Fold01 accuracy binary 0.793\n#> 2 Fold02 accuracy binary 0.8 \n#> 3 Fold03 accuracy binary 0.786\n#> 4 Fold04 accuracy binary 0.804\n#> 5 Fold05 accuracy binary 0.796\n#> 6 Fold06 accuracy binary 0.789\n#> 7 Fold07 accuracy binary 0.793\n#> 8 Fold08 accuracy binary 0.808\n#> 9 Fold09 accuracy binary 0.783\n#> 10 Fold10 accuracy binary 0.780\n#> # β„Ή 20 more rows\n```\n:::\n\n\n## Where are the fitted models? ![](hexes/tune.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"} {.annotation}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntaxi_res\n#> # Resampling results\n#> # 10-fold cross-validation using stratification \n#> # A tibble: 10 Γ— 5\n#> splits id .metrics .notes .predictions\n#> \n#> 1 Fold01 \n#> 2 Fold02 \n#> 3 Fold03 \n#> 4 Fold04 \n#> 5 Fold05 \n#> 6 Fold06 \n#> 7 Fold07 \n#> 8 Fold08 \n#> 9 Fold09 \n#> 10 Fold10 \n```\n:::\n\n\n. . .\n\nπŸ—‘οΈ\n\n# Alternate resampling schemes\n\n## Bootstrapping\n\n![](https://www.tmwr.org/premade/bootstraps.svg)\n\n## Bootstrapping ![](hexes/rsample.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(3214)\nbootstraps(taxi_train)\n#> # Bootstrap sampling \n#> # A tibble: 25 Γ— 2\n#> splits id \n#> \n#> 1 Bootstrap01\n#> 2 Bootstrap02\n#> 3 Bootstrap03\n#> 4 Bootstrap04\n#> 5 Bootstrap05\n#> 6 Bootstrap06\n#> 7 Bootstrap07\n#> 8 Bootstrap08\n#> 9 Bootstrap09\n#> 10 Bootstrap10\n#> # β„Ή 15 more rows\n```\n:::\n\n\n## {background-iframe=\"https://rsample.tidymodels.org/reference/index.html\"}\n\n::: footer\n:::\n\n## The whole game - status update\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](images/whole-game-resamples.jpg){fig-align='center' width=3543}\n:::\n:::\n\n\n## Your turn {transition=\"slide-in\"}\n\n![](images/parsnip-flagger.jpg){.absolute top=\"0\" right=\"0\" width=\"150\" height=\"150\"}\n\n*Create:*\n\n- *Monte Carlo Cross-Validation sets*\n- *validation set*\n\n(use the reference guide to find the function)\n\n*Don't forget to set a seed when you resample!*\n\n\n::: {.cell}\n::: {.cell-output-display}\n```{=html}\n
\n
\n05:00\n
\n```\n:::\n:::\n\n\n## Monte Carlo Cross-Validation ![](hexes/rsample.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(322)\nmc_cv(taxi_train, times = 10)\n#> # Monte Carlo cross-validation (0.75/0.25) with 10 resamples \n#> # A tibble: 10 Γ— 2\n#> splits id \n#> \n#> 1 Resample01\n#> 2 Resample02\n#> 3 Resample03\n#> 4 Resample04\n#> 5 Resample05\n#> 6 Resample06\n#> 7 Resample07\n#> 8 Resample08\n#> 9 Resample09\n#> 10 Resample10\n```\n:::\n\n\n## Validation set ![](hexes/rsample.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"} {.annotation}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(853)\nvalidation_split(taxi_train, strata = tip)\n#> # Validation Set Split (0.75/0.25) using stratification \n#> # A tibble: 1 Γ— 2\n#> splits id \n#> \n#> 1 validation\n```\n:::\n\n\n. . .\n\nA validation set is just another type of resample\n\n# Decision tree 🌳\n\n# Random forest 🌳🌲🌴🌡🌴🌳🌳🌴🌲🌡🌴🌲🌳🌴🌳🌡🌡🌴🌲🌲🌳🌴🌳🌴🌲🌴🌡🌴🌲🌴🌡🌲🌡🌴🌲🌳🌴🌡🌳🌴🌳\n\n## Random forest 🌳🌲🌴🌡🌳🌳🌴🌲🌡🌴🌳🌡\n\n- Ensemble many decision tree models\n\n- All the trees vote! πŸ—³οΈ\n\n- Bootstrap aggregating + random predictor sampling\n\n. . .\n\n- Often works well without tuning hyperparameters (more on this tomorrow!), as long as there are enough trees\n\n## Create a random forest model ![](hexes/parsnip.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrf_spec <- rand_forest(trees = 1000, mode = \"classification\")\nrf_spec\n#> Random Forest Model Specification (classification)\n#> \n#> Main Arguments:\n#> trees = 1000\n#> \n#> Computational engine: ranger\n```\n:::\n\n\n## Create a random forest model ![](hexes/workflows.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrf_wflow <- workflow(tip ~ ., rf_spec)\nrf_wflow\n#> ══ Workflow ══════════════════════════════════════════════════════════\n#> Preprocessor: Formula\n#> Model: rand_forest()\n#> \n#> ── Preprocessor ──────────────────────────────────────────────────────\n#> tip ~ .\n#> \n#> ── Model ─────────────────────────────────────────────────────────────\n#> Random Forest Model Specification (classification)\n#> \n#> Main Arguments:\n#> trees = 1000\n#> \n#> Computational engine: ranger\n```\n:::\n\n\n## Your turn {transition=\"slide-in\"}\n\n![](images/parsnip-flagger.jpg){.absolute top=\"0\" right=\"0\" width=\"150\" height=\"150\"}\n\n*Use `fit_resamples()` and `rf_wflow` to:*\n\n- *keep predictions*\n- *compute metrics*\n\n\n::: {.cell}\n::: {.cell-output-display}\n```{=html}\n
\n
\n08:00\n
\n```\n:::\n:::\n\n\n## Evaluating model performance ![](hexes/tune.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nctrl_taxi <- control_resamples(save_pred = TRUE)\n\n# Random forest uses random numbers so set the seed first\n\nset.seed(2)\nrf_res <- fit_resamples(rf_wflow, taxi_folds, control = ctrl_taxi)\ncollect_metrics(rf_res)\n#> # A tibble: 2 Γ— 6\n#> .metric .estimator mean n std_err .config \n#> \n#> 1 accuracy binary 0.813 10 0.00305 Preprocessor1_Model1\n#> 2 roc_auc binary 0.832 10 0.00513 Preprocessor1_Model1\n```\n:::\n\n\n## How can we compare multiple model workflows at once?\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](images/taxi_spinning.svg){fig-align='center'}\n:::\n:::\n\n\n\n## Evaluate a workflow set\n\n\n::: {.cell}\n\n```{.r .cell-code}\nworkflow_set(list(tip ~ .), list(tree_spec, rf_spec))\n#> # A workflow set/tibble: 2 Γ— 4\n#> wflow_id info option result \n#> \n#> 1 formula_decision_tree \n#> 2 formula_rand_forest \n```\n:::\n\n\n## Evaluate a workflow set\n\n\n::: {.cell}\n\n```{.r .cell-code}\nworkflow_set(list(tip ~ .), list(tree_spec, rf_spec)) %>%\n workflow_map(\"fit_resamples\", resamples = taxi_folds)\n#> # A workflow set/tibble: 2 Γ— 4\n#> wflow_id info option result \n#> \n#> 1 formula_decision_tree \n#> 2 formula_rand_forest \n```\n:::\n\n\n## Evaluate a workflow set\n\n\n::: {.cell}\n\n```{.r .cell-code}\nworkflow_set(list(tip ~ .), list(tree_spec, rf_spec)) %>%\n workflow_map(\"fit_resamples\", resamples = taxi_folds) %>%\n rank_results()\n#> # A tibble: 4 Γ— 9\n#> wflow_id .config .metric mean std_err n preprocessor model rank\n#> \n#> 1 formula_rand_for… Prepro… accura… 0.813 0.00339 10 formula rand… 1\n#> 2 formula_rand_for… Prepro… roc_auc 0.833 0.00528 10 formula rand… 1\n#> 3 formula_decision… Prepro… accura… 0.793 0.00293 10 formula deci… 2\n#> 4 formula_decision… Prepro… roc_auc 0.809 0.00461 10 formula deci… 2\n```\n:::\n\n\nThe first metric of the metric set is used for ranking. Use `rank_metric` to change that.\n\n. . .\n\nLots more available with workflow sets, like `collect_metrics()`, `autoplot()` methods, and more!\n\n## Your turn {transition=\"slide-in\"}\n\n![](images/parsnip-flagger.jpg){.absolute top=\"0\" right=\"0\" width=\"150\" height=\"150\"}\n\n*When do you think a workflow set would be useful?*\n\n\n::: {.cell}\n::: {.cell-output-display}\n```{=html}\n
\n
\n03:00\n
\n```\n:::\n:::\n\n\n## The whole game - status update\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](images/whole-game-select.jpg){fig-align='center' width=3543}\n:::\n:::\n\n\n## The final fit ![](hexes/tune.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"} \n\nSuppose that we are happy with our random forest model.\n\nLet's fit the model on the training set and verify our performance using the test set.\n\n. . .\n\nWe've shown you `fit()` and `predict()` (+ `augment()`) but there is a shortcut:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# taxi_split has train + test info\nfinal_fit <- last_fit(rf_wflow, taxi_split) \n\nfinal_fit\n#> # Resampling results\n#> # Manual resampling \n#> # A tibble: 1 Γ— 6\n#> splits id .metrics .notes .predictions .workflow \n#> \n#> 1 train/test split \n```\n:::\n\n\n## What is in `final_fit`? ![](hexes/tune.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncollect_metrics(final_fit)\n#> # A tibble: 2 Γ— 4\n#> .metric .estimator .estimate .config \n#> \n#> 1 accuracy binary 0.810 Preprocessor1_Model1\n#> 2 roc_auc binary 0.817 Preprocessor1_Model1\n```\n:::\n\n\n. . .\n\nThese are metrics computed with the **test** set\n\n## What is in `final_fit`? ![](hexes/tune.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncollect_predictions(final_fit)\n#> # A tibble: 1,762 Γ— 7\n#> id .pred_yes .pred_no .row .pred_class tip .config \n#> \n#> 1 train/test split 0.732 0.268 10 yes no Preprocessor1_Mo…\n#> 2 train/test split 0.827 0.173 29 yes yes Preprocessor1_Mo…\n#> 3 train/test split 0.899 0.101 35 yes yes Preprocessor1_Mo…\n#> 4 train/test split 0.914 0.0856 42 yes yes Preprocessor1_Mo…\n#> 5 train/test split 0.911 0.0889 47 yes no Preprocessor1_Mo…\n#> 6 train/test split 0.848 0.152 54 yes yes Preprocessor1_Mo…\n#> 7 train/test split 0.580 0.420 59 yes yes Preprocessor1_Mo…\n#> 8 train/test split 0.912 0.0876 62 yes yes Preprocessor1_Mo…\n#> 9 train/test split 0.810 0.190 63 yes yes Preprocessor1_Mo…\n#> 10 train/test split 0.960 0.0402 69 yes yes Preprocessor1_Mo…\n#> # β„Ή 1,752 more rows\n```\n:::\n\n\n## What is in `final_fit`? ![](hexes/tune.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nextract_workflow(final_fit)\n#> ══ Workflow [trained] ════════════════════════════════════════════════\n#> Preprocessor: Formula\n#> Model: rand_forest()\n#> \n#> ── Preprocessor ──────────────────────────────────────────────────────\n#> tip ~ .\n#> \n#> ── Model ─────────────────────────────────────────────────────────────\n#> Ranger result\n#> \n#> Call:\n#> ranger::ranger(x = maybe_data_frame(x), y = y, num.trees = ~1000, num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1), probability = TRUE) \n#> \n#> Type: Probability estimation \n#> Number of trees: 1000 \n#> Sample size: 7045 \n#> Number of independent variables: 6 \n#> Mtry: 2 \n#> Target node size: 10 \n#> Variable importance mode: none \n#> Splitrule: gini \n#> OOB prediction error (Brier s.): 0.1373147\n```\n:::\n\n\n. . .\n\nUse this for **prediction** on new data, like for deploying\n\n## The whole game\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](images/whole-game-final-performance.jpg){fig-align='center' width=3543}\n:::\n:::\n\n\n## Your turn {transition=\"slide-in\"}\n\n![](images/parsnip-flagger.jpg){.absolute top=\"0\" right=\"0\" width=\"150\" height=\"150\"}\n\n*End of the day discussion!*\n\n*Which model do you think you would decide to use?*\n\n*What surprised you the most?*\n\n*What is one thing you are looking forward to for tomorrow?*\n\n\n::: {.cell}\n::: {.cell-output-display}\n```{=html}\n
\n
\n05:00\n
\n```\n:::\n:::\n", + "supporting": [], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": { + "include-in-header": [ + "\n\n" + ], + "include-after-body": [ + "\n\n\n" + ] + }, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/_freeze/archive/2023-07-nyr/08-wrapping-up/execute-results/html.json b/_freeze/archive/2023-07-nyr/08-wrapping-up/execute-results/html.json new file mode 100644 index 00000000..6db5567b --- /dev/null +++ b/_freeze/archive/2023-07-nyr/08-wrapping-up/execute-results/html.json @@ -0,0 +1,21 @@ +{ + "hash": "f2fd12a89707ee8d5ce342b615bf79ff", + "result": { + "markdown": "---\ntitle: \"8 - Wrapping up\"\nsubtitle: \"Machine learning with tidymodels\"\nformat:\n revealjs: \n slide-number: true\n footer: \n include-before-body: header.html\n theme: [default, tidymodels.scss]\n width: 1280\n height: 720\nknitr:\n opts_chunk: \n echo: true\n collapse: true\n comment: \"#>\"\n---\n\n\n\n\n::: r-fit-text\nWe made it!\n:::\n\n## Your turn {transition=\"slide-in\"}\n\n![](images/parsnip-flagger.jpg){.absolute top=\"0\" right=\"0\" width=\"150\" height=\"150\"}\n\n\n*What is one thing you learned that surprised you?*\n\n*What is one thing you learned that you plan to use?*\n\n\n::: {.cell}\n::: {.cell-output-display}\n```{=html}\n
\n
\n05:00\n
\n```\n:::\n:::\n\n\n\n## Resources to keep learning\n\n. . .\n\n- \n\n. . .\n\n- \n\n. . .\n\n- \n\n. . .\n\n- \n\n. . .\n\nFollow us on Twitter and at the tidyverse blog for updates!\n\n\n", + "supporting": [], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": { + "include-in-header": [ + "\n\n" + ], + "include-after-body": [ + "\n\n\n" + ] + }, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/_freeze/archive/2023-07-nyr/advanced-01-feature-engineering/execute-results/html.json b/_freeze/archive/2023-07-nyr/advanced-01-feature-engineering/execute-results/html.json new file mode 100644 index 00000000..2725ad80 --- /dev/null +++ b/_freeze/archive/2023-07-nyr/advanced-01-feature-engineering/execute-results/html.json @@ -0,0 +1,21 @@ +{ + "hash": "d8628a825c608e29655e7837aacd529a", + "result": { + "markdown": "---\ntitle: \"1 - Feature Engineering\"\nsubtitle: \"Advanced tidymodels\"\nformat:\n revealjs: \n slide-number: true\n footer: \n include-before-body: header.html\n include-after-body: footer-annotations.html\n theme: [default, tidymodels.scss]\n width: 1280\n height: 720\nknitr:\n opts_chunk: \n echo: true\n collapse: true\n comment: \"#>\"\n fig.path: \"figures/\"\n---\n\n\n\n\n\n\n\n## Working with our predictors\n\nWe might want to modify our predictors columns for a few reasons: \n\n::: {.incremental}\n- The model requires them in a different format (e.g. dummy variables for linear regression).\n- The model needs certain data qualities (e.g. same units for K-NN).\n- The outcome is better predicted when one or more columns are transformed in some way (a.k.a \"feature engineering\"). \n:::\n\n. . .\n\nThe first two reasons are fairly predictable ([next page](https://www.tmwr.org/pre-proc-table.html#tab:preprocessing)).\n\nThe last one depends on your modeling problem. \n\n\n## {background-iframe=\"https://www.tmwr.org/pre-proc-table.html#tab:preprocessing\"}\n\n::: footer\n:::\n\n\n## What is feature engineering?\n\nThink of a feature as some *representation* of a predictor that will be used in a model.\n\n. . .\n\nExample representations:\n\n- Interactions\n- Polynomial expansions/splines\n- Principal component analysis (PCA) feature extraction\n\nThere are a lot of examples in [_Feature Engineering and Selection_](https://bookdown.org/max/FES/) (FES).\n\n\n\n## Example: Dates\n\nHow can we represent date columns for our model?\n\n. . .\n\nWhen we use a date column in its native format, most models in R convert it to an integer.\n\n. . .\n\nWe can re-engineer it as:\n\n- Days since a reference date\n- Day of the week\n- Month\n- Year\n- Indicators for holidays\n\n::: notes\nThe main point is that we try to maximize performance with different versions of the predictors. \n\nMention that, for the Chicago data, the day or the week features are usually the most important ones in the model.\n:::\n\n## General definitions ![](hexes/recipes.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n- *Data preprocessing* steps allow your model to fit.\n\n- *Feature engineering* steps help the model do the least work to predict the outcome as well as possible.\n\nThe recipes package can handle both!\n\n::: notes\nThese terms are often used interchangeably in the ML community but we want to distinguish them.\n:::\n\n\n## Hotel Data ![](hexes/tidymodels.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"} ![](hexes/dplyr.png){.absolute top=-20 right=64 width=\"64\" height=\"74.24\"}\n\nWe'll use [data on hotels](https://www.sciencedirect.com/science/article/pii/S2352340918315191) to predict the cost of a room. \n\nThe [data](https://modeldatatoo.tidymodels.org/dev/reference/data_hotel_rates.html) are in the modeldatatoo package. We'll sample down the data and refactor some columns: \n\n:::: {.columns}\n\n::: {.column width=\"40%\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidymodels)\nlibrary(modeldatatoo)\n\n# Max's usual settings: \ntidymodels_prefer()\ntheme_set(theme_bw())\noptions(\n pillar.advice = FALSE, \n pillar.min_title_chars = Inf\n)\n```\n:::\n\n\n:::\n\n::: {.column width=\"60%\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(295)\nhotel_rates <- \n data_hotel_rates() %>% \n sample_n(5000) %>% \n arrange(arrival_date) %>% \n select(-arrival_date_num, -arrival_date) %>% \n mutate(\n company = factor(as.character(company)),\n country = factor(as.character(country)),\n agent = factor(as.character(agent))\n )\n```\n:::\n\n\n\n:::\n\n::::\n\n\n## Data splitting strategy\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](images/initial-split.svg){fig-align='center' width=20%}\n:::\n:::\n\n\n\n## Data Spending ![](hexes/rsample.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\nLet's split the data into a training set (75%) and testing set (25%):\n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(4028)\nhotel_split <-\n initial_split(hotel_rates, strata = avg_price_per_room)\n\nhotel_tr <- training(hotel_split)\nhotel_te <- testing(hotel_split)\n```\n:::\n\n\n\n\n## Your turn {transition=\"slide-in\"}\n\nLet's take some time and investigate the _training data_. The outcome is `avg_price_per_room`. \n\nAre there any interesting characteristics of the data?\n\n\n::: {.cell}\n::: {.cell-output-display}\n```{=html}\n
\n
\n10:00\n
\n```\n:::\n:::\n\n\n## Resampling Strategy\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](images/10-Fold-CV.svg){fig-align='center' width=100%}\n:::\n:::\n\n\n\n## Resampling Strategy ![](hexes/rsample.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\nWe'll use simple 10-fold cross-validation (stratified sampling):\n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(472)\nhotel_rs <- vfold_cv(hotel_tr, strata = avg_price_per_room)\nhotel_rs\n#> # 10-fold cross-validation using stratification \n#> # A tibble: 10 Γ— 2\n#> splits id \n#> \n#> 1 Fold01\n#> 2 Fold02\n#> 3 Fold03\n#> 4 Fold04\n#> 5 Fold05\n#> 6 Fold06\n#> 7 Fold07\n#> 8 Fold08\n#> 9 Fold09\n#> 10 Fold10\n```\n:::\n\n\n\n## Prepare your data for modeling ![](hexes/recipes.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n- The recipes package is an extensible framework for pipeable sequences of preprocessing and feature engineering steps.\n\n. . .\n\n- Statistical parameters for the steps can be _estimated_ from an initial data set and then _applied_ to other data sets.\n\n. . .\n\n- The resulting processed output can be used as inputs for statistical or machine learning models.\n\n## A first recipe ![](hexes/recipes.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nhotel_rec <- \n recipe(avg_price_per_room ~ ., data = hotel_tr)\n```\n:::\n\n\n. . .\n\n- The `recipe()` function assigns columns to roles of \"outcome\" or \"predictor\" using the formula\n\n## A first recipe ![](hexes/recipes.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsummary(hotel_rec)\n#> # A tibble: 28 Γ— 4\n#> variable type role source \n#> \n#> 1 lead_time predictor original\n#> 2 arrival_date_day_of_month predictor original\n#> 3 stays_in_weekend_nights predictor original\n#> 4 stays_in_week_nights predictor original\n#> 5 adults predictor original\n#> 6 children predictor original\n#> 7 babies predictor original\n#> 8 meal predictor original\n#> 9 country predictor original\n#> 10 market_segment predictor original\n#> # β„Ή 18 more rows\n```\n:::\n\n\nThe `type` column contains information on the variables\n\n\n## Your turn {transition=\"slide-in\"}\n\nWhat do you think are in the `type` vectors for the `lead_time` and `country` columns?\n\n\n::: {.cell}\n::: {.cell-output-display}\n```{=html}\n
\n
\n02:00\n
\n```\n:::\n:::\n\n\n\n\n## Create indicator variables ![](hexes/recipes.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"3\"}\nhotel_rec <- \n recipe(avg_price_per_room ~ ., data = hotel_tr) %>% \n step_dummy(all_nominal_predictors())\n```\n:::\n\n\n. . .\n\n- For any factor or character predictors, make binary indicators.\n\n- There are *many* recipe steps that can convert categorical predictors to numeric columns.\n\n- `step_dummy()` records the levels of the categorical predictors in the training set. \n\n## Filter out constant columns ![](hexes/recipes.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"4\"}\nhotel_rec <- \n recipe(avg_price_per_room ~ ., data = hotel_tr) %>% \n step_dummy(all_nominal_predictors()) %>% \n step_zv(all_predictors())\n```\n:::\n\n\n. . .\n\nIn case there is a factor level that was never observed in the training data (resulting in a column of all `0`s), we can delete any *zero-variance* predictors that have a single unique value.\n\n:::notes\nNote that the selector chooses all columns with a role of \"predictor\"\n:::\n\n\n## Normalization ![](hexes/recipes.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"5\"}\nhotel_rec <- \n recipe(avg_price_per_room ~ ., data = hotel_tr) %>% \n step_dummy(all_nominal_predictors()) %>% \n step_zv(all_predictors()) %>% \n step_normalize(all_numeric_predictors())\n```\n:::\n\n\n. . .\n\n- This centers and scales the numeric predictors.\n\n\n- The recipe will use the _training_ set to estimate the means and standard deviations of the data.\n\n. . .\n\n- All data the recipe is applied to will be normalized using those statistics (there is no re-estimation).\n\n## Reduce correlation ![](hexes/recipes.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"6\"}\nhotel_rec <- \n recipe(avg_price_per_room ~ ., data = hotel_tr) %>% \n step_dummy(all_nominal_predictors()) %>% \n step_zv(all_predictors()) %>% \n step_normalize(all_numeric_predictors()) %>% \n step_corr(all_numeric_predictors(), threshold = 0.9)\n```\n:::\n\n\n. . .\n\nTo deal with highly correlated predictors, find the minimum set of predictor columns that make the pairwise correlations less than the threshold.\n\n## Other possible steps ![](hexes/recipes.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"6\"}\nhotel_rec <- \n recipe(avg_price_per_room ~ ., data = hotel_tr) %>% \n step_dummy(all_nominal_predictors()) %>% \n step_zv(all_predictors()) %>% \n step_normalize(all_numeric_predictors()) %>% \n step_pca(all_numeric_predictors())\n```\n:::\n\n\n. . . \n\nPCA feature extraction...\n\n## Other possible steps ![](hexes/recipes.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"} ![](hexes/embed.png){.absolute top=-20 right=64 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"6\"}\nhotel_rec <- \n recipe(avg_price_per_room ~ ., data = hotel_tr) %>% \n step_dummy(all_nominal_predictors()) %>% \n step_zv(all_predictors()) %>% \n step_normalize(all_numeric_predictors()) %>% \n embed::step_umap(all_numeric_predictors(), outcome = avg_price_per_room)\n```\n:::\n\n\n. . . \n\nA fancy machine learning supervised dimension reduction technique...\n\n:::notes\nNote that this uses the outcome, and it is from an extension package\n:::\n\n\n## Other possible steps ![](hexes/recipes.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"6\"}\nhotel_rec <- \n recipe(avg_price_per_room ~ ., data = hotel_tr) %>% \n step_dummy(all_nominal_predictors()) %>% \n step_zv(all_predictors()) %>% \n step_normalize(all_numeric_predictors()) %>% \n step_spline_natural(year_day, deg_free = 10)\n```\n:::\n\n\n. . . \n\nNonlinear transforms like natural splines, and so on!\n\n## {background-iframe=\"https://recipes.tidymodels.org/reference/index.html\"}\n\n::: footer\n:::\n\n\n## Your turn {transition=\"slide-in\"}\n\n![](images/parsnip-flagger.jpg){.absolute top=\"0\" right=\"0\" width=\"150\" height=\"150\"}\n\n*Create a `recipe()` for the hotel data to:*\n\n- *use a Yeo-Johnson (YJ) transformation on `lead_time`*\n- *convert factors to indicator variables*\n- *remove zero-variance variables*\n\n\n::: {.cell}\n::: {.cell-output-display}\n```{=html}\n
\n
\n03:00\n
\n```\n:::\n:::\n\n\n\n## Minimal recipe ![](hexes/recipes.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"} \n\n\n::: {.cell}\n\n```{.r .cell-code}\nhotel_indicators <-\n recipe(avg_price_per_room ~ ., data = hotel_tr) %>% \n step_YeoJohnson(lead_time) %>%\n step_dummy(all_nominal_predictors()) %>%\n step_zv(all_predictors())\n```\n:::\n\n\n\n## Measuring Performance ![](hexes/yardstick.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\nWe'll compute two measures: mean absolute error and the coefficient of determination (a.k.a $R^2$). \n\n\\begin{align}\nMAE &= \\frac{1}{n}\\sum_{i=1}^n |y_i - \\hat{y}_i| \\notag \\\\\nR^2 &= cor(y_i, \\hat{y}_i)^2\n\\end{align}\n\nThe focus will be on MAE for parameter optimization. We'll use a metric set to compute these: \n\n\n::: {.cell}\n\n```{.r .cell-code}\nreg_metrics <- metric_set(mae, rsq)\n```\n:::\n\n\n\n## Using a workflow ![](hexes/workflows.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"} ![](hexes/tune.png){.absolute top=-20 right=64 width=\"64\" height=\"74.24\"} ![](hexes/recipes.png){.absolute top=-20 right=128 width=\"64\" height=\"74.24\"} ![](hexes/parsnip.png){.absolute top=-20 right=192 width=\"64\" height=\"74.24\"} \n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(9)\n\nhotel_lm_wflow <-\n workflow() %>%\n add_recipe(hotel_indicators) %>%\n add_model(linear_reg())\n \nctrl <- control_resamples(save_pred = TRUE)\nhotel_lm_res <-\n hotel_lm_wflow %>%\n fit_resamples(hotel_rs, control = ctrl, metrics = reg_metrics)\n\ncollect_metrics(hotel_lm_res)\n#> # A tibble: 2 Γ— 6\n#> .metric .estimator mean n std_err .config \n#> \n#> 1 mae standard 17.3 10 0.199 Preprocessor1_Model1\n#> 2 rsq standard 0.874 10 0.00400 Preprocessor1_Model1\n```\n:::\n\n\n## Your turn {transition=\"slide-in\"}\n\n![](images/parsnip-flagger.jpg){.absolute top=\"0\" right=\"0\" width=\"150\" height=\"150\"}\n\n*Use `fit_resamples()` to fit your workflow with a recipe.*\n\n*Collect the predictions from the results.*\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n```{=html}\n
\n
\n05:00\n
\n```\n:::\n:::\n\n\n\n## Holdout predictions ![](hexes/workflows.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"} ![](hexes/tune.png){.absolute top=-20 right=64 width=\"64\" height=\"74.24\"} ![](hexes/recipes.png){.absolute top=-20 right=128 width=\"64\" height=\"74.24\"} ![](hexes/parsnip.png){.absolute top=-20 right=192 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Since we used `save_pred = TRUE`\nlm_val_pred <- collect_predictions(hotel_lm_res)\nlm_val_pred %>% slice(1:7)\n#> # A tibble: 7 Γ— 5\n#> id .pred .row avg_price_per_room .config \n#> \n#> 1 Fold01 62.1 20 40 Preprocessor1_Model1\n#> 2 Fold01 48.0 28 54 Preprocessor1_Model1\n#> 3 Fold01 64.6 45 50 Preprocessor1_Model1\n#> 4 Fold01 45.8 49 42 Preprocessor1_Model1\n#> 5 Fold01 45.8 61 49 Preprocessor1_Model1\n#> 6 Fold01 30.0 66 40 Preprocessor1_Model1\n#> 7 Fold01 38.8 88 49 Preprocessor1_Model1\n```\n:::\n\n\n\n## Calibration Plot ![](hexes/probably.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlibrary(probably)\n\ncal_plot_regression(hotel_lm_res, alpha = 1 / 5)\n```\n\n::: {.cell-output-display}\n![](figures/lm-cal-plot-1.svg){fig-align='center' width=40%}\n:::\n:::\n\n\n\n\n## What do we do with the agent and company data? \n\nThere are 98 unique agent values and 100 unique companies in our training set. How can we include this information in our model?\n\n. . .\n\nWe could:\n\n- make the full set of indicator variables 😳\n\n- lump agents and companies that rarely occur into an \"other\" group\n\n- use [feature hashing](https://www.tmwr.org/categorical.html#feature-hashing) to create a smaller set of indicator variables\n\n- use effect encoding to replace the `agent` and `company` columns with the estimated effect of that predictor (in the extra materials)\n\n\n\n\n\n\n\n## Per-agent statistics \n\n::: columns\n::: {.column width=\"50%\"}\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](figures/effects-freq-1.svg){fig-align='center' width=90%}\n:::\n:::\n\n:::\n\n::: {.column width=\"50%\"}\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](figures/effects-adr-1.svg){fig-align='center' width=90%}\n:::\n:::\n\n:::\n:::\n\n## Collapsing factor levels ![](hexes/recipes.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\nThere is a recipe step that will redefine factor levels based on their frequency in the training set: \n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"4|\"}\nhotel_other_rec <-\n recipe(avg_price_per_room ~ ., data = hotel_tr) %>% \n step_YeoJohnson(lead_time) %>%\n step_other(agent, threshold = 0.001) %>%\n step_dummy(all_nominal_predictors()) %>%\n step_zv(all_predictors())\n```\n:::\n\n\n\n\nUsing this code, 34 agents (out of 98) were collapsed into \"other\" based on the training set.\n\nWe _could_ try to optimize the threshold for collapsing (see the next set of slides on model tuning).\n\n## Does othering help? ![](hexes/tune.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"} ![](hexes/recipes.png){.absolute top=-20 right=64 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"3|\"}\nhotel_other_wflow <-\n hotel_lm_wflow %>%\n update_recipe(hotel_other_rec)\n\nhotel_other_res <-\n hotel_other_wflow %>%\n fit_resamples(hotel_rs, control = ctrl, metrics = reg_metrics)\n\ncollect_metrics(hotel_other_res)\n#> # A tibble: 2 Γ— 6\n#> .metric .estimator mean n std_err .config \n#> \n#> 1 mae standard 17.4 10 0.205 Preprocessor1_Model1\n#> 2 rsq standard 0.874 10 0.00417 Preprocessor1_Model1\n```\n:::\n\n\nAabout the same MAE and much faster to complete. \n\nNow let's look at a more sophisticated tool called effect feature hashing. \n\n## Feature Hashing\n\nBetween `agent` and `company`, simple dummy variables would create 198 new columns (that are mostly zeros).\n\nAnother option is to have a binary indicator that combines some levels of these variables.\n\nFeature hashing (for more see [_FES_](https://bookdown.org/max/FES/encoding-predictors-with-many-categories.html), [_SMLTAR_](https://smltar.com/mlregression.html#case-study-feature-hashing), and [_TMwR_](https://www.tmwr.org/categorical.html#feature-hashing)): \n\n- uses the character values of the levels \n- converts them to integer hash values\n- uses the integers to assign them to a specific indicator column. \n\n## Feature Hashing\n\nSuppose we want to use 32 indicator variables for `agent`. \n\nFor a agent with value \"`Max_Kuhn`\", a hashing function converts it to an integer (say 210397726). \n\nTo assign it to one of the 32 columns, we would use modular arithmetic to assign it to a column: \n\n\n::: {.cell}\n\n```{.r .cell-code}\n# For \"Max_Kuhn\" put a '1' in column: \n210397726 %% 32\n#> [1] 30\n```\n:::\n\n\n[Hash functions](https://www.metamorphosite.com/one-way-hash-encryption-sha1-data-software) are meant to _emulate_ randomness. \n\n\n## Feature Hashing Pros\n\n\n- The procedure will automatically work on new values of the predictors.\n- It is fast. \n- \"Signed\" hashes add a sign to help avoid aliasing. \n\n## Feature Hashing Cons\n\n- There is no real logic behind which factor levels are combined. \n- We don't know how many columns to add (more in the next section).\n- Some columns may have all zeros. \n- If a indicator column is important to the model, we can't easily determine why. \n\n:::notes\nThe signed hash make it slightly more possible to differentiate between confounded levels\n:::\n\n\n## Feature Hashing in recipes ![](hexes/workflows.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"} ![](hexes/textrecipes.png){.absolute top=-20 right=64 width=\"64\" height=\"74.24\"} ![](hexes/recipes.png){.absolute top=-20 right=128 width=\"64\" height=\"74.24\"}\n\nThe textrecipes package has a step that can be added to the recipe: \n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"6-8|\"}\nlibrary(textrecipes)\n\nhash_rec <-\n recipe(avg_price_per_room ~ ., data = hotel_tr) %>%\n step_YeoJohnson(lead_time) %>%\n # Defaults to 32 signed indicator columns\n step_dummy_hash(agent) %>%\n step_dummy_hash(company) %>%\n # Regular indicators for the others\n step_dummy(all_nominal_predictors()) %>% \n step_zv(all_predictors())\n\nhotel_hash_wflow <-\n hotel_lm_wflow %>%\n update_recipe(hash_rec)\n```\n:::\n\n\n\n## Feature Hashing in recipes ![](hexes/tune.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"} ![](hexes/textrecipes.png){.absolute top=-20 right=64 width=\"64\" height=\"74.24\"} ![](hexes/recipes.png){.absolute top=-20 right=128 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nhotel_hash_res <-\n hotel_hash_wflow %>%\n fit_resamples(hotel_rs, control = ctrl, metrics = reg_metrics)\n\ncollect_metrics(hotel_hash_res)\n#> # A tibble: 2 Γ— 6\n#> .metric .estimator mean n std_err .config \n#> \n#> 1 mae standard 17.5 10 0.256 Preprocessor1_Model1\n#> 2 rsq standard 0.872 10 0.00395 Preprocessor1_Model1\n```\n:::\n\n\nAbout the same performance but now we can handle new values. \n\n\n## Debugging a recipe\n\n- Typically, you will want to use a workflow to estimate and apply a recipe.\n\n. . .\n\n- If you have an error and need to debug your recipe, the original recipe object (e.g. `hash_rec`) can be estimated manually with a function called `prep()`. It is analogous to `fit()`. See [TMwR section 16.4](https://www.tmwr.org/dimensionality.html#recipe-functions)\n\n. . .\n\n- Another function (`bake()`) is analogous to `predict()`, and gives you the processed data back.\n\n. . .\n\n- The `tidy()` function can be used to get specific results from the recipe.\n\n## Example ![](hexes/recipes.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"} ![](hexes/broom.png){.absolute top=-20 right=64 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nhash_rec_fit <- prep(hash_rec)\n\n# Get the transformation coefficient\ntidy(hash_rec_fit, number = 1)\n\n# Get the processed data\nbake(hash_rec_fit, hotel_tr %>% slice(1:3), contains(\"_agent_\"))\n```\n:::\n\n\n## More on recipes\n\n- Once `fit()` is called on a workflow, changing the model does not re-fit the recipe.\n\n. . .\n\n- A list of all known steps is at .\n\n. . .\n\n- Some steps can be [skipped](https://recipes.tidymodels.org/articles/Skipping.html) when using `predict()`.\n\n. . .\n\n- The [order](https://recipes.tidymodels.org/articles/Ordering.html) of the steps matters.\n\n\n\n", + "supporting": [], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": { + "include-in-header": [ + "\n\n" + ], + "include-after-body": [ + "\n\n\n" + ] + }, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/_freeze/archive/2023-07-nyr/advanced-02-tuning-hyperparameters/execute-results/html.json b/_freeze/archive/2023-07-nyr/advanced-02-tuning-hyperparameters/execute-results/html.json new file mode 100644 index 00000000..e2efdc76 --- /dev/null +++ b/_freeze/archive/2023-07-nyr/advanced-02-tuning-hyperparameters/execute-results/html.json @@ -0,0 +1,21 @@ +{ + "hash": "0e183a03b8c38cce1cf1df5e00d08e01", + "result": { + "markdown": "---\ntitle: \"2 - Tuning Hyperparameters\"\nsubtitle: \"Advanced tidymodels\"\nformat:\n revealjs: \n slide-number: true\n footer: \n include-before-body: header.html\n include-after-body: footer-annotations.html\n theme: [default, tidymodels.scss]\n width: 1280\n height: 720\nknitr:\n opts_chunk: \n echo: true\n collapse: true\n comment: \"#>\"\n fig.path: \"figures/\"\n---\n\n\n\n\n\n\n\n## Previously - Setup ![](hexes/tidymodels.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n:::: {.columns}\n\n::: {.column width=\"40%\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidymodels)\nlibrary(modeldatatoo)\nlibrary(textrecipes)\nlibrary(bonsai)\n\n# Max's usual settings: \ntidymodels_prefer()\ntheme_set(theme_bw())\noptions(\n pillar.advice = FALSE, \n pillar.min_title_chars = Inf\n)\n\nreg_metrics <- metric_set(mae, rsq)\n```\n:::\n\n\n:::\n\n::: {.column width=\"60%\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(295)\nhotel_rates <- \n data_hotel_rates() %>% \n sample_n(5000) %>% \n arrange(arrival_date) %>% \n select(-arrival_date_num, -arrival_date) %>% \n mutate(\n company = factor(as.character(company)),\n country = factor(as.character(country)),\n agent = factor(as.character(agent))\n )\n```\n:::\n\n\n\n:::\n\n::::\n\n\n## Previously - Data Usage ![](hexes/rsample.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(4028)\nhotel_split <- initial_split(hotel_rates, strata = avg_price_per_room)\n\nhotel_tr <- training(hotel_split)\nhotel_te <- testing(hotel_split)\n\nset.seed(472)\nhotel_rs <- vfold_cv(hotel_tr, strata = avg_price_per_room)\n```\n:::\n\n\n\n## Previously - Feature engineering ![](hexes/textrecipes.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"} ![](hexes/recipes.png){.absolute top=-20 right=64 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(textrecipes)\n\nhash_rec <-\n recipe(avg_price_per_room ~ ., data = hotel_tr) %>%\n step_YeoJohnson(lead_time) %>%\n # Defaults to 32 signed indicator columns\n step_dummy_hash(agent) %>%\n step_dummy_hash(company) %>%\n # Regular indicators for the others\n step_dummy(all_nominal_predictors()) %>% \n step_zv(all_predictors())\n```\n:::\n\n\n# Optimizing Models via Tuning Parameters\n\n## Tuning parameters\n\nSome model or preprocessing parameters cannot be estimated directly from the data.\n\n. . .\n\nSome examples:\n\n- Tree depth in decision trees\n- Number of neighbors in a K-nearest neighbor model\n\n# Activation function in neural networks?\n\nSigmoidal functions, ReLu, etc.\n\n::: fragment\nYes, it is a tuning parameter.\nβœ…\n:::\n\n# Number of feature hashing columns to generate?\n\n::: fragment\nYes, it is a tuning parameter.\nβœ…\n:::\n\n# Bayesian priors for model parameters?\n\n::: fragment\nHmmmm, probably not.\nThese are based on prior belief.\n❌\n:::\n\n# Covariance/correlation matrix structure in mixed models?\n\n::: fragment\nYes, but it is unlikely to affect performance.\n:::\n\n::: fragment\nIt will impact inference though.\nπŸ€”\n:::\n\n\n\n# Is the random seed a tuning parameter?\n\n::: fragment\nNope. It is not. \n❌\n:::\n\n## Optimize tuning parameters\n\n- Try different values and measure their performance.\n\n. . .\n\n- Find good values for these parameters.\n\n. . .\n\n- Once the value(s) of the parameter(s) are determined, a model can be finalized by fitting the model to the entire training set.\n\n\n## Tagging parameters for tuning ![](hexes/tune.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\nWith tidymodels, you can mark the parameters that you want to optimize with a value of `tune()`. \n\n
\n\nThe function itself just returns... itself: \n\n\n::: {.cell}\n\n```{.r .cell-code}\ntune()\n#> tune()\nstr(tune())\n#> language tune()\n\n# optionally add a label\ntune(\"I hope that the workshop is going well\")\n#> tune(\"I hope that the workshop is going well\")\n```\n:::\n\n\n. . . \n\nFor example...\n\n## Optimizing the hash features ![](hexes/tune.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"} ![](hexes/textrecipes.png){.absolute top=-20 right=64 width=\"64\" height=\"74.24\"} ![](hexes/recipes.png){.absolute top=-20 right=128 width=\"64\" height=\"74.24\"}\n\nOur new recipe is: \n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"4-5|\"}\nhash_rec <-\n recipe(avg_price_per_room ~ ., data = hotel_tr) %>%\n step_YeoJohnson(lead_time) %>%\n step_dummy_hash(agent, num_terms = tune(\"agent hash\")) %>%\n step_dummy_hash(company, num_terms = tune(\"company hash\")) %>%\n step_zv(all_predictors())\n```\n:::\n\n\n
\n\nWe will be using a tree-based model in a minute. \n\n - The other categorical predictors are left as-is.\n - That's why there is no `step_dummy()`. \n\n\n## Boosted Trees\n\nThese are popular ensemble methods that build a _sequence_ of tree models. \n\n
\n\nEach tree uses the results of the previous tree to better predict samples, especially those that have been poorly predicted. \n\n
\n\nEach tree in the ensemble is saved and new samples are predicted using a weighted average of the votes of each tree in the ensemble. \n\n
\n\nWe'll focus on the popular lightgbm implementation. \n\n## Boosted Tree Tuning Parameters\n\nSome _possible_ parameters: \n\n* `mtry`: The number of predictors randomly sampled at each split (in $[1, ncol(x)]$ or $(0, 1]$).\n* `trees`: The number of trees ($[1, \\infty]$, but usually up to thousands)\n* `min_n`: The number of samples needed to further split ($[1, n]$).\n* `learn_rate`: The rate that each tree adapts from previous iterations ($(0, \\infty]$, usual maximum is 0.1).\n* `stop_iter`: The number of iterations of boosting where _no improvement_ was shown before stopping ($[1, trees]$)\n\n## Boosted Tree Tuning Parameters\n\nTBH it is usually not difficult to optimize these models. \n\n
\n\nOften, there are multiple _candidate_ tuning parameter combinations that have very good results. \n\n
\n\nTo demonstrate simple concepts, we'll look at optimizing the number of trees in the ensemble (between 1 and 100) and the learning rate ($10^{-5}$ to $10^{-1}$).\n\n## Boosted Tree Tuning Parameters ![](hexes/workflows.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"} ![](hexes/parsnip.png){.absolute top=-20 right=64 width=\"64\" height=\"74.24\"} ![](hexes/bonsai.png){.absolute top=-20 right=128 width=\"64\" height=\"74.24\"}\n\nWe'll need to load the bonsai package. This has the information needed to use lightgbm\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(bonsai)\nlgbm_spec <- \n boost_tree(trees = tune(), learn_rate = tune()) %>% \n set_mode(\"regression\") %>% \n set_engine(\"lightgbm\")\n\nlgbm_wflow <- workflow(hash_rec, lgbm_spec)\n```\n:::\n\n\n\n\n\n## Optimize tuning parameters\n\nThe main two strategies for optimization are:\n\n. . .\n\n- **Grid search** πŸ’  which tests a pre-defined set of candidate values\n\n- **Iterative search** πŸŒ€ which suggests/estimates new values of candidate parameters to evaluate\n\n\n## Grid search\n\nA small grid of points trying to minimize the error via learning rate: \n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](images/small_init.svg){fig-align='center' width=60%}\n:::\n:::\n\n\n\n## Grid search\n\nIn reality we would probably sample the space more densely: \n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](images/grid_points.svg){fig-align='center' width=60%}\n:::\n:::\n\n\n\n## Iterative Search\n\nWe could start with a few points and search the space:\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](animations/anime_seq.gif){fig-align='center' width=60%}\n:::\n:::\n\n\n# Grid Search\n\n## Parameters\n\n- The tidymodels framework provides pre-defined information on tuning parameters (such as their type, range, transformations, etc).\n\n- The `extract_parameter_set_dials()` function extracts these tuning parameters and the info.\n\n::: fragment\n#### Grids\n\n- Create your grid manually or automatically.\n\n- The `grid_*()` functions can make a grid.\n:::\n\n::: notes\nMost basic (but very effective) way to tune models\n:::\n\n## Create a grid ![](hexes/workflows.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"} ![](hexes/dials.png){.absolute top=-20 right=64 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlgbm_wflow %>% \n extract_parameter_set_dials()\n#> Collection of 4 parameters for tuning\n#> \n#> identifier type object\n#> trees trees nparam[+]\n#> learn_rate learn_rate nparam[+]\n#> agent hash num_terms nparam[+]\n#> company hash num_terms nparam[+]\n\n# Individual functions: \ntrees()\n#> # Trees (quantitative)\n#> Range: [1, 2000]\nlearn_rate()\n#> Learning Rate (quantitative)\n#> Transformer: log-10 [1e-100, Inf]\n#> Range (transformed scale): [-10, -1]\n```\n:::\n\n\n::: fragment\nA parameter set can be updated (e.g. to change the ranges).\n:::\n\n## Create a grid ![](hexes/workflows.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"} ![](hexes/dials.png){.absolute top=-20 right=64 width=\"64\" height=\"74.24\"}\n\n::: columns\n::: {.column width=\"65%\"}\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(12)\ngrid <- \n lgbm_wflow %>% \n extract_parameter_set_dials() %>% \n grid_latin_hypercube(size = 25)\n\ngrid\n#> # A tibble: 25 Γ— 4\n#> trees learn_rate `agent hash` `company hash`\n#> \n#> 1 1629 0.00000440 524 1454\n#> 2 1746 0.0000000751 1009 2865\n#> 3 53 0.0000180 2313 367\n#> 4 442 0.000000445 347 460\n#> 5 1413 0.0000000208 3232 553\n#> 6 1488 0.0000578 3692 639\n#> 7 906 0.000385 602 332\n#> 8 1884 0.00000000101 1127 567\n#> 9 1812 0.0239 961 1183\n#> 10 393 0.000000117 487 1783\n#> # β„Ή 15 more rows\n```\n:::\n\n:::\n\n::: {.column width=\"35%\"}\n::: fragment\n- A *space-filling design* tends to perform better than random grids.\n- Space-filling designs are also usually more efficient than regular grids.\n:::\n:::\n:::\n\n## Your turn {transition=\"slide-in\"}\n\n![](images/parsnip-flagger.jpg){.absolute top=\"0\" right=\"0\" width=\"150\" height=\"150\"}\n\n*Create a grid for our tunable workflow.*\n\n*Try creating a regular grid.*\n\n\n::: {.cell}\n::: {.cell-output-display}\n```{=html}\n
\n
\n03:00\n
\n```\n:::\n:::\n\n\n## Create a regular grid ![](hexes/workflows.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"} ![](hexes/dials.png){.absolute top=-20 right=64 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"5\"}\nset.seed(12)\ngrid <- \n lgbm_wflow %>% \n extract_parameter_set_dials() %>% \n grid_regular(levels = 4)\n\ngrid\n#> # A tibble: 256 Γ— 4\n#> trees learn_rate `agent hash` `company hash`\n#> \n#> 1 1 0.0000000001 256 256\n#> 2 667 0.0000000001 256 256\n#> 3 1333 0.0000000001 256 256\n#> 4 2000 0.0000000001 256 256\n#> 5 1 0.0000001 256 256\n#> 6 667 0.0000001 256 256\n#> 7 1333 0.0000001 256 256\n#> 8 2000 0.0000001 256 256\n#> 9 1 0.0001 256 256\n#> 10 667 0.0001 256 256\n#> # β„Ή 246 more rows\n```\n:::\n\n\n\n## Your turn {transition=\"slide-in\"}\n\n![](images/parsnip-flagger.jpg){.absolute top=\"0\" right=\"0\" width=\"150\" height=\"150\"}\n\n
\n\n*What advantage would a regular grid have?* \n\n\n\n## Update parameter ranges ![](hexes/workflows.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"} ![](hexes/dials.png){.absolute top=-20 right=64 width=\"64\" height=\"74.24\"} {.annotation}\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"4-5|\"}\nlgbm_param <- \n lgbm_wflow %>% \n extract_parameter_set_dials() %>% \n update(trees = trees(c(1L, 100L)),\n learn_rate = learn_rate(c(-5, -1)))\n\nset.seed(712)\ngrid <- \n lgbm_param %>% \n grid_latin_hypercube(size = 25)\n\ngrid\n#> # A tibble: 25 Γ— 4\n#> trees learn_rate `agent hash` `company hash`\n#> \n#> 1 75 0.000312 2991 1250\n#> 2 4 0.0000337 899 3088\n#> 3 15 0.0295 520 1578\n#> 4 8 0.0997 1256 3592\n#> 5 80 0.000622 419 258\n#> 6 70 0.000474 2499 1089\n#> 7 35 0.000165 287 2376\n#> 8 64 0.00137 389 359\n#> 9 58 0.0000250 616 881\n#> 10 84 0.0639 2311 2635\n#> # β„Ή 15 more rows\n```\n:::\n\n\n\n## The results ![](hexes/ggplot2.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"} ![](hexes/dials.png){.absolute top=-20 right=64 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell layout-align=\"center\" output-location='column'}\n\n```{.r .cell-code}\ngrid %>% \n ggplot(aes(trees, learn_rate)) +\n geom_point(size = 4) +\n scale_y_log10()\n```\n\n::: {.cell-output-display}\n![](figures/sfd-1.svg){fig-align='center'}\n:::\n:::\n\n\nNote that the learning rates are uniform on the log-10 scale. \n\n\n# Use the `tune_*()` functions to tune models\n\n\n## Choosing tuning parameters ![](hexes/workflows.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"} ![](hexes/tune.png){.absolute top=-20 right=64 width=\"64\" height=\"74.24\"} ![](hexes/recipes.png){.absolute top=-20 right=128 width=\"64\" height=\"74.24\"} ![](hexes/parsnip.png){.absolute top=-20 right=192 width=\"64\" height=\"74.24\"} ![](hexes/bonsai.png){.absolute top=-20 right=256 width=\"64\" height=\"74.24\"}\n\nLet's take our previous model and tune more parameters:\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"2,12-13|\"}\nlgbm_spec <- \n boost_tree(trees = tune(), learn_rate = tune(), min_n = tune()) %>% \n set_mode(\"regression\") %>% \n set_engine(\"lightgbm\")\n\nlgbm_wflow <- workflow(hash_rec, lgbm_spec)\n\n# Update the feature hash ranges (log-2 units)\nlgbm_param <-\n lgbm_wflow %>%\n extract_parameter_set_dials() %>%\n update(`agent hash` = num_hash(c(3, 8)),\n `company hash` = num_hash(c(3, 8)))\n```\n:::\n\n\n\n## Grid Search ![](hexes/workflows.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"} ![](hexes/tune.png){.absolute top=-20 right=64 width=\"64\" height=\"74.24\"} ![](hexes/dials.png){.absolute top=-20 right=128 width=\"64\" height=\"74.24\"} \n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"2|\"}\nset.seed(9)\nctrl <- control_grid(save_pred = TRUE)\n\nlgbm_res <-\n lgbm_wflow %>%\n tune_grid(\n resamples = hotel_rs,\n grid = 25,\n # The options below are not required by default\n param_info = lgbm_param, \n control = ctrl,\n metrics = reg_metrics\n )\n```\n:::\n\n\n::: notes\n- `tune_grid()` is representative of tuning function syntax\n- similar to `fit_resamples()`\n:::\n\n\n\n## Grid Search ![](hexes/workflows.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"} ![](hexes/tune.png){.absolute top=-20 right=64 width=\"64\" height=\"74.24\"} ![](hexes/dials.png){.absolute top=-20 right=128 width=\"64\" height=\"74.24\"} \n\n\n::: {.cell}\n\n```{.r .cell-code}\nlgbm_res \n#> # Tuning results\n#> # 10-fold cross-validation using stratification \n#> # A tibble: 10 Γ— 5\n#> splits id .metrics .notes .predictions \n#> \n#> 1 Fold01 \n#> 2 Fold02 \n#> 3 Fold03 \n#> 4 Fold04 \n#> 5 Fold05 \n#> 6 Fold06 \n#> 7 Fold07 \n#> 8 Fold08 \n#> 9 Fold09 \n#> 10 Fold10 \n```\n:::\n\n\n\n## Grid results ![](hexes/tune.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nautoplot(lgbm_res)\n```\n\n::: {.cell-output-display}\n![](figures/autoplot-1.svg){fig-align='center' width=80%}\n:::\n:::\n\n\n## Tuning results ![](hexes/tune.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncollect_metrics(lgbm_res)\n#> # A tibble: 50 Γ— 11\n#> trees min_n learn_rate `agent hash` `company hash` .metric .estimator mean n std_err .config \n#> \n#> 1 298 19 4.15e- 9 222 36 mae standard 53.2 10 0.427 Preprocessor01_Model1\n#> 2 298 19 4.15e- 9 222 36 rsq standard 0.811 10 0.00785 Preprocessor01_Model1\n#> 3 1394 5 5.82e- 6 28 21 mae standard 52.9 10 0.424 Preprocessor02_Model1\n#> 4 1394 5 5.82e- 6 28 21 rsq standard 0.810 10 0.00857 Preprocessor02_Model1\n#> 5 774 12 4.41e- 2 27 95 mae standard 10.5 10 0.175 Preprocessor03_Model1\n#> 6 774 12 4.41e- 2 27 95 rsq standard 0.939 10 0.00381 Preprocessor03_Model1\n#> 7 1342 7 6.84e-10 71 17 mae standard 53.2 10 0.427 Preprocessor04_Model1\n#> 8 1342 7 6.84e-10 71 17 rsq standard 0.810 10 0.00903 Preprocessor04_Model1\n#> 9 669 39 8.62e- 7 141 145 mae standard 53.2 10 0.426 Preprocessor05_Model1\n#> 10 669 39 8.62e- 7 141 145 rsq standard 0.808 10 0.00661 Preprocessor05_Model1\n#> # β„Ή 40 more rows\n```\n:::\n\n\n## Tuning results ![](hexes/tune.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncollect_metrics(lgbm_res, summarize = FALSE)\n#> # A tibble: 500 Γ— 10\n#> id trees min_n learn_rate `agent hash` `company hash` .metric .estimator .estimate .config \n#> \n#> 1 Fold01 298 19 0.00000000415 222 36 mae standard 51.8 Preprocessor01_Model1\n#> 2 Fold01 298 19 0.00000000415 222 36 rsq standard 0.834 Preprocessor01_Model1\n#> 3 Fold02 298 19 0.00000000415 222 36 mae standard 52.1 Preprocessor01_Model1\n#> 4 Fold02 298 19 0.00000000415 222 36 rsq standard 0.801 Preprocessor01_Model1\n#> 5 Fold03 298 19 0.00000000415 222 36 mae standard 52.2 Preprocessor01_Model1\n#> 6 Fold03 298 19 0.00000000415 222 36 rsq standard 0.784 Preprocessor01_Model1\n#> 7 Fold04 298 19 0.00000000415 222 36 mae standard 51.7 Preprocessor01_Model1\n#> 8 Fold04 298 19 0.00000000415 222 36 rsq standard 0.828 Preprocessor01_Model1\n#> 9 Fold05 298 19 0.00000000415 222 36 mae standard 55.2 Preprocessor01_Model1\n#> 10 Fold05 298 19 0.00000000415 222 36 rsq standard 0.850 Preprocessor01_Model1\n#> # β„Ή 490 more rows\n```\n:::\n\n\n## Choose a parameter combination ![](hexes/tune.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nshow_best(lgbm_res, metric = \"rsq\")\n#> # A tibble: 5 Γ— 11\n#> trees min_n learn_rate `agent hash` `company hash` .metric .estimator mean n std_err .config \n#> \n#> 1 1890 10 0.0159 115 174 rsq standard 0.940 10 0.00369 Preprocessor12_Model1\n#> 2 774 12 0.0441 27 95 rsq standard 0.939 10 0.00381 Preprocessor03_Model1\n#> 3 1638 36 0.0409 15 120 rsq standard 0.938 10 0.00346 Preprocessor16_Model1\n#> 4 963 23 0.00556 157 13 rsq standard 0.930 10 0.00358 Preprocessor06_Model1\n#> 5 590 5 0.00320 85 73 rsq standard 0.905 10 0.00505 Preprocessor24_Model1\n```\n:::\n\n\n## Choose a parameter combination ![](hexes/tune.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\nCreate your own tibble for final parameters or use one of the `tune::select_*()` functions:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlgbm_best <- select_best(lgbm_res, metric = \"mae\")\nlgbm_best\n#> # A tibble: 1 Γ— 6\n#> trees min_n learn_rate `agent hash` `company hash` .config \n#> \n#> 1 774 12 0.0441 27 95 Preprocessor03_Model1\n```\n:::\n\n\n## Checking Calibration ![](hexes/tune.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"} ![](hexes/probably.png){.absolute top=-20 right=64 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell output-location='column'}\n\n```{.r .cell-code}\nlibrary(probably)\nlgbm_res %>%\n collect_predictions(\n parameters = lgbm_best\n ) %>%\n cal_plot_regression(\n truth = avg_price_per_room,\n estimate = .pred,\n alpha = 1 / 3\n )\n```\n\n::: {.cell-output-display}\n![](figures/lgb-cal-plot-1.svg){width=90%}\n:::\n:::\n\n\n\n## Running in parallel\n\n::: columns\n::: {.column width=\"60%\"}\n- Grid search, combined with resampling, requires fitting a lot of models!\n\n- These models don't depend on one another and can be run in parallel.\n\nWe can use a *parallel backend* to do this:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncores <- parallelly::availableCores(logical = FALSE)\ncl <- parallel::makePSOCKcluster(cores)\ndoParallel::registerDoParallel(cl)\n\n# Now call `tune_grid()`!\n\n# Shut it down with:\nforeach::registerDoSEQ()\nparallel::stopCluster(cl)\n```\n:::\n\n:::\n\n::: {.column width=\"40%\"}\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](figures/resample-times-1.svg){fig-align='center' width=100%}\n:::\n:::\n\n:::\n:::\n\n## Running in parallel\n\nSpeed-ups are fairly linear up to the number of physical cores (10 here).\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](figures/parallel-speedup-1.svg){fig-align='center' width=90%}\n:::\n:::\n\n\n:::notes\nFaceted on the expensiveness of preprocessing used.\n:::\n\n\n## Early stopping for boosted trees {.annotation}\n\nWe have directly optimized the number of trees as a tuning parameter. \n\nInstead we could \n \n - Set the number of trees to a single large number.\n - Stop adding trees when performance gets worse. \n \nThis is known as \"early stopping\" and there is a parameter for that: `stop_iter`.\n\nEarly stopping has a potential to decrease the tuning time. \n\n\n## Your turn {transition=\"slide-in\"}\n\n![](images/parsnip-flagger.jpg){.absolute top=\"0\" right=\"0\" width=\"150\" height=\"150\"}\n\n
\n\n\n*Set `trees = 2000` and tune the `stop_iter` parameter.* \n\n\n\n::: {.cell}\n::: {.cell-output-display}\n```{=html}\n
\n
\n10:00\n
\n```\n:::\n:::\n\n::: {.cell}\n\n:::\n", + "supporting": [], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": { + "include-in-header": [ + "\n\n" + ], + "include-after-body": [ + "\n\n\n" + ] + }, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/_freeze/archive/2023-07-nyr/advanced-03-racing/execute-results/html.json b/_freeze/archive/2023-07-nyr/advanced-03-racing/execute-results/html.json new file mode 100644 index 00000000..d77f512d --- /dev/null +++ b/_freeze/archive/2023-07-nyr/advanced-03-racing/execute-results/html.json @@ -0,0 +1,21 @@ +{ + "hash": "93a0d1db000d6a20b3292980b0c3be5d", + "result": { + "markdown": "---\ntitle: \"3 - Grid Search via Racing\"\nsubtitle: \"Advanced tidymodels\"\nformat:\n revealjs: \n slide-number: true\n footer: \n include-before-body: header.html\n include-after-body: footer-annotations.html\n theme: [default, tidymodels.scss]\n width: 1280\n height: 720\nknitr:\n opts_chunk: \n echo: true\n collapse: true\n comment: \"#>\"\n fig.path: \"figures/\"\n---\n\n\n\n\n\n\n## Previously - Setup ![](hexes/tidymodels.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n:::: {.columns}\n\n::: {.column width=\"40%\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidymodels)\nlibrary(modeldatatoo)\nlibrary(textrecipes)\nlibrary(bonsai)\n\n# Max's usual settings: \ntidymodels_prefer()\ntheme_set(theme_bw())\noptions(\n pillar.advice = FALSE, \n pillar.min_title_chars = Inf\n)\n\nreg_metrics <- metric_set(mae, rsq)\n```\n:::\n\n\n:::\n\n::: {.column width=\"60%\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(295)\nhotel_rates <- \n data_hotel_rates() %>% \n sample_n(5000) %>% \n arrange(arrival_date) %>% \n select(-arrival_date_num, -arrival_date) %>% \n mutate(\n company = factor(as.character(company)),\n country = factor(as.character(country)),\n agent = factor(as.character(agent))\n )\n```\n:::\n\n\n\n:::\n\n::::\n\n\n## Previously - Data Usage ![](hexes/rsample.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(4028)\nhotel_split <-\n initial_split(hotel_rates, strata = avg_price_per_room)\n\nhotel_tr <- training(hotel_split)\nhotel_te <- testing(hotel_split)\n\nset.seed(472)\nhotel_rs <- vfold_cv(hotel_tr, strata = avg_price_per_room)\n```\n:::\n\n\n## Previously - Boosting Model ![](hexes/tune.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"} ![](hexes/textrecipes.png){.absolute top=-20 right=64 width=\"64\" height=\"74.24\"} ![](hexes/recipes.png){.absolute top=-20 right=128 width=\"64\" height=\"74.24\"} ![](hexes/bonsai.png){.absolute top=-20 right=192 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nhotel_rec <-\n recipe(avg_price_per_room ~ ., data = hotel_tr) %>%\n step_YeoJohnson(lead_time) %>%\n step_dummy_hash(agent, num_terms = tune(\"agent hash\")) %>%\n step_dummy_hash(company, num_terms = tune(\"company hash\")) %>%\n step_zv(all_predictors())\n\nlgbm_spec <- \n boost_tree(trees = tune(), learn_rate = tune(), min_n = tune()) %>% \n set_mode(\"regression\") %>% \n set_engine(\"lightgbm\")\n\nlgbm_wflow <- workflow(hotel_rec, lgbm_spec)\n\nlgbm_param <-\n lgbm_wflow %>%\n extract_parameter_set_dials() %>%\n update(`agent hash` = num_hash(c(3, 8)),\n `company hash` = num_hash(c(3, 8)))\n```\n:::\n\n\n\n## Making Grid Search More Efficient\n\nIn the last section, we evaluated 250 models (25 candidates times 10 resamples).\n\nWe can make this go faster using parallel processing. \n\nAlso, for some models, we can _fit_ far fewer models than the number that are being evaluated. \n \n * For boosting, a model with `X` trees can often predict on candidates with less than `X` trees. \n \nBoth of these methods can lead to enormous speed-ups. \n\n\n## Model Racing \n\n[_Racing_](https://scholar.google.com/scholar?hl=en&as_sdt=0%2C7&q=+Hoeffding+racing) is an old tool that we can use to go even faster. \n\n1. Evaluate all of the candidate models but only for a few resamples. \n1. Determine which candidates have a low probability of being selected.\n1. Eliminate poor candidates.\n1. Repeat with next resample (until no more resamples remain) \n\nThis can result in fitting a small number of models. \n\n\n## Discarding Candidates\n\nHow do we eliminate tuning parameter combinations? \n\nThere are a few methods to do so. We'll use one based on analysis of variance (ANOVA). \n\n_However_... there is typically a large difference between resamples in the results. \n\n## Resampling Results (Non-Racing)\n\n:::: {.columns}\n\n::: {.column width=\"50%\"}\n\n\n\n::: {.cell layout-align=\"center\"}\n\n:::\n\n\nHere are some realistic (but simulated) examples of two candidate models. \n\nAn error estimate is measured for each of 10 resamples. \n\n - The lines connect resamples. \n\nThere is usually a significant resample-to-resample effect (rank corr: 0.83). \n\n:::\n\n::: {.column width=\"50%\"}\n\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](figures/race-data-1.svg){fig-align='center' width=100%}\n:::\n:::\n\n\n:::\n\n::::\n\n\n## Are Candidates Different?\n\nOne way to evaluate these models is to do a paired t-test\n \n - or a t-test on their differences matched by resamples\n\nWith $n = 10$ resamples, the confidence interval is (0.99, 2.8), indicating that candidate number 2 has smaller error. \n\nWhat if we were to compare each model candidate to the current best at each resample? \n\nOne shows superiority when 4 resamples have been evaluated.\n\n\n## Evaluating Differences in Candidates\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](figures/race-ci-1.svg){fig-align='center' width=70%}\n:::\n:::\n\n\n## Interim Analysis of Results\n\nOne version of racing uses a _mixed model ANOVA_ to construct one-sided confidence intervals for each candidate versus the current best. \n\nAny candidates whose bound does not include zero are discarded. [Here](https://www.tmwr.org/race_results.mp4) is an animation.\n\nThe resamples are analyzed in a random order.\n\n
\n\n[Kuhn (2014)](https://arxiv.org/abs/1405.6974) has examples and simulations to show that the method works. \n\nThe [finetune](https://finetune.tidymodels.org/) package has functions `tune_race_anova()` and `tune_race_win_loss()`. \n\n\n## Racing ![](hexes/tune.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"} ![](hexes/textrecipes.png){.absolute top=-20 right=64 width=\"64\" height=\"74.24\"} ![](hexes/recipes.png){.absolute top=-20 right=128 width=\"64\" height=\"74.24\"} ![](hexes/finetune.png){.absolute top=-20 right=192 width=\"64\" height=\"74.24\"} ![](hexes/bonsai.png){.absolute top=-20 right=256 width=\"64\" height=\"74.24\"}\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"1,8|\"}\n# Let's use a larger grid\nset.seed(8945)\nlgbm_grid <- \n lgbm_param %>% \n grid_latin_hypercube(size = 50)\n\nlibrary(finetune)\n\nset.seed(9)\nlgbm_race_res <-\n lgbm_wflow %>%\n tune_race_anova(\n resamples = hotel_rs,\n grid = lgbm_grid, \n metrics = reg_metrics\n )\n```\n:::\n\n\nThe syntax and helper functions are extremely similar to those shown for `tune_grid()`. \n\n\n## Racing Results ![](hexes/tune.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nshow_best(lgbm_race_res, metric = \"mae\")\n#> # A tibble: 2 Γ— 11\n#> trees min_n learn_rate `agent hash` `company hash` .metric .estimator mean n std_err .config \n#> \n#> 1 1014 5 0.0791 35 181 mae standard 10.3 10 0.202 Preprocessor06_Model1\n#> 2 1516 7 0.0421 176 12 mae standard 10.4 10 0.200 Preprocessor42_Model1\n```\n:::\n\n\n## Racing Results ![](hexes/finetune.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n:::: {.columns}\n\n::: {.column width=\"50%\"}\nOnly 170 models were fit (out of 500). \n\n`select_best()` never considers candidate models that did not get to the end of the race. \n\nThere is a helper function to see how candidate models were removed from consideration. \n\n:::\n\n::: {.column width=\"50%\"}\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nplot_race(lgbm_race_res) + \n scale_x_continuous(breaks = pretty_breaks())\n```\n\n::: {.cell-output-display}\n![](figures/plot-race-1.svg){fig-align='center' width=100%}\n:::\n:::\n\n\n:::\n\n::::\n\n\n## Your turn {transition=\"slide-in\"}\n\n- *Run `tune_race_anova()` with a different seed.*\n- *Did you get the same or similar results?*\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n```{=html}\n
\n
\n10:00\n
\n```\n:::\n:::\n\n::: {.cell}\n\n:::\n", + "supporting": [], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": { + "include-in-header": [ + "\n\n" + ], + "include-after-body": [ + "\n\n\n" + ] + }, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/_freeze/archive/2023-07-nyr/advanced-04-iterative/execute-results/html.json b/_freeze/archive/2023-07-nyr/advanced-04-iterative/execute-results/html.json new file mode 100644 index 00000000..19ce73d7 --- /dev/null +++ b/_freeze/archive/2023-07-nyr/advanced-04-iterative/execute-results/html.json @@ -0,0 +1,21 @@ +{ + "hash": "8835ce5fc026395ffb3d1a28b06c4c9f", + "result": { + "markdown": "---\ntitle: \"4 - Iterative Search\"\nsubtitle: \"Advanced tidymodels\"\nformat:\n revealjs: \n slide-number: true\n footer: \n include-before-body: header.html\n include-after-body: footer-annotations.html\n theme: [default, tidymodels.scss]\n width: 1280\n height: 720\nknitr:\n opts_chunk: \n echo: true\n collapse: true\n comment: \"#>\"\n fig.path: \"figures/\"\n---\n\n\n\n\n\n\n## Previously - Setup\n\n:::: {.columns}\n\n::: {.column width=\"40%\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidymodels)\nlibrary(modeldatatoo)\nlibrary(textrecipes)\nlibrary(bonsai)\n\n# Max's usual settings: \ntidymodels_prefer()\ntheme_set(theme_bw())\noptions(\n pillar.advice = FALSE, \n pillar.min_title_chars = Inf\n)\n```\n:::\n\n\n:::\n\n::: {.column width=\"60%\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(295)\nhotel_rates <- \n data_hotel_rates() %>% \n sample_n(5000) %>% \n arrange(arrival_date) %>% \n select(-arrival_date_num, -arrival_date) %>% \n mutate(\n company = factor(as.character(company)),\n country = factor(as.character(country)),\n agent = factor(as.character(agent))\n )\n```\n:::\n\n\n\n:::\n\n::::\n\n\n## Previously - Data Usage\n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(4028)\nhotel_split <- initial_split(hotel_rates, strata = avg_price_per_room)\n\nhotel_tr <- training(hotel_split)\nhotel_te <- testing(hotel_split)\n\nset.seed(472)\nhotel_rs <- vfold_cv(hotel_tr, strata = avg_price_per_room)\n```\n:::\n\n\n## Previously - Boosting Model\n\n\n::: {.cell}\n\n```{.r .cell-code}\nhotel_rec <-\n recipe(avg_price_per_room ~ ., data = hotel_tr) %>%\n step_YeoJohnson(lead_time) %>%\n step_dummy_hash(agent, num_terms = tune(\"agent hash\")) %>%\n step_dummy_hash(company, num_terms = tune(\"company hash\")) %>%\n step_zv(all_predictors())\n\nlgbm_spec <- \n boost_tree(trees = tune(), learn_rate = tune(), min_n = tune()) %>% \n set_mode(\"regression\") %>% \n set_engine(\"lightgbm\")\n\nlgbm_wflow <- workflow(hotel_rec, lgbm_spec)\n\nlgbm_param <-\n lgbm_wflow %>%\n extract_parameter_set_dials() %>%\n update(`agent hash` = num_hash(c(3, 8)),\n `company hash` = num_hash(c(3, 8)))\n```\n:::\n\n\n## Iterative Search\n\nInstead of pre-defining a grid of candidate points, we can model our current results to predict what the next candidate point should be. \n\n
\n\nSuppose that we are only tuning the learning rate in our boosted tree. \n\n
\n\nWe could do something like: \n\n```r\nmae_pred <- lm(mae ~ learn_rate, data = resample_results)\n```\n\nand use this to predict and rank new learning rate candidates. \n\n\n## Iterative Search\n\nA linear model probably isn't the best choice though (more in a minute). \n\nTo illustrate the process, we resampled a large grid of learning rate values for our data to show what the relationship is between MAE and learning rate. \n\nNow suppose that we used a grid of three points in the parameter range for learning rate...\n\n\n## A Large Grid\n\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](figures/grid-large-1.svg){fig-align='center' width=50%}\n:::\n:::\n\n\n\n## A Three Point Grid\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](figures/grid-large-sampled-1.svg){fig-align='center' width=50%}\n:::\n:::\n\n\n## Gaussian Processes and Optimization\n\nWe can make a \"meta-model\" with a small set of historical performance results. \n\n[Gaussian Processes](https://gaussianprocess.org/gpml/) (GP) models are a good choice to model performance. \n\n- It is a Bayesian model so we are using **Bayesian Optimization (BO)**.\n- For regression, we can assume that our data are multivariate normal. \n- We also define a _covariance_ function for the variance relationship between data points. A common one is:\n\n$$\\operatorname{cov}(\\boldsymbol{x}_i, \\boldsymbol{x}_j) = \\exp\\left(-\\frac{1}{2}|\\boldsymbol{x}_i - \\boldsymbol{x}_j|^2\\right) + \\sigma^2_{ij}$$\n\n\n:::notes\nGPs are good because \n\n- they are flexible regression models (in the sense that splines are flexible). \n- we need to get mean and variance predictions (and they are Bayesian)\n- their variability is based on spatial distances.\n\nSome people use random forests (with conformal variance estimates) or other methods but GPs are most popular.\n:::\n\n\n## Predicting Candidates\n\nThe GP model can take candidate tuning parameter combinations as inputs and make predictions for performance (e.g. MAE)\n\n - The _mean_ performance\n - The _variance_ of performance \n \nThe variance is mostly driven by spatial variability (the previous equation). \n\nThe predicted variance is zero at locations of actual data points and becomes very high when far away from any observed data. \n\n\n## Your turn {transition=\"slide-in\"}\n\n:::: {.columns}\n\n::: {.column width=\"50%\"}\n\n*Your GP makes predictions on two new candidate tuning parameters.* \n\n*We want to minimize MAE.* \n\n*Which should we choose?*\n\n:::\n\n::: {.column width=\"50%\"}\n\n::: {.cell}\n::: {.cell-output-display}\n![](figures/two-candidates-1.svg){width=100%}\n:::\n:::\n\n:::\n\n::::\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n```{=html}\n
\n
\n03:00\n
\n```\n:::\n:::\n\n\n\n\n## GP Fit (ribbon is mean +/- 1SD)\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](figures/gp-iter-0-1.svg){fig-align='center' width=50%}\n:::\n:::\n\n\n\n## Choosing New Candidates\n\nThis isn't a very good fit but we can still use it.\n\nHow can we use the outputs to choose the next point to measure?\n\n
\n\n[_Acquisition functions_](https://ekamperi.github.io/machine%20learning/2021/06/11/acquisition-functions.html) take the predicted mean and variance and use them to balance: \n\n - _exploration_: new candidates should explore new areas.\n - _exploitation_: new candidates must stay near existing values. \n\nExploration focuses on the variance, exploitation is about the mean. \n\n## Acquisition Functions\n\nWe'll use an acquisition function to select a new candidate.\n\nThe most popular method appears to be _expected improvement_ ([EI](https://arxiv.org/pdf/1911.12809.pdf)) above the current best results. \n \n - Zero at existing data points. \n - The _expected_ improvement is integrated over all possible improvement (\"expected\" in the probability sense). \n\nWe would probably pick the point with the largest EI as the next point. \n\n(There are other functions beyond EI.)\n\n## Expected Improvement\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](figures/gp-iter-0-ei-1.svg){fig-align='center' width=50%}\n:::\n:::\n\n\n## Iteration\n\nOnce we pick the candidate point, we measure performance for it (e.g. resampling). \n\n
\n\nAnother GP is fit, EI is recomputed, and so on. \n\n
\n\nWe stop when we have completed the allowed number of iterations _or_ if we don't see any improvement after a pre-set number of attempts. \n\n\n## GP Fit with four points\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](figures/gp-iter-1-1.svg){fig-align='center' width=50%}\n:::\n:::\n\n\n\n## Expected Improvement\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](figures/gp-iter-1-ei-1.svg){fig-align='center' width=50%}\n:::\n:::\n\n\n\n## GP Evolution\n\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](animations/anime_gp.gif){fig-align='center' width=50%}\n:::\n:::\n\n\n\n## Expected Improvement Evolution\n\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](animations/anime_improvement.gif){fig-align='center' width=50%}\n:::\n:::\n\n\n## BO in tidymodels\n\nWe'll use a function called `tune_bayes()` that has very similar syntax to `tune_grid()`. \n\n
\n\nIt has an additional `initial` argument for the initial set of performance estimates and parameter combinations for the GP model. \n\n## Initial grid points\n\n`initial` can be the results of another `tune_*()` function or an integer (in which case `tune_grid()` is used under to hood to make such an initial set of results).\n \n - We'll run the optimization more than once, so let's make an initial grid of results to serve as the substrate for the BO. \n\n - I suggest at least the number of tuning parameters plus two as the initial grid for BO. \n\n## An Initial Grid\n\n\n::: {.cell}\n\n```{.r .cell-code}\nreg_metrics <- metric_set(mae, rsq)\n\nset.seed(12)\ninit_res <-\n lgbm_wflow %>%\n tune_grid(\n resamples = hotel_rs,\n grid = nrow(lgbm_param) + 2,\n param_info = lgbm_param,\n metrics = reg_metrics\n )\n\nshow_best(init_res, metric = \"mae\")\n#> # A tibble: 5 Γ— 11\n#> trees min_n learn_rate `agent hash` `company hash` .metric .estimator mean n std_err .config \n#> \n#> 1 390 10 0.0139 13 62 mae standard 11.9 10 0.208 Preprocessor1_Model1\n#> 2 718 31 0.00112 72 25 mae standard 29.1 10 0.325 Preprocessor4_Model1\n#> 3 1236 22 0.0000261 11 17 mae standard 51.8 10 0.416 Preprocessor7_Model1\n#> 4 1044 25 0.00000832 34 12 mae standard 52.8 10 0.424 Preprocessor5_Model1\n#> 5 1599 7 0.0000000402 254 179 mae standard 53.2 10 0.427 Preprocessor6_Model1\n```\n:::\n\n\n## BO using tidymodels\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"4,6-8|\"}\nset.seed(15)\nlgbm_bayes_res <-\n lgbm_wflow %>%\n tune_bayes(\n resamples = hotel_rs,\n initial = init_res, # <- initial results\n iter = 20,\n param_info = lgbm_param,\n metrics = reg_metrics\n )\n\nshow_best(lgbm_bayes_res, metric = \"mae\")\n#> # A tibble: 5 Γ— 12\n#> trees min_n learn_rate `agent hash` `company hash` .metric .estimator mean n std_err .config .iter\n#> \n#> 1 1665 2 0.0593 12 59 mae standard 10.1 10 0.173 Iter13 13\n#> 2 1179 2 0.0552 161 121 mae standard 10.2 10 0.147 Iter7 7\n#> 3 1609 6 0.0592 186 192 mae standard 10.2 10 0.195 Iter17 17\n#> 4 1352 6 0.0799 217 46 mae standard 10.3 10 0.211 Iter4 4\n#> 5 1647 4 0.0819 12 240 mae standard 10.3 10 0.198 Iter20 20\n```\n:::\n\n\n\n## Plotting BO Results\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nautoplot(lgbm_bayes_res, metric = \"mae\")\n```\n\n::: {.cell-output-display}\n![](figures/autoplot-marginals-1.svg){fig-align='center' width=50%}\n:::\n:::\n\n\n\n## Plotting BO Results\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nautoplot(lgbm_bayes_res, metric = \"mae\", type = \"parameters\")\n```\n\n::: {.cell-output-display}\n![](figures/autoplot-param-1.svg){fig-align='center' width=50%}\n:::\n:::\n\n\n\n## Plotting BO Results\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nautoplot(lgbm_bayes_res, metric = \"mae\", type = \"performance\")\n```\n\n::: {.cell-output-display}\n![](figures/autoplot-perf-1.svg){fig-align='center' width=50%}\n:::\n:::\n\n\n\n## ENHANCE\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nautoplot(lgbm_bayes_res, metric = \"mae\", type = \"performance\") +\n ylim(c(9.5, 14))\n```\n\n::: {.cell-output-display}\n![](figures/autoplot-perf-zoomed-1.svg){fig-align='center' width=50%}\n:::\n:::\n\n\n\n## Your turn {transition=\"slide-in\"}\n\n*Let's try a different acquisition function: `conf_bound(kappa)`.*\n\n*We'll use the `objective` argument to set it.*\n\n*Choose your own `kappa` value:*\n\n - *Larger values will explore the space more.* \n - *\"Large\" values are usually less than one.*\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n```{=html}\n
\n
\n10:00\n
\n```\n:::\n:::\n\n\n## Notes\n\n- Stopping `tune_bayes()` will return the current results. \n\n- Parallel processing can still be used to more efficiently measure each candidate point. \n\n- There are [a lot of other iterative methods](https://github.com/topepo/Optimization-Methods-for-Tuning-Predictive-Models) that you can use. \n\n- The finetune package also has functions for [simulated annealing](https://www.tmwr.org/iterative-search.html#simulated-annealing) search. \n\n## Finalizing the Model\n\nLet's say that we've tried a lot of different models and we like our lightgbm model the most. \n\nWhat do we do now? \n\n * Finalize the workflow by choosing the values for the tuning parameters. \n * Fit the model on the entire training set. \n * Verify performance using the test set. \n * Document and publish the model(?)\n \n## Locking Down the Tuning Parameters\n\nWe can take the results of the Bayesian optimization and accept the best results: \n\n\n::: {.cell}\n\n```{.r .cell-code}\nbest_param <- select_best(lgbm_bayes_res, metric = \"mae\")\nfinal_wflow <- \n lgbm_wflow %>% \n finalize_workflow(best_param)\nfinal_wflow\n#> ══ Workflow ══════════════════════════════════════════════════════════\n#> Preprocessor: Recipe\n#> Model: boost_tree()\n#> \n#> ── Preprocessor ──────────────────────────────────────────────────────\n#> 4 Recipe Steps\n#> \n#> β€’ step_YeoJohnson()\n#> β€’ step_dummy_hash()\n#> β€’ step_dummy_hash()\n#> β€’ step_zv()\n#> \n#> ── Model ─────────────────────────────────────────────────────────────\n#> Boosted Tree Model Specification (regression)\n#> \n#> Main Arguments:\n#> trees = 1665\n#> min_n = 2\n#> learn_rate = 0.0592557571004946\n#> \n#> Computational engine: lightgbm\n```\n:::\n\n\n## The Final Fit\n\nWe can use individual functions: \n\n```r\nfinal_fit <- final_wflow %>% fit(data = hotel_tr)\n\n# then predict() or augment() \n# then compute metrics\n```\n\n
\n\nRemember that there is also a convenience function to do all of this: \n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(3893)\nfinal_res <- final_wflow %>% last_fit(hotel_split, metrics = reg_metrics)\nfinal_res\n#> # Resampling results\n#> # Manual resampling \n#> # A tibble: 1 Γ— 6\n#> splits id .metrics .notes .predictions .workflow \n#> \n#> 1 train/test split \n```\n:::\n\n\n## Test Set Results\n\n:::: {.columns}\n\n::: {.column width=\"65%\"}\n\n::: {.cell}\n\n```{.r .cell-code}\nfinal_res %>% \n collect_predictions() %>% \n cal_plot_regression(\n truth = avg_price_per_room, \n estimate = .pred, \n alpha = 1 / 4)\n```\n:::\n\n\nTest set performance: \n\n\n::: {.cell}\n\n```{.r .cell-code}\nfinal_res %>% collect_metrics()\n#> # A tibble: 2 Γ— 4\n#> .metric .estimator .estimate .config \n#> \n#> 1 mae standard 10.5 Preprocessor1_Model1\n#> 2 rsq standard 0.937 Preprocessor1_Model1\n```\n:::\n\n:::\n\n::: {.column width=\"35%\"}\n\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](figures/test-cal-1.svg){fig-align='center' width=100%}\n:::\n:::\n\n\n:::\n\n::::\n\n\n::: {.cell}\n\n:::\n", + "supporting": [], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": { + "include-in-header": [ + "\n\n" + ], + "include-after-body": [ + "\n\n\n" + ] + }, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/_freeze/archive/2023-07-nyr/annotations/execute-results/html.json b/_freeze/archive/2023-07-nyr/annotations/execute-results/html.json new file mode 100644 index 00000000..8f3b7f42 --- /dev/null +++ b/_freeze/archive/2023-07-nyr/annotations/execute-results/html.json @@ -0,0 +1,14 @@ +{ + "hash": "34abff9197cbe142abc26d11f45a3f71", + "result": { + "markdown": "---\ntitle: \"Annotations\"\nknitr:\n opts_chunk: \n echo: true\n collapse: true\n comment: \"#>\"\n fig.path: \"figures/\"\n---\n\n\n\n\n
\n\n# 01 - Introduction\n\n## πŸ‘€\n\nThis page contains _annotations_ for selected slides. \n\nThere's a lot that we want to tell you. We don't want people to have to frantically scribble down things that we say that are not on the slides. \n\nWe've added sections to this document with longer explanations and links to other resources. \n\n
\n\n# 02 - Data Budget\n\n## The initial split\n\nWhat does `set.seed()` do? \n\nWe’ll use pseudo-random numbers (PRN) to partition the data into training and testing. PRN are numbers that emulate truly random numbers (but really are not truly random). \n\nThink of PRN as a box that takes a starting value (the \"seed\") that produces random numbers using that starting value as an input into its process. \n\nIf we know a seed value, we can reproduce our \"random\" numbers. To use a different set of random numbers, choose a different seed value. \n\nFor example: \n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(1)\nrunif(3)\n#> [1] 0.2655087 0.3721239 0.5728534\n\n# Get a new set of random numbers:\nset.seed(2)\nrunif(3)\n#> [1] 0.1848823 0.7023740 0.5733263\n\n# We can reproduce the old ones with the same seed\nset.seed(1)\nrunif(3)\n#> [1] 0.2655087 0.3721239 0.5728534\n```\n:::\n\n\nIf we _don’t_ set the seed, R uses the clock time and the process ID to create a seed. This isn’t reproducible. \n\nSince we want our code to be reproducible, we set the seeds before random numbers are used. \n\nIn theory, you can set the seed once at the start of a script. However, if we do interactive data analysis, we might unwittingly use random numbers while coding. In that case, the stream is not the same and we don’t get reproducible results. \n\nThe value of the seed is an integer and really has no meaning. Max has a script to generate random integers to use as seeds to \"spread the randomness around\". It is basically:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncat(paste0(\"set.seed(\", sample.int(10000, 5), \")\", collapse = \"\\n\"))\n#> set.seed(9725)\n#> set.seed(8462)\n#> set.seed(4050)\n#> set.seed(8789)\n#> set.seed(1301)\n```\n:::\n\n\n
\n\n# 03 - What Makes A Model?\n\n## What is wrong with this? \n\nIf we treat the preprocessing as a separate task, it raises the risk that we might accidentally overfit to the data at hand. \n\nFor example, someone might estimate something from the entire data set (such as the principle components) and treat that data as if it were known (and not estimated). Depending on the what was done with the data, consequences in doing that could be:\n\n* Your performance metrics are slightly-to-moderately optimistic (e.g. you might think your accuracy is 85% when it is actually 75%)\n* A consequential component of the analysis is not right and the model just doesn’t work. \n\nThe big issue here is that you won’t be able to figure this out until you get a new piece of data, such as the test set. \n\nA really good example of this is in [β€˜Selection bias in gene extraction on the basis of microarray gene-expression data’](https://pubmed.ncbi.nlm.nih.gov/11983868/). The authors re-analyze a previous publication and show that the original researchers did not include feature selection in the workflow. Because of that, their performance statistics were extremely optimistic. In one case, they could do the original analysis on complete noise and still achieve zero errors. \n\nGenerally speaking, this problem is referred to as [data leakage](https://en.wikipedia.org/wiki/Leakage_(machine_learning)). Some other references: \n\n * [Overfitting to Predictors and External Validation](https://bookdown.org/max/FES/selection-overfitting.html)\n * [Are We Learning Yet? A Meta Review of Evaluation Failures Across Machine Learning](https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/757b505cfd34c64c85ca5b5690ee5293-Abstract-round2.html)\n * [Navigating the pitfalls of applying machine learning in genomics](https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=Navigating+the+pitfalls+of+applying+machine+learning+in+genomics&btnG=)\n * [A review of feature selection techniques in bioinformatics](https://academic.oup.com/bioinformatics/article/23/19/2507/185254)\n * [On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation](https://www.jmlr.org/papers/volume11/cawley10a/cawley10a.pdf)\n\n
\n\n# 04 - Evaluating Models\n\n## Where are the fitted models?\n\nThe primary purpose of resampling is to estimate model performance. The models are almost never needed again. \n\nAlso, if the data set is large, the model object may require a lot of memory to save so, by default, we don't keep them. \n\nFor more advanced use cases, you can extract and save them. See:\n\n * \n * (an example)\n\n\n## Validation set\n\nThe upcoming version of the rsample package (1.2.0) will have a new set of functions specific to validation sets. They will allow you to make an initial _three-way split_ and still use a validation set with the tune package. \n\n
\n\n# 06 - Tuning Hyperparameters\n\n## Update parameter ranges\n\nIn about 90% of the cases, the dials function that you use to update the parameter range has the same name as the argument. For example, if you were to update the `mtry` parameter in a random forests model, the code would look like\n\n\n::: {.cell}\n\n```{.r .cell-code}\nparameter_object %>% \n update(mtry = mtry(c(1, 100)))\n```\n:::\n\n\nThere are some cases where the parameter function, or its associated values, are different from the argument name. \n\nFor example, with `step_spline_naturall()`, we might want to tune the `deg_free` argument (for the degrees of freedom of a spline function. ). In this case, the argument name is `deg_free` but we update it with `spline_degree()`. \n\n`deg_free` represents the general concept of degrees of freedom and could be associated with many different things. For example, if we ever had an argument that was the number of degrees of freedom for a $t$ distribution, we would call that argument `deg_free`. \n\nFor splines, we probably want a wider range for the degrees of freedom. We made a specialized function called `spline_degree()` to be used in these cases. \n\nHow can you tell when this happens? There is a helper function called `tunable()` and that gives information on how we make the default ranges for parameters. There is a column in these objects names `call_info`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidymodels)\nns_tunable <- \n recipe(mpg ~ ., data = mtcars) %>% \n step_spline_natural(dis, deg_free = tune()) %>% \n tunable()\n\nns_tunable\n#> # A tibble: 1 Γ— 5\n#> name call_info source component component_id \n#> \n#> 1 deg_free recipe step_spline_natural spline_natural_P1Tjg\nns_tunable$call_info\n#> [[1]]\n#> [[1]]$pkg\n#> [1] \"dials\"\n#> \n#> [[1]]$fun\n#> [1] \"spline_degree\"\n#> \n#> [[1]]$range\n#> [1] 2 15\n```\n:::\n\n\n\n## Early stopping for boosted trees\n\nWhen deciding on the number of boosting iterations, there are two main strategies:\n\n * Directly tune it (`trees = tune()`)\n \n * Set it to one value and tune the number of early stopping iterations (`trees = 500`, `stop_iter = tune()`).\n\nEarly stopping is when we monitor the performance of the model. If the model doesn't make any improvements for `stop_iter` iterations, training stops. \n\nHere's an example where, after eleven iterations, performance starts to get worse. \n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](figures/early-stopping-1.svg)\n:::\n:::\n\n\nThis is likely due to over-fitting so we stop the model at eleven boosting iterations. \n\nEarly stopping usually has good results and takes far less time. \n\nWe _could_ an engine argument called `validation` here. That's not an argument to any function in the lightgbm package. \n\nbonsai has its own wrapper around (`lightgbm::lgb.train()`) called `bonsai::train_lightgbm()`. We use that here and it has a `validation` argument.\n\nHow would you know that? There are a few different ways:\n\n * Look at the documentation in `?boost_tree` and click on the `lightgbm` entry in the engine list. \n * Check out the pkgdown reference website \n * Run the `translate()` function on the parsnip specification object. \n\nThe first two options are best since they tell you a lot more about the particularities of each model engine (there are a lot for lightgbm). \n\n
\n\n# Extras - Effect Encodings\n\n## Per-agent statistics\n\nThe effect encoding method essentially takes the effect of a variable, like agent, and makes a data column for that effect. In our example, affect of the agent on the ADR is quantified by a model and then added as a data column to be used in the model. \n\nSuppose agent Max has a single reservation in the data and it had an ADR of €200. If we used a naive estimate for Max’s effect, the model is being told that Max should always produce an effect of €200. That’s a very poor estimate since it is from a single data point. \n\nContrast this with seasoned agent Davis, who has taken 250 reservations with an average ADR of €100. Davis’s mean is more predictive because it is estimated with better data (i.e., more total reservations). \nPartial pooling leverages the entire data set and can borrow strength from all of the agents. It is a common tool in Bayesian estimation and non-Bayesian mixed models. If a agent’s data is of good quality, the partial pooling effect estimate is closer to the raw mean. Max’s data is not great and is \"shrunk\" towards the center of the overall average. Since there is so little known about Max’s reservation history, this is a better effect estimate (until more data is available for him). \n\nThe Stan documentation has a pretty good vignette on this: \n\nAlso, _Bayes Rules!_ has a nice section on this: \n\nSince this example has a numeric outcome, partial pooling is very similar to the James–Stein estimator: \n\n## Agent effects\n\nEffect encoding might result in a somewhat circular argument: the column is more likely to be important to the model since it is the output of a separate model. The risk here is that we might over-fit the effect to the data. For this reason, it is super important to make sure that we verify that we aren’t overfitting by checking with resampling (or a validation set). \n\nPartial pooling somewhat lowers the risk of overfitting since it tends to correct for agents with small sample sizes. It can’t correct for improper data usage or data leakage though. \n\n", + "supporting": [], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": {}, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/_freeze/archive/2023-07-nyr/extras-effect-encodings/execute-results/html.json b/_freeze/archive/2023-07-nyr/extras-effect-encodings/execute-results/html.json new file mode 100644 index 00000000..d81c9f22 --- /dev/null +++ b/_freeze/archive/2023-07-nyr/extras-effect-encodings/execute-results/html.json @@ -0,0 +1,18 @@ +{ + "hash": "8ce23d67b651258eabce1cb1cb8b0cfc", + "result": { + "markdown": "---\ntitle: \"Extras - Effect Encodings\"\nsubtitle: \"Advanced tidymodels\"\nformat:\n revealjs: \n slide-number: true\n footer: \n include-before-body: header.html\n include-after-body: footer-annotations.html\n theme: [default, tidymodels.scss]\n width: 1280\n height: 720\nknitr:\n opts_chunk: \n echo: true\n collapse: true\n comment: \"#>\"\n fig.path: \"figures/\"\n---\n\n\n\n\n\n\n## Previously - Setup\n\n:::: {.columns}\n\n::: {.column width=\"40%\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidymodels)\nlibrary(modeldatatoo)\nlibrary(textrecipes)\nlibrary(bonsai)\n\n# Max's usual settings: \ntidymodels_prefer()\ntheme_set(theme_bw())\noptions(\n pillar.advice = FALSE, \n pillar.min_title_chars = Inf\n)\n```\n:::\n\n\n:::\n\n::: {.column width=\"60%\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(295)\nhotel_rates <- \n data_hotel_rates() %>% \n sample_n(5000) %>% \n arrange(arrival_date) %>% \n select(-arrival_date_num, -arrival_date) %>% \n mutate(\n company = factor(as.character(company)),\n country = factor(as.character(country)),\n agent = factor(as.character(agent))\n )\n```\n:::\n\n\n\n:::\n\n::::\n\n\n## Previously - Data Usage\n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(4028)\nhotel_split <-\n initial_split(hotel_rates, strata = avg_price_per_room)\n\nhotel_tr <- training(hotel_split)\nhotel_te <- testing(hotel_split)\n\nset.seed(472)\nhotel_rs <- vfold_cv(hotel_tr, strata = avg_price_per_room)\n```\n:::\n\n\n\n## What do we do with the agent and company data? \n\nThere are 98 unique agent values and 100 companies in our training set. How can we include this information in our model?\n\n. . .\n\nWe could:\n\n- make the full set of indicator variables 😳\n\n- lump agents and companies that rarely occur into an \"other\" group\n\n- use [feature hashing](https://www.tmwr.org/categorical.html#feature-hashing) to create a smaller set of indicator variables\n\n- use effect encoding to replace the `agent` and `company` columns with the estimated effect of that predictor\n\n\n\n\n\n\n\n\n## Per-agent statistics {.annotation}\n\n::: columns\n::: {.column width=\"50%\"}\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](figures/effects-freq-1.svg){fig-align='center' width=90%}\n:::\n:::\n\n:::\n\n::: {.column width=\"50%\"}\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](figures/effects-adr-1.svg){fig-align='center' width=90%}\n:::\n:::\n\n:::\n:::\n\n\n## What is an effect encoding?\n\nWe replace the qualitative’s predictor data with their _effect on the outcome_. \n\n::: columns\n::: {.column width=\"50%\"}\nData before:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbefore\n#> # A tibble: 7 Γ— 3\n#> avg_price_per_room agent .row\n#> \n#> 1 52.7 cynthia_worsley 1\n#> 2 51.8 carlos_bryant 2\n#> 3 53.8 lance_hitchcock 3\n#> 4 51.8 lance_hitchcock 4\n#> 5 46.8 cynthia_worsley 5\n#> 6 54.7 charles_najera 6\n#> 7 46.8 cynthia_worsley 7\n```\n:::\n\n\n:::\n\n::: {.column width=\"50%\"}\n\nData after:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nafter\n#> # A tibble: 7 Γ— 3\n#> avg_price_per_room agent .row\n#> \n#> 1 52.7 88.5 1\n#> 2 51.8 89.5 2\n#> 3 53.8 79.8 3\n#> 4 51.8 79.8 4\n#> 5 46.8 88.5 5\n#> 6 54.7 109. 6\n#> 7 46.8 88.5 7\n```\n:::\n\n\n:::\n:::\n\nThe `agent` column is replaced with an estimate of the ADR. \n\n\n## Per-agent statistics again \n\n::: columns\n::: {.column width=\"50%\"}\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](figures/effects-again-1.svg){fig-align='center' width=90%}\n:::\n\n::: {.cell-output-display}\n![](figures/effects-again-2.svg){fig-align='center' width=90%}\n:::\n:::\n\n:::\n\n::: {.column width=\"50%\"}\n\n- Good statistical methods for estimating these means use *partial pooling*.\n\n\n- Pooling borrows strength across agents and shrinks extreme values towards the mean for agents with very few transations\n\n\n- The embed package has recipe steps for effect encodings.\n\n:::\n:::\n\n\n:::notes\nPartial pooling gives better estimates for agents with fewer reservations by shrinking the estimate to the overall ADR mean\n\n\n:::\n\n## Partial pooling\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](figures/effect-compare-1.svg){fig-align='center'}\n:::\n:::\n\n\n## Agent effects ![](hexes/recipes.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"} ![](hexes/embed.png){.absolute top=-20 right=64 width=\"64\" height=\"74.24\"} {.annotation}\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"1,6|\"}\nlibrary(embed)\n\nhotel_effect_rec <-\n recipe(avg_price_per_room ~ ., data = hotel_tr) %>% \n step_YeoJohnson(lead_time) %>%\n step_lencode_mixed(agent, company, outcome = vars(avg_price_per_room)) %>%\n step_dummy(all_nominal_predictors()) %>%\n step_zv(all_predictors())\n```\n:::\n\n\n. . .\n\nIt is very important to appropriately validate the effect encoding step to make sure that we are not overfitting.\n\n## Effect encoding results ![](hexes/workflows.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"} ![](hexes/tune.png){.absolute top=-20 right=64 width=\"64\" height=\"74.24\"} ![](hexes/recipes.png){.absolute top=-20 right=128 width=\"64\" height=\"74.24\"} ![](hexes/embed.png){.absolute top=-20 right=192 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"4|\"}\nhotel_effect_wflow <-\n workflow() %>%\n add_model(linear_reg()) %>% \n update_recipe(hotel_effect_rec)\n\nreg_metrics <- metric_set(mae, rsq)\n\nhotel_effect_res <-\n hotel_effect_wflow %>%\n fit_resamples(hotel_rs, metrics = reg_metrics)\n\ncollect_metrics(hotel_effect_res)\n#> # A tibble: 2 Γ— 6\n#> .metric .estimator mean n std_err .config \n#> \n#> 1 mae standard 17.8 10 0.236 Preprocessor1_Model1\n#> 2 rsq standard 0.867 10 0.00377 Preprocessor1_Model1\n```\n:::\n\n\nSlightly worse but it can handle new agents (if they occur).\n\n\n\n", + "supporting": [], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": { + "include-after-body": [ + "\n\n\n" + ] + }, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/_freeze/archive/2023-07-nyr/extras-transit-case-study/execute-results/html.json b/_freeze/archive/2023-07-nyr/extras-transit-case-study/execute-results/html.json new file mode 100644 index 00000000..8479e73b --- /dev/null +++ b/_freeze/archive/2023-07-nyr/extras-transit-case-study/execute-results/html.json @@ -0,0 +1,21 @@ +{ + "hash": "ab0cf007fdfc7dcac22e6319151e49ee", + "result": { + "markdown": "---\ntitle: \"Case Study on Transportation\"\nsubtitle: \"Machine learning with tidymodels\"\nformat:\n revealjs: \n slide-number: true\n footer: \n include-before-body: header.html\n include-after-body: footer-annotations.html\n theme: [default, tidymodels.scss]\n width: 1280\n height: 720\nknitr:\n opts_chunk: \n echo: true\n collapse: true\n comment: \"#>\"\n fig.path: \"figures/\"\n---\n\n\n\n\n\n\n## Chicago L-Train data\n\nSeveral years worth of pre-pandemic data were assembled to try to predict the daily number of people entering the Clark and Lake elevated (\"L\") train station in Chicago. \n\n\nMore information: \n\n- Several Chapters in _Feature Engineering and Selection_. \n\n - Start with [Section 4.1](https://bookdown.org/max/FES/chicago-intro.html) \n - See [Section 1.3](https://bookdown.org/max/FES/a-more-complex-example.html)\n\n- Video: [_The Global Pandemic Ruined My Favorite Data Set_](https://www.youtube.com/watch?v=KkpKSqbGnBA)\n\n\n## Predictors\n\n- the 14-day lagged ridership at this and other stations (units: thousands of rides/day)\n- weather data\n- home/away game schedules for Chicago teams\n- the date\n\nThe data are in `modeldata`. See `?Chicago`. \n\n\n## L Train Locations\n\n\n::: {.cell}\n::: {.cell-output-display}\n```{=html}\n
\n\n```\n:::\n:::\n\n\n## Your turn: Explore the Data\n\n*Take a look at these data for a few minutes and see if you can find any interesting characteristics in the predictors or the outcome.* \n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidymodels)\nlibrary(rules)\ndata(\"Chicago\")\ndim(Chicago)\n#> [1] 5698 50\nstations\n#> [1] \"Austin\" \"Quincy_Wells\" \"Belmont\" \"Archer_35th\" \n#> [5] \"Oak_Park\" \"Western\" \"Clark_Lake\" \"Clinton\" \n#> [9] \"Merchandise_Mart\" \"Irving_Park\" \"Washington_Wells\" \"Harlem\" \n#> [13] \"Monroe\" \"Polk\" \"Ashland\" \"Kedzie\" \n#> [17] \"Addison\" \"Jefferson_Park\" \"Montrose\" \"California\"\n```\n:::\n\n::: {.cell}\n::: {.cell-output-display}\n```{=html}\n
\n
\n05:00\n
\n```\n:::\n:::\n\n\n\n## Splitting with Chicago data ![](hexes/rsample.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\nLet's put the last two weeks of data into the test set. `initial_time_split()` can be used for this purpose:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndata(Chicago)\n\nchi_split <- initial_time_split(Chicago, prop = 1 - (14/nrow(Chicago)))\nchi_split\n#> \n#> <5684/14/5698>\n\nchi_train <- training(chi_split)\nchi_test <- testing(chi_split)\n\n## training\nnrow(chi_train)\n#> [1] 5684\n \n## testing\nnrow(chi_test)\n#> [1] 14\n```\n:::\n\n\n## Time series resampling \n\nOur Chicago data is over time. Regular cross-validation, which uses random sampling, may not be the best idea. \n\nWe can emulate our training/test split by making similar resamples. \n\n* Fold 1: Take the first X years of data as the analysis set, the next 2 weeks as the assessment set.\n\n* Fold 2: Take the first X years + 2 weeks of data as the analysis set, the next 2 weeks as the assessment set.\n\n* and so on\n\n## Rolling forecast origin resampling \n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](images/rolling.svg){fig-align='center' width=70%}\n:::\n:::\n\n\n:::notes\nThis image shows overlapping assessment sets. We will use non-overlapping data but it could be done wither way.\n:::\n\n## Times series resampling ![](hexes/rsample.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"4|\"}\nchi_rs <-\n chi_train %>%\n sliding_period(\n index = \"date\", \n\n\n\n\n )\n```\n:::\n\n\nUse the `date` column to find the date data. \n\n\n## Times series resampling ![](hexes/rsample.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"5|\"}\nchi_rs <-\n chi_train %>%\n sliding_period(\n index = \"date\", \n period = \"week\",\n\n\n\n )\n```\n:::\n\n\nOur units will be weeks. \n\n\n## Times series resampling ![](hexes/rsample.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"6|\"}\nchi_rs <-\n chi_train %>%\n sliding_period(\n index = \"date\", \n period = \"week\",\n lookback = 52 * 15 \n \n \n )\n```\n:::\n\n\nEvery analysis set has 15 years of data\n\n\n\n## Times series resampling ![](hexes/rsample.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"7|\"}\nchi_rs <-\n chi_train %>%\n sliding_period(\n index = \"date\", \n period = \"week\",\n lookback = 52 * 15,\n assess_stop = 2,\n\n )\n```\n:::\n\n\nEvery assessment set has 2 weeks of data\n\n\n## Times series resampling ![](hexes/rsample.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"8|\"}\nchi_rs <-\n chi_train %>%\n sliding_period(\n index = \"date\", \n period = \"week\",\n lookback = 52 * 15,\n assess_stop = 2,\n step = 2 \n )\n```\n:::\n\n\nIncrement by 2 weeks so that there are no overlapping assessment sets. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nchi_rs$splits[[1]] %>% assessment() %>% pluck(\"date\") %>% range()\n#> [1] \"2016-01-07\" \"2016-01-20\"\nchi_rs$splits[[2]] %>% assessment() %>% pluck(\"date\") %>% range()\n#> [1] \"2016-01-21\" \"2016-02-03\"\n```\n:::\n\n\n\n## Our resampling object ![](hexes/rsample.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n::: columns\n::: {.column width=\"45%\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nchi_rs\n#> # Sliding period resampling \n#> # A tibble: 16 Γ— 2\n#> splits id \n#> \n#> 1 Slice01\n#> 2 Slice02\n#> 3 Slice03\n#> 4 Slice04\n#> 5 Slice05\n#> 6 Slice06\n#> 7 Slice07\n#> 8 Slice08\n#> 9 Slice09\n#> 10 Slice10\n#> 11 Slice11\n#> 12 Slice12\n#> 13 Slice13\n#> 14 Slice14\n#> 15 Slice15\n#> 16 Slice16\n```\n:::\n\n\n:::\n\n::: {.column width=\"5%\"}\n\n:::\n\n::: {.column width=\"50%\"}\n\nWe will fit 16 models on 16 slightly different analysis sets. \n\nEach will produce a separate performance metrics. \n\nWe will average the 16 metrics to get the resampling estimate of that statistic. \n\n:::\n:::\n\n\n## Feature engineering with recipes ![](hexes/recipes.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nchi_rec <- \n recipe(ridership ~ ., data = chi_train)\n```\n:::\n\n\nBased on the formula, the function assigns columns to roles of \"outcome\" or \"predictor\"\n\n## A recipe\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsummary(chi_rec)\n#> # A tibble: 50 Γ— 4\n#> variable type role source \n#> \n#> 1 Austin predictor original\n#> 2 Quincy_Wells predictor original\n#> 3 Belmont predictor original\n#> 4 Archer_35th predictor original\n#> 5 Oak_Park predictor original\n#> 6 Western predictor original\n#> 7 Clark_Lake predictor original\n#> 8 Clinton predictor original\n#> 9 Merchandise_Mart predictor original\n#> 10 Irving_Park predictor original\n#> # β„Ή 40 more rows\n```\n:::\n\n\n\n\n## A recipe - work with dates ![](hexes/recipes.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"3|\"}\nchi_rec <- \n recipe(ridership ~ ., data = chi_train) %>% \n step_date(date, features = c(\"dow\", \"month\", \"year\")) \n```\n:::\n\n\nThis creates three new columns in the data based on the date. Note that the day-of-the-week column is a factor.\n\n\n## A recipe - work with dates ![](hexes/recipes.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"4|\"}\nchi_rec <- \n recipe(ridership ~ ., data = chi_train) %>% \n step_date(date, features = c(\"dow\", \"month\", \"year\")) %>% \n step_holiday(date) \n```\n:::\n\n\nAdd indicators for major holidays. Specific holidays, especially those non-USA, can also be generated. \n\nAt this point, we don't need `date` anymore. Instead of deleting it (there is a step for that) we will change its _role_ to be an identification variable. \n\n:::notes\nWe might want to change the role (instead of removing the column) because it will stay in the data set (even when resampled) and might be useful for diagnosing issues.\n:::\n\n\n## A recipe - work with dates ![](hexes/recipes.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"5,6|\"}\nchi_rec <- \n recipe(ridership ~ ., data = chi_train) %>% \n step_date(date, features = c(\"dow\", \"month\", \"year\")) %>% \n step_holiday(date) %>% \n update_role(date, new_role = \"id\") %>%\n update_role_requirements(role = \"id\", bake = TRUE)\n```\n:::\n\n\n`date` is still in the data set but tidymodels knows not to treat it as an analysis column. \n\n`update_role_requirements()` is needed to make sure that this column is required when making new data points. \n\n## A recipe - remove constant columns ![](hexes/recipes.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"7|\"}\nchi_rec <- \n recipe(ridership ~ ., data = chi_train) %>% \n step_date(date, features = c(\"dow\", \"month\", \"year\")) %>% \n step_holiday(date) %>% \n update_role(date, new_role = \"id\") %>%\n update_role_requirements(role = \"id\", bake = TRUE) %>% \n step_zv(all_nominal_predictors()) \n```\n:::\n\n\n\n## A recipe - handle correlations ![](hexes/recipes.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\nThe station columns have a very high degree of correlation. \n\nWe might want to decorrelated them with principle component analysis to help the model fits go more easily. \n\nThe vector `stations` contains all station names and can be used to identify all the relevant columns.\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"7|\"}\nchi_pca_rec <- \n chi_rec %>% \n step_normalize(all_of(!!stations)) %>% \n step_pca(all_of(!!stations), num_comp = tune())\n```\n:::\n\n\nWe'll tune the number of PCA components for (default) values of one to four.\n\n## Make some models ![](hexes/yardstick.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"} ![](hexes/tune.png){.absolute top=-20 right=64 width=\"64\" height=\"74.24\"} ![](hexes/rules.png){.absolute top=-20 right=128 width=\"64\" height=\"74.24\"} ![](hexes/recipes.png){.absolute top=-20 right=192 width=\"64\" height=\"74.24\"} ![](hexes/parsnip.png){.absolute top=-20 right=256 width=\"64\" height=\"74.24\"}\n\nLet's try three models. The first one requires the `rules` package (loaded earlier).\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncb_spec <- cubist_rules(committees = 25, neighbors = tune())\nmars_spec <- mars(prod_degree = tune()) %>% set_mode(\"regression\")\nlm_spec <- linear_reg()\n\nchi_set <- \n workflow_set(\n list(pca = chi_pca_rec, basic = chi_rec), \n list(cubist = cb_spec, mars = mars_spec, lm = lm_spec)\n ) %>% \n # Evaluate models using mean absolute errors\n option_add(metrics = metric_set(mae))\n```\n:::\n\n\n\n:::notes\nBriefly talk about Cubist being a (sort of) boosted rule-based model and MARS being a nonlinear regression model. Both incorporate feature selection nicely. \n:::\n\n## Process them on the resamples\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Set up some objects for stacking ensembles (in a few slides)\ngrid_ctrl <- control_grid(save_pred = TRUE, save_workflow = TRUE)\n\nchi_res <- \n chi_set %>% \n workflow_map(\n resamples = chi_rs,\n grid = 10,\n control = grid_ctrl,\n verbose = TRUE,\n seed = 12\n )\n```\n:::\n\n\n## How do the results look? \n\n\n::: {.cell}\n\n```{.r .cell-code}\nrank_results(chi_res)\n#> # A tibble: 31 Γ— 9\n#> wflow_id .config .metric mean std_err n preprocessor model rank\n#> \n#> 1 pca_cubist Preprocessor1_Model1 mae 0.798 0.104 16 recipe cubis… 1\n#> 2 pca_cubist Preprocessor3_Model3 mae 0.978 0.110 16 recipe cubis… 2\n#> 3 pca_cubist Preprocessor4_Model2 mae 0.983 0.122 16 recipe cubis… 3\n#> 4 pca_cubist Preprocessor4_Model1 mae 0.991 0.127 16 recipe cubis… 4\n#> 5 pca_cubist Preprocessor3_Model2 mae 0.991 0.113 16 recipe cubis… 5\n#> 6 pca_cubist Preprocessor2_Model2 mae 1.02 0.118 16 recipe cubis… 6\n#> 7 pca_cubist Preprocessor1_Model3 mae 1.05 0.134 16 recipe cubis… 7\n#> 8 basic_cubist Preprocessor1_Model8 mae 1.07 0.115 16 recipe cubis… 8\n#> 9 basic_cubist Preprocessor1_Model7 mae 1.07 0.112 16 recipe cubis… 9\n#> 10 basic_cubist Preprocessor1_Model6 mae 1.07 0.114 16 recipe cubis… 10\n#> # β„Ή 21 more rows\n```\n:::\n\n\n## Plot the results ![](hexes/tune.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"} ![](hexes/ggplot2.png){.absolute top=-20 right=64 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nautoplot(chi_res)\n```\n\n::: {.cell-output-display}\n![](figures/set-results-1.svg){fig-align='center'}\n:::\n:::\n\n\n## Pull out specific results ![](hexes/tune.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"} ![](hexes/ggplot2.png){.absolute top=-20 right=64 width=\"64\" height=\"74.24\"}\n\nWe can also pull out the specific tuning results and look at them: \n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nchi_res %>% \n extract_workflow_set_result(\"pca_cubist\") %>% \n autoplot()\n```\n\n::: {.cell-output-display}\n![](figures/cubist-autoplot-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n## Why choose just one `final_fit`? ![](hexes/stacks.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n_Model stacks_ generate predictions that are informed by several models.\n\n## Why choose just one `final_fit`? ![](hexes/stacks.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n![](images/stack_01.png)\n\n## Why choose just one `final_fit`? ![](hexes/stacks.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n![](images/stack_02.png)\n\n## Why choose just one `final_fit`? ![](hexes/stacks.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n![](images/stack_03.png)\n\n## Why choose just one `final_fit`? ![](hexes/stacks.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n![](images/stack_04.png)\n\n## Why choose just one `final_fit`? ![](hexes/stacks.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n![](images/stack_05.png)\n\n## Building a model stack ![](hexes/stacks.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(stacks)\n```\n:::\n\n\n1) Define candidate members\n2) Initialize a data stack object\n3) Add candidate ensemble members to the data stack\n4) Evaluate how to combine their predictions\n5) Fit candidate ensemble members with non-zero stacking coefficients\n6) Predict on new data!\n\n\n## Start the stack and add members ![](hexes/stacks.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\nCollect all of the resampling results for all model configurations. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nchi_stack <- \n stacks() %>% \n add_candidates(chi_res)\n```\n:::\n\n\n\n## Estimate weights for each candidate ![](hexes/stacks.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\nWhich configurations should be retained? Uses a penalized linear model: \n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(122)\nchi_stack_res <- blend_predictions(chi_stack)\n\nchi_stack_res\n#> # A tibble: 5 Γ— 3\n#> member type weight\n#> \n#> 1 pca_cubist_1_1 cubist_rules 0.343\n#> 2 pca_cubist_3_2 cubist_rules 0.236\n#> 3 basic_cubist_1_4 cubist_rules 0.189\n#> 4 pca_lm_4_1 linear_reg 0.163\n#> 5 pca_cubist_3_3 cubist_rules 0.109\n```\n:::\n\n\n## How did it do? ![](hexes/stacks.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"} ![](hexes/ggplot2.png){.absolute top=-20 right=64 width=\"64\" height=\"74.24\"}\n\nThe overall results of the penalized model: \n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nautoplot(chi_stack_res)\n```\n\n::: {.cell-output-display}\n![](figures/stack-autoplot-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n\n## What does it use? ![](hexes/stacks.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"} ![](hexes/ggplot2.png){.absolute top=-20 right=64 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nautoplot(chi_stack_res, type = \"weights\")\n```\n\n::: {.cell-output-display}\n![](figures/stack-members-1.svg){fig-align='center'}\n:::\n:::\n\n\n\n## Fit the required candidate models![](hexes/stacks.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\nFor each model we retain in the stack, we need their model fit on the entire training set. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nchi_stack_res <- fit_members(chi_stack_res)\n```\n:::\n\n\n\n## The test set: best Cubist model ![](hexes/workflows.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"} ![](hexes/tune.png){.absolute top=-20 right=64 width=\"64\" height=\"74.24\"}\n\nWe can pull out the results and the workflow to fit the single best cubist model. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nbest_cubist <- \n chi_res %>% \n extract_workflow_set_result(\"pca_cubist\") %>% \n select_best()\n\ncubist_res <- \n chi_res %>% \n extract_workflow(\"pca_cubist\") %>% \n finalize_workflow(best_cubist) %>% \n last_fit(split = chi_split, metrics = metric_set(mae))\n```\n:::\n\n\n## The test set: stack ensemble![](hexes/stacks.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\nWe don't have `last_fit()` for stacks (yet) so we manually make predictions. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nstack_pred <- \n predict(chi_stack_res, chi_test) %>% \n bind_cols(chi_test)\n```\n:::\n\n\n## Compare the results ![](hexes/tune.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"} ![](hexes/stacks.png){.absolute top=-20 right=64 width=\"64\" height=\"74.24\"}\n\nSingle best versus the stack:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncollect_metrics(cubist_res)\n#> # A tibble: 1 Γ— 4\n#> .metric .estimator .estimate .config \n#> \n#> 1 mae standard 0.670 Preprocessor1_Model1\n\nstack_pred %>% mae(ridership, .pred)\n#> # A tibble: 1 Γ— 3\n#> .metric .estimator .estimate\n#> \n#> 1 mae standard 0.689\n```\n:::\n\n\n\n## Plot the test set ![](hexes/tune.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"} ![](hexes/ggplot2.png){.absolute top=-20 right=64 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell layout-align=\"center\" output-location='column-fragment'}\n\n```{.r .cell-code}\nlibrary(probably)\ncubist_res %>% \n collect_predictions() %>% \n ggplot(aes(ridership, .pred)) + \n geom_point(alpha = 1 / 2) + \n geom_abline(lty = 2, col = \"green\") + \n coord_obs_pred()\n```\n\n::: {.cell-output-display}\n![](figures/obs-pred-1.svg){fig-align='center'}\n:::\n:::\n", + "supporting": [], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": { + "include-in-header": [ + "\n\n\n\n\n\n\n\n\n\n\n\n\n" + ], + "include-after-body": [ + "\n\n\n" + ] + }, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/_quarto.yml b/_quarto.yml index ff8a1b21..4010893c 100644 --- a/_quarto.yml +++ b/_quarto.yml @@ -5,10 +5,12 @@ project: - "slides/*qmd" - "archive/2022-07-RStudio-conf/*qmd" - "archive/2022-08-Reykjavik-City/*qmd" + - "archive/2023-07-nyr/*qmd" - "!classwork/" - "!CODE_OF_CONDUCT.md" resources: - "archive/2022-08-Reykjavik-City/classwork/*qmd" + - "archive/2023-07-nyr/classwork/*qmd" output-dir: docs website: diff --git a/archive/2023-07-nyr/.gitignore b/archive/2023-07-nyr/.gitignore new file mode 100644 index 00000000..075b2542 --- /dev/null +++ b/archive/2023-07-nyr/.gitignore @@ -0,0 +1 @@ +/.quarto/ diff --git a/archive/2023-07-nyr/01-introduction.qmd b/archive/2023-07-nyr/01-introduction.qmd new file mode 100644 index 00000000..f39ee08e --- /dev/null +++ b/archive/2023-07-nyr/01-introduction.qmd @@ -0,0 +1,273 @@ +--- +title: "1 - Introduction" +subtitle: "Machine learning with tidymodels" +format: + revealjs: + slide-number: true + footer: + include-before-body: header.html + include-after-body: footer-annotations.html + theme: [default, tidymodels.scss] + width: 1280 + height: 720 +knitr: + opts_chunk: + echo: true + collapse: true + comment: "#>" +--- + +```{r} +#| include: false +#| file: setup.R +``` + +::: r-fit-text +Welcome! +::: + +## Who are you? + +- You can use the magrittr `%>%` or base R `|>` pipe + +- You are familiar with functions from dplyr, tidyr, ggplot2 + +- You have exposure to basic statistical concepts + +- You do **not** need intermediate or expert familiarity with modeling or ML + +## Who are tidymodels? + +- Simon Couch +- Hannah Frick +- Emil Hvitfeldt +- Max Kuhn + +. . . + +Many thanks to Davis Vaughan, Julia Silge, David Robinson, Julie Jung, Alison Hill, and DesirΓ©e De Leon for their role in creating these materials! + +## Asking for help + +. . . + +πŸŸͺ "I'm stuck and need help!" + +. . . + +🟩 "I finished the exercise" + + +## `r emo::ji("eyes")` {.annotation} + +![](images/pointing.svg){.absolute top="0" right="0"} + +## Tentative plan for this workshop + +::: columns +::: {.column width="50%"} +- *Today:* + + - Your data budget + - What makes a model + - Evaluating models +::: +::: {.column width="50%"} +- *Tomorrow:* + + - Feature engineering + - Tuning hyperparameters + - Racing methods + - Iterative search methods +::: +::: + +## {.center} + +### Introduce yourself to your neighbors πŸ‘‹ + +

+ +Check Slack (`#ml-ws-2023`) for an RStudio Cloud link. + +## What is machine learning? + +![](https://imgs.xkcd.com/comics/machine_learning.png){fig-align="center"} + +::: footer + +::: + +## What is machine learning? + +![](images/what_is_ml.jpg){fig-align="center"} + +::: footer +Illustration credit: +::: + +## What is machine learning? + +![](images/ml_illustration.jpg){fig-align="center"} + +::: footer +Illustration credit: +::: + +## Your turn {transition="slide-in"} + +![](images/parsnip-flagger.jpg){.absolute top="0" right="0" width="150" height="150"} + +. . . + +*How are statistics and machine learning related?* + +*How are they similar? Different?* + +```{r} +#| echo: false +countdown::countdown(minutes = 3, id = "statistics-vs-ml") +``` + +::: notes +the "two cultures" + +model first vs. data first + +inference vs. prediction +::: + +## What is tidymodels? `r hexes("tidymodels")` + +```{r} +#| message: true +library(tidymodels) +``` + +## {background-image="images/tm-org.png" background-size="contain"} + +## The whole game + +Part of any modelling process is + +* Splitting your data into training and test set +* Using a resampling scheme +* Fitting models +* Assessing performance +* Choosing a model +* Fitting and assessing the final model + + +## The whole game + +```{r diagram-split, echo = FALSE} +#| fig-align: "center" + +knitr::include_graphics("images/whole-game-split.jpg") +``` + +## The whole game + +```{r diagram-model-1, echo = FALSE} +#| fig-align: "center" + +knitr::include_graphics("images/whole-game-model-1.jpg") +``` + +:::notes +Stress that we are **not** fitting a model on the entire training set other than for illustrative purposes in deck 2. +::: + +## The whole game + +```{r diagram-model-n, echo = FALSE} +#| fig-align: "center" + +knitr::include_graphics("images/whole-game-model-n.jpg") +``` + +## The whole game + +```{r diagram-resamples, echo = FALSE} +#| fig-align: "center" + +knitr::include_graphics("images/whole-game-resamples.jpg") +``` + +## The whole game + +```{r diagram-select, echo = FALSE} +#| fig-align: "center" + +knitr::include_graphics("images/whole-game-select.jpg") +``` + +## The whole game + +```{r diagram-final-fit, echo = FALSE} +#| fig-align: "center" + +knitr::include_graphics("images/whole-game-final-fit.jpg") +``` + +## The whole game + +```{r diagram-final-performance, echo = FALSE} +#| fig-align: "center" + +knitr::include_graphics("images/whole-game-final-performance.jpg") +``` + + +## Let's install some packages + +If you are using your own laptop instead of RStudio Cloud: + +```{r} +#| eval: false + +install.packages("pak") + +pkgs <- c("bonsai", "doParallel", "embed", "finetune", "lightgbm", "lme4", + "parallelly", "plumber", "probably", "ranger", "rpart", "rpart.plot", + "stacks", "textrecipes", "tidymodels", "tidymodels/modeldatatoo", + "vetiver") +pak::pak(pkgs) +``` + +. . . + +Check Slack (`#ml-ws-2023`) for an RStudio Cloud link. + + +## Our versions + +```{r pkg-list, echo = FALSE} +deps <- c("bonsai", "doParallel", "embed", "finetune", "lightgbm", "lme4", + "parallelly", "plumber", "probably", "ranger", "rpart", "rpart.plot", + "stacks", "textrecipes", "tidymodels", "modeldatatoo", + "vetiver") +loaded <- purrr::map(deps, ~ library(.x, character.only = TRUE)) +excl <- c("iterators", "emo", "countdown", "stats", "graphics", + "grDevices", "utils", "datasets", "methods", "base", "forcats", + "infer", "foreach", "Matrix", "R6", "parallel", "devtools", "usethis") +loaded <- loaded[[length(loaded)]] +loaded <- loaded[!(loaded %in% excl)] +pkgs <- + sessioninfo::package_info(loaded, dependencies = FALSE) %>% + select(-date) +df <- tibble::tibble( + package = pkgs$package, + version = pkgs$ondiskversion, + source = ifelse(grepl("CRAN", pkgs$source), "CRAN", pkgs$source) +) %>% + mutate( + source = gsub(" (R 4.2.0)", "", source, fixed = TRUE), + source = substr(source, 1, 31), + info = paste0(package, " (", version, ", ", source, ")") + ) +quarto_info <- paste0("Quarto (", system("quarto --version", intern = TRUE), ")") +version_info <- knitr::combine_words(c(df$info, quarto_info)) +``` + +`r version_info` diff --git a/archive/2023-07-nyr/02-data-budget.qmd b/archive/2023-07-nyr/02-data-budget.qmd new file mode 100644 index 00000000..1cf6c5f1 --- /dev/null +++ b/archive/2023-07-nyr/02-data-budget.qmd @@ -0,0 +1,382 @@ +--- +title: "2 - Your data budget" +subtitle: "Machine learning with tidymodels" +format: + revealjs: + slide-number: true + footer: + include-before-body: header.html + include-after-body: footer-annotations.html + theme: [default, tidymodels.scss] + width: 1280 + height: 720 +knitr: + opts_chunk: + echo: true + collapse: true + comment: "#>" + fig.path: "figures/" +--- + +```{r} +#| include: false +#| file: setup.R +``` + +## {background-image="https://media.giphy.com/media/Lr3UeH9tYu3qJtsSUg/giphy.gif" background-size="40%"} + + +## Data on Chicago taxi trips + +::: columns +::: {.column width="60%"} +- The city of Chicago releases anonymized trip-level data on taxi trips in the city. +- We pulled a sample of 10,000 rides occurring in early 2022. +- Type `?modeldatatoo::data_taxi()` to learn more about this dataset, including references. +::: + +::: {.column width="40%"} +![](images/taxi_spinning.svg) +::: + +::: + +::: footer +Credit: +::: + +## Which of these variables can we use? + +```{r} +library(tidymodels) +library(modeldatatoo) + +taxi <- data_taxi() + +names(taxi) +``` + +## Checklist for predictors + +- Is it ethical to use this variable? (Or even legal?) + +- Will this variable be available at prediction time? + +- Does this variable contribute to explainability? + + +## Data on Chicago taxi trips + +We are using a slightly modified version from the modeldatatoo data. + +```{r} +taxi <- taxi %>% + mutate(month = factor(month, levels = c("Jan", "Feb", "Mar", "Apr"))) %>% + select(-c(id, duration, fare, tolls, extras, total_cost, payment_type)) %>% + drop_na() +``` + +## Data on Chicago taxi trips + +::: columns +::: {.column width="60%"} +- `N = 10,000` +- A nominal outcome, `tip`, with levels `"yes"` and `"no"` +- 6 other variables + - `company`, `local`, and `dow`, and `month` are **nominal** predictors + - `distance` and `hours` are **numeric** predictors +::: + +::: {.column width="40%"} +![](images/taxi.png) +::: +::: + +::: footer +Credit: +::: + +:::notes +`tip`: Whether the rider left a tip. A factor with levels "yes" and "no". + +`distance`: The trip distance, in odometer miles. + +`company`: The taxi company, as a factor. Companies that occurred few times were binned as "other". + +`local`: Whether the trip started in the same community area as it began. See the source data for community area values. + +`dow`: The day of the week in which the trip began, as a factor. + +`month`: The month in which the trip began, as a factor. + +`hour`: The hour of the day in which the trip began, as a numeric. + +::: + +## Data on Chicago taxi trips + +```{r} +taxi +``` + + +## Data splitting and spending + +For machine learning, we typically split data into training and test sets: + +. . . + +- The **training set** is used to estimate model parameters. +- The **test set** is used to find an independent assessment of model performance. + +. . . + +Do not 🚫 use the test set during training. + +## Data splitting and spending + +```{r test-train-split} +#| echo: false +#| fig.width: 12 +#| fig.height: 3 +#| +set.seed(123) +library(forcats) +one_split <- slice(taxi, 1:30) %>% + initial_split() %>% + tidy() %>% + add_row(Row = 1:30, Data = "Original") %>% + mutate(Data = case_when( + Data == "Analysis" ~ "Training", + Data == "Assessment" ~ "Testing", + TRUE ~ Data + )) %>% + mutate(Data = factor(Data, levels = c("Original", "Training", "Testing"))) +all_split <- + ggplot(one_split, aes(x = Row, y = fct_rev(Data), fill = Data)) + + geom_tile(color = "white", + size = 1) + + scale_fill_manual(values = splits_pal, guide = "none") + + theme_minimal() + + theme(axis.text.y = element_text(size = rel(2)), + axis.text.x = element_blank(), + legend.position = "top", + panel.grid = element_blank()) + + coord_equal(ratio = 1) + + labs(x = NULL, y = NULL) +all_split +``` + +# The more data
we spend πŸ€‘

the better estimates
we'll get. + +## Data splitting and spending + +- Spending too much data in **training** prevents us from computing a good assessment of predictive **performance**. + +. . . + +- Spending too much data in **testing** prevents us from computing a good estimate of model **parameters**. + +## Your turn {transition="slide-in"} + +![](images/parsnip-flagger.jpg){.absolute top="0" right="0" width="150" height="150"} + +*When is a good time to split your data?* + +```{r} +#| echo: false +countdown(minutes = 3, id = "when-to-split") +``` + +# The testing data is precious πŸ’Ž + +## The initial split `r hexes("rsample")` {.annotation} + +```{r} +set.seed(123) +taxi_split <- initial_split(taxi) +taxi_split +``` + +:::notes +How much data in training vs testing? +This function uses a good default, but this depends on your specific goal/data +We will talk about more powerful ways of splitting, like stratification, later +::: + +## Accessing the data `r hexes("rsample")` + +```{r} +taxi_train <- training(taxi_split) +taxi_test <- testing(taxi_split) +``` + +## The training set`r hexes("rsample")` + +```{r} +taxi_train +``` + +## The test set `r hexes("rsample")` + +πŸ™ˆ + +. . . + +There are `r nrow(taxi_test)` rows and `r ncol(taxi_test)` columns in the test set. + +## Your turn {transition="slide-in"} + +![](images/parsnip-flagger.jpg){.absolute top="0" right="0" width="150" height="150"} + +*Split your data so 20% is held out for the test set.* + +*Try out different values in `set.seed()` to see how the results change.* + +```{r} +#| echo: false +countdown(minutes = 5, id = "try-splitting") +``` + +## Data splitting and spending `r hexes("rsample")` + +```{r} +set.seed(123) +taxi_split <- initial_split(taxi, prop = 0.8) +taxi_train <- training(taxi_split) +taxi_test <- testing(taxi_split) + +nrow(taxi_train) +nrow(taxi_test) +``` + +# What about a validation set? + +## {background-color="white" background-image="https://www.tmwr.org/premade/validation.svg" background-size="50%"} + +:::notes +We will use this tomorrow +::: + +## {background-color="white" background-image="https://www.tmwr.org/premade/validation-alt.svg" background-size="40%"} + +# Exploratory data analysis for ML 🧐 + +## Your turn {transition="slide-in"} + +![](images/parsnip-flagger.jpg){.absolute top="0" right="0" width="150" height="150"} + +*Explore the `taxi_train` data on your own!* + +* *What's the distribution of the outcome, tip?* +* *What's the distribution of numeric variables like distance?* +* *How does tip differ across the categorical variables?* + +```{r} +#| echo: false +countdown(minutes = 8, id = "explore-taxi") +``` + +::: notes +Make a plot or summary and then share with neighbor +::: + +## + +```{r taxi-tip-counts} +#| fig-align: 'center' +taxi_train %>% + ggplot(aes(x = tip)) + + geom_bar() +``` + +## + +```{r taxi-tip-by-local} +#| fig-align: 'center' +taxi_train %>% + ggplot(aes(x = tip, fill = local)) + + geom_bar() + + scale_fill_viridis_d(end = .5) +``` + +## + +```{r taxi-tip-by-hour} +#| fig-align: 'center' +taxi_train %>% + mutate(tip = forcats::fct_rev(tip)) %>% + ggplot(aes(x = hour, fill = tip)) + + geom_bar() +``` + +## + +```{r taxi-tip-by-hour-fill} +#| fig-align: 'center' +taxi_train %>% + mutate(tip = forcats::fct_rev(tip)) %>% + ggplot(aes(x = hour, fill = tip)) + + geom_bar(position = "fill") +``` + +## + +```{r taxi-tip-by-distance} +#| fig-align: 'center' +taxi_train %>% + mutate(tip = forcats::fct_rev(tip)) %>% + ggplot(aes(x = distance)) + + geom_histogram(bins = 100) + + facet_grid(vars(tip)) +``` + +# Split smarter + +## + +```{r taxi-tip-pct, echo = FALSE} +taxi %>% + mutate(tip = forcats::fct_rev(tip)) %>% + ggplot(aes(x = "", fill = tip)) + + geom_bar(position = "fill") + + labs(x = "") +``` + +Stratified sampling would split within response values + +:::notes +Based on our EDA, we know that the source data contains fewer `"no"` tip values than `"yes"`. We want to make sure we allot equal proportions of those responses so that both the training and testing data have enough of each to give accurate estimates. +::: + +## Stratification + +Use `strata = tip` + +```{r} +set.seed(123) +taxi_split <- initial_split(taxi, prop = 0.8, strata = tip) +taxi_split +``` + +## Stratification + +Stratification often helps, with very little downside + +```{r taxi-tip-pct-by-split, echo = FALSE} +bind_rows( + taxi_train %>% mutate(split = "train"), + taxi_test %>% mutate(split = "test") +) %>% + mutate(tip = forcats::fct_rev(tip)) %>% + ggplot(aes(x = split, fill = tip)) + + geom_bar(position = "fill") +``` + +## The whole game - status update + +```{r diagram-split, echo = FALSE} +#| fig-align: "center" + +knitr::include_graphics("images/whole-game-split.jpg") +``` diff --git a/archive/2023-07-nyr/03-what-makes-a-model.qmd b/archive/2023-07-nyr/03-what-makes-a-model.qmd new file mode 100644 index 00000000..0e0c39e4 --- /dev/null +++ b/archive/2023-07-nyr/03-what-makes-a-model.qmd @@ -0,0 +1,694 @@ +--- +title: "3 - What makes a model?" +subtitle: "Machine learning with tidymodels" +format: + revealjs: + slide-number: true + footer: + include-before-body: header.html + include-after-body: footer-annotations.html + theme: [default, tidymodels.scss] + width: 1280 + height: 720 +knitr: + opts_chunk: + echo: true + collapse: true + comment: "#>" + fig.path: "figures/" +--- + +```{r} +#| include: false +#| file: setup.R +``` + +## Your turn {transition="slide-in"} + +![](images/parsnip-flagger.jpg){.absolute top="0" right="0" width="150" height="150"} + +*How do you fit a linear model in R?* + +*How many different ways can you think of?* + +```{r} +#| echo: false +countdown(minutes = 3, id = "how-to-fit-linear-model") +``` + +. . . + +- `lm` for linear model + +- `glm` for generalized linear model (e.g. logistic regression) + +- `glmnet` for regularized regression + +- `keras` for regression using TensorFlow + +- `stan` for Bayesian regression + +- `spark` for large data sets + +## To specify a model `r hexes("parsnip")` + +. . . + +::: columns +::: {.column width="40%"} +- Choose a [model]{.underline} +- Specify an engine +- Set the mode +::: + +::: {.column width="60%"} +![](images/taxi_spinning.svg) +::: +::: + +## To specify a model `r hexes("parsnip")` + +```{r} +#| echo: false +library(tidymodels) +library(modeldatatoo) + +taxi <- data_taxi() + +taxi <- taxi %>% + mutate(month = factor(month, levels = c("Jan", "Feb", "Mar", "Apr"))) %>% + select(-c(id, duration, fare, tolls, extras, total_cost, payment_type)) %>% + drop_na() + +set.seed(123) + +taxi_split <- initial_split(taxi, prop = 0.8, strata = tip) +taxi_train <- training(taxi_split) +taxi_test <- testing(taxi_split) +``` + +```{r} +logistic_reg() +``` + + +:::notes +Models have default engines +::: + +## To specify a model `r hexes("parsnip")` + +::: columns +::: {.column width="40%"} +- Choose a model +- Specify an [engine]{.underline} +- Set the mode +::: + +::: {.column width="60%"} +![](images/taxi_spinning.svg) +::: +::: + +## To specify a model `r hexes("parsnip")` + +```{r} +logistic_reg() %>% + set_engine("glmnet") +``` + +## To specify a model `r hexes("parsnip")` + +```{r} +logistic_reg() %>% + set_engine("stan") +``` + +## To specify a model `r hexes("parsnip")` + +::: columns +::: {.column width="40%"} +- Choose a model +- Specify an engine +- Set the [mode]{.underline} +::: + +::: {.column width="60%"} +![](images/taxi_spinning.svg) +::: +::: + + +## To specify a model `r hexes("parsnip")` + +```{r} +decision_tree() +``` + +:::notes +Some models have a default mode +::: + +## To specify a model `r hexes("parsnip")` + +```{r} +decision_tree() %>% + set_mode("classification") +``` + +. . . + +

+ +::: r-fit-text +All available models are listed at +::: + +## {background-iframe="https://www.tidymodels.org/find/parsnip/"} + +::: footer +::: + +## To specify a model `r hexes("parsnip")` + +::: columns +::: {.column width="40%"} +- Choose a [model]{.underline} +- Specify an [engine]{.underline} +- Set the [mode]{.underline} +::: + +::: {.column width="60%"} +![](images/taxi_spinning.svg) +::: +::: + +## Your turn {transition="slide-in"} + +![](images/parsnip-flagger.jpg){.absolute top="0" right="0" width="150" height="150"} + +*Run the `tree_spec` chunk in your `.qmd`.* + +*Edit this code to use a different model.* + +```{r} +#| echo: false +countdown(minutes = 5, id = "explore-tree-spec") +``` + +

+ +::: r-fit-text +All available models are listed at +::: + +## Models we'll be using today + +* Logistic regression +* Decision trees + +## Logistic regression + +::: columns +::: {.column width="60%"} +```{r} +#| echo: false +#| fig.width: 8 +#| fig.height: 7 +taxi_test %>% + mutate(tip = forcats::fct_rev(tip)) %>% + ggplot() + + geom_histogram(aes(distance, fill = tip), position = "fill") + + labs(y = "") + + theme_bw(base_size = 18) +``` +::: + +::: {.column width="40%"} +::: +::: + +## Logistic regression + +::: columns +::: {.column width="60%"} +```{r} +#| echo: false +#| fig.width: 8 +#| fig.height: 7 +logistic_preds <- + logistic_reg() %>% + fit(tip ~ distance, data = taxi_train) %>% + augment(new_data = taxi_test) + +logistic_preds %>% + mutate(tip = forcats::fct_rev(tip)) %>% + ggplot() + + geom_histogram(aes(distance, fill = tip), position = "fill") + + geom_line(aes(x = distance, y = .pred_yes), size = 2, alpha = 0.8, color = data_color) + + labs(y = "") + + theme_bw(base_size = 18) +``` +::: + +::: {.column width="40%"} +::: +::: + +## Logistic regression + +::: columns +::: {.column width="60%"} +```{r} +#| echo: false +#| fig.width: 8 +#| fig.height: 7 +logistic_preds %>% + mutate(tip = forcats::fct_rev(tip)) %>% + ggplot() + + geom_histogram(aes(distance, fill = tip), position = "fill") + + geom_line(aes(x = distance, y = .pred_yes), size = 2, alpha = 0.8, color = data_color) + + labs(y = "") + + theme_bw(base_size = 18) +``` +::: + +::: {.column width="40%"} + +- Logit of outcome probability modeled as linear combination of predictors: + +$log(\frac{p}{1 - p}) = \beta_0 + \beta_1\cdot \text{distance}$ + +- Find a sigmoid line that separates the two classes + +::: +::: + +## Decision trees + +::: columns +::: {.column width="50%"} +```{r} +#| echo: false +#| fig.width: 8 +#| fig.height: 7 + +tree_fit <- + decision_tree(cost_complexity = 0.1, mode = "classification") %>% + fit(tip ~ distance, data = taxi_train) + +tree_preds <- + tree_fit %>% + augment(new_data = taxi_test) +``` + +```{r} +#| echo: false +#| fig-align: center +library(rpart.plot) +tree_fit %>% + extract_fit_engine() %>% + rpart.plot(roundint = FALSE) +``` + +::: + +::: {.column width="50%"} +::: +::: + +## Decision trees + +::: columns +::: {.column width="50%"} +```{r} +#| echo: false +#| fig-align: center +library(rpart.plot) +tree_fit %>% + extract_fit_engine() %>% + rpart.plot(roundint = FALSE) +``` +::: + +::: {.column width="50%"} +- Series of splits or if/then statements based on predictors + +- First the tree *grows* until some condition is met (maximum depth, no more data) + +- Then the tree is *pruned* to reduce its complexity +::: +::: + +## Decision trees + +::: columns +::: {.column width="50%"} +```{r} +#| echo: false +#| fig.width: 8 +#| fig.height: 7 +#| fig-align: center +library(rpart.plot) +tree_fit %>% + extract_fit_engine() %>% + rpart.plot(roundint = FALSE) +``` +::: + +::: {.column width="50%"} +```{r} +#| echo: false +#| fig.width: 8 +#| fig.height: 7 + +tree_preds %>% + mutate(tip = forcats::fct_rev(tip)) %>% + ggplot() + + geom_histogram(aes(distance, fill = tip), position = "fill") + + geom_line(aes(x = distance, y = .pred_yes), size = 2, alpha = 0.8, color = data_color) + + labs(y = "") + + theme_bw(base_size = 18) +``` +::: +::: + +## All models are wrong, but some are useful! + +::: columns +::: {.column width="50%"} +### Logistic regression +```{r} +#| echo: false +#| fig.width: 8 +#| fig.height: 7 + +logistic_preds %>% + mutate(tip = forcats::fct_rev(tip)) %>% + ggplot() + + geom_histogram(aes(distance, fill = tip), position = "fill") + + geom_line(aes(x = distance, y = .pred_yes), size = 2, alpha = 0.8, color = data_color) + + labs(y = "") + + theme_bw(base_size = 18) +``` +::: + +::: {.column width="50%"} +### Decision trees +```{r} +#| echo: false +#| fig.width: 8 +#| fig.height: 7 + +tree_preds %>% + mutate(tip = forcats::fct_rev(tip)) %>% + ggplot() + + geom_histogram(aes(distance, fill = tip), position = "fill") + + geom_line(aes(x = distance, y = .pred_yes), size = 2, alpha = 0.8, color = data_color) + + labs(y = "") + + theme_bw(base_size = 18) +``` +::: +::: + +# A model workflow + +## Workflows bind preprocessors and models + +```{r good-workflow} +#| echo: false +#| out-width: '70%' +#| fig-align: 'center' +knitr::include_graphics("images/good_workflow.png") +``` + +:::notes +Explain that PCA that is a preprocessor / dimensionality reduction, used to decorrelate data +::: + + +## What is wrong with this? {.annotation} + +```{r bad-workflow} +#| echo: false +#| out-width: '70%' +#| fig-align: 'center' +knitr::include_graphics("images/bad_workflow.png") +``` + +## Why a `workflow()`? `r hexes("workflows")` + +. . . + +- Workflows handle new data better than base R tools in terms of new factor levels + +. . . + +- You can use other preprocessors besides formulas (more on feature engineering tomorrow!) + +. . . + +- They can help organize your work when working with multiple models + +. . . + +- [Most importantly]{.underline}, a workflow captures the entire modeling process: `fit()` and `predict()` apply to the preprocessing steps in addition to the actual model fit + +::: notes +Two ways workflows handle levels better than base R: + +- Enforces that new levels are not allowed at prediction time (this is an optional check that can be turned off) + +- Restores missing levels that were present at fit time, but happen to be missing at prediction time (like, if your "new" data just doesn't have an instance of that level) +::: + +## A model workflow `r hexes("parsnip", "workflows")` + +```{r} +tree_spec <- + decision_tree() %>% + set_mode("classification") + +tree_spec %>% + fit(tip ~ ., data = taxi_train) +``` + +## A model workflow `r hexes("parsnip", "workflows")` + +```{r} +tree_spec <- + decision_tree() %>% + set_mode("classification") + +workflow() %>% + add_formula(tip ~ .) %>% + add_model(tree_spec) %>% + fit(data = taxi_train) +``` + +## A model workflow `r hexes("parsnip", "workflows")` + +```{r} +tree_spec <- + decision_tree() %>% + set_mode("classification") + +workflow(tip ~ ., tree_spec) %>% + fit(data = taxi_train) +``` + +## Your turn {transition="slide-in"} + +![](images/parsnip-flagger.jpg){.absolute top="0" right="0" width="150" height="150"} + +*Run the `tree_wflow` chunk in your `.qmd`.* + +*Edit this code to make a workflow with your own model of choice.* + +```{r} +#| echo: false +countdown(minutes = 5, id = "explore-tree-workflow") +``` + +## Predict with your model `r hexes("parsnip", "workflows")` + +How do you use your new `tree_fit` model? + +```{r} +tree_spec <- + decision_tree() %>% + set_mode("classification") + +tree_fit <- + workflow(tip ~ ., tree_spec) %>% + fit(data = taxi_train) +``` + +## Your turn {transition="slide-in"} + +![](images/parsnip-flagger.jpg){.absolute top="0" right="0" width="150" height="150"} + +*Run:* + +`predict(tree_fit, new_data = taxi_test)` + +*What do you get?* + +```{r} +#| echo: false +countdown(minutes = 3, id = "predict-tree-fit") +``` + +## Your turn + +![](images/parsnip-flagger.jpg){.absolute top="0" right="0" width="150" height="150"} + +*Run:* + +`augment(tree_fit, new_data = taxi_test)` + +*What do you get?* + +```{r} +#| echo: false +countdown(minutes = 3, id = "augment-tree-fit") +``` + +# The tidymodels prediction guarantee! + +. . . + +- The predictions will always be inside a **tibble** +- The column names and types are **unsurprising** and **predictable** +- The number of rows in `new_data` and the output **are the same** + +## Understand your model `r hexes("parsnip", "workflows")` + +How do you **understand** your new `tree_fit` model? + +```{r} +#| echo: false +#| fig-align: center +library(rpart.plot) +tree_fit %>% + extract_fit_engine() %>% + rpart.plot(roundint = FALSE) +``` + +## Understand your model `r hexes("parsnip", "workflows")` + +How do you **understand** your new `tree_fit` model? + +```{r} +#| eval: false +library(rpart.plot) +tree_fit %>% + extract_fit_engine() %>% + rpart.plot(roundint = FALSE) +``` + +You can `extract_*()` several components of your fitted workflow. + +::: notes +`roundint = FALSE` is only to quiet a warning +::: + + +## Understand your model `r hexes("parsnip", "workflows")` + +How do you **understand** your new `tree_fit` model? + +. . . + +You can use your fitted workflow for model and/or prediction explanations: + +. . . + +- overall variable importance, such as with the [vip](https://koalaverse.github.io/vip/) package + +. . . + +- flexible model explainers, such as with the [DALEXtra](https://dalex.drwhy.ai/) package + +. . . + +Learn more at + +## {background-iframe="https://hardhat.tidymodels.org/reference/hardhat-extract.html"} + +::: footer +::: + +## Your turn {transition="slide-in"} + +![](images/parsnip-flagger.jpg){.absolute top="0" right="0" width="150" height="150"} + +*Extract the model engine object from your fitted workflow.* + +⚠️ *Never `predict()` with any extracted components!* + +```{r} +#| echo: false +countdown(minutes = 5, id = "extract-methods") +``` + +:::notes +Afterward, ask what kind of object people got from the extraction, and what they did with it (e.g. give it to `summary()`, `plot()`, `broom::tidy()` ). Live code along +::: + +# Deploy your model `r hexes("vetiver")` + +## {background-image="https://vetiver.rstudio.com/images/ml_ops_cycle.png" background-size="contain"} + +## Deploying a model `r hexes("vetiver")` + +How do you use your new `tree_fit` model in **production**? + +```{r} +library(vetiver) +v <- vetiver_model(tree_fit, "taxi") +v +``` + +Learn more at + +## Deploy your model `r hexes("vetiver")` + +How do you use your new model `tree_fit` in **production**? + +```{r} +library(plumber) +pr() %>% + vetiver_api(v) +``` + +Learn more at + +:::notes +Live-code making a prediction +::: + +## Your turn {transition="slide-in"} + +![](images/parsnip-flagger.jpg){.absolute top="0" right="0" width="150" height="150"} + +*Run the `vetiver` chunk in your `.qmd`.* + +*Check out the automated visual documentation.* + +```{r} +#| echo: false +countdown(minutes = 5, id = "vetiver") +``` + +## The whole game - status update + +```{r diagram-model-1, echo = FALSE} +#| fig-align: "center" + +knitr::include_graphics("images/whole-game-model-1.jpg") +``` + +:::notes +Stress that fitting a model on the entire training set was only for illustrating how to fit a model +::: diff --git a/archive/2023-07-nyr/04-evaluating-models.qmd b/archive/2023-07-nyr/04-evaluating-models.qmd new file mode 100644 index 00000000..a99ddf72 --- /dev/null +++ b/archive/2023-07-nyr/04-evaluating-models.qmd @@ -0,0 +1,795 @@ +--- +title: "4 - Evaluating models" +subtitle: "Machine learning with tidymodels" +format: + revealjs: + slide-number: true + footer: + include-before-body: header.html + include-after-body: footer-annotations.html + theme: [default, tidymodels.scss] + width: 1280 + height: 720 +knitr: + opts_chunk: + echo: true + collapse: true + comment: "#>" + fig.path: "figures/" +--- + +```{r} +#| include: false +#| file: setup.R +``` + +## Looking at predictions + +```{r} +#| echo: false +library(countdown) +library(tidymodels) +library(modeldatatoo) +taxi <- data_taxi() %>% + mutate(month = factor(month, levels = c("Jan", "Feb", "Mar", "Apr"))) %>% + select(-c(id, duration, fare, tolls, extras, total_cost, payment_type)) %>% + drop_na() + +set.seed(123) +taxi_split <- initial_split(taxi, prop = 0.8, strata = tip) +taxi_train <- training(taxi_split) +taxi_test <- testing(taxi_split) + +tree_spec <- decision_tree(cost_complexity = 0.0001, mode = "classification") +taxi_wflow <- workflow(tip ~ ., tree_spec) +taxi_fit <- fit(taxi_wflow, taxi_train) +``` + +```{r} +augment(taxi_fit, new_data = taxi_train) %>% + relocate(tip, .pred_class, .pred_yes, .pred_no) +``` + +## Confusion matrix `r hexes("yardstick")` + +![](images/confusion-matrix.png) + +## Confusion matrix `r hexes("yardstick")` + +```{r} +augment(taxi_fit, new_data = taxi_train) %>% + conf_mat(truth = tip, estimate = .pred_class) +``` + +## Confusion matrix `r hexes("yardstick")` + +```{r} +augment(taxi_fit, new_data = taxi_train) %>% + conf_mat(truth = tip, estimate = .pred_class) %>% + autoplot(type = "heatmap") +``` + +## Metrics for model performance `r hexes("yardstick")` + +::: columns +::: {.column width="60%"} +```{r} +augment(taxi_fit, new_data = taxi_train) %>% + accuracy(truth = tip, estimate = .pred_class) +``` +::: + +::: {.column width="40%"} +![](images/confusion-matrix-accuracy.png) +::: +::: + +## Dangers of accuracy `r hexes("yardstick")` + +We need to be careful of using `accuracy()` since it can give "good" performance by only predicting one way with imbalanced data + +```{r} +augment(taxi_fit, new_data = taxi_train) %>% + mutate(.pred_class = factor("yes", levels = c("yes", "no"))) %>% + accuracy(truth = tip, estimate = .pred_class) +``` + +## Metrics for model performance `r hexes("yardstick")` + +::: columns +::: {.column width="60%"} +```{r} +augment(taxi_fit, new_data = taxi_train) %>% + sensitivity(truth = tip, estimate = .pred_class) +``` +::: + +::: {.column width="40%"} +![](images/confusion-matrix-sensitivity.png) +::: +::: + + +## Metrics for model performance `r hexes("yardstick")` + +::: columns +::: {.column width="60%"} +```{r} +#| code-line-numbers: "3-6" +augment(taxi_fit, new_data = taxi_train) %>% + sensitivity(truth = tip, estimate = .pred_class) +``` + +
+ +```{r} +augment(taxi_fit, new_data = taxi_train) %>% + specificity(truth = tip, estimate = .pred_class) +``` +::: + +::: {.column width="40%"} +![](images/confusion-matrix-specificity.png) +::: +::: + +## Metrics for model performance `r hexes("yardstick")` + +We can use `metric_set()` to combine multiple calculations into one + +```{r} +taxi_metrics <- metric_set(accuracy, specificity, sensitivity) + +augment(taxi_fit, new_data = taxi_train) %>% + taxi_metrics(truth = tip, estimate = .pred_class) +``` + +## Metrics for model performance `r hexes("yardstick")` + +```{r} +taxi_metrics <- metric_set(accuracy, specificity, sensitivity) + +augment(taxi_fit, new_data = taxi_train) %>% + group_by(local) %>% + taxi_metrics(truth = tip, estimate = .pred_class) +``` + +## Two class data + +These metrics assume that we know the threshold for converting "soft" probability predictions into "hard" class predictions. + +. . . + +Is a 50% threshold good? + +What happens if we say that we need to be 80% sure to declare an event? + +- sensitivity ⬇️, specificity ⬆️ + +. . . + +What happens for a 20% threshold? + +- sensitivity ⬆️, specificity ⬇️ + +## Varying the threshold + +```{r} +#| label: thresholds +#| echo: false + +augment(taxi_fit, new_data = taxi_train) %>% + roc_curve(truth = tip, .pred_yes) %>% + filter(is.finite(.threshold)) %>% + pivot_longer(c(specificity, sensitivity), names_to = "statistic", values_to = "value") %>% + rename(`event threshold` = .threshold) %>% + ggplot(aes(x = `event threshold`, y = value, col = statistic, group = statistic)) + + geom_line() + + scale_color_brewer(palette = "Dark2") + + labs(y = NULL) + + coord_equal() + + theme(legend.position = "top") +``` + +## ROC curves + +To make an ROC (receiver operator characteristic) curve, we: + +- calculate the sensitivity and specificity for all possible thresholds + +- plot false positive rate (x-axis) versus true positive rate (y-axis) + +given that sensitivity is the true positive rate, and specificity is the true negative rate. Hence `1 - specificity` is the false positive rate. + +. . . + +We can use the area under the ROC curve as a classification metric: + +- ROC AUC = 1 πŸ’― +- ROC AUC = 1/2 😒 + +:::notes +ROC curves are insensitive to class imbalance. +::: + +## ROC curves `r hexes("yardstick")` + +```{r} +# Assumes _first_ factor level is event; there are options to change that +augment(taxi_fit, new_data = taxi_train) %>% + roc_curve(truth = tip, .pred_yes) %>% + slice(1, 20, 50) + +augment(taxi_fit, new_data = taxi_train) %>% + roc_auc(truth = tip, .pred_yes) +``` + +## ROC curve plot `r hexes("yardstick")` + +```{r roc-curve} +#| fig-width: 6 +#| fig-height: 6 +#| output-location: "column" + +augment(taxi_fit, new_data = taxi_train) %>% + roc_curve(truth = tip, .pred_yes) %>% + autoplot() +``` + +## Your turn {transition="slide-in"} + +![](images/parsnip-flagger.jpg){.absolute top="0" right="0" width="150" height="150"} + +*Compute and plot an ROC curve for your current model.* + +*What data are being used for this ROC curve plot?* + +```{r} +#| echo: false +countdown(minutes = 5, id = "roc-curve") +``` + +## {background-iframe="https://yardstick.tidymodels.org/reference/index.html"} + +::: footer +::: + +# ⚠️ DANGERS OF OVERFITTING ⚠️ + +## Dangers of overfitting ⚠️ + +![](https://raw.githubusercontent.com/topepo/2022-nyr-workshop/main/images/tuning-overfitting-train-1.svg) + +## Dangers of overfitting ⚠️ + +![](https://raw.githubusercontent.com/topepo/2022-nyr-workshop/main/images/tuning-overfitting-test-1.svg) + +## Dangers of overfitting ⚠️ `r hexes("yardstick")` + +```{r} +taxi_fit %>% + augment(taxi_train) +``` + +We call this "resubstitution" or "repredicting the training set" + +## Dangers of overfitting ⚠️ `r hexes("yardstick")` + +```{r} +taxi_fit %>% + augment(taxi_train) %>% + accuracy(tip, .pred_class) +``` + +We call this a "resubstitution estimate" + +## Dangers of overfitting ⚠️ `r hexes("yardstick")` + +::: columns +::: {.column width="50%"} +```{r} +taxi_fit %>% + augment(taxi_train) %>% + accuracy(tip, .pred_class) +``` +::: + +::: {.column width="50%"} +::: +::: + +## Dangers of overfitting ⚠️ `r hexes("yardstick")` + +::: columns +::: {.column width="50%"} +```{r} +taxi_fit %>% + augment(taxi_train) %>% + accuracy(tip, .pred_class) +``` +::: + +::: {.column width="50%"} +```{r} +taxi_fit %>% + augment(taxi_test) %>% + accuracy(tip, .pred_class) +``` +::: +::: + +. . . + +⚠️ Remember that we're demonstrating overfitting + +. . . + +⚠️ Don't use the test set until the *end* of your modeling analysis + + +## {background-image="https://media.giphy.com/media/55itGuoAJiZEEen9gg/giphy.gif" background-size="70%"} + +## Your turn {transition="slide-in"} + +![](images/parsnip-flagger.jpg){.absolute bottom="0" left="0" width="150" height="150"} + +*Use `augment()` and and a metric function to compute a classification metric like `brier_class()`.* + +*Compute the metrics for both training and testing data to demonstrate overfitting!* + +*Notice the evidence of overfitting!* ⚠️ + +```{r} +#| echo: false +countdown(minutes = 5, id = "augment-metrics") +``` + +## Dangers of overfitting ⚠️ `r hexes("yardstick")` + +::: columns +::: {.column width="50%"} +```{r} +taxi_fit %>% + augment(taxi_train) %>% + brier_class(tip, .pred_yes) +``` +::: + +::: {.column width="50%"} +```{r} +taxi_fit %>% + augment(taxi_test) %>% + brier_class(tip, .pred_yes) +``` +::: +::: + +. . . + +What if we want to compare more models? + +. . . + +And/or more model configurations? + +. . . + +And we want to understand if these are important differences? + +# The testing data are precious πŸ’Ž + +# How can we use the *training* data to compare and evaluate different models? πŸ€” + +## {background-color="white" background-image="https://www.tmwr.org/premade/resampling.svg" background-size="80%"} + +## Cross-validation + +![](https://www.tmwr.org/premade/three-CV.svg) + +## Cross-validation + +![](https://www.tmwr.org/premade/three-CV-iter.svg) + +## Your turn {transition="slide-in"} + +![](images/parsnip-flagger.jpg){.absolute top="0" right="0" width="150" height="150"} + +*If we use 10 folds, what percent of the training data* + +- *ends up in analysis* +- *ends up in assessment* + +*for* **each** *fold?* + +![](images/taxi_spinning.svg){width="300"} + +```{r} +#| echo: false +countdown(minutes = 3, id = "percent-in-folds") +``` + +## Cross-validation `r hexes("rsample")` + +```{r} +vfold_cv(taxi_train) # v = 10 is default +``` + +## Cross-validation `r hexes("rsample")` + +What is in this? + +```{r} +taxi_folds <- vfold_cv(taxi_train) +taxi_folds$splits[1:3] +``` + +::: notes +Talk about a list column, storing non-atomic types in dataframe +::: + +## Cross-validation `r hexes("rsample")` + +```{r} +vfold_cv(taxi_train, v = 5) +``` + +## Cross-validation `r hexes("rsample")` + +```{r} +vfold_cv(taxi_train, strata = tip) +``` + +. . . + +Stratification often helps, with very little downside + +## Cross-validation `r hexes("rsample")` + +We'll use this setup: + +```{r} +set.seed(123) +taxi_folds <- vfold_cv(taxi_train, v = 10, strata = tip) +taxi_folds +``` + +. . . + +Set the seed when creating resamples + +# We are equipped with metrics and resamples! + +## Fit our model to the resamples + +```{r} +taxi_res <- fit_resamples(taxi_wflow, taxi_folds) +taxi_res +``` + +## Evaluating model performance `r hexes("tune")` + +```{r} +taxi_res %>% + collect_metrics() +``` + +. . . + +We can reliably measure performance using only the **training** data πŸŽ‰ + +## Comparing metrics `r hexes("yardstick")` + +How do the metrics from resampling compare to the metrics from training and testing? + +```{r} +#| echo: false +taxi_training_roc_auc <- + taxi_fit %>% + augment(taxi_train) %>% + roc_auc(tip, .pred_yes) %>% + pull(.estimate) %>% + round(digits = 2) + +taxi_testing_roc_auc <- + taxi_fit %>% + augment(taxi_test) %>% + roc_auc(tip, .pred_yes) %>% + pull(.estimate) %>% + round(digits = 2) +``` + +::: columns +::: {.column width="50%"} +```{r} +taxi_res %>% + collect_metrics() %>% + select(.metric, mean, n) +``` +::: + +::: {.column width="50%"} +The ROC AUC previously was + +- `r taxi_training_roc_auc` for the training set +- `r taxi_testing_roc_auc` for test set +::: +::: + +. . . + +Remember that: + +⚠️ the training set gives you overly optimistic metrics + +⚠️ the test set is precious + +## Evaluating model performance `r hexes("tune")` + +```{r} +# Save the assessment set results +ctrl_taxi <- control_resamples(save_pred = TRUE) +taxi_res <- fit_resamples(taxi_wflow, taxi_folds, control = ctrl_taxi) + +taxi_preds <- collect_predictions(taxi_res) +taxi_preds +``` + +## Evaluating model performance `r hexes("tune")` + +```{r} +taxi_preds %>% + group_by(id) %>% + taxi_metrics(truth = tip, estimate = .pred_class) +``` + +## Where are the fitted models? `r hexes("tune")` {.annotation} + +```{r} +taxi_res +``` + +. . . + +πŸ—‘οΈ + +# Alternate resampling schemes + +## Bootstrapping + +![](https://www.tmwr.org/premade/bootstraps.svg) + +## Bootstrapping `r hexes("rsample")` + +```{r} +set.seed(3214) +bootstraps(taxi_train) +``` + +## {background-iframe="https://rsample.tidymodels.org/reference/index.html"} + +::: footer +::: + +## The whole game - status update + +```{r diagram-resamples, echo = FALSE} +#| fig-align: "center" + +knitr::include_graphics("images/whole-game-resamples.jpg") +``` + +## Your turn {transition="slide-in"} + +![](images/parsnip-flagger.jpg){.absolute top="0" right="0" width="150" height="150"} + +*Create:* + +- *Monte Carlo Cross-Validation sets* +- *validation set* + +(use the reference guide to find the function) + +*Don't forget to set a seed when you resample!* + +```{r} +#| echo: false +countdown(minutes = 5, id = "try-rsample") +``` + +## Monte Carlo Cross-Validation `r hexes("rsample")` + +```{r} +set.seed(322) +mc_cv(taxi_train, times = 10) +``` + +## Validation set `r hexes("rsample")` {.annotation} + +```{r} +set.seed(853) +validation_split(taxi_train, strata = tip) +``` + +. . . + +A validation set is just another type of resample + +# Decision tree 🌳 + +# Random forest 🌳🌲🌴🌡🌴🌳🌳🌴🌲🌡🌴🌲🌳🌴🌳🌡🌡🌴🌲🌲🌳🌴🌳🌴🌲🌴🌡🌴🌲🌴🌡🌲🌡🌴🌲🌳🌴🌡🌳🌴🌳 + +## Random forest 🌳🌲🌴🌡🌳🌳🌴🌲🌡🌴🌳🌡 + +- Ensemble many decision tree models + +- All the trees vote! πŸ—³οΈ + +- Bootstrap aggregating + random predictor sampling + +. . . + +- Often works well without tuning hyperparameters (more on this tomorrow!), as long as there are enough trees + +## Create a random forest model `r hexes("parsnip")` + +```{r} +rf_spec <- rand_forest(trees = 1000, mode = "classification") +rf_spec +``` + +## Create a random forest model `r hexes("workflows")` + +```{r} +rf_wflow <- workflow(tip ~ ., rf_spec) +rf_wflow +``` + +## Your turn {transition="slide-in"} + +![](images/parsnip-flagger.jpg){.absolute top="0" right="0" width="150" height="150"} + +*Use `fit_resamples()` and `rf_wflow` to:* + +- *keep predictions* +- *compute metrics* + +```{r} +#| echo: false +countdown(minutes = 8, id = "try-fit-resamples") +``` + +## Evaluating model performance `r hexes("tune")` + +```{r} +ctrl_taxi <- control_resamples(save_pred = TRUE) + +# Random forest uses random numbers so set the seed first + +set.seed(2) +rf_res <- fit_resamples(rf_wflow, taxi_folds, control = ctrl_taxi) +collect_metrics(rf_res) +``` + +## How can we compare multiple model workflows at once? + +```{r taxi-spinning, echo = FALSE} +#| fig-align: "center" + +knitr::include_graphics("images/taxi_spinning.svg") +``` + + +## Evaluate a workflow set + +```{r} +workflow_set(list(tip ~ .), list(tree_spec, rf_spec)) +``` + +## Evaluate a workflow set + +```{r} +workflow_set(list(tip ~ .), list(tree_spec, rf_spec)) %>% + workflow_map("fit_resamples", resamples = taxi_folds) +``` + +## Evaluate a workflow set + +```{r} +workflow_set(list(tip ~ .), list(tree_spec, rf_spec)) %>% + workflow_map("fit_resamples", resamples = taxi_folds) %>% + rank_results() +``` + +The first metric of the metric set is used for ranking. Use `rank_metric` to change that. + +. . . + +Lots more available with workflow sets, like `collect_metrics()`, `autoplot()` methods, and more! + +## Your turn {transition="slide-in"} + +![](images/parsnip-flagger.jpg){.absolute top="0" right="0" width="150" height="150"} + +*When do you think a workflow set would be useful?* + +```{r} +#| echo: false +countdown(minutes = 3, id = "discuss-workflow-sets") +``` + +## The whole game - status update + +```{r diagram-select, echo = FALSE} +#| fig-align: "center" + +knitr::include_graphics("images/whole-game-select.jpg") +``` + +## The final fit `r hexes("tune")` + +Suppose that we are happy with our random forest model. + +Let's fit the model on the training set and verify our performance using the test set. + +. . . + +We've shown you `fit()` and `predict()` (+ `augment()`) but there is a shortcut: + +```{r} +# taxi_split has train + test info +final_fit <- last_fit(rf_wflow, taxi_split) + +final_fit +``` + +## What is in `final_fit`? `r hexes("tune")` + +```{r} +collect_metrics(final_fit) +``` + +. . . + +These are metrics computed with the **test** set + +## What is in `final_fit`? `r hexes("tune")` + +```{r} +collect_predictions(final_fit) +``` + +## What is in `final_fit`? `r hexes("tune")` + +```{r} +extract_workflow(final_fit) +``` + +. . . + +Use this for **prediction** on new data, like for deploying + +## The whole game + +```{r diagram-final-performance, echo = FALSE} +#| fig-align: "center" + +knitr::include_graphics("images/whole-game-final-performance.jpg") +``` + +## Your turn {transition="slide-in"} + +![](images/parsnip-flagger.jpg){.absolute top="0" right="0" width="150" height="150"} + +*End of the day discussion!* + +*Which model do you think you would decide to use?* + +*What surprised you the most?* + +*What is one thing you are looking forward to for tomorrow?* + +```{r} +#| echo: false +countdown(minutes = 5, id = "discuss-which-model") +``` + diff --git a/archive/2023-07-nyr/08-wrapping-up.qmd b/archive/2023-07-nyr/08-wrapping-up.qmd new file mode 100644 index 00000000..2b0d23ee --- /dev/null +++ b/archive/2023-07-nyr/08-wrapping-up.qmd @@ -0,0 +1,65 @@ +--- +title: "8 - Wrapping up" +subtitle: "Machine learning with tidymodels" +format: + revealjs: + slide-number: true + footer: + include-before-body: header.html + theme: [default, tidymodels.scss] + width: 1280 + height: 720 +knitr: + opts_chunk: + echo: true + collapse: true + comment: "#>" +--- + +```{r} +#| include: false +#| file: setup.R +``` + +::: r-fit-text +We made it! +::: + +## Your turn {transition="slide-in"} + +![](images/parsnip-flagger.jpg){.absolute top="0" right="0" width="150" height="150"} + + +*What is one thing you learned that surprised you?* + +*What is one thing you learned that you plan to use?* + +```{r} +#| echo: false +countdown(minutes = 5, id = "statistics-vs-ml") +``` + + +## Resources to keep learning + +. . . + +- + +. . . + +- + +. . . + +- + +. . . + +- + +. . . + +Follow us on Twitter and at the tidyverse blog for updates! + + diff --git a/archive/2023-07-nyr/_freeze/01-introduction/execute-results/html.json b/archive/2023-07-nyr/_freeze/01-introduction/execute-results/html.json new file mode 100644 index 00000000..dc692f63 --- /dev/null +++ b/archive/2023-07-nyr/_freeze/01-introduction/execute-results/html.json @@ -0,0 +1,21 @@ +{ + "hash": "3ea97871f74836d15d22db5ec0939940", + "result": { + "markdown": "---\ntitle: \"1 - Introduction\"\nsubtitle: \"Machine learning with tidymodels\"\nformat:\n revealjs: \n slide-number: true\n footer: \n include-before-body: header.html\n include-after-body: footer-annotations.html\n theme: [default, tidymodels.scss]\n width: 1280\n height: 720\nknitr:\n opts_chunk: \n echo: true\n collapse: true\n comment: \"#>\"\n---\n\n\n\n\n::: r-fit-text\nWelcome!\n:::\n\n## Who are you?\n\n- You can use the magrittr `%>%` or base R `|>` pipe\n\n- You are familiar with functions from dplyr, tidyr, ggplot2\n\n- You have exposure to basic statistical concepts\n\n- You do **not** need intermediate or expert familiarity with modeling or ML\n\n## Who are tidymodels?\n\n- Simon Couch\n- Hannah Frick\n- Emil Hvitfeldt\n- Max Kuhn\n\n. . .\n\nMany thanks to Davis Vaughan, Julia Silge, David Robinson, Julie Jung, Alison Hill, and DesirΓ©e De Leon for their role in creating these materials!\n\n## Asking for help\n\n. . .\n\nπŸŸͺ \"I'm stuck and need help!\"\n\n. . .\n\n🟩 \"I finished the exercise\"\n\n\n## πŸ‘€ {.annotation}\n\n![](images/pointing.svg){.absolute top=\"0\" right=\"0\"}\n\n## Tentative plan for this workshop\n\n::: columns\n::: {.column width=\"50%\"}\n- *Today:* \n\n - Your data budget\n - What makes a model\n - Evaluating models\n:::\n::: {.column width=\"50%\"}\n- *Tomorrow:*\n \n - Feature engineering\n - Tuning hyperparameters\n - Racing methods\n - Iterative search methods\n:::\n:::\n\n## {.center}\n\n### Introduce yourself to your neighbors πŸ‘‹\n\n

\n\nCheck Slack (`#ml-ws-2023`) for an RStudio Cloud link.\n\n## What is machine learning?\n\n![](https://imgs.xkcd.com/comics/machine_learning.png){fig-align=\"center\"}\n\n::: footer\n\n:::\n\n## What is machine learning?\n\n![](images/what_is_ml.jpg){fig-align=\"center\"}\n\n::: footer\nIllustration credit: \n:::\n\n## What is machine learning?\n\n![](images/ml_illustration.jpg){fig-align=\"center\"}\n\n::: footer\nIllustration credit: \n:::\n\n## Your turn {transition=\"slide-in\"}\n\n![](images/parsnip-flagger.jpg){.absolute top=\"0\" right=\"0\" width=\"150\" height=\"150\"}\n\n. . .\n\n*How are statistics and machine learning related?*\n\n*How are they similar? Different?*\n\n\n::: {.cell}\n::: {.cell-output-display}\n```{=html}\n
\n
\n03:00\n
\n```\n:::\n:::\n\n\n::: notes\nthe \"two cultures\"\n\nmodel first vs. data first\n\ninference vs. prediction\n:::\n\n## What is tidymodels? ![](hexes/tidymodels.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidymodels)\n#> ── Attaching packages ──────────────────────────── tidymodels 1.1.0 ──\n#> βœ” broom 1.0.5 βœ” rsample 1.1.1.9000\n#> βœ” dials 1.2.0 βœ” tibble 3.2.1 \n#> βœ” dplyr 1.1.2 βœ” tidyr 1.3.0 \n#> βœ” infer 1.0.4 βœ” tune 1.1.1.9001\n#> βœ” modeldata 1.1.0 βœ” workflows 1.1.3 \n#> βœ” parsnip 1.1.0.9003 βœ” workflowsets 1.0.1 \n#> βœ” purrr 1.0.1 βœ” yardstick 1.2.0.9001\n#> βœ” recipes 1.0.6\n#> ── Conflicts ─────────────────────────────── tidymodels_conflicts() ──\n#> βœ– purrr::discard() masks scales::discard()\n#> βœ– dplyr::filter() masks stats::filter()\n#> βœ– dplyr::lag() masks stats::lag()\n#> βœ– recipes::step() masks stats::step()\n#> β€’ Dig deeper into tidy modeling with R at https://www.tmwr.org\n```\n:::\n\n\n## {background-image=\"images/tm-org.png\" background-size=\"contain\"}\n\n## The whole game\n\nPart of any modelling process is\n\n* Splitting your data into training and test set\n* Using a resampling scheme\n* Fitting models\n* Assessing performance\n* Choosing a model\n* Fitting and assessing the final model\n\n\n## The whole game\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](images/whole-game-split.jpg){fig-align='center' width=1772}\n:::\n:::\n\n\n## The whole game\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](images/whole-game-model-1.jpg){fig-align='center' width=1772}\n:::\n:::\n\n\n:::notes\nStress that we are **not** fitting a model on the entire training set other than for illustrative purposes in deck 2.\n:::\n\n## The whole game\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](images/whole-game-model-n.jpg){fig-align='center' width=1772}\n:::\n:::\n\n\n## The whole game\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](images/whole-game-resamples.jpg){fig-align='center' width=1772}\n:::\n:::\n\n\n## The whole game\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](images/whole-game-select.jpg){fig-align='center' width=1772}\n:::\n:::\n\n\n## The whole game\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](images/whole-game-final-fit.jpg){fig-align='center' width=1772}\n:::\n:::\n\n\n## The whole game\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](images/whole-game-final-performance.jpg){fig-align='center' width=1772}\n:::\n:::\n\n\n\n## Let's install some packages\n\nIf you are using your own laptop instead of RStudio Cloud:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ninstall.packages(\"pak\")\n\npkgs <- c(\"bonsai\", \"doParallel\", \"embed\", \"finetune\", \"lightgbm\", \"lme4\", \n \"parallelly\", \"plumber\", \"probably\", \"ranger\", \"rpart\", \"rpart.plot\", \n \"stacks\", \"textrecipes\", \"tidymodels\", \"tidymodels/modeldatatoo\", \n \"vetiver\")\npak::pak(pkgs)\n```\n:::\n\n\n. . .\n\nCheck Slack (`#ml-ws-2023`) for an RStudio Cloud link.\n\n\n## Our versions\n\n\n::: {.cell}\n\n:::\n\n\nbonsai (0.2.1.9000, Github (tidymodels/bonsai@aab79), broom (1.0.5, local), dials (1.2.0, CRAN), doParallel (1.0.17, CRAN), dplyr (1.1.2, CRAN), embed (1.0.0, CRAN), finetune (1.1.0.9000, Github (tidymodels/finetune@52d), ggplot2 (3.4.2, CRAN), lightgbm (3.3.5, CRAN), lme4 (1.1-33, CRAN), modeldata (1.1.0, CRAN), modeldatatoo (0.1.0.9000, Github (tidymodels/modeldatatoo), parallelly (1.36.0, CRAN), parsnip (1.1.0.9003, Github (tidymodels/parsnip@e627), plumber (1.2.1, CRAN), probably (1.0.2, CRAN), purrr (1.0.1, CRAN), ranger (0.15.1, CRAN), recipes (1.0.6, CRAN), rpart (4.1.19, CRAN), rpart.plot (3.1.1, CRAN), rsample (1.1.1.9000, Github (tidymodels/rsample@afc4), scales (1.2.1, CRAN), stacks (1.0.2.9000, local), textrecipes (1.0.2, CRAN), tibble (3.2.1, CRAN), tidymodels (1.1.0, CRAN), tidyr (1.3.0, CRAN), tune (1.1.1.9001, Github (tidymodels/tune@fea8b02), vetiver (0.2.0, CRAN), workflows (1.1.3, CRAN), workflowsets (1.0.1, CRAN), yardstick (1.2.0.9001, Github (tidymodels/yardstick@6c), and Quarto (1.3.433)\n", + "supporting": [], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": { + "include-in-header": [ + "\n\n" + ], + "include-after-body": [ + "\n\n\n" + ] + }, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/archive/2023-07-nyr/_freeze/01-introduction/libs/countdown-0.4.0/countdown.css b/archive/2023-07-nyr/_freeze/01-introduction/libs/countdown-0.4.0/countdown.css new file mode 100644 index 00000000..bf387012 --- /dev/null +++ b/archive/2023-07-nyr/_freeze/01-introduction/libs/countdown-0.4.0/countdown.css @@ -0,0 +1,144 @@ +.countdown { + background: inherit; + position: absolute; + cursor: pointer; + font-size: 3rem; + line-height: 1; + border-color: #ddd; + border-width: 3px; + border-style: solid; + border-radius: 15px; + box-shadow: 0px 4px 10px 0px rgba(50, 50, 50, 0.4); + -webkit-box-shadow: 0px 4px 10px 0px rgba(50, 50, 50, 0.4); + margin: 0.6em; + padding: 10px 15px; + text-align: center; + z-index: 10; + -webkit-user-select: none; + -moz-user-select: none; + -ms-user-select: none; + user-select: none; +} +.countdown { + display: flex; + align-items: center; + justify-content: center; +} +.countdown .countdown-time { + background: none; + font-size: 100%; + padding: 0; +} +.countdown-digits { + color: inherit; +} +.countdown.running { + border-color: #2A9B59FF; + background-color: #43AC6A; +} +.countdown.running .countdown-digits { + color: #002F14FF; +} +.countdown.finished { + border-color: #DE3000FF; + background-color: #F04124; +} +.countdown.finished .countdown-digits { + color: #4A0900FF; +} +.countdown.running.warning { + border-color: #CEAC04FF; + background-color: #E6C229; +} +.countdown.running.warning .countdown-digits { + color: #3A2F02FF; +} + +.countdown.running.blink-colon .countdown-digits.colon { + opacity: 0.1; +} + +/* ------ Controls ------ */ +.countdown:not(.running) .countdown-controls { + display: none; +} + +.countdown-controls { + position: absolute; + top: -0.5rem; + right: -0.5rem; + left: -0.5rem; + display: flex; + justify-content: space-between; + margin: 0; + padding: 0; +} + +.countdown-controls > button { + font-size: 1.5rem; + width: 1rem; + height: 1rem; + display: inline-block; + display: flex; + flex-direction: column; + align-items: center; + justify-content: center; + font-family: monospace; + padding: 10px; + margin: 0; + background: inherit; + border: 2px solid; + border-radius: 100%; + transition: 50ms transform ease-in-out, 150ms opacity ease-in; + --countdown-transition-distance: 10px; +} + +.countdown .countdown-controls > button:last-child { + transform: translate(calc(-1 * var(--countdown-transition-distance)), var(--countdown-transition-distance)); + opacity: 0; + color: #002F14FF; + background-color: #43AC6A; + border-color: #2A9B59FF; +} + +.countdown .countdown-controls > button:first-child { + transform: translate(var(--countdown-transition-distance), var(--countdown-transition-distance)); + opacity: 0; + color: #4A0900FF; + background-color: #F04124; + border-color: #DE3000FF; +} + +.countdown.running:hover .countdown-controls > button, +.countdown.running:focus-within .countdown-controls > button{ + transform: translate(0, 0); + opacity: 1; +} + +.countdown.running:hover .countdown-controls > button:hover, +.countdown.running:focus-within .countdown-controls > button:hover{ + transform: translate(0, calc(var(--countdown-transition-distance) / -2)); + box-shadow: 0px 2px 5px 0px rgba(50, 50, 50, 0.4); + -webkit-box-shadow: 0px 2px 5px 0px rgba(50, 50, 50, 0.4); +} + +.countdown.running:hover .countdown-controls > button:active, +.countdown.running:focus-within .countdown-controls > button:active{ + transform: translate(0, calc(var(--coutndown-transition-distance) / -5)); +} + +/* ----- Fullscreen ----- */ +.countdown.countdown-fullscreen { + z-index: 0; +} + +.countdown-fullscreen.running .countdown-controls { + top: 1rem; + left: 0; + right: 0; + justify-content: center; +} + +.countdown-fullscreen.running .countdown-controls > button + button { + margin-left: 1rem; +} diff --git a/archive/2023-07-nyr/_freeze/01-introduction/libs/countdown-0.4.0/countdown.js b/archive/2023-07-nyr/_freeze/01-introduction/libs/countdown-0.4.0/countdown.js new file mode 100644 index 00000000..a058ad8f --- /dev/null +++ b/archive/2023-07-nyr/_freeze/01-introduction/libs/countdown-0.4.0/countdown.js @@ -0,0 +1,478 @@ +/* globals Shiny,Audio */ +class CountdownTimer { + constructor (el, opts) { + if (typeof el === 'string' || el instanceof String) { + el = document.querySelector(el) + } + + if (el.counter) { + return el.counter + } + + const minutes = parseInt(el.querySelector('.minutes').innerText || '0') + const seconds = parseInt(el.querySelector('.seconds').innerText || '0') + const duration = minutes * 60 + seconds + + function attrIsTrue (x) { + if (x === true) return true + return !!(x === 'true' || x === '' || x === '1') + } + + this.element = el + this.duration = duration + this.end = null + this.is_running = false + this.warn_when = parseInt(el.dataset.warnWhen) || -1 + this.update_every = parseInt(el.dataset.updateEvery) || 1 + this.play_sound = attrIsTrue(el.dataset.playSound) + this.blink_colon = attrIsTrue(el.dataset.blinkColon) + this.startImmediately = attrIsTrue(el.dataset.startImmediately) + this.timeout = null + this.display = { minutes, seconds } + + if (opts.src_location) { + this.src_location = opts.src_location + } + + this.addEventListeners() + } + + addEventListeners () { + const self = this + + if (this.startImmediately) { + if (window.remark && window.slideshow) { + // Remark (xaringan) support + const isOnVisibleSlide = () => { + return document.querySelector('.remark-visible').contains(self.element) + } + if (isOnVisibleSlide()) { + self.start() + } else { + let started_once = 0 + window.slideshow.on('afterShowSlide', function () { + if (started_once > 0) return + if (isOnVisibleSlide()) { + self.start() + started_once = 1 + } + }) + } + } else if (window.Reveal) { + // Revealjs (quarto) support + const isOnVisibleSlide = () => { + const currentSlide = document.querySelector('.reveal .slide.present') + return currentSlide ? currentSlide.contains(self.element) : false + } + if (isOnVisibleSlide()) { + self.start() + } else { + const revealStartTimer = () => { + if (isOnVisibleSlide()) { + self.start() + window.Reveal.off('slidechanged', revealStartTimer) + } + } + window.Reveal.on('slidechanged', revealStartTimer) + } + } else if (window.IntersectionObserver) { + // All other situtations use IntersectionObserver + const onVisible = (element, callback) => { + new window.IntersectionObserver((entries, observer) => { + entries.forEach(entry => { + if (entry.intersectionRatio > 0) { + callback(element) + observer.disconnect() + } + }) + }).observe(element) + } + onVisible(this.element, el => el.countdown.start()) + } else { + // or just start the timer as soon as it's initialized + this.start() + } + } + + function haltEvent (ev) { + ev.preventDefault() + ev.stopPropagation() + } + function isSpaceOrEnter (ev) { + return ev.code === 'Space' || ev.code === 'Enter' + } + function isArrowUpOrDown (ev) { + return ev.code === 'ArrowUp' || ev.code === 'ArrowDown' + } + + ;['click', 'touchend'].forEach(function (eventType) { + self.element.addEventListener(eventType, function (ev) { + haltEvent(ev) + self.is_running ? self.stop() : self.start() + }) + }) + this.element.addEventListener('keydown', function (ev) { + if (ev.code === "Escape") { + self.reset() + haltEvent(ev) + } + if (!isSpaceOrEnter(ev) && !isArrowUpOrDown(ev)) return + haltEvent(ev) + if (isSpaceOrEnter(ev)) { + self.is_running ? self.stop() : self.start() + return + } + + if (!self.is_running) return + + if (ev.code === 'ArrowUp') { + self.bumpUp() + } else if (ev.code === 'ArrowDown') { + self.bumpDown() + } + }) + this.element.addEventListener('dblclick', function (ev) { + haltEvent(ev) + if (self.is_running) self.reset() + }) + this.element.addEventListener('touchmove', haltEvent) + + const btnBumpDown = this.element.querySelector('.countdown-bump-down') + ;['click', 'touchend'].forEach(function (eventType) { + btnBumpDown.addEventListener(eventType, function (ev) { + haltEvent(ev) + if (self.is_running) self.bumpDown() + }) + }) + btnBumpDown.addEventListener('keydown', function (ev) { + if (!isSpaceOrEnter(ev) || !self.is_running) return + haltEvent(ev) + self.bumpDown() + }) + + const btnBumpUp = this.element.querySelector('.countdown-bump-up') + ;['click', 'touchend'].forEach(function (eventType) { + btnBumpUp.addEventListener(eventType, function (ev) { + haltEvent(ev) + if (self.is_running) self.bumpUp() + }) + }) + btnBumpUp.addEventListener('keydown', function (ev) { + if (!isSpaceOrEnter(ev) || !self.is_running) return + haltEvent(ev) + self.bumpUp() + }) + this.element.querySelector('.countdown-controls').addEventListener('dblclick', function (ev) { + haltEvent(ev) + }) + } + + remainingTime () { + const remaining = this.is_running + ? (this.end - Date.now()) / 1000 + : this.remaining || this.duration + + let minutes = Math.floor(remaining / 60) + let seconds = Math.ceil(remaining - minutes * 60) + + if (seconds > 59) { + minutes = minutes + 1 + seconds = seconds - 60 + } + + return { remaining, minutes, seconds } + } + + start () { + if (this.is_running) return + + this.is_running = true + + if (this.remaining) { + // Having a static remaining time indicates timer was paused + this.end = Date.now() + this.remaining * 1000 + this.remaining = null + } else { + this.end = Date.now() + this.duration * 1000 + } + + this.reportStateToShiny('start') + + this.element.classList.remove('finished') + this.element.classList.add('running') + this.update(true) + this.tick() + } + + tick (run_again) { + if (typeof run_again === 'undefined') { + run_again = true + } + + if (!this.is_running) return + + const { seconds: secondsWas } = this.display + this.update() + + if (run_again) { + const delay = (this.end - Date.now() > 10000) ? 1000 : 250 + this.blinkColon(secondsWas) + this.timeout = setTimeout(this.tick.bind(this), delay) + } + } + + blinkColon (secondsWas) { + // don't blink unless option is set + if (!this.blink_colon) return + // warn_when always updates the seconds + if (this.warn_when > 0 && Date.now() + this.warn_when > this.end) { + this.element.classList.remove('blink-colon') + return + } + const { seconds: secondsIs } = this.display + if (secondsIs > 10 || secondsWas !== secondsIs) { + this.element.classList.toggle('blink-colon') + } + } + + update (force) { + if (typeof force === 'undefined') { + force = false + } + + const { remaining, minutes, seconds } = this.remainingTime() + + const setRemainingTime = (selector, time) => { + const timeContainer = this.element.querySelector(selector) + if (!timeContainer) return + time = Math.max(time, 0) + timeContainer.innerText = String(time).padStart(2, 0) + } + + if (this.is_running && remaining < 0.25) { + this.stop() + setRemainingTime('.minutes', 0) + setRemainingTime('.seconds', 0) + this.playSound() + return + } + + const should_update = force || + Math.round(remaining) < this.warn_when || + Math.round(remaining) % this.update_every === 0 + + if (should_update) { + this.element.classList.toggle('warning', remaining <= this.warn_when) + this.display = { minutes, seconds } + setRemainingTime('.minutes', minutes) + setRemainingTime('.seconds', seconds) + } + } + + stop () { + const { remaining } = this.remainingTime() + if (remaining > 1) { + this.remaining = remaining + } + this.element.classList.remove('running') + this.element.classList.remove('warning') + this.element.classList.remove('blink-colon') + this.element.classList.add('finished') + this.is_running = false + this.end = null + this.reportStateToShiny('stop') + this.timeout = clearTimeout(this.timeout) + } + + reset () { + this.stop() + this.remaining = null + this.update(true) + this.reportStateToShiny('reset') + this.element.classList.remove('finished') + this.element.classList.remove('warning') + } + + setValues (opts) { + if (typeof opts.warn_when !== 'undefined') { + this.warn_when = opts.warn_when + } + if (typeof opts.update_every !== 'undefined') { + this.update_every = opts.update_every + } + if (typeof opts.blink_colon !== 'undefined') { + this.blink_colon = opts.blink_colon + if (!opts.blink_colon) { + this.element.classList.remove('blink-colon') + } + } + if (typeof opts.play_sound !== 'undefined') { + this.play_sound = opts.play_sound + } + if (typeof opts.duration !== 'undefined') { + this.duration = opts.duration + if (this.is_running) { + this.reset() + this.start() + } + } + this.reportStateToShiny('update') + this.update(true) + } + + bumpTimer (val, round) { + round = typeof round === 'boolean' ? round : true + const { remaining } = this.remainingTime() + let newRemaining = remaining + val + if (newRemaining <= 0) { + this.setRemaining(0) + this.stop() + return + } + if (round && newRemaining > 10) { + newRemaining = Math.round(newRemaining / 5) * 5 + } + this.setRemaining(newRemaining) + this.reportStateToShiny(val > 0 ? 'bumpUp' : 'bumpDown') + this.update(true) + } + + bumpUp (val) { + if (!this.is_running) { + console.error('timer is not running') + return + } + this.bumpTimer( + val || this.bumpIncrementValue(), + typeof val === 'undefined' + ) + } + + bumpDown (val) { + if (!this.is_running) { + console.error('timer is not running') + return + } + this.bumpTimer( + val || -1 * this.bumpIncrementValue(), + typeof val === 'undefined' + ) + } + + setRemaining (val) { + if (!this.is_running) { + console.error('timer is not running') + return + } + this.end = Date.now() + val * 1000 + this.update(true) + } + + playSound () { + let url = this.play_sound + if (!url) return + if (typeof url === 'boolean') { + const src = this.src_location + ? this.src_location.replace('/countdown.js', '') + : 'libs/countdown' + url = src + '/smb_stage_clear.mp3' + } + const sound = new Audio(url) + sound.play() + } + + bumpIncrementValue (val) { + val = val || this.remainingTime().remaining + if (val <= 30) { + return 5 + } else if (val <= 300) { + return 15 + } else if (val <= 3000) { + return 30 + } else { + return 60 + } + } + + reportStateToShiny (action) { + if (!window.Shiny) return + + const inputId = this.element.id + const data = { + event: { + action, + time: new Date().toISOString() + }, + timer: { + is_running: this.is_running, + end: this.end ? new Date(this.end).toISOString() : null, + remaining: this.remainingTime() + } + } + + function shinySetInputValue () { + if (!window.Shiny.setInputValue) { + setTimeout(shinySetInputValue, 100) + return + } + window.Shiny.setInputValue(inputId, data) + } + + shinySetInputValue() + } +} + +(function () { + const CURRENT_SCRIPT = document.currentScript.getAttribute('src') + + document.addEventListener('DOMContentLoaded', function () { + const els = document.querySelectorAll('.countdown') + if (!els || !els.length) { + return + } + els.forEach(function (el) { + el.countdown = new CountdownTimer(el, { src_location: CURRENT_SCRIPT }) + }) + + if (window.Shiny) { + Shiny.addCustomMessageHandler('countdown:update', function (x) { + if (!x.id) { + console.error('No `id` provided, cannot update countdown') + return + } + const el = document.getElementById(x.id) + el.countdown.setValues(x) + }) + + Shiny.addCustomMessageHandler('countdown:start', function (id) { + const el = document.getElementById(id) + if (!el) return + el.countdown.start() + }) + + Shiny.addCustomMessageHandler('countdown:stop', function (id) { + const el = document.getElementById(id) + if (!el) return + el.countdown.stop() + }) + + Shiny.addCustomMessageHandler('countdown:reset', function (id) { + const el = document.getElementById(id) + if (!el) return + el.countdown.reset() + }) + + Shiny.addCustomMessageHandler('countdown:bumpUp', function (id) { + const el = document.getElementById(id) + if (!el) return + el.countdown.bumpUp() + }) + + Shiny.addCustomMessageHandler('countdown:bumpDown', function (id) { + const el = document.getElementById(id) + if (!el) return + el.countdown.bumpDown() + }) + } + }) +})() diff --git a/archive/2023-07-nyr/_freeze/01-introduction/libs/countdown-0.4.0/smb_stage_clear.mp3 b/archive/2023-07-nyr/_freeze/01-introduction/libs/countdown-0.4.0/smb_stage_clear.mp3 new file mode 100644 index 00000000..da2ddc2c Binary files /dev/null and b/archive/2023-07-nyr/_freeze/01-introduction/libs/countdown-0.4.0/smb_stage_clear.mp3 differ diff --git a/archive/2023-07-nyr/_freeze/02-data-budget/execute-results/html.json b/archive/2023-07-nyr/_freeze/02-data-budget/execute-results/html.json new file mode 100644 index 00000000..7f41affd --- /dev/null +++ b/archive/2023-07-nyr/_freeze/02-data-budget/execute-results/html.json @@ -0,0 +1,21 @@ +{ + "hash": "1bdca46c14f3060519ecc1b8ebabc623", + "result": { + "markdown": "---\ntitle: \"2 - Your data budget\"\nsubtitle: \"Machine learning with tidymodels\"\nformat:\n revealjs: \n slide-number: true\n footer: \n include-before-body: header.html\n include-after-body: footer-annotations.html\n theme: [default, tidymodels.scss]\n width: 1280\n height: 720\nknitr:\n opts_chunk: \n echo: true\n collapse: true\n comment: \"#>\"\n fig.path: \"figures/\"\n---\n\n\n\n\n## {background-image=\"https://media.giphy.com/media/Lr3UeH9tYu3qJtsSUg/giphy.gif\" background-size=\"40%\"}\n\n\n## Data on Chicago taxi trips\n\n::: columns\n::: {.column width=\"60%\"}\n- The city of Chicago releases anonymized trip-level data on taxi trips in the city.\n- We pulled a sample of 10,000 rides occurring in early 2022.\n- Type `?modeldatatoo::data_taxi()` to learn more about this dataset, including references.\n:::\n\n::: {.column width=\"40%\"}\n![](images/taxi_spinning.svg)\n:::\n\n:::\n\n::: footer\nCredit: \n:::\n\n## Which of these variables can we use?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidymodels)\nlibrary(modeldatatoo)\n\ntaxi <- data_taxi()\n\nnames(taxi)\n#> [1] \"tip\" \"id\" \"duration\" \"distance\" \"fare\" \n#> [6] \"tolls\" \"extras\" \"total_cost\" \"payment_type\" \"company\" \n#> [11] \"local\" \"dow\" \"month\" \"hour\"\n```\n:::\n\n\n## Checklist for predictors\n\n- Is it ethical to use this variable? (Or even legal?)\n\n- Will this variable be available at prediction time?\n\n- Does this variable contribute to explainability?\n\n\n## Data on Chicago taxi trips\n\nWe are using a slightly modified version from the modeldatatoo data.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntaxi <- taxi %>%\n mutate(month = factor(month, levels = c(\"Jan\", \"Feb\", \"Mar\", \"Apr\"))) %>% \n select(-c(id, duration, fare, tolls, extras, total_cost, payment_type)) %>% \n drop_na()\n```\n:::\n\n\n## Data on Chicago taxi trips\n\n::: columns\n::: {.column width=\"60%\"}\n- `N = 10,000`\n- A nominal outcome, `tip`, with levels `\"yes\"` and `\"no\"`\n- 6 other variables\n - `company`, `local`, and `dow`, and `month` are **nominal** predictors\n - `distance` and `hours` are **numeric** predictors\n:::\n\n::: {.column width=\"40%\"}\n![](images/taxi.png)\n:::\n:::\n\n::: footer\nCredit: \n:::\n\n:::notes\n`tip`: Whether the rider left a tip. A factor with levels \"yes\" and \"no\".\n\n`distance`: The trip distance, in odometer miles.\n\n`company`: The taxi company, as a factor. Companies that occurred few times were binned as \"other\".\n\n`local`: Whether the trip started in the same community area as it began. See the source data for community area values.\n\n`dow`: The day of the week in which the trip began, as a factor.\n\n`month`: The month in which the trip began, as a factor.\n\n`hour`: The hour of the day in which the trip began, as a numeric.\n\n:::\n\n## Data on Chicago taxi trips\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntaxi\n#> # A tibble: 8,807 Γ— 7\n#> tip distance company local dow month hour\n#> \n#> 1 yes 1.24 Sun Taxi no Thu Feb 13\n#> 2 no 5.39 Flash Cab no Sat Mar 12\n#> 3 yes 3.01 City Service no Wed Feb 17\n#> 4 no 18.4 Sun Taxi no Sat Apr 6\n#> 5 yes 1.76 Sun Taxi no Sun Jan 15\n#> 6 yes 13.6 Sun Taxi no Mon Feb 17\n#> 7 yes 3.71 City Service no Mon Mar 21\n#> 8 yes 4.8 other no Tue Mar 9\n#> 9 yes 18.0 City Service no Fri Jan 19\n#> 10 no 17.5 other yes Thu Apr 12\n#> # β„Ή 8,797 more rows\n```\n:::\n\n\n\n## Data splitting and spending\n\nFor machine learning, we typically split data into training and test sets:\n\n. . .\n\n- The **training set** is used to estimate model parameters.\n- The **test set** is used to find an independent assessment of model performance.\n\n. . .\n\nDo not 🚫 use the test set during training.\n\n## Data splitting and spending\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](figures/test-train-split-1.svg){width=1152}\n:::\n:::\n\n\n# The more data
we spend πŸ€‘

the better estimates
we'll get.\n\n## Data splitting and spending\n\n- Spending too much data in **training** prevents us from computing a good assessment of predictive **performance**.\n\n. . .\n\n- Spending too much data in **testing** prevents us from computing a good estimate of model **parameters**.\n\n## Your turn {transition=\"slide-in\"}\n\n![](images/parsnip-flagger.jpg){.absolute top=\"0\" right=\"0\" width=\"150\" height=\"150\"}\n\n*When is a good time to split your data?*\n\n\n::: {.cell}\n::: {.cell-output-display}\n```{=html}\n
\n
\n03:00\n
\n```\n:::\n:::\n\n\n# The testing data is precious πŸ’Ž\n\n## The initial split ![](hexes/rsample.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"} {.annotation}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(123)\ntaxi_split <- initial_split(taxi)\ntaxi_split\n#> \n#> <6605/2202/8807>\n```\n:::\n\n\n:::notes\nHow much data in training vs testing?\nThis function uses a good default, but this depends on your specific goal/data\nWe will talk about more powerful ways of splitting, like stratification, later\n:::\n\n## Accessing the data ![](hexes/rsample.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntaxi_train <- training(taxi_split)\ntaxi_test <- testing(taxi_split)\n```\n:::\n\n\n## The training set![](hexes/rsample.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntaxi_train\n#> # A tibble: 6,605 Γ— 7\n#> tip distance company local dow month hour\n#> \n#> 1 yes 4.54 City Service no Sat Mar 16\n#> 2 no 10.2 Flash Cab no Mon Feb 8\n#> 3 yes 12.4 other no Sun Apr 15\n#> 4 yes 15.3 Sun Taxi no Mon Apr 18\n#> 5 no 6.41 Flash Cab no Wed Apr 14\n#> 6 yes 1.56 other no Tue Jan 13\n#> 7 yes 3.13 Flash Cab no Sun Apr 12\n#> 8 yes 7.54 other no Tue Apr 8\n#> 9 yes 6.98 Flash Cab no Tue Apr 5\n#> 10 yes 0.7 Taxi Affiliation Services no Tue Jan 9\n#> # β„Ή 6,595 more rows\n```\n:::\n\n\n## The test set ![](hexes/rsample.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\nπŸ™ˆ\n\n. . .\n\nThere are 2202 rows and 7 columns in the test set.\n\n## Your turn {transition=\"slide-in\"}\n\n![](images/parsnip-flagger.jpg){.absolute top=\"0\" right=\"0\" width=\"150\" height=\"150\"}\n\n*Split your data so 20% is held out for the test set.*\n\n*Try out different values in `set.seed()` to see how the results change.*\n\n\n::: {.cell}\n::: {.cell-output-display}\n```{=html}\n
\n
\n05:00\n
\n```\n:::\n:::\n\n\n## Data splitting and spending ![](hexes/rsample.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(123)\ntaxi_split <- initial_split(taxi, prop = 0.8)\ntaxi_train <- training(taxi_split)\ntaxi_test <- testing(taxi_split)\n\nnrow(taxi_train)\n#> [1] 7045\nnrow(taxi_test)\n#> [1] 1762\n```\n:::\n\n\n# What about a validation set?\n\n## {background-color=\"white\" background-image=\"https://www.tmwr.org/premade/validation.svg\" background-size=\"50%\"}\n\n:::notes\nWe will use this tomorrow\n:::\n\n## {background-color=\"white\" background-image=\"https://www.tmwr.org/premade/validation-alt.svg\" background-size=\"40%\"}\n\n# Exploratory data analysis for ML 🧐\n\n## Your turn {transition=\"slide-in\"}\n\n![](images/parsnip-flagger.jpg){.absolute top=\"0\" right=\"0\" width=\"150\" height=\"150\"}\n\n*Explore the `taxi_train` data on your own!*\n\n* *What's the distribution of the outcome, tip?*\n* *What's the distribution of numeric variables like distance?*\n* *How does tip differ across the categorical variables?*\n\n\n::: {.cell}\n::: {.cell-output-display}\n```{=html}\n
\n
\n08:00\n
\n```\n:::\n:::\n\n\n::: notes\nMake a plot or summary and then share with neighbor\n:::\n\n## \n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ntaxi_train %>% \n ggplot(aes(x = tip)) +\n geom_bar()\n```\n\n::: {.cell-output-display}\n![](figures/taxi-tip-counts-1.svg){fig-align='center' width=960}\n:::\n:::\n\n\n## \n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ntaxi_train %>% \n ggplot(aes(x = tip, fill = local)) +\n geom_bar() +\n scale_fill_viridis_d(end = .5)\n```\n\n::: {.cell-output-display}\n![](figures/taxi-tip-by-local-1.svg){fig-align='center' width=960}\n:::\n:::\n\n\n## \n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ntaxi_train %>% \n mutate(tip = forcats::fct_rev(tip)) %>% \n ggplot(aes(x = hour, fill = tip)) +\n geom_bar()\n```\n\n::: {.cell-output-display}\n![](figures/taxi-tip-by-hour-1.svg){fig-align='center' width=960}\n:::\n:::\n\n\n## \n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ntaxi_train %>% \n mutate(tip = forcats::fct_rev(tip)) %>% \n ggplot(aes(x = hour, fill = tip)) +\n geom_bar(position = \"fill\")\n```\n\n::: {.cell-output-display}\n![](figures/taxi-tip-by-hour-fill-1.svg){fig-align='center' width=960}\n:::\n:::\n\n\n## \n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ntaxi_train %>% \n mutate(tip = forcats::fct_rev(tip)) %>% \n ggplot(aes(x = distance)) +\n geom_histogram(bins = 100) +\n facet_grid(vars(tip))\n```\n\n::: {.cell-output-display}\n![](figures/taxi-tip-by-distance-1.svg){fig-align='center' width=960}\n:::\n:::\n\n\n# Split smarter\n\n##\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](figures/taxi-tip-pct-1.svg){width=960}\n:::\n:::\n\n\nStratified sampling would split within response values\n\n:::notes\nBased on our EDA, we know that the source data contains fewer `\"no\"` tip values than `\"yes\"`. We want to make sure we allot equal proportions of those responses so that both the training and testing data have enough of each to give accurate estimates.\n:::\n\n## Stratification\n\nUse `strata = tip`\n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(123)\ntaxi_split <- initial_split(taxi, prop = 0.8, strata = tip)\ntaxi_split\n#> \n#> <7045/1762/8807>\n```\n:::\n\n\n## Stratification\n\nStratification often helps, with very little downside\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](figures/taxi-tip-pct-by-split-1.svg){width=960}\n:::\n:::\n\n\n## The whole game - status update\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](images/whole-game-split.jpg){fig-align='center' width=1772}\n:::\n:::\n", + "supporting": [], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": { + "include-in-header": [ + "\n\n" + ], + "include-after-body": [ + "\n\n\n" + ] + }, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/archive/2023-07-nyr/_freeze/02-data-budget/libs/countdown-0.4.0/countdown.css b/archive/2023-07-nyr/_freeze/02-data-budget/libs/countdown-0.4.0/countdown.css new file mode 100644 index 00000000..bf387012 --- /dev/null +++ b/archive/2023-07-nyr/_freeze/02-data-budget/libs/countdown-0.4.0/countdown.css @@ -0,0 +1,144 @@ +.countdown { + background: inherit; + position: absolute; + cursor: pointer; + font-size: 3rem; + line-height: 1; + border-color: #ddd; + border-width: 3px; + border-style: solid; + border-radius: 15px; + box-shadow: 0px 4px 10px 0px rgba(50, 50, 50, 0.4); + -webkit-box-shadow: 0px 4px 10px 0px rgba(50, 50, 50, 0.4); + margin: 0.6em; + padding: 10px 15px; + text-align: center; + z-index: 10; + -webkit-user-select: none; + -moz-user-select: none; + -ms-user-select: none; + user-select: none; +} +.countdown { + display: flex; + align-items: center; + justify-content: center; +} +.countdown .countdown-time { + background: none; + font-size: 100%; + padding: 0; +} +.countdown-digits { + color: inherit; +} +.countdown.running { + border-color: #2A9B59FF; + background-color: #43AC6A; +} +.countdown.running .countdown-digits { + color: #002F14FF; +} +.countdown.finished { + border-color: #DE3000FF; + background-color: #F04124; +} +.countdown.finished .countdown-digits { + color: #4A0900FF; +} +.countdown.running.warning { + border-color: #CEAC04FF; + background-color: #E6C229; +} +.countdown.running.warning .countdown-digits { + color: #3A2F02FF; +} + +.countdown.running.blink-colon .countdown-digits.colon { + opacity: 0.1; +} + +/* ------ Controls ------ */ +.countdown:not(.running) .countdown-controls { + display: none; +} + +.countdown-controls { + position: absolute; + top: -0.5rem; + right: -0.5rem; + left: -0.5rem; + display: flex; + justify-content: space-between; + margin: 0; + padding: 0; +} + +.countdown-controls > button { + font-size: 1.5rem; + width: 1rem; + height: 1rem; + display: inline-block; + display: flex; + flex-direction: column; + align-items: center; + justify-content: center; + font-family: monospace; + padding: 10px; + margin: 0; + background: inherit; + border: 2px solid; + border-radius: 100%; + transition: 50ms transform ease-in-out, 150ms opacity ease-in; + --countdown-transition-distance: 10px; +} + +.countdown .countdown-controls > button:last-child { + transform: translate(calc(-1 * var(--countdown-transition-distance)), var(--countdown-transition-distance)); + opacity: 0; + color: #002F14FF; + background-color: #43AC6A; + border-color: #2A9B59FF; +} + +.countdown .countdown-controls > button:first-child { + transform: translate(var(--countdown-transition-distance), var(--countdown-transition-distance)); + opacity: 0; + color: #4A0900FF; + background-color: #F04124; + border-color: #DE3000FF; +} + +.countdown.running:hover .countdown-controls > button, +.countdown.running:focus-within .countdown-controls > button{ + transform: translate(0, 0); + opacity: 1; +} + +.countdown.running:hover .countdown-controls > button:hover, +.countdown.running:focus-within .countdown-controls > button:hover{ + transform: translate(0, calc(var(--countdown-transition-distance) / -2)); + box-shadow: 0px 2px 5px 0px rgba(50, 50, 50, 0.4); + -webkit-box-shadow: 0px 2px 5px 0px rgba(50, 50, 50, 0.4); +} + +.countdown.running:hover .countdown-controls > button:active, +.countdown.running:focus-within .countdown-controls > button:active{ + transform: translate(0, calc(var(--coutndown-transition-distance) / -5)); +} + +/* ----- Fullscreen ----- */ +.countdown.countdown-fullscreen { + z-index: 0; +} + +.countdown-fullscreen.running .countdown-controls { + top: 1rem; + left: 0; + right: 0; + justify-content: center; +} + +.countdown-fullscreen.running .countdown-controls > button + button { + margin-left: 1rem; +} diff --git a/archive/2023-07-nyr/_freeze/02-data-budget/libs/countdown-0.4.0/countdown.js b/archive/2023-07-nyr/_freeze/02-data-budget/libs/countdown-0.4.0/countdown.js new file mode 100644 index 00000000..a058ad8f --- /dev/null +++ b/archive/2023-07-nyr/_freeze/02-data-budget/libs/countdown-0.4.0/countdown.js @@ -0,0 +1,478 @@ +/* globals Shiny,Audio */ +class CountdownTimer { + constructor (el, opts) { + if (typeof el === 'string' || el instanceof String) { + el = document.querySelector(el) + } + + if (el.counter) { + return el.counter + } + + const minutes = parseInt(el.querySelector('.minutes').innerText || '0') + const seconds = parseInt(el.querySelector('.seconds').innerText || '0') + const duration = minutes * 60 + seconds + + function attrIsTrue (x) { + if (x === true) return true + return !!(x === 'true' || x === '' || x === '1') + } + + this.element = el + this.duration = duration + this.end = null + this.is_running = false + this.warn_when = parseInt(el.dataset.warnWhen) || -1 + this.update_every = parseInt(el.dataset.updateEvery) || 1 + this.play_sound = attrIsTrue(el.dataset.playSound) + this.blink_colon = attrIsTrue(el.dataset.blinkColon) + this.startImmediately = attrIsTrue(el.dataset.startImmediately) + this.timeout = null + this.display = { minutes, seconds } + + if (opts.src_location) { + this.src_location = opts.src_location + } + + this.addEventListeners() + } + + addEventListeners () { + const self = this + + if (this.startImmediately) { + if (window.remark && window.slideshow) { + // Remark (xaringan) support + const isOnVisibleSlide = () => { + return document.querySelector('.remark-visible').contains(self.element) + } + if (isOnVisibleSlide()) { + self.start() + } else { + let started_once = 0 + window.slideshow.on('afterShowSlide', function () { + if (started_once > 0) return + if (isOnVisibleSlide()) { + self.start() + started_once = 1 + } + }) + } + } else if (window.Reveal) { + // Revealjs (quarto) support + const isOnVisibleSlide = () => { + const currentSlide = document.querySelector('.reveal .slide.present') + return currentSlide ? currentSlide.contains(self.element) : false + } + if (isOnVisibleSlide()) { + self.start() + } else { + const revealStartTimer = () => { + if (isOnVisibleSlide()) { + self.start() + window.Reveal.off('slidechanged', revealStartTimer) + } + } + window.Reveal.on('slidechanged', revealStartTimer) + } + } else if (window.IntersectionObserver) { + // All other situtations use IntersectionObserver + const onVisible = (element, callback) => { + new window.IntersectionObserver((entries, observer) => { + entries.forEach(entry => { + if (entry.intersectionRatio > 0) { + callback(element) + observer.disconnect() + } + }) + }).observe(element) + } + onVisible(this.element, el => el.countdown.start()) + } else { + // or just start the timer as soon as it's initialized + this.start() + } + } + + function haltEvent (ev) { + ev.preventDefault() + ev.stopPropagation() + } + function isSpaceOrEnter (ev) { + return ev.code === 'Space' || ev.code === 'Enter' + } + function isArrowUpOrDown (ev) { + return ev.code === 'ArrowUp' || ev.code === 'ArrowDown' + } + + ;['click', 'touchend'].forEach(function (eventType) { + self.element.addEventListener(eventType, function (ev) { + haltEvent(ev) + self.is_running ? self.stop() : self.start() + }) + }) + this.element.addEventListener('keydown', function (ev) { + if (ev.code === "Escape") { + self.reset() + haltEvent(ev) + } + if (!isSpaceOrEnter(ev) && !isArrowUpOrDown(ev)) return + haltEvent(ev) + if (isSpaceOrEnter(ev)) { + self.is_running ? self.stop() : self.start() + return + } + + if (!self.is_running) return + + if (ev.code === 'ArrowUp') { + self.bumpUp() + } else if (ev.code === 'ArrowDown') { + self.bumpDown() + } + }) + this.element.addEventListener('dblclick', function (ev) { + haltEvent(ev) + if (self.is_running) self.reset() + }) + this.element.addEventListener('touchmove', haltEvent) + + const btnBumpDown = this.element.querySelector('.countdown-bump-down') + ;['click', 'touchend'].forEach(function (eventType) { + btnBumpDown.addEventListener(eventType, function (ev) { + haltEvent(ev) + if (self.is_running) self.bumpDown() + }) + }) + btnBumpDown.addEventListener('keydown', function (ev) { + if (!isSpaceOrEnter(ev) || !self.is_running) return + haltEvent(ev) + self.bumpDown() + }) + + const btnBumpUp = this.element.querySelector('.countdown-bump-up') + ;['click', 'touchend'].forEach(function (eventType) { + btnBumpUp.addEventListener(eventType, function (ev) { + haltEvent(ev) + if (self.is_running) self.bumpUp() + }) + }) + btnBumpUp.addEventListener('keydown', function (ev) { + if (!isSpaceOrEnter(ev) || !self.is_running) return + haltEvent(ev) + self.bumpUp() + }) + this.element.querySelector('.countdown-controls').addEventListener('dblclick', function (ev) { + haltEvent(ev) + }) + } + + remainingTime () { + const remaining = this.is_running + ? (this.end - Date.now()) / 1000 + : this.remaining || this.duration + + let minutes = Math.floor(remaining / 60) + let seconds = Math.ceil(remaining - minutes * 60) + + if (seconds > 59) { + minutes = minutes + 1 + seconds = seconds - 60 + } + + return { remaining, minutes, seconds } + } + + start () { + if (this.is_running) return + + this.is_running = true + + if (this.remaining) { + // Having a static remaining time indicates timer was paused + this.end = Date.now() + this.remaining * 1000 + this.remaining = null + } else { + this.end = Date.now() + this.duration * 1000 + } + + this.reportStateToShiny('start') + + this.element.classList.remove('finished') + this.element.classList.add('running') + this.update(true) + this.tick() + } + + tick (run_again) { + if (typeof run_again === 'undefined') { + run_again = true + } + + if (!this.is_running) return + + const { seconds: secondsWas } = this.display + this.update() + + if (run_again) { + const delay = (this.end - Date.now() > 10000) ? 1000 : 250 + this.blinkColon(secondsWas) + this.timeout = setTimeout(this.tick.bind(this), delay) + } + } + + blinkColon (secondsWas) { + // don't blink unless option is set + if (!this.blink_colon) return + // warn_when always updates the seconds + if (this.warn_when > 0 && Date.now() + this.warn_when > this.end) { + this.element.classList.remove('blink-colon') + return + } + const { seconds: secondsIs } = this.display + if (secondsIs > 10 || secondsWas !== secondsIs) { + this.element.classList.toggle('blink-colon') + } + } + + update (force) { + if (typeof force === 'undefined') { + force = false + } + + const { remaining, minutes, seconds } = this.remainingTime() + + const setRemainingTime = (selector, time) => { + const timeContainer = this.element.querySelector(selector) + if (!timeContainer) return + time = Math.max(time, 0) + timeContainer.innerText = String(time).padStart(2, 0) + } + + if (this.is_running && remaining < 0.25) { + this.stop() + setRemainingTime('.minutes', 0) + setRemainingTime('.seconds', 0) + this.playSound() + return + } + + const should_update = force || + Math.round(remaining) < this.warn_when || + Math.round(remaining) % this.update_every === 0 + + if (should_update) { + this.element.classList.toggle('warning', remaining <= this.warn_when) + this.display = { minutes, seconds } + setRemainingTime('.minutes', minutes) + setRemainingTime('.seconds', seconds) + } + } + + stop () { + const { remaining } = this.remainingTime() + if (remaining > 1) { + this.remaining = remaining + } + this.element.classList.remove('running') + this.element.classList.remove('warning') + this.element.classList.remove('blink-colon') + this.element.classList.add('finished') + this.is_running = false + this.end = null + this.reportStateToShiny('stop') + this.timeout = clearTimeout(this.timeout) + } + + reset () { + this.stop() + this.remaining = null + this.update(true) + this.reportStateToShiny('reset') + this.element.classList.remove('finished') + this.element.classList.remove('warning') + } + + setValues (opts) { + if (typeof opts.warn_when !== 'undefined') { + this.warn_when = opts.warn_when + } + if (typeof opts.update_every !== 'undefined') { + this.update_every = opts.update_every + } + if (typeof opts.blink_colon !== 'undefined') { + this.blink_colon = opts.blink_colon + if (!opts.blink_colon) { + this.element.classList.remove('blink-colon') + } + } + if (typeof opts.play_sound !== 'undefined') { + this.play_sound = opts.play_sound + } + if (typeof opts.duration !== 'undefined') { + this.duration = opts.duration + if (this.is_running) { + this.reset() + this.start() + } + } + this.reportStateToShiny('update') + this.update(true) + } + + bumpTimer (val, round) { + round = typeof round === 'boolean' ? round : true + const { remaining } = this.remainingTime() + let newRemaining = remaining + val + if (newRemaining <= 0) { + this.setRemaining(0) + this.stop() + return + } + if (round && newRemaining > 10) { + newRemaining = Math.round(newRemaining / 5) * 5 + } + this.setRemaining(newRemaining) + this.reportStateToShiny(val > 0 ? 'bumpUp' : 'bumpDown') + this.update(true) + } + + bumpUp (val) { + if (!this.is_running) { + console.error('timer is not running') + return + } + this.bumpTimer( + val || this.bumpIncrementValue(), + typeof val === 'undefined' + ) + } + + bumpDown (val) { + if (!this.is_running) { + console.error('timer is not running') + return + } + this.bumpTimer( + val || -1 * this.bumpIncrementValue(), + typeof val === 'undefined' + ) + } + + setRemaining (val) { + if (!this.is_running) { + console.error('timer is not running') + return + } + this.end = Date.now() + val * 1000 + this.update(true) + } + + playSound () { + let url = this.play_sound + if (!url) return + if (typeof url === 'boolean') { + const src = this.src_location + ? this.src_location.replace('/countdown.js', '') + : 'libs/countdown' + url = src + '/smb_stage_clear.mp3' + } + const sound = new Audio(url) + sound.play() + } + + bumpIncrementValue (val) { + val = val || this.remainingTime().remaining + if (val <= 30) { + return 5 + } else if (val <= 300) { + return 15 + } else if (val <= 3000) { + return 30 + } else { + return 60 + } + } + + reportStateToShiny (action) { + if (!window.Shiny) return + + const inputId = this.element.id + const data = { + event: { + action, + time: new Date().toISOString() + }, + timer: { + is_running: this.is_running, + end: this.end ? new Date(this.end).toISOString() : null, + remaining: this.remainingTime() + } + } + + function shinySetInputValue () { + if (!window.Shiny.setInputValue) { + setTimeout(shinySetInputValue, 100) + return + } + window.Shiny.setInputValue(inputId, data) + } + + shinySetInputValue() + } +} + +(function () { + const CURRENT_SCRIPT = document.currentScript.getAttribute('src') + + document.addEventListener('DOMContentLoaded', function () { + const els = document.querySelectorAll('.countdown') + if (!els || !els.length) { + return + } + els.forEach(function (el) { + el.countdown = new CountdownTimer(el, { src_location: CURRENT_SCRIPT }) + }) + + if (window.Shiny) { + Shiny.addCustomMessageHandler('countdown:update', function (x) { + if (!x.id) { + console.error('No `id` provided, cannot update countdown') + return + } + const el = document.getElementById(x.id) + el.countdown.setValues(x) + }) + + Shiny.addCustomMessageHandler('countdown:start', function (id) { + const el = document.getElementById(id) + if (!el) return + el.countdown.start() + }) + + Shiny.addCustomMessageHandler('countdown:stop', function (id) { + const el = document.getElementById(id) + if (!el) return + el.countdown.stop() + }) + + Shiny.addCustomMessageHandler('countdown:reset', function (id) { + const el = document.getElementById(id) + if (!el) return + el.countdown.reset() + }) + + Shiny.addCustomMessageHandler('countdown:bumpUp', function (id) { + const el = document.getElementById(id) + if (!el) return + el.countdown.bumpUp() + }) + + Shiny.addCustomMessageHandler('countdown:bumpDown', function (id) { + const el = document.getElementById(id) + if (!el) return + el.countdown.bumpDown() + }) + } + }) +})() diff --git a/archive/2023-07-nyr/_freeze/02-data-budget/libs/countdown-0.4.0/smb_stage_clear.mp3 b/archive/2023-07-nyr/_freeze/02-data-budget/libs/countdown-0.4.0/smb_stage_clear.mp3 new file mode 100644 index 00000000..da2ddc2c Binary files /dev/null and b/archive/2023-07-nyr/_freeze/02-data-budget/libs/countdown-0.4.0/smb_stage_clear.mp3 differ diff --git a/archive/2023-07-nyr/_freeze/03-what-makes-a-model/execute-results/html.json b/archive/2023-07-nyr/_freeze/03-what-makes-a-model/execute-results/html.json new file mode 100644 index 00000000..3ef93b29 --- /dev/null +++ b/archive/2023-07-nyr/_freeze/03-what-makes-a-model/execute-results/html.json @@ -0,0 +1,21 @@ +{ + "hash": "f763c4ce44cf1dec8e9b54c63d259ee8", + "result": { + "markdown": "---\ntitle: \"3 - What makes a model?\"\nsubtitle: \"Machine learning with tidymodels\"\nformat:\n revealjs: \n slide-number: true\n footer: \n include-before-body: header.html\n include-after-body: footer-annotations.html\n theme: [default, tidymodels.scss]\n width: 1280\n height: 720\nknitr:\n opts_chunk: \n echo: true\n collapse: true\n comment: \"#>\"\n fig.path: \"figures/\"\n---\n\n\n\n\n## Your turn {transition=\"slide-in\"}\n\n![](images/parsnip-flagger.jpg){.absolute top=\"0\" right=\"0\" width=\"150\" height=\"150\"}\n\n*How do you fit a linear model in R?*\n\n*How many different ways can you think of?*\n\n\n::: {.cell}\n::: {.cell-output-display}\n```{=html}\n
\n
\n03:00\n
\n```\n:::\n:::\n\n\n. . .\n\n- `lm` for linear model\n\n- `glm` for generalized linear model (e.g. logistic regression)\n\n- `glmnet` for regularized regression\n\n- `keras` for regression using TensorFlow\n\n- `stan` for Bayesian regression\n\n- `spark` for large data sets\n\n## To specify a model ![](hexes/parsnip.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n. . .\n\n::: columns\n::: {.column width=\"40%\"}\n- Choose a [model]{.underline}\n- Specify an engine\n- Set the mode\n:::\n\n::: {.column width=\"60%\"}\n![](images/taxi_spinning.svg)\n:::\n:::\n\n## To specify a model ![](hexes/parsnip.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nlogistic_reg()\n#> Logistic Regression Model Specification (classification)\n#> \n#> Computational engine: glm\n```\n:::\n\n\n\n:::notes\nModels have default engines\n:::\n\n## To specify a model ![](hexes/parsnip.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n::: columns\n::: {.column width=\"40%\"}\n- Choose a model\n- Specify an [engine]{.underline}\n- Set the mode\n:::\n\n::: {.column width=\"60%\"}\n![](images/taxi_spinning.svg)\n:::\n:::\n\n## To specify a model ![](hexes/parsnip.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlogistic_reg() %>%\n set_engine(\"glmnet\")\n#> Logistic Regression Model Specification (classification)\n#> \n#> Computational engine: glmnet\n```\n:::\n\n\n## To specify a model ![](hexes/parsnip.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlogistic_reg() %>%\n set_engine(\"stan\")\n#> Logistic Regression Model Specification (classification)\n#> \n#> Computational engine: stan\n```\n:::\n\n\n## To specify a model ![](hexes/parsnip.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n::: columns\n::: {.column width=\"40%\"}\n- Choose a model\n- Specify an engine\n- Set the [mode]{.underline}\n:::\n\n::: {.column width=\"60%\"}\n![](images/taxi_spinning.svg)\n:::\n:::\n\n\n## To specify a model ![](hexes/parsnip.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndecision_tree()\n#> Decision Tree Model Specification (unknown mode)\n#> \n#> Computational engine: rpart\n```\n:::\n\n\n:::notes\nSome models have a default mode\n:::\n\n## To specify a model ![](hexes/parsnip.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndecision_tree() %>% \n set_mode(\"classification\")\n#> Decision Tree Model Specification (classification)\n#> \n#> Computational engine: rpart\n```\n:::\n\n\n. . .\n\n

\n\n::: r-fit-text\nAll available models are listed at \n:::\n\n## {background-iframe=\"https://www.tidymodels.org/find/parsnip/\"}\n\n::: footer\n:::\n\n## To specify a model ![](hexes/parsnip.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n::: columns\n::: {.column width=\"40%\"}\n- Choose a [model]{.underline}\n- Specify an [engine]{.underline}\n- Set the [mode]{.underline}\n:::\n\n::: {.column width=\"60%\"}\n![](images/taxi_spinning.svg)\n:::\n:::\n\n## Your turn {transition=\"slide-in\"}\n\n![](images/parsnip-flagger.jpg){.absolute top=\"0\" right=\"0\" width=\"150\" height=\"150\"}\n\n*Run the `tree_spec` chunk in your `.qmd`.*\n\n*Edit this code to use a different model.*\n\n\n::: {.cell}\n::: {.cell-output-display}\n```{=html}\n
\n
\n05:00\n
\n```\n:::\n:::\n\n\n

\n\n::: r-fit-text\nAll available models are listed at \n:::\n\n## Models we'll be using today\n\n* Logistic regression\n* Decision trees\n\n## Logistic regression\n\n::: columns\n::: {.column width=\"60%\"}\n\n::: {.cell}\n::: {.cell-output-display}\n![](figures/unnamed-chunk-10-1.svg){width=768}\n:::\n:::\n\n:::\n\n::: {.column width=\"40%\"}\n:::\n:::\n\n## Logistic regression\n\n::: columns\n::: {.column width=\"60%\"}\n\n::: {.cell}\n::: {.cell-output-display}\n![](figures/unnamed-chunk-11-1.svg){width=768}\n:::\n:::\n\n:::\n\n::: {.column width=\"40%\"}\n:::\n:::\n\n## Logistic regression\n\n::: columns\n::: {.column width=\"60%\"}\n\n::: {.cell}\n::: {.cell-output-display}\n![](figures/unnamed-chunk-12-1.svg){width=768}\n:::\n:::\n\n:::\n\n::: {.column width=\"40%\"}\n\n- Logit of outcome probability modeled as linear combination of predictors:\n\n$log(\\frac{p}{1 - p}) = \\beta_0 + \\beta_1\\cdot \\text{distance}$\n\n- Find a sigmoid line that separates the two classes\n\n:::\n:::\n\n## Decision trees\n\n::: columns\n::: {.column width=\"50%\"}\n\n::: {.cell}\n\n:::\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](figures/unnamed-chunk-14-1.svg){fig-align='center' width=960}\n:::\n:::\n\n\n:::\n\n::: {.column width=\"50%\"}\n:::\n:::\n\n## Decision trees\n\n::: columns\n::: {.column width=\"50%\"}\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](figures/unnamed-chunk-15-1.svg){fig-align='center' width=960}\n:::\n:::\n\n:::\n\n::: {.column width=\"50%\"}\n- Series of splits or if/then statements based on predictors\n\n- First the tree *grows* until some condition is met (maximum depth, no more data)\n\n- Then the tree is *pruned* to reduce its complexity\n:::\n:::\n\n## Decision trees\n\n::: columns\n::: {.column width=\"50%\"}\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](figures/unnamed-chunk-16-1.svg){fig-align='center' width=768}\n:::\n:::\n\n:::\n\n::: {.column width=\"50%\"}\n\n::: {.cell}\n::: {.cell-output-display}\n![](figures/unnamed-chunk-17-1.svg){width=768}\n:::\n:::\n\n:::\n:::\n\n## All models are wrong, but some are useful!\n\n::: columns\n::: {.column width=\"50%\"}\n### Logistic regression\n\n::: {.cell}\n::: {.cell-output-display}\n![](figures/unnamed-chunk-18-1.svg){width=768}\n:::\n:::\n\n:::\n\n::: {.column width=\"50%\"}\n### Decision trees\n\n::: {.cell}\n::: {.cell-output-display}\n![](figures/unnamed-chunk-19-1.svg){width=768}\n:::\n:::\n\n:::\n:::\n\n# A model workflow\n\n## Workflows bind preprocessors and models\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](images/good_workflow.png){fig-align='center' width=70%}\n:::\n:::\n\n\n:::notes\nExplain that PCA that is a preprocessor / dimensionality reduction, used to decorrelate data\n:::\n\n\n## What is wrong with this? {.annotation}\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](images/bad_workflow.png){fig-align='center' width=70%}\n:::\n:::\n\n\n## Why a `workflow()`? ![](hexes/workflows.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n. . .\n\n- Workflows handle new data better than base R tools in terms of new factor levels\n\n. . .\n\n- You can use other preprocessors besides formulas (more on feature engineering tomorrow!)\n\n. . .\n\n- They can help organize your work when working with multiple models\n\n. . .\n\n- [Most importantly]{.underline}, a workflow captures the entire modeling process: `fit()` and `predict()` apply to the preprocessing steps in addition to the actual model fit\n\n::: notes\nTwo ways workflows handle levels better than base R:\n\n- Enforces that new levels are not allowed at prediction time (this is an optional check that can be turned off)\n\n- Restores missing levels that were present at fit time, but happen to be missing at prediction time (like, if your \"new\" data just doesn't have an instance of that level)\n:::\n\n## A model workflow ![](hexes/workflows.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"} ![](hexes/parsnip.png){.absolute top=-20 right=64 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntree_spec <-\n decision_tree() %>% \n set_mode(\"classification\")\n\ntree_spec %>% \n fit(tip ~ ., data = taxi_train) \n#> parsnip model object\n#> \n#> n= 7045 \n#> \n#> node), split, n, loss, yval, (yprob)\n#> * denotes terminal node\n#> \n#> 1) root 7045 2069 yes (0.70631654 0.29368346) \n#> 2) company=Chicago Independents,City Service,Sun Taxi,Taxicab Insurance Agency Llc,other 4328 744 yes (0.82809612 0.17190388) \n#> 4) distance< 4.615 2365 254 yes (0.89260042 0.10739958) *\n#> 5) distance>=4.615 1963 490 yes (0.75038207 0.24961793) \n#> 10) distance>=12.565 1069 81 yes (0.92422825 0.07577175) *\n#> 11) distance< 12.565 894 409 yes (0.54250559 0.45749441) \n#> 22) company=Chicago Independents,Sun Taxi,Taxicab Insurance Agency Llc 278 71 yes (0.74460432 0.25539568) *\n#> 23) company=City Service,other 616 278 no (0.45129870 0.54870130) \n#> 46) distance< 7.205 178 59 yes (0.66853933 0.33146067) *\n#> 47) distance>=7.205 438 159 no (0.36301370 0.63698630) *\n#> 3) company=Flash Cab,Taxi Affiliation Services 2717 1325 yes (0.51232978 0.48767022) \n#> 6) distance< 3.235 1331 391 yes (0.70623591 0.29376409) *\n#> 7) distance>=3.235 1386 452 no (0.32611833 0.67388167) \n#> 14) distance>=12.39 344 90 yes (0.73837209 0.26162791) *\n#> 15) distance< 12.39 1042 198 no (0.19001919 0.80998081) *\n```\n:::\n\n\n## A model workflow ![](hexes/workflows.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"} ![](hexes/parsnip.png){.absolute top=-20 right=64 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntree_spec <-\n decision_tree() %>% \n set_mode(\"classification\")\n\nworkflow() %>%\n add_formula(tip ~ .) %>%\n add_model(tree_spec) %>%\n fit(data = taxi_train) \n#> ══ Workflow [trained] ════════════════════════════════════════════════\n#> Preprocessor: Formula\n#> Model: decision_tree()\n#> \n#> ── Preprocessor ──────────────────────────────────────────────────────\n#> tip ~ .\n#> \n#> ── Model ─────────────────────────────────────────────────────────────\n#> n= 7045 \n#> \n#> node), split, n, loss, yval, (yprob)\n#> * denotes terminal node\n#> \n#> 1) root 7045 2069 yes (0.70631654 0.29368346) \n#> 2) company=Chicago Independents,City Service,Sun Taxi,Taxicab Insurance Agency Llc,other 4328 744 yes (0.82809612 0.17190388) \n#> 4) distance< 4.615 2365 254 yes (0.89260042 0.10739958) *\n#> 5) distance>=4.615 1963 490 yes (0.75038207 0.24961793) \n#> 10) distance>=12.565 1069 81 yes (0.92422825 0.07577175) *\n#> 11) distance< 12.565 894 409 yes (0.54250559 0.45749441) \n#> 22) company=Chicago Independents,Sun Taxi,Taxicab Insurance Agency Llc 278 71 yes (0.74460432 0.25539568) *\n#> 23) company=City Service,other 616 278 no (0.45129870 0.54870130) \n#> 46) distance< 7.205 178 59 yes (0.66853933 0.33146067) *\n#> 47) distance>=7.205 438 159 no (0.36301370 0.63698630) *\n#> 3) company=Flash Cab,Taxi Affiliation Services 2717 1325 yes (0.51232978 0.48767022) \n#> 6) distance< 3.235 1331 391 yes (0.70623591 0.29376409) *\n#> 7) distance>=3.235 1386 452 no (0.32611833 0.67388167) \n#> 14) distance>=12.39 344 90 yes (0.73837209 0.26162791) *\n#> 15) distance< 12.39 1042 198 no (0.19001919 0.80998081) *\n```\n:::\n\n\n## A model workflow ![](hexes/workflows.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"} ![](hexes/parsnip.png){.absolute top=-20 right=64 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntree_spec <-\n decision_tree() %>% \n set_mode(\"classification\")\n\nworkflow(tip ~ ., tree_spec) %>% \n fit(data = taxi_train) \n#> ══ Workflow [trained] ════════════════════════════════════════════════\n#> Preprocessor: Formula\n#> Model: decision_tree()\n#> \n#> ── Preprocessor ──────────────────────────────────────────────────────\n#> tip ~ .\n#> \n#> ── Model ─────────────────────────────────────────────────────────────\n#> n= 7045 \n#> \n#> node), split, n, loss, yval, (yprob)\n#> * denotes terminal node\n#> \n#> 1) root 7045 2069 yes (0.70631654 0.29368346) \n#> 2) company=Chicago Independents,City Service,Sun Taxi,Taxicab Insurance Agency Llc,other 4328 744 yes (0.82809612 0.17190388) \n#> 4) distance< 4.615 2365 254 yes (0.89260042 0.10739958) *\n#> 5) distance>=4.615 1963 490 yes (0.75038207 0.24961793) \n#> 10) distance>=12.565 1069 81 yes (0.92422825 0.07577175) *\n#> 11) distance< 12.565 894 409 yes (0.54250559 0.45749441) \n#> 22) company=Chicago Independents,Sun Taxi,Taxicab Insurance Agency Llc 278 71 yes (0.74460432 0.25539568) *\n#> 23) company=City Service,other 616 278 no (0.45129870 0.54870130) \n#> 46) distance< 7.205 178 59 yes (0.66853933 0.33146067) *\n#> 47) distance>=7.205 438 159 no (0.36301370 0.63698630) *\n#> 3) company=Flash Cab,Taxi Affiliation Services 2717 1325 yes (0.51232978 0.48767022) \n#> 6) distance< 3.235 1331 391 yes (0.70623591 0.29376409) *\n#> 7) distance>=3.235 1386 452 no (0.32611833 0.67388167) \n#> 14) distance>=12.39 344 90 yes (0.73837209 0.26162791) *\n#> 15) distance< 12.39 1042 198 no (0.19001919 0.80998081) *\n```\n:::\n\n\n## Your turn {transition=\"slide-in\"}\n\n![](images/parsnip-flagger.jpg){.absolute top=\"0\" right=\"0\" width=\"150\" height=\"150\"}\n\n*Run the `tree_wflow` chunk in your `.qmd`.*\n\n*Edit this code to make a workflow with your own model of choice.*\n\n\n::: {.cell}\n::: {.cell-output-display}\n```{=html}\n
\n
\n05:00\n
\n```\n:::\n:::\n\n\n## Predict with your model ![](hexes/workflows.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"} ![](hexes/parsnip.png){.absolute top=-20 right=64 width=\"64\" height=\"74.24\"}\n\nHow do you use your new `tree_fit` model?\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntree_spec <-\n decision_tree() %>% \n set_mode(\"classification\")\n\ntree_fit <-\n workflow(tip ~ ., tree_spec) %>% \n fit(data = taxi_train) \n```\n:::\n\n\n## Your turn {transition=\"slide-in\"}\n\n![](images/parsnip-flagger.jpg){.absolute top=\"0\" right=\"0\" width=\"150\" height=\"150\"}\n\n*Run:*\n\n`predict(tree_fit, new_data = taxi_test)`\n\n*What do you get?*\n\n\n::: {.cell}\n::: {.cell-output-display}\n```{=html}\n
\n
\n03:00\n
\n```\n:::\n:::\n\n\n## Your turn\n\n![](images/parsnip-flagger.jpg){.absolute top=\"0\" right=\"0\" width=\"150\" height=\"150\"}\n\n*Run:*\n\n`augment(tree_fit, new_data = taxi_test)`\n\n*What do you get?*\n\n\n::: {.cell}\n::: {.cell-output-display}\n```{=html}\n
\n
\n03:00\n
\n```\n:::\n:::\n\n\n# The tidymodels prediction guarantee!\n\n. . .\n\n- The predictions will always be inside a **tibble**\n- The column names and types are **unsurprising** and **predictable**\n- The number of rows in `new_data` and the output **are the same**\n\n## Understand your model ![](hexes/workflows.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"} ![](hexes/parsnip.png){.absolute top=-20 right=64 width=\"64\" height=\"74.24\"}\n\nHow do you **understand** your new `tree_fit` model?\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](figures/unnamed-chunk-27-1.svg){fig-align='center' width=960}\n:::\n:::\n\n\n## Understand your model ![](hexes/workflows.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"} ![](hexes/parsnip.png){.absolute top=-20 right=64 width=\"64\" height=\"74.24\"}\n\nHow do you **understand** your new `tree_fit` model?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(rpart.plot)\ntree_fit %>%\n extract_fit_engine() %>%\n rpart.plot(roundint = FALSE)\n```\n:::\n\n\nYou can `extract_*()` several components of your fitted workflow.\n\n::: notes\n`roundint = FALSE` is only to quiet a warning\n:::\n\n\n## Understand your model ![](hexes/workflows.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"} ![](hexes/parsnip.png){.absolute top=-20 right=64 width=\"64\" height=\"74.24\"}\n\nHow do you **understand** your new `tree_fit` model?\n\n. . .\n\nYou can use your fitted workflow for model and/or prediction explanations:\n\n. . .\n\n- overall variable importance, such as with the [vip](https://koalaverse.github.io/vip/) package\n\n. . .\n\n- flexible model explainers, such as with the [DALEXtra](https://dalex.drwhy.ai/) package\n\n. . .\n\nLearn more at \n\n## {background-iframe=\"https://hardhat.tidymodels.org/reference/hardhat-extract.html\"}\n\n::: footer\n:::\n\n## Your turn {transition=\"slide-in\"}\n\n![](images/parsnip-flagger.jpg){.absolute top=\"0\" right=\"0\" width=\"150\" height=\"150\"}\n\n*Extract the model engine object from your fitted workflow.*\n\n⚠️ *Never `predict()` with any extracted components!*\n\n\n::: {.cell}\n::: {.cell-output-display}\n```{=html}\n
\n
\n05:00\n
\n```\n:::\n:::\n\n\n:::notes\nAfterward, ask what kind of object people got from the extraction, and what they did with it (e.g. give it to `summary()`, `plot()`, `broom::tidy()` ). Live code along\n:::\n\n# Deploy your model ![](hexes/vetiver.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n## {background-image=\"https://vetiver.rstudio.com/images/ml_ops_cycle.png\" background-size=\"contain\"}\n\n## Deploying a model ![](hexes/vetiver.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\nHow do you use your new `tree_fit` model in **production**?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(vetiver)\nv <- vetiver_model(tree_fit, \"taxi\")\nv\n#> \n#> ── taxi ─ model for deployment \n#> A rpart classification modeling workflow using 6 features\n```\n:::\n\n\nLearn more at \n\n## Deploy your model ![](hexes/vetiver.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\nHow do you use your new model `tree_fit` in **production**?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(plumber)\npr() %>%\n vetiver_api(v)\n#> # Plumber router with 2 endpoints, 4 filters, and 1 sub-router.\n#> # Use `pr_run()` on this object to start the API.\n#> β”œβ”€β”€[queryString]\n#> β”œβ”€β”€[body]\n#> β”œβ”€β”€[cookieParser]\n#> β”œβ”€β”€[sharedSecret]\n#> β”œβ”€β”€/logo\n#> β”‚ β”‚ # Plumber static router serving from directory: /Library/Frameworks/R.framework/Versions/4.2-arm64/Resources/library/vetiver\n#> β”œβ”€β”€/ping (GET)\n#> └──/predict (POST)\n```\n:::\n\n\nLearn more at \n\n:::notes\nLive-code making a prediction\n:::\n\n## Your turn {transition=\"slide-in\"}\n\n![](images/parsnip-flagger.jpg){.absolute top=\"0\" right=\"0\" width=\"150\" height=\"150\"}\n\n*Run the `vetiver` chunk in your `.qmd`.*\n\n*Check out the automated visual documentation.*\n\n\n::: {.cell}\n::: {.cell-output-display}\n```{=html}\n
\n
\n05:00\n
\n```\n:::\n:::\n\n\n## The whole game - status update\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](images/whole-game-model-1.jpg){fig-align='center' width=1772}\n:::\n:::\n\n\n:::notes\nStress that fitting a model on the entire training set was only for illustrating how to fit a model\n:::\n", + "supporting": [], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": { + "include-in-header": [ + "\n\n" + ], + "include-after-body": [ + "\n\n\n" + ] + }, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/archive/2023-07-nyr/_freeze/03-what-makes-a-model/libs/countdown-0.4.0/countdown.css b/archive/2023-07-nyr/_freeze/03-what-makes-a-model/libs/countdown-0.4.0/countdown.css new file mode 100644 index 00000000..bf387012 --- /dev/null +++ b/archive/2023-07-nyr/_freeze/03-what-makes-a-model/libs/countdown-0.4.0/countdown.css @@ -0,0 +1,144 @@ +.countdown { + background: inherit; + position: absolute; + cursor: pointer; + font-size: 3rem; + line-height: 1; + border-color: #ddd; + border-width: 3px; + border-style: solid; + border-radius: 15px; + box-shadow: 0px 4px 10px 0px rgba(50, 50, 50, 0.4); + -webkit-box-shadow: 0px 4px 10px 0px rgba(50, 50, 50, 0.4); + margin: 0.6em; + padding: 10px 15px; + text-align: center; + z-index: 10; + -webkit-user-select: none; + -moz-user-select: none; + -ms-user-select: none; + user-select: none; +} +.countdown { + display: flex; + align-items: center; + justify-content: center; +} +.countdown .countdown-time { + background: none; + font-size: 100%; + padding: 0; +} +.countdown-digits { + color: inherit; +} +.countdown.running { + border-color: #2A9B59FF; + background-color: #43AC6A; +} +.countdown.running .countdown-digits { + color: #002F14FF; +} +.countdown.finished { + border-color: #DE3000FF; + background-color: #F04124; +} +.countdown.finished .countdown-digits { + color: #4A0900FF; +} +.countdown.running.warning { + border-color: #CEAC04FF; + background-color: #E6C229; +} +.countdown.running.warning .countdown-digits { + color: #3A2F02FF; +} + +.countdown.running.blink-colon .countdown-digits.colon { + opacity: 0.1; +} + +/* ------ Controls ------ */ +.countdown:not(.running) .countdown-controls { + display: none; +} + +.countdown-controls { + position: absolute; + top: -0.5rem; + right: -0.5rem; + left: -0.5rem; + display: flex; + justify-content: space-between; + margin: 0; + padding: 0; +} + +.countdown-controls > button { + font-size: 1.5rem; + width: 1rem; + height: 1rem; + display: inline-block; + display: flex; + flex-direction: column; + align-items: center; + justify-content: center; + font-family: monospace; + padding: 10px; + margin: 0; + background: inherit; + border: 2px solid; + border-radius: 100%; + transition: 50ms transform ease-in-out, 150ms opacity ease-in; + --countdown-transition-distance: 10px; +} + +.countdown .countdown-controls > button:last-child { + transform: translate(calc(-1 * var(--countdown-transition-distance)), var(--countdown-transition-distance)); + opacity: 0; + color: #002F14FF; + background-color: #43AC6A; + border-color: #2A9B59FF; +} + +.countdown .countdown-controls > button:first-child { + transform: translate(var(--countdown-transition-distance), var(--countdown-transition-distance)); + opacity: 0; + color: #4A0900FF; + background-color: #F04124; + border-color: #DE3000FF; +} + +.countdown.running:hover .countdown-controls > button, +.countdown.running:focus-within .countdown-controls > button{ + transform: translate(0, 0); + opacity: 1; +} + +.countdown.running:hover .countdown-controls > button:hover, +.countdown.running:focus-within .countdown-controls > button:hover{ + transform: translate(0, calc(var(--countdown-transition-distance) / -2)); + box-shadow: 0px 2px 5px 0px rgba(50, 50, 50, 0.4); + -webkit-box-shadow: 0px 2px 5px 0px rgba(50, 50, 50, 0.4); +} + +.countdown.running:hover .countdown-controls > button:active, +.countdown.running:focus-within .countdown-controls > button:active{ + transform: translate(0, calc(var(--coutndown-transition-distance) / -5)); +} + +/* ----- Fullscreen ----- */ +.countdown.countdown-fullscreen { + z-index: 0; +} + +.countdown-fullscreen.running .countdown-controls { + top: 1rem; + left: 0; + right: 0; + justify-content: center; +} + +.countdown-fullscreen.running .countdown-controls > button + button { + margin-left: 1rem; +} diff --git a/archive/2023-07-nyr/_freeze/03-what-makes-a-model/libs/countdown-0.4.0/countdown.js b/archive/2023-07-nyr/_freeze/03-what-makes-a-model/libs/countdown-0.4.0/countdown.js new file mode 100644 index 00000000..a058ad8f --- /dev/null +++ b/archive/2023-07-nyr/_freeze/03-what-makes-a-model/libs/countdown-0.4.0/countdown.js @@ -0,0 +1,478 @@ +/* globals Shiny,Audio */ +class CountdownTimer { + constructor (el, opts) { + if (typeof el === 'string' || el instanceof String) { + el = document.querySelector(el) + } + + if (el.counter) { + return el.counter + } + + const minutes = parseInt(el.querySelector('.minutes').innerText || '0') + const seconds = parseInt(el.querySelector('.seconds').innerText || '0') + const duration = minutes * 60 + seconds + + function attrIsTrue (x) { + if (x === true) return true + return !!(x === 'true' || x === '' || x === '1') + } + + this.element = el + this.duration = duration + this.end = null + this.is_running = false + this.warn_when = parseInt(el.dataset.warnWhen) || -1 + this.update_every = parseInt(el.dataset.updateEvery) || 1 + this.play_sound = attrIsTrue(el.dataset.playSound) + this.blink_colon = attrIsTrue(el.dataset.blinkColon) + this.startImmediately = attrIsTrue(el.dataset.startImmediately) + this.timeout = null + this.display = { minutes, seconds } + + if (opts.src_location) { + this.src_location = opts.src_location + } + + this.addEventListeners() + } + + addEventListeners () { + const self = this + + if (this.startImmediately) { + if (window.remark && window.slideshow) { + // Remark (xaringan) support + const isOnVisibleSlide = () => { + return document.querySelector('.remark-visible').contains(self.element) + } + if (isOnVisibleSlide()) { + self.start() + } else { + let started_once = 0 + window.slideshow.on('afterShowSlide', function () { + if (started_once > 0) return + if (isOnVisibleSlide()) { + self.start() + started_once = 1 + } + }) + } + } else if (window.Reveal) { + // Revealjs (quarto) support + const isOnVisibleSlide = () => { + const currentSlide = document.querySelector('.reveal .slide.present') + return currentSlide ? currentSlide.contains(self.element) : false + } + if (isOnVisibleSlide()) { + self.start() + } else { + const revealStartTimer = () => { + if (isOnVisibleSlide()) { + self.start() + window.Reveal.off('slidechanged', revealStartTimer) + } + } + window.Reveal.on('slidechanged', revealStartTimer) + } + } else if (window.IntersectionObserver) { + // All other situtations use IntersectionObserver + const onVisible = (element, callback) => { + new window.IntersectionObserver((entries, observer) => { + entries.forEach(entry => { + if (entry.intersectionRatio > 0) { + callback(element) + observer.disconnect() + } + }) + }).observe(element) + } + onVisible(this.element, el => el.countdown.start()) + } else { + // or just start the timer as soon as it's initialized + this.start() + } + } + + function haltEvent (ev) { + ev.preventDefault() + ev.stopPropagation() + } + function isSpaceOrEnter (ev) { + return ev.code === 'Space' || ev.code === 'Enter' + } + function isArrowUpOrDown (ev) { + return ev.code === 'ArrowUp' || ev.code === 'ArrowDown' + } + + ;['click', 'touchend'].forEach(function (eventType) { + self.element.addEventListener(eventType, function (ev) { + haltEvent(ev) + self.is_running ? self.stop() : self.start() + }) + }) + this.element.addEventListener('keydown', function (ev) { + if (ev.code === "Escape") { + self.reset() + haltEvent(ev) + } + if (!isSpaceOrEnter(ev) && !isArrowUpOrDown(ev)) return + haltEvent(ev) + if (isSpaceOrEnter(ev)) { + self.is_running ? self.stop() : self.start() + return + } + + if (!self.is_running) return + + if (ev.code === 'ArrowUp') { + self.bumpUp() + } else if (ev.code === 'ArrowDown') { + self.bumpDown() + } + }) + this.element.addEventListener('dblclick', function (ev) { + haltEvent(ev) + if (self.is_running) self.reset() + }) + this.element.addEventListener('touchmove', haltEvent) + + const btnBumpDown = this.element.querySelector('.countdown-bump-down') + ;['click', 'touchend'].forEach(function (eventType) { + btnBumpDown.addEventListener(eventType, function (ev) { + haltEvent(ev) + if (self.is_running) self.bumpDown() + }) + }) + btnBumpDown.addEventListener('keydown', function (ev) { + if (!isSpaceOrEnter(ev) || !self.is_running) return + haltEvent(ev) + self.bumpDown() + }) + + const btnBumpUp = this.element.querySelector('.countdown-bump-up') + ;['click', 'touchend'].forEach(function (eventType) { + btnBumpUp.addEventListener(eventType, function (ev) { + haltEvent(ev) + if (self.is_running) self.bumpUp() + }) + }) + btnBumpUp.addEventListener('keydown', function (ev) { + if (!isSpaceOrEnter(ev) || !self.is_running) return + haltEvent(ev) + self.bumpUp() + }) + this.element.querySelector('.countdown-controls').addEventListener('dblclick', function (ev) { + haltEvent(ev) + }) + } + + remainingTime () { + const remaining = this.is_running + ? (this.end - Date.now()) / 1000 + : this.remaining || this.duration + + let minutes = Math.floor(remaining / 60) + let seconds = Math.ceil(remaining - minutes * 60) + + if (seconds > 59) { + minutes = minutes + 1 + seconds = seconds - 60 + } + + return { remaining, minutes, seconds } + } + + start () { + if (this.is_running) return + + this.is_running = true + + if (this.remaining) { + // Having a static remaining time indicates timer was paused + this.end = Date.now() + this.remaining * 1000 + this.remaining = null + } else { + this.end = Date.now() + this.duration * 1000 + } + + this.reportStateToShiny('start') + + this.element.classList.remove('finished') + this.element.classList.add('running') + this.update(true) + this.tick() + } + + tick (run_again) { + if (typeof run_again === 'undefined') { + run_again = true + } + + if (!this.is_running) return + + const { seconds: secondsWas } = this.display + this.update() + + if (run_again) { + const delay = (this.end - Date.now() > 10000) ? 1000 : 250 + this.blinkColon(secondsWas) + this.timeout = setTimeout(this.tick.bind(this), delay) + } + } + + blinkColon (secondsWas) { + // don't blink unless option is set + if (!this.blink_colon) return + // warn_when always updates the seconds + if (this.warn_when > 0 && Date.now() + this.warn_when > this.end) { + this.element.classList.remove('blink-colon') + return + } + const { seconds: secondsIs } = this.display + if (secondsIs > 10 || secondsWas !== secondsIs) { + this.element.classList.toggle('blink-colon') + } + } + + update (force) { + if (typeof force === 'undefined') { + force = false + } + + const { remaining, minutes, seconds } = this.remainingTime() + + const setRemainingTime = (selector, time) => { + const timeContainer = this.element.querySelector(selector) + if (!timeContainer) return + time = Math.max(time, 0) + timeContainer.innerText = String(time).padStart(2, 0) + } + + if (this.is_running && remaining < 0.25) { + this.stop() + setRemainingTime('.minutes', 0) + setRemainingTime('.seconds', 0) + this.playSound() + return + } + + const should_update = force || + Math.round(remaining) < this.warn_when || + Math.round(remaining) % this.update_every === 0 + + if (should_update) { + this.element.classList.toggle('warning', remaining <= this.warn_when) + this.display = { minutes, seconds } + setRemainingTime('.minutes', minutes) + setRemainingTime('.seconds', seconds) + } + } + + stop () { + const { remaining } = this.remainingTime() + if (remaining > 1) { + this.remaining = remaining + } + this.element.classList.remove('running') + this.element.classList.remove('warning') + this.element.classList.remove('blink-colon') + this.element.classList.add('finished') + this.is_running = false + this.end = null + this.reportStateToShiny('stop') + this.timeout = clearTimeout(this.timeout) + } + + reset () { + this.stop() + this.remaining = null + this.update(true) + this.reportStateToShiny('reset') + this.element.classList.remove('finished') + this.element.classList.remove('warning') + } + + setValues (opts) { + if (typeof opts.warn_when !== 'undefined') { + this.warn_when = opts.warn_when + } + if (typeof opts.update_every !== 'undefined') { + this.update_every = opts.update_every + } + if (typeof opts.blink_colon !== 'undefined') { + this.blink_colon = opts.blink_colon + if (!opts.blink_colon) { + this.element.classList.remove('blink-colon') + } + } + if (typeof opts.play_sound !== 'undefined') { + this.play_sound = opts.play_sound + } + if (typeof opts.duration !== 'undefined') { + this.duration = opts.duration + if (this.is_running) { + this.reset() + this.start() + } + } + this.reportStateToShiny('update') + this.update(true) + } + + bumpTimer (val, round) { + round = typeof round === 'boolean' ? round : true + const { remaining } = this.remainingTime() + let newRemaining = remaining + val + if (newRemaining <= 0) { + this.setRemaining(0) + this.stop() + return + } + if (round && newRemaining > 10) { + newRemaining = Math.round(newRemaining / 5) * 5 + } + this.setRemaining(newRemaining) + this.reportStateToShiny(val > 0 ? 'bumpUp' : 'bumpDown') + this.update(true) + } + + bumpUp (val) { + if (!this.is_running) { + console.error('timer is not running') + return + } + this.bumpTimer( + val || this.bumpIncrementValue(), + typeof val === 'undefined' + ) + } + + bumpDown (val) { + if (!this.is_running) { + console.error('timer is not running') + return + } + this.bumpTimer( + val || -1 * this.bumpIncrementValue(), + typeof val === 'undefined' + ) + } + + setRemaining (val) { + if (!this.is_running) { + console.error('timer is not running') + return + } + this.end = Date.now() + val * 1000 + this.update(true) + } + + playSound () { + let url = this.play_sound + if (!url) return + if (typeof url === 'boolean') { + const src = this.src_location + ? this.src_location.replace('/countdown.js', '') + : 'libs/countdown' + url = src + '/smb_stage_clear.mp3' + } + const sound = new Audio(url) + sound.play() + } + + bumpIncrementValue (val) { + val = val || this.remainingTime().remaining + if (val <= 30) { + return 5 + } else if (val <= 300) { + return 15 + } else if (val <= 3000) { + return 30 + } else { + return 60 + } + } + + reportStateToShiny (action) { + if (!window.Shiny) return + + const inputId = this.element.id + const data = { + event: { + action, + time: new Date().toISOString() + }, + timer: { + is_running: this.is_running, + end: this.end ? new Date(this.end).toISOString() : null, + remaining: this.remainingTime() + } + } + + function shinySetInputValue () { + if (!window.Shiny.setInputValue) { + setTimeout(shinySetInputValue, 100) + return + } + window.Shiny.setInputValue(inputId, data) + } + + shinySetInputValue() + } +} + +(function () { + const CURRENT_SCRIPT = document.currentScript.getAttribute('src') + + document.addEventListener('DOMContentLoaded', function () { + const els = document.querySelectorAll('.countdown') + if (!els || !els.length) { + return + } + els.forEach(function (el) { + el.countdown = new CountdownTimer(el, { src_location: CURRENT_SCRIPT }) + }) + + if (window.Shiny) { + Shiny.addCustomMessageHandler('countdown:update', function (x) { + if (!x.id) { + console.error('No `id` provided, cannot update countdown') + return + } + const el = document.getElementById(x.id) + el.countdown.setValues(x) + }) + + Shiny.addCustomMessageHandler('countdown:start', function (id) { + const el = document.getElementById(id) + if (!el) return + el.countdown.start() + }) + + Shiny.addCustomMessageHandler('countdown:stop', function (id) { + const el = document.getElementById(id) + if (!el) return + el.countdown.stop() + }) + + Shiny.addCustomMessageHandler('countdown:reset', function (id) { + const el = document.getElementById(id) + if (!el) return + el.countdown.reset() + }) + + Shiny.addCustomMessageHandler('countdown:bumpUp', function (id) { + const el = document.getElementById(id) + if (!el) return + el.countdown.bumpUp() + }) + + Shiny.addCustomMessageHandler('countdown:bumpDown', function (id) { + const el = document.getElementById(id) + if (!el) return + el.countdown.bumpDown() + }) + } + }) +})() diff --git a/archive/2023-07-nyr/_freeze/03-what-makes-a-model/libs/countdown-0.4.0/smb_stage_clear.mp3 b/archive/2023-07-nyr/_freeze/03-what-makes-a-model/libs/countdown-0.4.0/smb_stage_clear.mp3 new file mode 100644 index 00000000..da2ddc2c Binary files /dev/null and b/archive/2023-07-nyr/_freeze/03-what-makes-a-model/libs/countdown-0.4.0/smb_stage_clear.mp3 differ diff --git a/archive/2023-07-nyr/_freeze/04-evaluating-models/execute-results/html.json b/archive/2023-07-nyr/_freeze/04-evaluating-models/execute-results/html.json new file mode 100644 index 00000000..7828ad35 --- /dev/null +++ b/archive/2023-07-nyr/_freeze/04-evaluating-models/execute-results/html.json @@ -0,0 +1,21 @@ +{ + "hash": "385a477f21d39283e40c50ca0fb00615", + "result": { + "markdown": "---\ntitle: \"4 - Evaluating models\"\nsubtitle: \"Machine learning with tidymodels\"\nformat:\n revealjs: \n slide-number: true\n footer: \n include-before-body: header.html\n include-after-body: footer-annotations.html\n theme: [default, tidymodels.scss]\n width: 1280\n height: 720\nknitr:\n opts_chunk: \n echo: true\n collapse: true\n comment: \"#>\"\n fig.path: \"figures/\"\n---\n\n\n\n\n## Looking at predictions\n\n\n::: {.cell}\n\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\naugment(taxi_fit, new_data = taxi_train) %>%\n relocate(tip, .pred_class, .pred_yes, .pred_no)\n#> # A tibble: 7,045 Γ— 10\n#> tip .pred_class .pred_yes .pred_no distance company local dow month hour\n#> \n#> 1 no no 0.0625 0.937 5.39 Flash … no Sat Mar 12\n#> 2 no yes 0.924 0.0758 18.4 Sun Ta… no Sat Apr 6\n#> 3 no no 0.391 0.609 5.8 other no Tue Jan 10\n#> 4 no no 0.112 0.888 6.85 Flash … no Fri Apr 8\n#> 5 no no 0.129 0.871 9.5 City S… no Wed Jan 7\n#> 6 no no 0.326 0.674 12 other no Fri Apr 11\n#> 7 no no 0.0917 0.908 8.9 Taxi A… no Mon Feb 14\n#> 8 no yes 0.902 0.0980 1.38 other no Fri Apr 16\n#> 9 no no 0.0917 0.908 9.12 Flash … no Wed Apr 9\n#> 10 no yes 0.933 0.0668 2.28 City S… no Thu Apr 16\n#> # β„Ή 7,035 more rows\n```\n:::\n\n\n## Confusion matrix ![](hexes/yardstick.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n![](images/confusion-matrix.png)\n\n## Confusion matrix ![](hexes/yardstick.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\naugment(taxi_fit, new_data = taxi_train) %>%\n conf_mat(truth = tip, estimate = .pred_class)\n#> Truth\n#> Prediction yes no\n#> yes 4639 660\n#> no 337 1409\n```\n:::\n\n\n## Confusion matrix ![](hexes/yardstick.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\naugment(taxi_fit, new_data = taxi_train) %>%\n conf_mat(truth = tip, estimate = .pred_class) %>%\n autoplot(type = \"heatmap\")\n```\n\n::: {.cell-output-display}\n![](figures/unnamed-chunk-5-1.svg){width=960}\n:::\n:::\n\n\n## Metrics for model performance ![](hexes/yardstick.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n::: columns\n::: {.column width=\"60%\"}\n\n::: {.cell}\n\n```{.r .cell-code}\naugment(taxi_fit, new_data = taxi_train) %>%\n accuracy(truth = tip, estimate = .pred_class)\n#> # A tibble: 1 Γ— 3\n#> .metric .estimator .estimate\n#> \n#> 1 accuracy binary 0.858\n```\n:::\n\n:::\n\n::: {.column width=\"40%\"}\n![](images/confusion-matrix-accuracy.png)\n:::\n:::\n\n## Dangers of accuracy ![](hexes/yardstick.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\nWe need to be careful of using `accuracy()` since it can give \"good\" performance by only predicting one way with imbalanced data\n\n\n::: {.cell}\n\n```{.r .cell-code}\naugment(taxi_fit, new_data = taxi_train) %>%\n mutate(.pred_class = factor(\"yes\", levels = c(\"yes\", \"no\"))) %>%\n accuracy(truth = tip, estimate = .pred_class)\n#> # A tibble: 1 Γ— 3\n#> .metric .estimator .estimate\n#> \n#> 1 accuracy binary 0.706\n```\n:::\n\n\n## Metrics for model performance ![](hexes/yardstick.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n::: columns\n::: {.column width=\"60%\"}\n\n::: {.cell}\n\n```{.r .cell-code}\naugment(taxi_fit, new_data = taxi_train) %>%\n sensitivity(truth = tip, estimate = .pred_class)\n#> # A tibble: 1 Γ— 3\n#> .metric .estimator .estimate\n#> \n#> 1 sensitivity binary 0.932\n```\n:::\n\n:::\n\n::: {.column width=\"40%\"}\n![](images/confusion-matrix-sensitivity.png)\n:::\n:::\n\n\n## Metrics for model performance ![](hexes/yardstick.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n::: columns\n::: {.column width=\"60%\"}\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"3-6\"}\naugment(taxi_fit, new_data = taxi_train) %>%\n sensitivity(truth = tip, estimate = .pred_class)\n#> # A tibble: 1 Γ— 3\n#> .metric .estimator .estimate\n#> \n#> 1 sensitivity binary 0.932\n```\n:::\n\n\n
\n\n\n::: {.cell}\n\n```{.r .cell-code}\naugment(taxi_fit, new_data = taxi_train) %>%\n specificity(truth = tip, estimate = .pred_class)\n#> # A tibble: 1 Γ— 3\n#> .metric .estimator .estimate\n#> \n#> 1 specificity binary 0.681\n```\n:::\n\n:::\n\n::: {.column width=\"40%\"}\n![](images/confusion-matrix-specificity.png)\n:::\n:::\n\n## Metrics for model performance ![](hexes/yardstick.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\nWe can use `metric_set()` to combine multiple calculations into one\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntaxi_metrics <- metric_set(accuracy, specificity, sensitivity)\n\naugment(taxi_fit, new_data = taxi_train) %>%\n taxi_metrics(truth = tip, estimate = .pred_class)\n#> # A tibble: 3 Γ— 3\n#> .metric .estimator .estimate\n#> \n#> 1 accuracy binary 0.858\n#> 2 specificity binary 0.681\n#> 3 sensitivity binary 0.932\n```\n:::\n\n\n## Metrics for model performance ![](hexes/yardstick.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntaxi_metrics <- metric_set(accuracy, specificity, sensitivity)\n\naugment(taxi_fit, new_data = taxi_train) %>%\n group_by(local) %>%\n taxi_metrics(truth = tip, estimate = .pred_class)\n#> # A tibble: 6 Γ— 4\n#> local .metric .estimator .estimate\n#> \n#> 1 yes accuracy binary 0.840\n#> 2 no accuracy binary 0.862\n#> 3 yes specificity binary 0.346\n#> 4 no specificity binary 0.719\n#> 5 yes sensitivity binary 0.969\n#> 6 no sensitivity binary 0.925\n```\n:::\n\n\n## Two class data\n\nThese metrics assume that we know the threshold for converting \"soft\" probability predictions into \"hard\" class predictions.\n\n. . .\n\nIs a 50% threshold good? \n\nWhat happens if we say that we need to be 80% sure to declare an event?\n\n- sensitivity ⬇️, specificity ⬆️\n\n. . .\n\nWhat happens for a 20% threshold?\n\n- sensitivity ⬆️, specificity ⬇️\n\n## Varying the threshold\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](figures/thresholds-1.svg){width=960}\n:::\n:::\n\n\n## ROC curves\n\nTo make an ROC (receiver operator characteristic) curve, we:\n\n- calculate the sensitivity and specificity for all possible thresholds\n\n- plot false positive rate (x-axis) versus true positive rate (y-axis)\n\ngiven that sensitivity is the true positive rate, and specificity is the true negative rate. Hence `1 - specificity` is the false positive rate.\n\n. . .\n\nWe can use the area under the ROC curve as a classification metric: \n\n- ROC AUC = 1 πŸ’― \n- ROC AUC = 1/2 😒\n\n:::notes\nROC curves are insensitive to class imbalance.\n:::\n\n## ROC curves ![](hexes/yardstick.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Assumes _first_ factor level is event; there are options to change that\naugment(taxi_fit, new_data = taxi_train) %>% \n roc_curve(truth = tip, .pred_yes) %>%\n slice(1, 20, 50)\n#> # A tibble: 3 Γ— 3\n#> .threshold specificity sensitivity\n#> \n#> 1 -Inf 0 1 \n#> 2 0.25 0.486 0.972\n#> 3 0.6 0.705 0.920\n\naugment(taxi_fit, new_data = taxi_train) %>% \n roc_auc(truth = tip, .pred_yes)\n#> # A tibble: 1 Γ— 3\n#> .metric .estimator .estimate\n#> \n#> 1 roc_auc binary 0.868\n```\n:::\n\n\n## ROC curve plot ![](hexes/yardstick.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell output-location='column'}\n\n```{.r .cell-code}\naugment(taxi_fit, new_data = taxi_train) %>% \n roc_curve(truth = tip, .pred_yes) %>%\n autoplot()\n```\n\n::: {.cell-output-display}\n![](figures/roc-curve-1.svg){width=576}\n:::\n:::\n\n\n## Your turn {transition=\"slide-in\"}\n\n![](images/parsnip-flagger.jpg){.absolute top=\"0\" right=\"0\" width=\"150\" height=\"150\"}\n\n*Compute and plot an ROC curve for your current model.*\n\n*What data are being used for this ROC curve plot?*\n\n\n::: {.cell}\n::: {.cell-output-display}\n```{=html}\n
\n
\n05:00\n
\n```\n:::\n:::\n\n\n## {background-iframe=\"https://yardstick.tidymodels.org/reference/index.html\"}\n\n::: footer\n:::\n\n# ⚠️ DANGERS OF OVERFITTING ⚠️\n\n## Dangers of overfitting ⚠️\n\n![](https://raw.githubusercontent.com/topepo/2022-nyr-workshop/main/images/tuning-overfitting-train-1.svg)\n\n## Dangers of overfitting ⚠️\n\n![](https://raw.githubusercontent.com/topepo/2022-nyr-workshop/main/images/tuning-overfitting-test-1.svg)\n\n## Dangers of overfitting ⚠️ ![](hexes/yardstick.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntaxi_fit %>%\n augment(taxi_train)\n#> # A tibble: 7,045 Γ— 10\n#> tip distance company local dow month hour .pred_class .pred_yes .pred_no\n#> \n#> 1 no 5.39 Flash … no Sat Mar 12 no 0.0625 0.937 \n#> 2 no 18.4 Sun Ta… no Sat Apr 6 yes 0.924 0.0758\n#> 3 no 5.8 other no Tue Jan 10 no 0.391 0.609 \n#> 4 no 6.85 Flash … no Fri Apr 8 no 0.112 0.888 \n#> 5 no 9.5 City S… no Wed Jan 7 no 0.129 0.871 \n#> 6 no 12 other no Fri Apr 11 no 0.326 0.674 \n#> 7 no 8.9 Taxi A… no Mon Feb 14 no 0.0917 0.908 \n#> 8 no 1.38 other no Fri Apr 16 yes 0.902 0.0980\n#> 9 no 9.12 Flash … no Wed Apr 9 no 0.0917 0.908 \n#> 10 no 2.28 City S… no Thu Apr 16 yes 0.933 0.0668\n#> # β„Ή 7,035 more rows\n```\n:::\n\n\nWe call this \"resubstitution\" or \"repredicting the training set\"\n\n## Dangers of overfitting ⚠️ ![](hexes/yardstick.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntaxi_fit %>%\n augment(taxi_train) %>%\n accuracy(tip, .pred_class)\n#> # A tibble: 1 Γ— 3\n#> .metric .estimator .estimate\n#> \n#> 1 accuracy binary 0.858\n```\n:::\n\n\nWe call this a \"resubstitution estimate\"\n\n## Dangers of overfitting ⚠️ ![](hexes/yardstick.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n::: columns\n::: {.column width=\"50%\"}\n\n::: {.cell}\n\n```{.r .cell-code}\ntaxi_fit %>%\n augment(taxi_train) %>%\n accuracy(tip, .pred_class)\n#> # A tibble: 1 Γ— 3\n#> .metric .estimator .estimate\n#> \n#> 1 accuracy binary 0.858\n```\n:::\n\n:::\n\n::: {.column width=\"50%\"}\n:::\n:::\n\n## Dangers of overfitting ⚠️ ![](hexes/yardstick.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n::: columns\n::: {.column width=\"50%\"}\n\n::: {.cell}\n\n```{.r .cell-code}\ntaxi_fit %>%\n augment(taxi_train) %>%\n accuracy(tip, .pred_class)\n#> # A tibble: 1 Γ— 3\n#> .metric .estimator .estimate\n#> \n#> 1 accuracy binary 0.858\n```\n:::\n\n:::\n\n::: {.column width=\"50%\"}\n\n::: {.cell}\n\n```{.r .cell-code}\ntaxi_fit %>%\n augment(taxi_test) %>%\n accuracy(tip, .pred_class)\n#> # A tibble: 1 Γ— 3\n#> .metric .estimator .estimate\n#> \n#> 1 accuracy binary 0.795\n```\n:::\n\n:::\n:::\n\n. . .\n\n⚠️ Remember that we're demonstrating overfitting \n\n. . .\n\n⚠️ Don't use the test set until the *end* of your modeling analysis\n\n\n## {background-image=\"https://media.giphy.com/media/55itGuoAJiZEEen9gg/giphy.gif\" background-size=\"70%\"}\n\n## Your turn {transition=\"slide-in\"}\n\n![](images/parsnip-flagger.jpg){.absolute bottom=\"0\" left=\"0\" width=\"150\" height=\"150\"}\n\n*Use `augment()` and and a metric function to compute a classification metric like `brier_class()`.*\n\n*Compute the metrics for both training and testing data to demonstrate overfitting!*\n\n*Notice the evidence of overfitting!* ⚠️\n\n\n::: {.cell}\n::: {.cell-output-display}\n```{=html}\n
\n
\n05:00\n
\n```\n:::\n:::\n\n\n## Dangers of overfitting ⚠️ ![](hexes/yardstick.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n::: columns\n::: {.column width=\"50%\"}\n\n::: {.cell}\n\n```{.r .cell-code}\ntaxi_fit %>%\n augment(taxi_train) %>%\n brier_class(tip, .pred_yes)\n#> # A tibble: 1 Γ— 3\n#> .metric .estimator .estimate\n#> \n#> 1 brier_class binary 0.113\n```\n:::\n\n:::\n\n::: {.column width=\"50%\"}\n\n::: {.cell}\n\n```{.r .cell-code}\ntaxi_fit %>%\n augment(taxi_test) %>%\n brier_class(tip, .pred_yes)\n#> # A tibble: 1 Γ— 3\n#> .metric .estimator .estimate\n#> \n#> 1 brier_class binary 0.152\n```\n:::\n\n:::\n:::\n\n. . .\n\nWhat if we want to compare more models?\n\n. . .\n\nAnd/or more model configurations?\n\n. . .\n\nAnd we want to understand if these are important differences?\n\n# The testing data are precious πŸ’Ž\n\n# How can we use the *training* data to compare and evaluate different models? πŸ€”\n\n## {background-color=\"white\" background-image=\"https://www.tmwr.org/premade/resampling.svg\" background-size=\"80%\"}\n\n## Cross-validation\n\n![](https://www.tmwr.org/premade/three-CV.svg)\n\n## Cross-validation\n\n![](https://www.tmwr.org/premade/three-CV-iter.svg)\n\n## Your turn {transition=\"slide-in\"}\n\n![](images/parsnip-flagger.jpg){.absolute top=\"0\" right=\"0\" width=\"150\" height=\"150\"}\n\n*If we use 10 folds, what percent of the training data*\n\n- *ends up in analysis*\n- *ends up in assessment*\n\n*for* **each** *fold?*\n\n![](images/taxi_spinning.svg){width=\"300\"}\n\n\n::: {.cell}\n::: {.cell-output-display}\n```{=html}\n
\n
\n03:00\n
\n```\n:::\n:::\n\n\n## Cross-validation ![](hexes/rsample.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nvfold_cv(taxi_train) # v = 10 is default\n#> # 10-fold cross-validation \n#> # A tibble: 10 Γ— 2\n#> splits id \n#> \n#> 1 Fold01\n#> 2 Fold02\n#> 3 Fold03\n#> 4 Fold04\n#> 5 Fold05\n#> 6 Fold06\n#> 7 Fold07\n#> 8 Fold08\n#> 9 Fold09\n#> 10 Fold10\n```\n:::\n\n\n## Cross-validation ![](hexes/rsample.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\nWhat is in this?\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntaxi_folds <- vfold_cv(taxi_train)\ntaxi_folds$splits[1:3]\n#> [[1]]\n#> \n#> <6340/705/7045>\n#> \n#> [[2]]\n#> \n#> <6340/705/7045>\n#> \n#> [[3]]\n#> \n#> <6340/705/7045>\n```\n:::\n\n\n::: notes\nTalk about a list column, storing non-atomic types in dataframe\n:::\n\n## Cross-validation ![](hexes/rsample.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nvfold_cv(taxi_train, v = 5)\n#> # 5-fold cross-validation \n#> # A tibble: 5 Γ— 2\n#> splits id \n#> \n#> 1 Fold1\n#> 2 Fold2\n#> 3 Fold3\n#> 4 Fold4\n#> 5 Fold5\n```\n:::\n\n\n## Cross-validation ![](hexes/rsample.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nvfold_cv(taxi_train, strata = tip)\n#> # 10-fold cross-validation using stratification \n#> # A tibble: 10 Γ— 2\n#> splits id \n#> \n#> 1 Fold01\n#> 2 Fold02\n#> 3 Fold03\n#> 4 Fold04\n#> 5 Fold05\n#> 6 Fold06\n#> 7 Fold07\n#> 8 Fold08\n#> 9 Fold09\n#> 10 Fold10\n```\n:::\n\n\n. . .\n\nStratification often helps, with very little downside\n\n## Cross-validation ![](hexes/rsample.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\nWe'll use this setup:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(123)\ntaxi_folds <- vfold_cv(taxi_train, v = 10, strata = tip)\ntaxi_folds\n#> # 10-fold cross-validation using stratification \n#> # A tibble: 10 Γ— 2\n#> splits id \n#> \n#> 1 Fold01\n#> 2 Fold02\n#> 3 Fold03\n#> 4 Fold04\n#> 5 Fold05\n#> 6 Fold06\n#> 7 Fold07\n#> 8 Fold08\n#> 9 Fold09\n#> 10 Fold10\n```\n:::\n\n\n. . .\n\nSet the seed when creating resamples\n\n# We are equipped with metrics and resamples!\n\n## Fit our model to the resamples\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntaxi_res <- fit_resamples(taxi_wflow, taxi_folds)\ntaxi_res\n#> # Resampling results\n#> # 10-fold cross-validation using stratification \n#> # A tibble: 10 Γ— 4\n#> splits id .metrics .notes \n#> \n#> 1 Fold01 \n#> 2 Fold02 \n#> 3 Fold03 \n#> 4 Fold04 \n#> 5 Fold05 \n#> 6 Fold06 \n#> 7 Fold07 \n#> 8 Fold08 \n#> 9 Fold09 \n#> 10 Fold10 \n```\n:::\n\n\n## Evaluating model performance ![](hexes/tune.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntaxi_res %>%\n collect_metrics()\n#> # A tibble: 2 Γ— 6\n#> .metric .estimator mean n std_err .config \n#> \n#> 1 accuracy binary 0.793 10 0.00293 Preprocessor1_Model1\n#> 2 roc_auc binary 0.809 10 0.00461 Preprocessor1_Model1\n```\n:::\n\n\n. . .\n\nWe can reliably measure performance using only the **training** data πŸŽ‰\n\n## Comparing metrics ![](hexes/yardstick.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\nHow do the metrics from resampling compare to the metrics from training and testing?\n\n\n::: {.cell}\n\n:::\n\n\n::: columns\n::: {.column width=\"50%\"}\n\n::: {.cell}\n\n```{.r .cell-code}\ntaxi_res %>%\n collect_metrics() %>% \n select(.metric, mean, n)\n#> # A tibble: 2 Γ— 3\n#> .metric mean n\n#> \n#> 1 accuracy 0.793 10\n#> 2 roc_auc 0.809 10\n```\n:::\n\n:::\n\n::: {.column width=\"50%\"}\nThe ROC AUC previously was\n\n- 0.87 for the training set\n- 0.81 for test set\n:::\n:::\n\n. . .\n\nRemember that:\n\n⚠️ the training set gives you overly optimistic metrics\n\n⚠️ the test set is precious\n\n## Evaluating model performance ![](hexes/tune.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Save the assessment set results\nctrl_taxi <- control_resamples(save_pred = TRUE)\ntaxi_res <- fit_resamples(taxi_wflow, taxi_folds, control = ctrl_taxi)\n\ntaxi_preds <- collect_predictions(taxi_res)\ntaxi_preds\n#> # A tibble: 7,045 Γ— 7\n#> id .pred_yes .pred_no .row .pred_class tip .config \n#> \n#> 1 Fold01 0.936 0.0638 10 yes no Preprocessor1_Model1\n#> 2 Fold01 0.898 0.102 20 yes no Preprocessor1_Model1\n#> 3 Fold01 0.898 0.102 47 yes no Preprocessor1_Model1\n#> 4 Fold01 0.101 0.899 51 no no Preprocessor1_Model1\n#> 5 Fold01 0.871 0.129 59 yes no Preprocessor1_Model1\n#> 6 Fold01 0.0815 0.918 60 no no Preprocessor1_Model1\n#> 7 Fold01 0.162 0.838 92 no no Preprocessor1_Model1\n#> 8 Fold01 0.26 0.74 97 no no Preprocessor1_Model1\n#> 9 Fold01 0.274 0.726 98 no no Preprocessor1_Model1\n#> 10 Fold01 0.804 0.196 104 yes no Preprocessor1_Model1\n#> # β„Ή 7,035 more rows\n```\n:::\n\n\n## Evaluating model performance ![](hexes/tune.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntaxi_preds %>% \n group_by(id) %>%\n taxi_metrics(truth = tip, estimate = .pred_class)\n#> # A tibble: 30 Γ— 4\n#> id .metric .estimator .estimate\n#> \n#> 1 Fold01 accuracy binary 0.793\n#> 2 Fold02 accuracy binary 0.8 \n#> 3 Fold03 accuracy binary 0.786\n#> 4 Fold04 accuracy binary 0.804\n#> 5 Fold05 accuracy binary 0.796\n#> 6 Fold06 accuracy binary 0.789\n#> 7 Fold07 accuracy binary 0.793\n#> 8 Fold08 accuracy binary 0.808\n#> 9 Fold09 accuracy binary 0.783\n#> 10 Fold10 accuracy binary 0.780\n#> # β„Ή 20 more rows\n```\n:::\n\n\n## Where are the fitted models? ![](hexes/tune.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"} {.annotation}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntaxi_res\n#> # Resampling results\n#> # 10-fold cross-validation using stratification \n#> # A tibble: 10 Γ— 5\n#> splits id .metrics .notes .predictions\n#> \n#> 1 Fold01 \n#> 2 Fold02 \n#> 3 Fold03 \n#> 4 Fold04 \n#> 5 Fold05 \n#> 6 Fold06 \n#> 7 Fold07 \n#> 8 Fold08 \n#> 9 Fold09 \n#> 10 Fold10 \n```\n:::\n\n\n. . .\n\nπŸ—‘οΈ\n\n# Alternate resampling schemes\n\n## Bootstrapping\n\n![](https://www.tmwr.org/premade/bootstraps.svg)\n\n## Bootstrapping ![](hexes/rsample.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(3214)\nbootstraps(taxi_train)\n#> # Bootstrap sampling \n#> # A tibble: 25 Γ— 2\n#> splits id \n#> \n#> 1 Bootstrap01\n#> 2 Bootstrap02\n#> 3 Bootstrap03\n#> 4 Bootstrap04\n#> 5 Bootstrap05\n#> 6 Bootstrap06\n#> 7 Bootstrap07\n#> 8 Bootstrap08\n#> 9 Bootstrap09\n#> 10 Bootstrap10\n#> # β„Ή 15 more rows\n```\n:::\n\n\n## {background-iframe=\"https://rsample.tidymodels.org/reference/index.html\"}\n\n::: footer\n:::\n\n## The whole game - status update\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](images/whole-game-resamples.jpg){fig-align='center' width=1772}\n:::\n:::\n\n\n## Your turn {transition=\"slide-in\"}\n\n![](images/parsnip-flagger.jpg){.absolute top=\"0\" right=\"0\" width=\"150\" height=\"150\"}\n\n*Create:*\n\n- *Monte Carlo Cross-Validation sets*\n- *validation set*\n\n(use the reference guide to find the function)\n\n*Don't forget to set a seed when you resample!*\n\n\n::: {.cell}\n::: {.cell-output-display}\n```{=html}\n
\n
\n05:00\n
\n```\n:::\n:::\n\n\n## Monte Carlo Cross-Validation ![](hexes/rsample.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(322)\nmc_cv(taxi_train, times = 10)\n#> # Monte Carlo cross-validation (0.75/0.25) with 10 resamples \n#> # A tibble: 10 Γ— 2\n#> splits id \n#> \n#> 1 Resample01\n#> 2 Resample02\n#> 3 Resample03\n#> 4 Resample04\n#> 5 Resample05\n#> 6 Resample06\n#> 7 Resample07\n#> 8 Resample08\n#> 9 Resample09\n#> 10 Resample10\n```\n:::\n\n\n## Validation set ![](hexes/rsample.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"} {.annotation}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(853)\nvalidation_split(taxi_train, strata = tip)\n#> # Validation Set Split (0.75/0.25) using stratification \n#> # A tibble: 1 Γ— 2\n#> splits id \n#> \n#> 1 validation\n```\n:::\n\n\n. . .\n\nA validation set is just another type of resample\n\n# Decision tree 🌳\n\n# Random forest 🌳🌲🌴🌡🌴🌳🌳🌴🌲🌡🌴🌲🌳🌴🌳🌡🌡🌴🌲🌲🌳🌴🌳🌴🌲🌴🌡🌴🌲🌴🌡🌲🌡🌴🌲🌳🌴🌡🌳🌴🌳\n\n## Random forest 🌳🌲🌴🌡🌳🌳🌴🌲🌡🌴🌳🌡\n\n- Ensemble many decision tree models\n\n- All the trees vote! πŸ—³οΈ\n\n- Bootstrap aggregating + random predictor sampling\n\n. . .\n\n- Often works well without tuning hyperparameters (more on this tomorrow!), as long as there are enough trees\n\n## Create a random forest model ![](hexes/parsnip.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrf_spec <- rand_forest(trees = 1000, mode = \"classification\")\nrf_spec\n#> Random Forest Model Specification (classification)\n#> \n#> Main Arguments:\n#> trees = 1000\n#> \n#> Computational engine: ranger\n```\n:::\n\n\n## Create a random forest model ![](hexes/workflows.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrf_wflow <- workflow(tip ~ ., rf_spec)\nrf_wflow\n#> ══ Workflow ══════════════════════════════════════════════════════════\n#> Preprocessor: Formula\n#> Model: rand_forest()\n#> \n#> ── Preprocessor ──────────────────────────────────────────────────────\n#> tip ~ .\n#> \n#> ── Model ─────────────────────────────────────────────────────────────\n#> Random Forest Model Specification (classification)\n#> \n#> Main Arguments:\n#> trees = 1000\n#> \n#> Computational engine: ranger\n```\n:::\n\n\n## Your turn {transition=\"slide-in\"}\n\n![](images/parsnip-flagger.jpg){.absolute top=\"0\" right=\"0\" width=\"150\" height=\"150\"}\n\n*Use `fit_resamples()` and `rf_wflow` to:*\n\n- *keep predictions*\n- *compute metrics*\n\n\n::: {.cell}\n::: {.cell-output-display}\n```{=html}\n
\n
\n08:00\n
\n```\n:::\n:::\n\n\n## Evaluating model performance ![](hexes/tune.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nctrl_taxi <- control_resamples(save_pred = TRUE)\n\n# Random forest uses random numbers so set the seed first\n\nset.seed(2)\nrf_res <- fit_resamples(rf_wflow, taxi_folds, control = ctrl_taxi)\ncollect_metrics(rf_res)\n#> # A tibble: 2 Γ— 6\n#> .metric .estimator mean n std_err .config \n#> \n#> 1 accuracy binary 0.813 10 0.00305 Preprocessor1_Model1\n#> 2 roc_auc binary 0.832 10 0.00513 Preprocessor1_Model1\n```\n:::\n\n\n## How can we compare multiple model workflows at once?\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](images/taxi_spinning.svg){fig-align='center'}\n:::\n:::\n\n\n\n## Evaluate a workflow set\n\n\n::: {.cell}\n\n```{.r .cell-code}\nworkflow_set(list(tip ~ .), list(tree_spec, rf_spec))\n#> # A workflow set/tibble: 2 Γ— 4\n#> wflow_id info option result \n#> \n#> 1 formula_decision_tree \n#> 2 formula_rand_forest \n```\n:::\n\n\n## Evaluate a workflow set\n\n\n::: {.cell}\n\n```{.r .cell-code}\nworkflow_set(list(tip ~ .), list(tree_spec, rf_spec)) %>%\n workflow_map(\"fit_resamples\", resamples = taxi_folds)\n#> # A workflow set/tibble: 2 Γ— 4\n#> wflow_id info option result \n#> \n#> 1 formula_decision_tree \n#> 2 formula_rand_forest \n```\n:::\n\n\n## Evaluate a workflow set\n\n\n::: {.cell}\n\n```{.r .cell-code}\nworkflow_set(list(tip ~ .), list(tree_spec, rf_spec)) %>%\n workflow_map(\"fit_resamples\", resamples = taxi_folds) %>%\n rank_results()\n#> # A tibble: 4 Γ— 9\n#> wflow_id .config .metric mean std_err n preprocessor model rank\n#> \n#> 1 formula_rand_for… Prepro… accura… 0.813 0.00339 10 formula rand… 1\n#> 2 formula_rand_for… Prepro… roc_auc 0.833 0.00528 10 formula rand… 1\n#> 3 formula_decision… Prepro… accura… 0.793 0.00293 10 formula deci… 2\n#> 4 formula_decision… Prepro… roc_auc 0.809 0.00461 10 formula deci… 2\n```\n:::\n\n\nThe first metric of the metric set is used for ranking. Use `rank_metric` to change that.\n\n. . .\n\nLots more available with workflow sets, like `collect_metrics()`, `autoplot()` methods, and more!\n\n## Your turn {transition=\"slide-in\"}\n\n![](images/parsnip-flagger.jpg){.absolute top=\"0\" right=\"0\" width=\"150\" height=\"150\"}\n\n*When do you think a workflow set would be useful?*\n\n\n::: {.cell}\n::: {.cell-output-display}\n```{=html}\n
\n
\n03:00\n
\n```\n:::\n:::\n\n\n## The whole game - status update\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](images/whole-game-select.jpg){fig-align='center' width=1772}\n:::\n:::\n\n\n## The final fit ![](hexes/tune.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"} \n\nSuppose that we are happy with our random forest model.\n\nLet's fit the model on the training set and verify our performance using the test set.\n\n. . .\n\nWe've shown you `fit()` and `predict()` (+ `augment()`) but there is a shortcut:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# taxi_split has train + test info\nfinal_fit <- last_fit(rf_wflow, taxi_split) \n\nfinal_fit\n#> # Resampling results\n#> # Manual resampling \n#> # A tibble: 1 Γ— 6\n#> splits id .metrics .notes .predictions .workflow \n#> \n#> 1 train/test split \n```\n:::\n\n\n## What is in `final_fit`? ![](hexes/tune.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncollect_metrics(final_fit)\n#> # A tibble: 2 Γ— 4\n#> .metric .estimator .estimate .config \n#> \n#> 1 accuracy binary 0.810 Preprocessor1_Model1\n#> 2 roc_auc binary 0.817 Preprocessor1_Model1\n```\n:::\n\n\n. . .\n\nThese are metrics computed with the **test** set\n\n## What is in `final_fit`? ![](hexes/tune.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncollect_predictions(final_fit)\n#> # A tibble: 1,762 Γ— 7\n#> id .pred_yes .pred_no .row .pred_class tip .config \n#> \n#> 1 train/test split 0.732 0.268 10 yes no Preprocessor1_Mo…\n#> 2 train/test split 0.827 0.173 29 yes yes Preprocessor1_Mo…\n#> 3 train/test split 0.899 0.101 35 yes yes Preprocessor1_Mo…\n#> 4 train/test split 0.914 0.0856 42 yes yes Preprocessor1_Mo…\n#> 5 train/test split 0.911 0.0889 47 yes no Preprocessor1_Mo…\n#> 6 train/test split 0.848 0.152 54 yes yes Preprocessor1_Mo…\n#> 7 train/test split 0.580 0.420 59 yes yes Preprocessor1_Mo…\n#> 8 train/test split 0.912 0.0876 62 yes yes Preprocessor1_Mo…\n#> 9 train/test split 0.810 0.190 63 yes yes Preprocessor1_Mo…\n#> 10 train/test split 0.960 0.0402 69 yes yes Preprocessor1_Mo…\n#> # β„Ή 1,752 more rows\n```\n:::\n\n\n## What is in `final_fit`? ![](hexes/tune.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nextract_workflow(final_fit)\n#> ══ Workflow [trained] ════════════════════════════════════════════════\n#> Preprocessor: Formula\n#> Model: rand_forest()\n#> \n#> ── Preprocessor ──────────────────────────────────────────────────────\n#> tip ~ .\n#> \n#> ── Model ─────────────────────────────────────────────────────────────\n#> Ranger result\n#> \n#> Call:\n#> ranger::ranger(x = maybe_data_frame(x), y = y, num.trees = ~1000, num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1), probability = TRUE) \n#> \n#> Type: Probability estimation \n#> Number of trees: 1000 \n#> Sample size: 7045 \n#> Number of independent variables: 6 \n#> Mtry: 2 \n#> Target node size: 10 \n#> Variable importance mode: none \n#> Splitrule: gini \n#> OOB prediction error (Brier s.): 0.1373147\n```\n:::\n\n\n. . .\n\nUse this for **prediction** on new data, like for deploying\n\n## The whole game\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](images/whole-game-final-performance.jpg){fig-align='center' width=1772}\n:::\n:::\n\n\n## Your turn {transition=\"slide-in\"}\n\n![](images/parsnip-flagger.jpg){.absolute top=\"0\" right=\"0\" width=\"150\" height=\"150\"}\n\n*End of the day discussion!*\n\n*Which model do you think you would decide to use?*\n\n*What surprised you the most?*\n\n*What is one thing you are looking forward to for tomorrow?*\n\n\n::: {.cell}\n::: {.cell-output-display}\n```{=html}\n
\n
\n05:00\n
\n```\n:::\n:::\n", + "supporting": [], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": { + "include-in-header": [ + "\n\n" + ], + "include-after-body": [ + "\n\n\n" + ] + }, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/archive/2023-07-nyr/_freeze/04-evaluating-models/libs/countdown-0.4.0/countdown.css b/archive/2023-07-nyr/_freeze/04-evaluating-models/libs/countdown-0.4.0/countdown.css new file mode 100644 index 00000000..bf387012 --- /dev/null +++ b/archive/2023-07-nyr/_freeze/04-evaluating-models/libs/countdown-0.4.0/countdown.css @@ -0,0 +1,144 @@ +.countdown { + background: inherit; + position: absolute; + cursor: pointer; + font-size: 3rem; + line-height: 1; + border-color: #ddd; + border-width: 3px; + border-style: solid; + border-radius: 15px; + box-shadow: 0px 4px 10px 0px rgba(50, 50, 50, 0.4); + -webkit-box-shadow: 0px 4px 10px 0px rgba(50, 50, 50, 0.4); + margin: 0.6em; + padding: 10px 15px; + text-align: center; + z-index: 10; + -webkit-user-select: none; + -moz-user-select: none; + -ms-user-select: none; + user-select: none; +} +.countdown { + display: flex; + align-items: center; + justify-content: center; +} +.countdown .countdown-time { + background: none; + font-size: 100%; + padding: 0; +} +.countdown-digits { + color: inherit; +} +.countdown.running { + border-color: #2A9B59FF; + background-color: #43AC6A; +} +.countdown.running .countdown-digits { + color: #002F14FF; +} +.countdown.finished { + border-color: #DE3000FF; + background-color: #F04124; +} +.countdown.finished .countdown-digits { + color: #4A0900FF; +} +.countdown.running.warning { + border-color: #CEAC04FF; + background-color: #E6C229; +} +.countdown.running.warning .countdown-digits { + color: #3A2F02FF; +} + +.countdown.running.blink-colon .countdown-digits.colon { + opacity: 0.1; +} + +/* ------ Controls ------ */ +.countdown:not(.running) .countdown-controls { + display: none; +} + +.countdown-controls { + position: absolute; + top: -0.5rem; + right: -0.5rem; + left: -0.5rem; + display: flex; + justify-content: space-between; + margin: 0; + padding: 0; +} + +.countdown-controls > button { + font-size: 1.5rem; + width: 1rem; + height: 1rem; + display: inline-block; + display: flex; + flex-direction: column; + align-items: center; + justify-content: center; + font-family: monospace; + padding: 10px; + margin: 0; + background: inherit; + border: 2px solid; + border-radius: 100%; + transition: 50ms transform ease-in-out, 150ms opacity ease-in; + --countdown-transition-distance: 10px; +} + +.countdown .countdown-controls > button:last-child { + transform: translate(calc(-1 * var(--countdown-transition-distance)), var(--countdown-transition-distance)); + opacity: 0; + color: #002F14FF; + background-color: #43AC6A; + border-color: #2A9B59FF; +} + +.countdown .countdown-controls > button:first-child { + transform: translate(var(--countdown-transition-distance), var(--countdown-transition-distance)); + opacity: 0; + color: #4A0900FF; + background-color: #F04124; + border-color: #DE3000FF; +} + +.countdown.running:hover .countdown-controls > button, +.countdown.running:focus-within .countdown-controls > button{ + transform: translate(0, 0); + opacity: 1; +} + +.countdown.running:hover .countdown-controls > button:hover, +.countdown.running:focus-within .countdown-controls > button:hover{ + transform: translate(0, calc(var(--countdown-transition-distance) / -2)); + box-shadow: 0px 2px 5px 0px rgba(50, 50, 50, 0.4); + -webkit-box-shadow: 0px 2px 5px 0px rgba(50, 50, 50, 0.4); +} + +.countdown.running:hover .countdown-controls > button:active, +.countdown.running:focus-within .countdown-controls > button:active{ + transform: translate(0, calc(var(--coutndown-transition-distance) / -5)); +} + +/* ----- Fullscreen ----- */ +.countdown.countdown-fullscreen { + z-index: 0; +} + +.countdown-fullscreen.running .countdown-controls { + top: 1rem; + left: 0; + right: 0; + justify-content: center; +} + +.countdown-fullscreen.running .countdown-controls > button + button { + margin-left: 1rem; +} diff --git a/archive/2023-07-nyr/_freeze/04-evaluating-models/libs/countdown-0.4.0/countdown.js b/archive/2023-07-nyr/_freeze/04-evaluating-models/libs/countdown-0.4.0/countdown.js new file mode 100644 index 00000000..a058ad8f --- /dev/null +++ b/archive/2023-07-nyr/_freeze/04-evaluating-models/libs/countdown-0.4.0/countdown.js @@ -0,0 +1,478 @@ +/* globals Shiny,Audio */ +class CountdownTimer { + constructor (el, opts) { + if (typeof el === 'string' || el instanceof String) { + el = document.querySelector(el) + } + + if (el.counter) { + return el.counter + } + + const minutes = parseInt(el.querySelector('.minutes').innerText || '0') + const seconds = parseInt(el.querySelector('.seconds').innerText || '0') + const duration = minutes * 60 + seconds + + function attrIsTrue (x) { + if (x === true) return true + return !!(x === 'true' || x === '' || x === '1') + } + + this.element = el + this.duration = duration + this.end = null + this.is_running = false + this.warn_when = parseInt(el.dataset.warnWhen) || -1 + this.update_every = parseInt(el.dataset.updateEvery) || 1 + this.play_sound = attrIsTrue(el.dataset.playSound) + this.blink_colon = attrIsTrue(el.dataset.blinkColon) + this.startImmediately = attrIsTrue(el.dataset.startImmediately) + this.timeout = null + this.display = { minutes, seconds } + + if (opts.src_location) { + this.src_location = opts.src_location + } + + this.addEventListeners() + } + + addEventListeners () { + const self = this + + if (this.startImmediately) { + if (window.remark && window.slideshow) { + // Remark (xaringan) support + const isOnVisibleSlide = () => { + return document.querySelector('.remark-visible').contains(self.element) + } + if (isOnVisibleSlide()) { + self.start() + } else { + let started_once = 0 + window.slideshow.on('afterShowSlide', function () { + if (started_once > 0) return + if (isOnVisibleSlide()) { + self.start() + started_once = 1 + } + }) + } + } else if (window.Reveal) { + // Revealjs (quarto) support + const isOnVisibleSlide = () => { + const currentSlide = document.querySelector('.reveal .slide.present') + return currentSlide ? currentSlide.contains(self.element) : false + } + if (isOnVisibleSlide()) { + self.start() + } else { + const revealStartTimer = () => { + if (isOnVisibleSlide()) { + self.start() + window.Reveal.off('slidechanged', revealStartTimer) + } + } + window.Reveal.on('slidechanged', revealStartTimer) + } + } else if (window.IntersectionObserver) { + // All other situtations use IntersectionObserver + const onVisible = (element, callback) => { + new window.IntersectionObserver((entries, observer) => { + entries.forEach(entry => { + if (entry.intersectionRatio > 0) { + callback(element) + observer.disconnect() + } + }) + }).observe(element) + } + onVisible(this.element, el => el.countdown.start()) + } else { + // or just start the timer as soon as it's initialized + this.start() + } + } + + function haltEvent (ev) { + ev.preventDefault() + ev.stopPropagation() + } + function isSpaceOrEnter (ev) { + return ev.code === 'Space' || ev.code === 'Enter' + } + function isArrowUpOrDown (ev) { + return ev.code === 'ArrowUp' || ev.code === 'ArrowDown' + } + + ;['click', 'touchend'].forEach(function (eventType) { + self.element.addEventListener(eventType, function (ev) { + haltEvent(ev) + self.is_running ? self.stop() : self.start() + }) + }) + this.element.addEventListener('keydown', function (ev) { + if (ev.code === "Escape") { + self.reset() + haltEvent(ev) + } + if (!isSpaceOrEnter(ev) && !isArrowUpOrDown(ev)) return + haltEvent(ev) + if (isSpaceOrEnter(ev)) { + self.is_running ? self.stop() : self.start() + return + } + + if (!self.is_running) return + + if (ev.code === 'ArrowUp') { + self.bumpUp() + } else if (ev.code === 'ArrowDown') { + self.bumpDown() + } + }) + this.element.addEventListener('dblclick', function (ev) { + haltEvent(ev) + if (self.is_running) self.reset() + }) + this.element.addEventListener('touchmove', haltEvent) + + const btnBumpDown = this.element.querySelector('.countdown-bump-down') + ;['click', 'touchend'].forEach(function (eventType) { + btnBumpDown.addEventListener(eventType, function (ev) { + haltEvent(ev) + if (self.is_running) self.bumpDown() + }) + }) + btnBumpDown.addEventListener('keydown', function (ev) { + if (!isSpaceOrEnter(ev) || !self.is_running) return + haltEvent(ev) + self.bumpDown() + }) + + const btnBumpUp = this.element.querySelector('.countdown-bump-up') + ;['click', 'touchend'].forEach(function (eventType) { + btnBumpUp.addEventListener(eventType, function (ev) { + haltEvent(ev) + if (self.is_running) self.bumpUp() + }) + }) + btnBumpUp.addEventListener('keydown', function (ev) { + if (!isSpaceOrEnter(ev) || !self.is_running) return + haltEvent(ev) + self.bumpUp() + }) + this.element.querySelector('.countdown-controls').addEventListener('dblclick', function (ev) { + haltEvent(ev) + }) + } + + remainingTime () { + const remaining = this.is_running + ? (this.end - Date.now()) / 1000 + : this.remaining || this.duration + + let minutes = Math.floor(remaining / 60) + let seconds = Math.ceil(remaining - minutes * 60) + + if (seconds > 59) { + minutes = minutes + 1 + seconds = seconds - 60 + } + + return { remaining, minutes, seconds } + } + + start () { + if (this.is_running) return + + this.is_running = true + + if (this.remaining) { + // Having a static remaining time indicates timer was paused + this.end = Date.now() + this.remaining * 1000 + this.remaining = null + } else { + this.end = Date.now() + this.duration * 1000 + } + + this.reportStateToShiny('start') + + this.element.classList.remove('finished') + this.element.classList.add('running') + this.update(true) + this.tick() + } + + tick (run_again) { + if (typeof run_again === 'undefined') { + run_again = true + } + + if (!this.is_running) return + + const { seconds: secondsWas } = this.display + this.update() + + if (run_again) { + const delay = (this.end - Date.now() > 10000) ? 1000 : 250 + this.blinkColon(secondsWas) + this.timeout = setTimeout(this.tick.bind(this), delay) + } + } + + blinkColon (secondsWas) { + // don't blink unless option is set + if (!this.blink_colon) return + // warn_when always updates the seconds + if (this.warn_when > 0 && Date.now() + this.warn_when > this.end) { + this.element.classList.remove('blink-colon') + return + } + const { seconds: secondsIs } = this.display + if (secondsIs > 10 || secondsWas !== secondsIs) { + this.element.classList.toggle('blink-colon') + } + } + + update (force) { + if (typeof force === 'undefined') { + force = false + } + + const { remaining, minutes, seconds } = this.remainingTime() + + const setRemainingTime = (selector, time) => { + const timeContainer = this.element.querySelector(selector) + if (!timeContainer) return + time = Math.max(time, 0) + timeContainer.innerText = String(time).padStart(2, 0) + } + + if (this.is_running && remaining < 0.25) { + this.stop() + setRemainingTime('.minutes', 0) + setRemainingTime('.seconds', 0) + this.playSound() + return + } + + const should_update = force || + Math.round(remaining) < this.warn_when || + Math.round(remaining) % this.update_every === 0 + + if (should_update) { + this.element.classList.toggle('warning', remaining <= this.warn_when) + this.display = { minutes, seconds } + setRemainingTime('.minutes', minutes) + setRemainingTime('.seconds', seconds) + } + } + + stop () { + const { remaining } = this.remainingTime() + if (remaining > 1) { + this.remaining = remaining + } + this.element.classList.remove('running') + this.element.classList.remove('warning') + this.element.classList.remove('blink-colon') + this.element.classList.add('finished') + this.is_running = false + this.end = null + this.reportStateToShiny('stop') + this.timeout = clearTimeout(this.timeout) + } + + reset () { + this.stop() + this.remaining = null + this.update(true) + this.reportStateToShiny('reset') + this.element.classList.remove('finished') + this.element.classList.remove('warning') + } + + setValues (opts) { + if (typeof opts.warn_when !== 'undefined') { + this.warn_when = opts.warn_when + } + if (typeof opts.update_every !== 'undefined') { + this.update_every = opts.update_every + } + if (typeof opts.blink_colon !== 'undefined') { + this.blink_colon = opts.blink_colon + if (!opts.blink_colon) { + this.element.classList.remove('blink-colon') + } + } + if (typeof opts.play_sound !== 'undefined') { + this.play_sound = opts.play_sound + } + if (typeof opts.duration !== 'undefined') { + this.duration = opts.duration + if (this.is_running) { + this.reset() + this.start() + } + } + this.reportStateToShiny('update') + this.update(true) + } + + bumpTimer (val, round) { + round = typeof round === 'boolean' ? round : true + const { remaining } = this.remainingTime() + let newRemaining = remaining + val + if (newRemaining <= 0) { + this.setRemaining(0) + this.stop() + return + } + if (round && newRemaining > 10) { + newRemaining = Math.round(newRemaining / 5) * 5 + } + this.setRemaining(newRemaining) + this.reportStateToShiny(val > 0 ? 'bumpUp' : 'bumpDown') + this.update(true) + } + + bumpUp (val) { + if (!this.is_running) { + console.error('timer is not running') + return + } + this.bumpTimer( + val || this.bumpIncrementValue(), + typeof val === 'undefined' + ) + } + + bumpDown (val) { + if (!this.is_running) { + console.error('timer is not running') + return + } + this.bumpTimer( + val || -1 * this.bumpIncrementValue(), + typeof val === 'undefined' + ) + } + + setRemaining (val) { + if (!this.is_running) { + console.error('timer is not running') + return + } + this.end = Date.now() + val * 1000 + this.update(true) + } + + playSound () { + let url = this.play_sound + if (!url) return + if (typeof url === 'boolean') { + const src = this.src_location + ? this.src_location.replace('/countdown.js', '') + : 'libs/countdown' + url = src + '/smb_stage_clear.mp3' + } + const sound = new Audio(url) + sound.play() + } + + bumpIncrementValue (val) { + val = val || this.remainingTime().remaining + if (val <= 30) { + return 5 + } else if (val <= 300) { + return 15 + } else if (val <= 3000) { + return 30 + } else { + return 60 + } + } + + reportStateToShiny (action) { + if (!window.Shiny) return + + const inputId = this.element.id + const data = { + event: { + action, + time: new Date().toISOString() + }, + timer: { + is_running: this.is_running, + end: this.end ? new Date(this.end).toISOString() : null, + remaining: this.remainingTime() + } + } + + function shinySetInputValue () { + if (!window.Shiny.setInputValue) { + setTimeout(shinySetInputValue, 100) + return + } + window.Shiny.setInputValue(inputId, data) + } + + shinySetInputValue() + } +} + +(function () { + const CURRENT_SCRIPT = document.currentScript.getAttribute('src') + + document.addEventListener('DOMContentLoaded', function () { + const els = document.querySelectorAll('.countdown') + if (!els || !els.length) { + return + } + els.forEach(function (el) { + el.countdown = new CountdownTimer(el, { src_location: CURRENT_SCRIPT }) + }) + + if (window.Shiny) { + Shiny.addCustomMessageHandler('countdown:update', function (x) { + if (!x.id) { + console.error('No `id` provided, cannot update countdown') + return + } + const el = document.getElementById(x.id) + el.countdown.setValues(x) + }) + + Shiny.addCustomMessageHandler('countdown:start', function (id) { + const el = document.getElementById(id) + if (!el) return + el.countdown.start() + }) + + Shiny.addCustomMessageHandler('countdown:stop', function (id) { + const el = document.getElementById(id) + if (!el) return + el.countdown.stop() + }) + + Shiny.addCustomMessageHandler('countdown:reset', function (id) { + const el = document.getElementById(id) + if (!el) return + el.countdown.reset() + }) + + Shiny.addCustomMessageHandler('countdown:bumpUp', function (id) { + const el = document.getElementById(id) + if (!el) return + el.countdown.bumpUp() + }) + + Shiny.addCustomMessageHandler('countdown:bumpDown', function (id) { + const el = document.getElementById(id) + if (!el) return + el.countdown.bumpDown() + }) + } + }) +})() diff --git a/archive/2023-07-nyr/_freeze/04-evaluating-models/libs/countdown-0.4.0/smb_stage_clear.mp3 b/archive/2023-07-nyr/_freeze/04-evaluating-models/libs/countdown-0.4.0/smb_stage_clear.mp3 new file mode 100644 index 00000000..da2ddc2c Binary files /dev/null and b/archive/2023-07-nyr/_freeze/04-evaluating-models/libs/countdown-0.4.0/smb_stage_clear.mp3 differ diff --git a/archive/2023-07-nyr/_freeze/08-wrapping-up/execute-results/html.json b/archive/2023-07-nyr/_freeze/08-wrapping-up/execute-results/html.json new file mode 100644 index 00000000..537df67e --- /dev/null +++ b/archive/2023-07-nyr/_freeze/08-wrapping-up/execute-results/html.json @@ -0,0 +1,21 @@ +{ + "hash": "f2fd12a89707ee8d5ce342b615bf79ff", + "result": { + "markdown": "---\ntitle: \"8 - Wrapping up\"\nsubtitle: \"Machine learning with tidymodels\"\nformat:\n revealjs: \n slide-number: true\n footer: \n include-before-body: header.html\n theme: [default, tidymodels.scss]\n width: 1280\n height: 720\nknitr:\n opts_chunk: \n echo: true\n collapse: true\n comment: \"#>\"\n---\n\n\n\n\n::: r-fit-text\nWe made it!\n:::\n\n## Your turn {transition=\"slide-in\"}\n\n![](images/parsnip-flagger.jpg){.absolute top=\"0\" right=\"0\" width=\"150\" height=\"150\"}\n\n\n*What is one thing you learned that surprised you?*\n\n*What is one thing you learned that you plan to use?*\n\n\n::: {.cell}\n::: {.cell-output-display}\n```{=html}\n
\n
\n05:00\n
\n```\n:::\n:::\n\n\n\n## Resources to keep learning\n\n. . .\n\n- \n\n. . .\n\n- \n\n. . .\n\n- \n\n. . .\n\n- \n\n. . .\n\nFollow us on Twitter and at the tidyverse blog for updates!\n\n\n", + "supporting": [], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": { + "include-in-header": [ + "\n\n" + ], + "include-after-body": [ + "\n\n\n" + ] + }, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/archive/2023-07-nyr/_freeze/08-wrapping-up/libs/countdown-0.4.0/countdown.css b/archive/2023-07-nyr/_freeze/08-wrapping-up/libs/countdown-0.4.0/countdown.css new file mode 100644 index 00000000..bf387012 --- /dev/null +++ b/archive/2023-07-nyr/_freeze/08-wrapping-up/libs/countdown-0.4.0/countdown.css @@ -0,0 +1,144 @@ +.countdown { + background: inherit; + position: absolute; + cursor: pointer; + font-size: 3rem; + line-height: 1; + border-color: #ddd; + border-width: 3px; + border-style: solid; + border-radius: 15px; + box-shadow: 0px 4px 10px 0px rgba(50, 50, 50, 0.4); + -webkit-box-shadow: 0px 4px 10px 0px rgba(50, 50, 50, 0.4); + margin: 0.6em; + padding: 10px 15px; + text-align: center; + z-index: 10; + -webkit-user-select: none; + -moz-user-select: none; + -ms-user-select: none; + user-select: none; +} +.countdown { + display: flex; + align-items: center; + justify-content: center; +} +.countdown .countdown-time { + background: none; + font-size: 100%; + padding: 0; +} +.countdown-digits { + color: inherit; +} +.countdown.running { + border-color: #2A9B59FF; + background-color: #43AC6A; +} +.countdown.running .countdown-digits { + color: #002F14FF; +} +.countdown.finished { + border-color: #DE3000FF; + background-color: #F04124; +} +.countdown.finished .countdown-digits { + color: #4A0900FF; +} +.countdown.running.warning { + border-color: #CEAC04FF; + background-color: #E6C229; +} +.countdown.running.warning .countdown-digits { + color: #3A2F02FF; +} + +.countdown.running.blink-colon .countdown-digits.colon { + opacity: 0.1; +} + +/* ------ Controls ------ */ +.countdown:not(.running) .countdown-controls { + display: none; +} + +.countdown-controls { + position: absolute; + top: -0.5rem; + right: -0.5rem; + left: -0.5rem; + display: flex; + justify-content: space-between; + margin: 0; + padding: 0; +} + +.countdown-controls > button { + font-size: 1.5rem; + width: 1rem; + height: 1rem; + display: inline-block; + display: flex; + flex-direction: column; + align-items: center; + justify-content: center; + font-family: monospace; + padding: 10px; + margin: 0; + background: inherit; + border: 2px solid; + border-radius: 100%; + transition: 50ms transform ease-in-out, 150ms opacity ease-in; + --countdown-transition-distance: 10px; +} + +.countdown .countdown-controls > button:last-child { + transform: translate(calc(-1 * var(--countdown-transition-distance)), var(--countdown-transition-distance)); + opacity: 0; + color: #002F14FF; + background-color: #43AC6A; + border-color: #2A9B59FF; +} + +.countdown .countdown-controls > button:first-child { + transform: translate(var(--countdown-transition-distance), var(--countdown-transition-distance)); + opacity: 0; + color: #4A0900FF; + background-color: #F04124; + border-color: #DE3000FF; +} + +.countdown.running:hover .countdown-controls > button, +.countdown.running:focus-within .countdown-controls > button{ + transform: translate(0, 0); + opacity: 1; +} + +.countdown.running:hover .countdown-controls > button:hover, +.countdown.running:focus-within .countdown-controls > button:hover{ + transform: translate(0, calc(var(--countdown-transition-distance) / -2)); + box-shadow: 0px 2px 5px 0px rgba(50, 50, 50, 0.4); + -webkit-box-shadow: 0px 2px 5px 0px rgba(50, 50, 50, 0.4); +} + +.countdown.running:hover .countdown-controls > button:active, +.countdown.running:focus-within .countdown-controls > button:active{ + transform: translate(0, calc(var(--coutndown-transition-distance) / -5)); +} + +/* ----- Fullscreen ----- */ +.countdown.countdown-fullscreen { + z-index: 0; +} + +.countdown-fullscreen.running .countdown-controls { + top: 1rem; + left: 0; + right: 0; + justify-content: center; +} + +.countdown-fullscreen.running .countdown-controls > button + button { + margin-left: 1rem; +} diff --git a/archive/2023-07-nyr/_freeze/08-wrapping-up/libs/countdown-0.4.0/countdown.js b/archive/2023-07-nyr/_freeze/08-wrapping-up/libs/countdown-0.4.0/countdown.js new file mode 100644 index 00000000..a058ad8f --- /dev/null +++ b/archive/2023-07-nyr/_freeze/08-wrapping-up/libs/countdown-0.4.0/countdown.js @@ -0,0 +1,478 @@ +/* globals Shiny,Audio */ +class CountdownTimer { + constructor (el, opts) { + if (typeof el === 'string' || el instanceof String) { + el = document.querySelector(el) + } + + if (el.counter) { + return el.counter + } + + const minutes = parseInt(el.querySelector('.minutes').innerText || '0') + const seconds = parseInt(el.querySelector('.seconds').innerText || '0') + const duration = minutes * 60 + seconds + + function attrIsTrue (x) { + if (x === true) return true + return !!(x === 'true' || x === '' || x === '1') + } + + this.element = el + this.duration = duration + this.end = null + this.is_running = false + this.warn_when = parseInt(el.dataset.warnWhen) || -1 + this.update_every = parseInt(el.dataset.updateEvery) || 1 + this.play_sound = attrIsTrue(el.dataset.playSound) + this.blink_colon = attrIsTrue(el.dataset.blinkColon) + this.startImmediately = attrIsTrue(el.dataset.startImmediately) + this.timeout = null + this.display = { minutes, seconds } + + if (opts.src_location) { + this.src_location = opts.src_location + } + + this.addEventListeners() + } + + addEventListeners () { + const self = this + + if (this.startImmediately) { + if (window.remark && window.slideshow) { + // Remark (xaringan) support + const isOnVisibleSlide = () => { + return document.querySelector('.remark-visible').contains(self.element) + } + if (isOnVisibleSlide()) { + self.start() + } else { + let started_once = 0 + window.slideshow.on('afterShowSlide', function () { + if (started_once > 0) return + if (isOnVisibleSlide()) { + self.start() + started_once = 1 + } + }) + } + } else if (window.Reveal) { + // Revealjs (quarto) support + const isOnVisibleSlide = () => { + const currentSlide = document.querySelector('.reveal .slide.present') + return currentSlide ? currentSlide.contains(self.element) : false + } + if (isOnVisibleSlide()) { + self.start() + } else { + const revealStartTimer = () => { + if (isOnVisibleSlide()) { + self.start() + window.Reveal.off('slidechanged', revealStartTimer) + } + } + window.Reveal.on('slidechanged', revealStartTimer) + } + } else if (window.IntersectionObserver) { + // All other situtations use IntersectionObserver + const onVisible = (element, callback) => { + new window.IntersectionObserver((entries, observer) => { + entries.forEach(entry => { + if (entry.intersectionRatio > 0) { + callback(element) + observer.disconnect() + } + }) + }).observe(element) + } + onVisible(this.element, el => el.countdown.start()) + } else { + // or just start the timer as soon as it's initialized + this.start() + } + } + + function haltEvent (ev) { + ev.preventDefault() + ev.stopPropagation() + } + function isSpaceOrEnter (ev) { + return ev.code === 'Space' || ev.code === 'Enter' + } + function isArrowUpOrDown (ev) { + return ev.code === 'ArrowUp' || ev.code === 'ArrowDown' + } + + ;['click', 'touchend'].forEach(function (eventType) { + self.element.addEventListener(eventType, function (ev) { + haltEvent(ev) + self.is_running ? self.stop() : self.start() + }) + }) + this.element.addEventListener('keydown', function (ev) { + if (ev.code === "Escape") { + self.reset() + haltEvent(ev) + } + if (!isSpaceOrEnter(ev) && !isArrowUpOrDown(ev)) return + haltEvent(ev) + if (isSpaceOrEnter(ev)) { + self.is_running ? self.stop() : self.start() + return + } + + if (!self.is_running) return + + if (ev.code === 'ArrowUp') { + self.bumpUp() + } else if (ev.code === 'ArrowDown') { + self.bumpDown() + } + }) + this.element.addEventListener('dblclick', function (ev) { + haltEvent(ev) + if (self.is_running) self.reset() + }) + this.element.addEventListener('touchmove', haltEvent) + + const btnBumpDown = this.element.querySelector('.countdown-bump-down') + ;['click', 'touchend'].forEach(function (eventType) { + btnBumpDown.addEventListener(eventType, function (ev) { + haltEvent(ev) + if (self.is_running) self.bumpDown() + }) + }) + btnBumpDown.addEventListener('keydown', function (ev) { + if (!isSpaceOrEnter(ev) || !self.is_running) return + haltEvent(ev) + self.bumpDown() + }) + + const btnBumpUp = this.element.querySelector('.countdown-bump-up') + ;['click', 'touchend'].forEach(function (eventType) { + btnBumpUp.addEventListener(eventType, function (ev) { + haltEvent(ev) + if (self.is_running) self.bumpUp() + }) + }) + btnBumpUp.addEventListener('keydown', function (ev) { + if (!isSpaceOrEnter(ev) || !self.is_running) return + haltEvent(ev) + self.bumpUp() + }) + this.element.querySelector('.countdown-controls').addEventListener('dblclick', function (ev) { + haltEvent(ev) + }) + } + + remainingTime () { + const remaining = this.is_running + ? (this.end - Date.now()) / 1000 + : this.remaining || this.duration + + let minutes = Math.floor(remaining / 60) + let seconds = Math.ceil(remaining - minutes * 60) + + if (seconds > 59) { + minutes = minutes + 1 + seconds = seconds - 60 + } + + return { remaining, minutes, seconds } + } + + start () { + if (this.is_running) return + + this.is_running = true + + if (this.remaining) { + // Having a static remaining time indicates timer was paused + this.end = Date.now() + this.remaining * 1000 + this.remaining = null + } else { + this.end = Date.now() + this.duration * 1000 + } + + this.reportStateToShiny('start') + + this.element.classList.remove('finished') + this.element.classList.add('running') + this.update(true) + this.tick() + } + + tick (run_again) { + if (typeof run_again === 'undefined') { + run_again = true + } + + if (!this.is_running) return + + const { seconds: secondsWas } = this.display + this.update() + + if (run_again) { + const delay = (this.end - Date.now() > 10000) ? 1000 : 250 + this.blinkColon(secondsWas) + this.timeout = setTimeout(this.tick.bind(this), delay) + } + } + + blinkColon (secondsWas) { + // don't blink unless option is set + if (!this.blink_colon) return + // warn_when always updates the seconds + if (this.warn_when > 0 && Date.now() + this.warn_when > this.end) { + this.element.classList.remove('blink-colon') + return + } + const { seconds: secondsIs } = this.display + if (secondsIs > 10 || secondsWas !== secondsIs) { + this.element.classList.toggle('blink-colon') + } + } + + update (force) { + if (typeof force === 'undefined') { + force = false + } + + const { remaining, minutes, seconds } = this.remainingTime() + + const setRemainingTime = (selector, time) => { + const timeContainer = this.element.querySelector(selector) + if (!timeContainer) return + time = Math.max(time, 0) + timeContainer.innerText = String(time).padStart(2, 0) + } + + if (this.is_running && remaining < 0.25) { + this.stop() + setRemainingTime('.minutes', 0) + setRemainingTime('.seconds', 0) + this.playSound() + return + } + + const should_update = force || + Math.round(remaining) < this.warn_when || + Math.round(remaining) % this.update_every === 0 + + if (should_update) { + this.element.classList.toggle('warning', remaining <= this.warn_when) + this.display = { minutes, seconds } + setRemainingTime('.minutes', minutes) + setRemainingTime('.seconds', seconds) + } + } + + stop () { + const { remaining } = this.remainingTime() + if (remaining > 1) { + this.remaining = remaining + } + this.element.classList.remove('running') + this.element.classList.remove('warning') + this.element.classList.remove('blink-colon') + this.element.classList.add('finished') + this.is_running = false + this.end = null + this.reportStateToShiny('stop') + this.timeout = clearTimeout(this.timeout) + } + + reset () { + this.stop() + this.remaining = null + this.update(true) + this.reportStateToShiny('reset') + this.element.classList.remove('finished') + this.element.classList.remove('warning') + } + + setValues (opts) { + if (typeof opts.warn_when !== 'undefined') { + this.warn_when = opts.warn_when + } + if (typeof opts.update_every !== 'undefined') { + this.update_every = opts.update_every + } + if (typeof opts.blink_colon !== 'undefined') { + this.blink_colon = opts.blink_colon + if (!opts.blink_colon) { + this.element.classList.remove('blink-colon') + } + } + if (typeof opts.play_sound !== 'undefined') { + this.play_sound = opts.play_sound + } + if (typeof opts.duration !== 'undefined') { + this.duration = opts.duration + if (this.is_running) { + this.reset() + this.start() + } + } + this.reportStateToShiny('update') + this.update(true) + } + + bumpTimer (val, round) { + round = typeof round === 'boolean' ? round : true + const { remaining } = this.remainingTime() + let newRemaining = remaining + val + if (newRemaining <= 0) { + this.setRemaining(0) + this.stop() + return + } + if (round && newRemaining > 10) { + newRemaining = Math.round(newRemaining / 5) * 5 + } + this.setRemaining(newRemaining) + this.reportStateToShiny(val > 0 ? 'bumpUp' : 'bumpDown') + this.update(true) + } + + bumpUp (val) { + if (!this.is_running) { + console.error('timer is not running') + return + } + this.bumpTimer( + val || this.bumpIncrementValue(), + typeof val === 'undefined' + ) + } + + bumpDown (val) { + if (!this.is_running) { + console.error('timer is not running') + return + } + this.bumpTimer( + val || -1 * this.bumpIncrementValue(), + typeof val === 'undefined' + ) + } + + setRemaining (val) { + if (!this.is_running) { + console.error('timer is not running') + return + } + this.end = Date.now() + val * 1000 + this.update(true) + } + + playSound () { + let url = this.play_sound + if (!url) return + if (typeof url === 'boolean') { + const src = this.src_location + ? this.src_location.replace('/countdown.js', '') + : 'libs/countdown' + url = src + '/smb_stage_clear.mp3' + } + const sound = new Audio(url) + sound.play() + } + + bumpIncrementValue (val) { + val = val || this.remainingTime().remaining + if (val <= 30) { + return 5 + } else if (val <= 300) { + return 15 + } else if (val <= 3000) { + return 30 + } else { + return 60 + } + } + + reportStateToShiny (action) { + if (!window.Shiny) return + + const inputId = this.element.id + const data = { + event: { + action, + time: new Date().toISOString() + }, + timer: { + is_running: this.is_running, + end: this.end ? new Date(this.end).toISOString() : null, + remaining: this.remainingTime() + } + } + + function shinySetInputValue () { + if (!window.Shiny.setInputValue) { + setTimeout(shinySetInputValue, 100) + return + } + window.Shiny.setInputValue(inputId, data) + } + + shinySetInputValue() + } +} + +(function () { + const CURRENT_SCRIPT = document.currentScript.getAttribute('src') + + document.addEventListener('DOMContentLoaded', function () { + const els = document.querySelectorAll('.countdown') + if (!els || !els.length) { + return + } + els.forEach(function (el) { + el.countdown = new CountdownTimer(el, { src_location: CURRENT_SCRIPT }) + }) + + if (window.Shiny) { + Shiny.addCustomMessageHandler('countdown:update', function (x) { + if (!x.id) { + console.error('No `id` provided, cannot update countdown') + return + } + const el = document.getElementById(x.id) + el.countdown.setValues(x) + }) + + Shiny.addCustomMessageHandler('countdown:start', function (id) { + const el = document.getElementById(id) + if (!el) return + el.countdown.start() + }) + + Shiny.addCustomMessageHandler('countdown:stop', function (id) { + const el = document.getElementById(id) + if (!el) return + el.countdown.stop() + }) + + Shiny.addCustomMessageHandler('countdown:reset', function (id) { + const el = document.getElementById(id) + if (!el) return + el.countdown.reset() + }) + + Shiny.addCustomMessageHandler('countdown:bumpUp', function (id) { + const el = document.getElementById(id) + if (!el) return + el.countdown.bumpUp() + }) + + Shiny.addCustomMessageHandler('countdown:bumpDown', function (id) { + const el = document.getElementById(id) + if (!el) return + el.countdown.bumpDown() + }) + } + }) +})() diff --git a/archive/2023-07-nyr/_freeze/08-wrapping-up/libs/countdown-0.4.0/smb_stage_clear.mp3 b/archive/2023-07-nyr/_freeze/08-wrapping-up/libs/countdown-0.4.0/smb_stage_clear.mp3 new file mode 100644 index 00000000..da2ddc2c Binary files /dev/null and b/archive/2023-07-nyr/_freeze/08-wrapping-up/libs/countdown-0.4.0/smb_stage_clear.mp3 differ diff --git a/archive/2023-07-nyr/_freeze/advanced-01-feature-engineering/execute-results/html.json b/archive/2023-07-nyr/_freeze/advanced-01-feature-engineering/execute-results/html.json new file mode 100644 index 00000000..2594192f --- /dev/null +++ b/archive/2023-07-nyr/_freeze/advanced-01-feature-engineering/execute-results/html.json @@ -0,0 +1,23 @@ +{ + "hash": "d8628a825c608e29655e7837aacd529a", + "result": { + "markdown": "---\ntitle: \"1 - Feature Engineering\"\nsubtitle: \"Advanced tidymodels\"\nformat:\n revealjs: \n slide-number: true\n footer: \n include-before-body: header.html\n include-after-body: footer-annotations.html\n theme: [default, tidymodels.scss]\n width: 1280\n height: 720\nknitr:\n opts_chunk: \n echo: true\n collapse: true\n comment: \"#>\"\n fig.path: \"figures/\"\n---\n\n\n\n\n\n\n\n## Working with our predictors\n\nWe might want to modify our predictors columns for a few reasons: \n\n::: {.incremental}\n- The model requires them in a different format (e.g. dummy variables for linear regression).\n- The model needs certain data qualities (e.g. same units for K-NN).\n- The outcome is better predicted when one or more columns are transformed in some way (a.k.a \"feature engineering\"). \n:::\n\n. . .\n\nThe first two reasons are fairly predictable ([next page](https://www.tmwr.org/pre-proc-table.html#tab:preprocessing)).\n\nThe last one depends on your modeling problem. \n\n\n## {background-iframe=\"https://www.tmwr.org/pre-proc-table.html#tab:preprocessing\"}\n\n::: footer\n:::\n\n\n## What is feature engineering?\n\nThink of a feature as some *representation* of a predictor that will be used in a model.\n\n. . .\n\nExample representations:\n\n- Interactions\n- Polynomial expansions/splines\n- Principal component analysis (PCA) feature extraction\n\nThere are a lot of examples in [_Feature Engineering and Selection_](https://bookdown.org/max/FES/) (FES).\n\n\n\n## Example: Dates\n\nHow can we represent date columns for our model?\n\n. . .\n\nWhen we use a date column in its native format, most models in R convert it to an integer.\n\n. . .\n\nWe can re-engineer it as:\n\n- Days since a reference date\n- Day of the week\n- Month\n- Year\n- Indicators for holidays\n\n::: notes\nThe main point is that we try to maximize performance with different versions of the predictors. \n\nMention that, for the Chicago data, the day or the week features are usually the most important ones in the model.\n:::\n\n## General definitions ![](hexes/recipes.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n- *Data preprocessing* steps allow your model to fit.\n\n- *Feature engineering* steps help the model do the least work to predict the outcome as well as possible.\n\nThe recipes package can handle both!\n\n::: notes\nThese terms are often used interchangeably in the ML community but we want to distinguish them.\n:::\n\n\n## Hotel Data ![](hexes/tidymodels.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"} ![](hexes/dplyr.png){.absolute top=-20 right=64 width=\"64\" height=\"74.24\"}\n\nWe'll use [data on hotels](https://www.sciencedirect.com/science/article/pii/S2352340918315191) to predict the cost of a room. \n\nThe [data](https://modeldatatoo.tidymodels.org/dev/reference/data_hotel_rates.html) are in the modeldatatoo package. We'll sample down the data and refactor some columns: \n\n:::: {.columns}\n\n::: {.column width=\"40%\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidymodels)\nlibrary(modeldatatoo)\n\n# Max's usual settings: \ntidymodels_prefer()\ntheme_set(theme_bw())\noptions(\n pillar.advice = FALSE, \n pillar.min_title_chars = Inf\n)\n```\n:::\n\n\n:::\n\n::: {.column width=\"60%\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(295)\nhotel_rates <- \n data_hotel_rates() %>% \n sample_n(5000) %>% \n arrange(arrival_date) %>% \n select(-arrival_date_num, -arrival_date) %>% \n mutate(\n company = factor(as.character(company)),\n country = factor(as.character(country)),\n agent = factor(as.character(agent))\n )\n```\n:::\n\n\n\n:::\n\n::::\n\n\n## Data splitting strategy\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](images/initial-split.svg){fig-align='center' width=20%}\n:::\n:::\n\n\n\n## Data Spending ![](hexes/rsample.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\nLet's split the data into a training set (75%) and testing set (25%):\n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(4028)\nhotel_split <-\n initial_split(hotel_rates, strata = avg_price_per_room)\n\nhotel_tr <- training(hotel_split)\nhotel_te <- testing(hotel_split)\n```\n:::\n\n\n\n\n## Your turn {transition=\"slide-in\"}\n\nLet's take some time and investigate the _training data_. The outcome is `avg_price_per_room`. \n\nAre there any interesting characteristics of the data?\n\n\n::: {.cell}\n::: {.cell-output-display}\n```{=html}\n
\n
\n10:00\n
\n```\n:::\n:::\n\n\n## Resampling Strategy\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](images/10-Fold-CV.svg){fig-align='center' width=100%}\n:::\n:::\n\n\n\n## Resampling Strategy ![](hexes/rsample.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\nWe'll use simple 10-fold cross-validation (stratified sampling):\n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(472)\nhotel_rs <- vfold_cv(hotel_tr, strata = avg_price_per_room)\nhotel_rs\n#> # 10-fold cross-validation using stratification \n#> # A tibble: 10 Γ— 2\n#> splits id \n#> \n#> 1 Fold01\n#> 2 Fold02\n#> 3 Fold03\n#> 4 Fold04\n#> 5 Fold05\n#> 6 Fold06\n#> 7 Fold07\n#> 8 Fold08\n#> 9 Fold09\n#> 10 Fold10\n```\n:::\n\n\n\n## Prepare your data for modeling ![](hexes/recipes.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n- The recipes package is an extensible framework for pipeable sequences of preprocessing and feature engineering steps.\n\n. . .\n\n- Statistical parameters for the steps can be _estimated_ from an initial data set and then _applied_ to other data sets.\n\n. . .\n\n- The resulting processed output can be used as inputs for statistical or machine learning models.\n\n## A first recipe ![](hexes/recipes.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nhotel_rec <- \n recipe(avg_price_per_room ~ ., data = hotel_tr)\n```\n:::\n\n\n. . .\n\n- The `recipe()` function assigns columns to roles of \"outcome\" or \"predictor\" using the formula\n\n## A first recipe ![](hexes/recipes.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsummary(hotel_rec)\n#> # A tibble: 28 Γ— 4\n#> variable type role source \n#> \n#> 1 lead_time predictor original\n#> 2 arrival_date_day_of_month predictor original\n#> 3 stays_in_weekend_nights predictor original\n#> 4 stays_in_week_nights predictor original\n#> 5 adults predictor original\n#> 6 children predictor original\n#> 7 babies predictor original\n#> 8 meal predictor original\n#> 9 country predictor original\n#> 10 market_segment predictor original\n#> # β„Ή 18 more rows\n```\n:::\n\n\nThe `type` column contains information on the variables\n\n\n## Your turn {transition=\"slide-in\"}\n\nWhat do you think are in the `type` vectors for the `lead_time` and `country` columns?\n\n\n::: {.cell}\n::: {.cell-output-display}\n```{=html}\n
\n
\n02:00\n
\n```\n:::\n:::\n\n\n\n\n## Create indicator variables ![](hexes/recipes.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"3\"}\nhotel_rec <- \n recipe(avg_price_per_room ~ ., data = hotel_tr) %>% \n step_dummy(all_nominal_predictors())\n```\n:::\n\n\n. . .\n\n- For any factor or character predictors, make binary indicators.\n\n- There are *many* recipe steps that can convert categorical predictors to numeric columns.\n\n- `step_dummy()` records the levels of the categorical predictors in the training set. \n\n## Filter out constant columns ![](hexes/recipes.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"4\"}\nhotel_rec <- \n recipe(avg_price_per_room ~ ., data = hotel_tr) %>% \n step_dummy(all_nominal_predictors()) %>% \n step_zv(all_predictors())\n```\n:::\n\n\n. . .\n\nIn case there is a factor level that was never observed in the training data (resulting in a column of all `0`s), we can delete any *zero-variance* predictors that have a single unique value.\n\n:::notes\nNote that the selector chooses all columns with a role of \"predictor\"\n:::\n\n\n## Normalization ![](hexes/recipes.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"5\"}\nhotel_rec <- \n recipe(avg_price_per_room ~ ., data = hotel_tr) %>% \n step_dummy(all_nominal_predictors()) %>% \n step_zv(all_predictors()) %>% \n step_normalize(all_numeric_predictors())\n```\n:::\n\n\n. . .\n\n- This centers and scales the numeric predictors.\n\n\n- The recipe will use the _training_ set to estimate the means and standard deviations of the data.\n\n. . .\n\n- All data the recipe is applied to will be normalized using those statistics (there is no re-estimation).\n\n## Reduce correlation ![](hexes/recipes.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"6\"}\nhotel_rec <- \n recipe(avg_price_per_room ~ ., data = hotel_tr) %>% \n step_dummy(all_nominal_predictors()) %>% \n step_zv(all_predictors()) %>% \n step_normalize(all_numeric_predictors()) %>% \n step_corr(all_numeric_predictors(), threshold = 0.9)\n```\n:::\n\n\n. . .\n\nTo deal with highly correlated predictors, find the minimum set of predictor columns that make the pairwise correlations less than the threshold.\n\n## Other possible steps ![](hexes/recipes.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"6\"}\nhotel_rec <- \n recipe(avg_price_per_room ~ ., data = hotel_tr) %>% \n step_dummy(all_nominal_predictors()) %>% \n step_zv(all_predictors()) %>% \n step_normalize(all_numeric_predictors()) %>% \n step_pca(all_numeric_predictors())\n```\n:::\n\n\n. . . \n\nPCA feature extraction...\n\n## Other possible steps ![](hexes/recipes.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"} ![](hexes/embed.png){.absolute top=-20 right=64 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"6\"}\nhotel_rec <- \n recipe(avg_price_per_room ~ ., data = hotel_tr) %>% \n step_dummy(all_nominal_predictors()) %>% \n step_zv(all_predictors()) %>% \n step_normalize(all_numeric_predictors()) %>% \n embed::step_umap(all_numeric_predictors(), outcome = avg_price_per_room)\n```\n:::\n\n\n. . . \n\nA fancy machine learning supervised dimension reduction technique...\n\n:::notes\nNote that this uses the outcome, and it is from an extension package\n:::\n\n\n## Other possible steps ![](hexes/recipes.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"6\"}\nhotel_rec <- \n recipe(avg_price_per_room ~ ., data = hotel_tr) %>% \n step_dummy(all_nominal_predictors()) %>% \n step_zv(all_predictors()) %>% \n step_normalize(all_numeric_predictors()) %>% \n step_spline_natural(year_day, deg_free = 10)\n```\n:::\n\n\n. . . \n\nNonlinear transforms like natural splines, and so on!\n\n## {background-iframe=\"https://recipes.tidymodels.org/reference/index.html\"}\n\n::: footer\n:::\n\n\n## Your turn {transition=\"slide-in\"}\n\n![](images/parsnip-flagger.jpg){.absolute top=\"0\" right=\"0\" width=\"150\" height=\"150\"}\n\n*Create a `recipe()` for the hotel data to:*\n\n- *use a Yeo-Johnson (YJ) transformation on `lead_time`*\n- *convert factors to indicator variables*\n- *remove zero-variance variables*\n\n\n::: {.cell}\n::: {.cell-output-display}\n```{=html}\n
\n
\n03:00\n
\n```\n:::\n:::\n\n\n\n## Minimal recipe ![](hexes/recipes.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"} \n\n\n::: {.cell}\n\n```{.r .cell-code}\nhotel_indicators <-\n recipe(avg_price_per_room ~ ., data = hotel_tr) %>% \n step_YeoJohnson(lead_time) %>%\n step_dummy(all_nominal_predictors()) %>%\n step_zv(all_predictors())\n```\n:::\n\n\n\n## Measuring Performance ![](hexes/yardstick.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\nWe'll compute two measures: mean absolute error and the coefficient of determination (a.k.a $R^2$). \n\n\\begin{align}\nMAE &= \\frac{1}{n}\\sum_{i=1}^n |y_i - \\hat{y}_i| \\notag \\\\\nR^2 &= cor(y_i, \\hat{y}_i)^2\n\\end{align}\n\nThe focus will be on MAE for parameter optimization. We'll use a metric set to compute these: \n\n\n::: {.cell}\n\n```{.r .cell-code}\nreg_metrics <- metric_set(mae, rsq)\n```\n:::\n\n\n\n## Using a workflow ![](hexes/workflows.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"} ![](hexes/tune.png){.absolute top=-20 right=64 width=\"64\" height=\"74.24\"} ![](hexes/recipes.png){.absolute top=-20 right=128 width=\"64\" height=\"74.24\"} ![](hexes/parsnip.png){.absolute top=-20 right=192 width=\"64\" height=\"74.24\"} \n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(9)\n\nhotel_lm_wflow <-\n workflow() %>%\n add_recipe(hotel_indicators) %>%\n add_model(linear_reg())\n \nctrl <- control_resamples(save_pred = TRUE)\nhotel_lm_res <-\n hotel_lm_wflow %>%\n fit_resamples(hotel_rs, control = ctrl, metrics = reg_metrics)\n\ncollect_metrics(hotel_lm_res)\n#> # A tibble: 2 Γ— 6\n#> .metric .estimator mean n std_err .config \n#> \n#> 1 mae standard 17.3 10 0.199 Preprocessor1_Model1\n#> 2 rsq standard 0.874 10 0.00400 Preprocessor1_Model1\n```\n:::\n\n\n## Your turn {transition=\"slide-in\"}\n\n![](images/parsnip-flagger.jpg){.absolute top=\"0\" right=\"0\" width=\"150\" height=\"150\"}\n\n*Use `fit_resamples()` to fit your workflow with a recipe.*\n\n*Collect the predictions from the results.*\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n```{=html}\n
\n
\n05:00\n
\n```\n:::\n:::\n\n\n\n## Holdout predictions ![](hexes/workflows.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"} ![](hexes/tune.png){.absolute top=-20 right=64 width=\"64\" height=\"74.24\"} ![](hexes/recipes.png){.absolute top=-20 right=128 width=\"64\" height=\"74.24\"} ![](hexes/parsnip.png){.absolute top=-20 right=192 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Since we used `save_pred = TRUE`\nlm_val_pred <- collect_predictions(hotel_lm_res)\nlm_val_pred %>% slice(1:7)\n#> # A tibble: 7 Γ— 5\n#> id .pred .row avg_price_per_room .config \n#> \n#> 1 Fold01 62.1 20 40 Preprocessor1_Model1\n#> 2 Fold01 48.0 28 54 Preprocessor1_Model1\n#> 3 Fold01 64.6 45 50 Preprocessor1_Model1\n#> 4 Fold01 45.8 49 42 Preprocessor1_Model1\n#> 5 Fold01 45.8 61 49 Preprocessor1_Model1\n#> 6 Fold01 30.0 66 40 Preprocessor1_Model1\n#> 7 Fold01 38.8 88 49 Preprocessor1_Model1\n```\n:::\n\n\n\n## Calibration Plot ![](hexes/probably.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlibrary(probably)\n\ncal_plot_regression(hotel_lm_res, alpha = 1 / 5)\n```\n\n::: {.cell-output-display}\n![](figures/lm-cal-plot-1.svg){fig-align='center' width=40%}\n:::\n:::\n\n\n\n\n## What do we do with the agent and company data? \n\nThere are 98 unique agent values and 100 unique companies in our training set. How can we include this information in our model?\n\n. . .\n\nWe could:\n\n- make the full set of indicator variables 😳\n\n- lump agents and companies that rarely occur into an \"other\" group\n\n- use [feature hashing](https://www.tmwr.org/categorical.html#feature-hashing) to create a smaller set of indicator variables\n\n- use effect encoding to replace the `agent` and `company` columns with the estimated effect of that predictor (in the extra materials)\n\n\n\n\n\n\n\n## Per-agent statistics \n\n::: columns\n::: {.column width=\"50%\"}\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](figures/effects-freq-1.svg){fig-align='center' width=90%}\n:::\n:::\n\n:::\n\n::: {.column width=\"50%\"}\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](figures/effects-adr-1.svg){fig-align='center' width=90%}\n:::\n:::\n\n:::\n:::\n\n## Collapsing factor levels ![](hexes/recipes.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\nThere is a recipe step that will redefine factor levels based on their frequency in the training set: \n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"4|\"}\nhotel_other_rec <-\n recipe(avg_price_per_room ~ ., data = hotel_tr) %>% \n step_YeoJohnson(lead_time) %>%\n step_other(agent, threshold = 0.001) %>%\n step_dummy(all_nominal_predictors()) %>%\n step_zv(all_predictors())\n```\n:::\n\n\n\n\nUsing this code, 34 agents (out of 98) were collapsed into \"other\" based on the training set.\n\nWe _could_ try to optimize the threshold for collapsing (see the next set of slides on model tuning).\n\n## Does othering help? ![](hexes/tune.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"} ![](hexes/recipes.png){.absolute top=-20 right=64 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"3|\"}\nhotel_other_wflow <-\n hotel_lm_wflow %>%\n update_recipe(hotel_other_rec)\n\nhotel_other_res <-\n hotel_other_wflow %>%\n fit_resamples(hotel_rs, control = ctrl, metrics = reg_metrics)\n\ncollect_metrics(hotel_other_res)\n#> # A tibble: 2 Γ— 6\n#> .metric .estimator mean n std_err .config \n#> \n#> 1 mae standard 17.4 10 0.205 Preprocessor1_Model1\n#> 2 rsq standard 0.874 10 0.00417 Preprocessor1_Model1\n```\n:::\n\n\nAabout the same MAE and much faster to complete. \n\nNow let's look at a more sophisticated tool called effect feature hashing. \n\n## Feature Hashing\n\nBetween `agent` and `company`, simple dummy variables would create 198 new columns (that are mostly zeros).\n\nAnother option is to have a binary indicator that combines some levels of these variables.\n\nFeature hashing (for more see [_FES_](https://bookdown.org/max/FES/encoding-predictors-with-many-categories.html), [_SMLTAR_](https://smltar.com/mlregression.html#case-study-feature-hashing), and [_TMwR_](https://www.tmwr.org/categorical.html#feature-hashing)): \n\n- uses the character values of the levels \n- converts them to integer hash values\n- uses the integers to assign them to a specific indicator column. \n\n## Feature Hashing\n\nSuppose we want to use 32 indicator variables for `agent`. \n\nFor a agent with value \"`Max_Kuhn`\", a hashing function converts it to an integer (say 210397726). \n\nTo assign it to one of the 32 columns, we would use modular arithmetic to assign it to a column: \n\n\n::: {.cell}\n\n```{.r .cell-code}\n# For \"Max_Kuhn\" put a '1' in column: \n210397726 %% 32\n#> [1] 30\n```\n:::\n\n\n[Hash functions](https://www.metamorphosite.com/one-way-hash-encryption-sha1-data-software) are meant to _emulate_ randomness. \n\n\n## Feature Hashing Pros\n\n\n- The procedure will automatically work on new values of the predictors.\n- It is fast. \n- \"Signed\" hashes add a sign to help avoid aliasing. \n\n## Feature Hashing Cons\n\n- There is no real logic behind which factor levels are combined. \n- We don't know how many columns to add (more in the next section).\n- Some columns may have all zeros. \n- If a indicator column is important to the model, we can't easily determine why. \n\n:::notes\nThe signed hash make it slightly more possible to differentiate between confounded levels\n:::\n\n\n## Feature Hashing in recipes ![](hexes/workflows.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"} ![](hexes/textrecipes.png){.absolute top=-20 right=64 width=\"64\" height=\"74.24\"} ![](hexes/recipes.png){.absolute top=-20 right=128 width=\"64\" height=\"74.24\"}\n\nThe textrecipes package has a step that can be added to the recipe: \n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"6-8|\"}\nlibrary(textrecipes)\n\nhash_rec <-\n recipe(avg_price_per_room ~ ., data = hotel_tr) %>%\n step_YeoJohnson(lead_time) %>%\n # Defaults to 32 signed indicator columns\n step_dummy_hash(agent) %>%\n step_dummy_hash(company) %>%\n # Regular indicators for the others\n step_dummy(all_nominal_predictors()) %>% \n step_zv(all_predictors())\n\nhotel_hash_wflow <-\n hotel_lm_wflow %>%\n update_recipe(hash_rec)\n```\n:::\n\n\n\n## Feature Hashing in recipes ![](hexes/tune.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"} ![](hexes/textrecipes.png){.absolute top=-20 right=64 width=\"64\" height=\"74.24\"} ![](hexes/recipes.png){.absolute top=-20 right=128 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nhotel_hash_res <-\n hotel_hash_wflow %>%\n fit_resamples(hotel_rs, control = ctrl, metrics = reg_metrics)\n\ncollect_metrics(hotel_hash_res)\n#> # A tibble: 2 Γ— 6\n#> .metric .estimator mean n std_err .config \n#> \n#> 1 mae standard 17.5 10 0.256 Preprocessor1_Model1\n#> 2 rsq standard 0.872 10 0.00395 Preprocessor1_Model1\n```\n:::\n\n\nAbout the same performance but now we can handle new values. \n\n\n## Debugging a recipe\n\n- Typically, you will want to use a workflow to estimate and apply a recipe.\n\n. . .\n\n- If you have an error and need to debug your recipe, the original recipe object (e.g. `hash_rec`) can be estimated manually with a function called `prep()`. It is analogous to `fit()`. See [TMwR section 16.4](https://www.tmwr.org/dimensionality.html#recipe-functions)\n\n. . .\n\n- Another function (`bake()`) is analogous to `predict()`, and gives you the processed data back.\n\n. . .\n\n- The `tidy()` function can be used to get specific results from the recipe.\n\n## Example ![](hexes/recipes.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"} ![](hexes/broom.png){.absolute top=-20 right=64 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nhash_rec_fit <- prep(hash_rec)\n\n# Get the transformation coefficient\ntidy(hash_rec_fit, number = 1)\n\n# Get the processed data\nbake(hash_rec_fit, hotel_tr %>% slice(1:3), contains(\"_agent_\"))\n```\n:::\n\n\n## More on recipes\n\n- Once `fit()` is called on a workflow, changing the model does not re-fit the recipe.\n\n. . .\n\n- A list of all known steps is at .\n\n. . .\n\n- Some steps can be [skipped](https://recipes.tidymodels.org/articles/Skipping.html) when using `predict()`.\n\n. . .\n\n- The [order](https://recipes.tidymodels.org/articles/Ordering.html) of the steps matters.\n\n\n\n", + "supporting": [ + "figures" + ], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": { + "include-in-header": [ + "\n\n" + ], + "include-after-body": [ + "\n\n\n" + ] + }, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/archive/2023-07-nyr/_freeze/advanced-01-feature-engineering/figure-revealjs/angle-1.svg b/archive/2023-07-nyr/_freeze/advanced-01-feature-engineering/figure-revealjs/angle-1.svg new file mode 100644 index 00000000..f6fac7cd --- /dev/null +++ b/archive/2023-07-nyr/_freeze/advanced-01-feature-engineering/figure-revealjs/angle-1.svg @@ -0,0 +1,675 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/archive/2023-07-nyr/_freeze/advanced-01-feature-engineering/figure-revealjs/effect-compare-1.svg b/archive/2023-07-nyr/_freeze/advanced-01-feature-engineering/figure-revealjs/effect-compare-1.svg new file mode 100644 index 00000000..4ed0f0a1 --- /dev/null +++ b/archive/2023-07-nyr/_freeze/advanced-01-feature-engineering/figure-revealjs/effect-compare-1.svg @@ -0,0 +1,192 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +50 +100 +150 +200 + + + + + + + + +50 +100 +150 +200 +ADR Sample Mean +Estimated via Effects Encoding + +num_reservations + + + + + + +300 +600 +900 + + diff --git a/archive/2023-07-nyr/_freeze/advanced-01-feature-engineering/figure-revealjs/effects-adr-1.svg b/archive/2023-07-nyr/_freeze/advanced-01-feature-engineering/figure-revealjs/effects-adr-1.svg new file mode 100644 index 00000000..a963baa3 --- /dev/null +++ b/archive/2023-07-nyr/_freeze/advanced-01-feature-engineering/figure-revealjs/effects-adr-1.svg @@ -0,0 +1,294 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/archive/2023-07-nyr/_freeze/advanced-01-feature-engineering/figure-revealjs/effects-again-1.svg b/archive/2023-07-nyr/_freeze/advanced-01-feature-engineering/figure-revealjs/effects-again-1.svg new file mode 100644 index 00000000..2cda819d --- /dev/null +++ b/archive/2023-07-nyr/_freeze/advanced-01-feature-engineering/figure-revealjs/effects-again-1.svg @@ -0,0 +1,359 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/archive/2023-07-nyr/_freeze/advanced-01-feature-engineering/figure-revealjs/effects-again-2.svg b/archive/2023-07-nyr/_freeze/advanced-01-feature-engineering/figure-revealjs/effects-again-2.svg new file mode 100644 index 00000000..29be7b8b --- /dev/null +++ b/archive/2023-07-nyr/_freeze/advanced-01-feature-engineering/figure-revealjs/effects-again-2.svg @@ -0,0 +1,294 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/archive/2023-07-nyr/_freeze/advanced-01-feature-engineering/figure-revealjs/effects-freq-1.svg b/archive/2023-07-nyr/_freeze/advanced-01-feature-engineering/figure-revealjs/effects-freq-1.svg new file mode 100644 index 00000000..664e4776 --- /dev/null +++ b/archive/2023-07-nyr/_freeze/advanced-01-feature-engineering/figure-revealjs/effects-freq-1.svg @@ -0,0 +1,359 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/archive/2023-07-nyr/_freeze/advanced-01-feature-engineering/figure-revealjs/effects-rate-1.svg b/archive/2023-07-nyr/_freeze/advanced-01-feature-engineering/figure-revealjs/effects-rate-1.svg new file mode 100644 index 00000000..ab3f46df --- /dev/null +++ b/archive/2023-07-nyr/_freeze/advanced-01-feature-engineering/figure-revealjs/effects-rate-1.svg @@ -0,0 +1,364 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/archive/2023-07-nyr/_freeze/advanced-01-feature-engineering/figure-revealjs/goal-line-1.svg b/archive/2023-07-nyr/_freeze/advanced-01-feature-engineering/figure-revealjs/goal-line-1.svg new file mode 100644 index 00000000..3fc48961 --- /dev/null +++ b/archive/2023-07-nyr/_freeze/advanced-01-feature-engineering/figure-revealjs/goal-line-1.svg @@ -0,0 +1,202 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/archive/2023-07-nyr/_freeze/advanced-01-feature-engineering/figure-revealjs/rink-code-1.svg b/archive/2023-07-nyr/_freeze/advanced-01-feature-engineering/figure-revealjs/rink-code-1.svg new file mode 100644 index 00000000..c572e61d --- /dev/null +++ b/archive/2023-07-nyr/_freeze/advanced-01-feature-engineering/figure-revealjs/rink-code-1.svg @@ -0,0 +1,328 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +shooter_type + + + + + +center +defenseman +left_wing +right_wing + + diff --git a/archive/2023-07-nyr/_freeze/advanced-01-feature-engineering/figure-revealjs/roc-curve-1.svg b/archive/2023-07-nyr/_freeze/advanced-01-feature-engineering/figure-revealjs/roc-curve-1.svg new file mode 100644 index 00000000..b3796a21 --- /dev/null +++ b/archive/2023-07-nyr/_freeze/advanced-01-feature-engineering/figure-revealjs/roc-curve-1.svg @@ -0,0 +1,319 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/archive/2023-07-nyr/_freeze/advanced-01-feature-engineering/figure-revealjs/thresholds-1.svg b/archive/2023-07-nyr/_freeze/advanced-01-feature-engineering/figure-revealjs/thresholds-1.svg new file mode 100644 index 00000000..27a35bc6 --- /dev/null +++ b/archive/2023-07-nyr/_freeze/advanced-01-feature-engineering/figure-revealjs/thresholds-1.svg @@ -0,0 +1,95 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +0.00 +0.25 +0.50 +0.75 +1.00 + + + + + + + + + + +0.00 +0.25 +0.50 +0.75 +1.00 +event threshold + +statistic + + + + +sensitivity +specificity + + diff --git a/archive/2023-07-nyr/_freeze/advanced-01-feature-engineering/figure-revealjs/unnamed-chunk-27-1.svg b/archive/2023-07-nyr/_freeze/advanced-01-feature-engineering/figure-revealjs/unnamed-chunk-27-1.svg new file mode 100644 index 00000000..6d912a92 --- /dev/null +++ b/archive/2023-07-nyr/_freeze/advanced-01-feature-engineering/figure-revealjs/unnamed-chunk-27-1.svg @@ -0,0 +1,81 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +1_no_coord +5_zone +6_bgl +2_other +4_angle +3_effects + + + + + + + + + + + +0.60 +0.65 +0.70 +0.75 +0.80 +ROC AUC (validation set) + + diff --git a/archive/2023-07-nyr/_freeze/advanced-01-feature-engineering/figure-revealjs/zone-1.svg b/archive/2023-07-nyr/_freeze/advanced-01-feature-engineering/figure-revealjs/zone-1.svg new file mode 100644 index 00000000..54c9d716 --- /dev/null +++ b/archive/2023-07-nyr/_freeze/advanced-01-feature-engineering/figure-revealjs/zone-1.svg @@ -0,0 +1,676 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/archive/2023-07-nyr/_freeze/advanced-01-feature-engineering/libs/countdown-0.4.0/countdown.css b/archive/2023-07-nyr/_freeze/advanced-01-feature-engineering/libs/countdown-0.4.0/countdown.css new file mode 100644 index 00000000..bf387012 --- /dev/null +++ b/archive/2023-07-nyr/_freeze/advanced-01-feature-engineering/libs/countdown-0.4.0/countdown.css @@ -0,0 +1,144 @@ +.countdown { + background: inherit; + position: absolute; + cursor: pointer; + font-size: 3rem; + line-height: 1; + border-color: #ddd; + border-width: 3px; + border-style: solid; + border-radius: 15px; + box-shadow: 0px 4px 10px 0px rgba(50, 50, 50, 0.4); + -webkit-box-shadow: 0px 4px 10px 0px rgba(50, 50, 50, 0.4); + margin: 0.6em; + padding: 10px 15px; + text-align: center; + z-index: 10; + -webkit-user-select: none; + -moz-user-select: none; + -ms-user-select: none; + user-select: none; +} +.countdown { + display: flex; + align-items: center; + justify-content: center; +} +.countdown .countdown-time { + background: none; + font-size: 100%; + padding: 0; +} +.countdown-digits { + color: inherit; +} +.countdown.running { + border-color: #2A9B59FF; + background-color: #43AC6A; +} +.countdown.running .countdown-digits { + color: #002F14FF; +} +.countdown.finished { + border-color: #DE3000FF; + background-color: #F04124; +} +.countdown.finished .countdown-digits { + color: #4A0900FF; +} +.countdown.running.warning { + border-color: #CEAC04FF; + background-color: #E6C229; +} +.countdown.running.warning .countdown-digits { + color: #3A2F02FF; +} + +.countdown.running.blink-colon .countdown-digits.colon { + opacity: 0.1; +} + +/* ------ Controls ------ */ +.countdown:not(.running) .countdown-controls { + display: none; +} + +.countdown-controls { + position: absolute; + top: -0.5rem; + right: -0.5rem; + left: -0.5rem; + display: flex; + justify-content: space-between; + margin: 0; + padding: 0; +} + +.countdown-controls > button { + font-size: 1.5rem; + width: 1rem; + height: 1rem; + display: inline-block; + display: flex; + flex-direction: column; + align-items: center; + justify-content: center; + font-family: monospace; + padding: 10px; + margin: 0; + background: inherit; + border: 2px solid; + border-radius: 100%; + transition: 50ms transform ease-in-out, 150ms opacity ease-in; + --countdown-transition-distance: 10px; +} + +.countdown .countdown-controls > button:last-child { + transform: translate(calc(-1 * var(--countdown-transition-distance)), var(--countdown-transition-distance)); + opacity: 0; + color: #002F14FF; + background-color: #43AC6A; + border-color: #2A9B59FF; +} + +.countdown .countdown-controls > button:first-child { + transform: translate(var(--countdown-transition-distance), var(--countdown-transition-distance)); + opacity: 0; + color: #4A0900FF; + background-color: #F04124; + border-color: #DE3000FF; +} + +.countdown.running:hover .countdown-controls > button, +.countdown.running:focus-within .countdown-controls > button{ + transform: translate(0, 0); + opacity: 1; +} + +.countdown.running:hover .countdown-controls > button:hover, +.countdown.running:focus-within .countdown-controls > button:hover{ + transform: translate(0, calc(var(--countdown-transition-distance) / -2)); + box-shadow: 0px 2px 5px 0px rgba(50, 50, 50, 0.4); + -webkit-box-shadow: 0px 2px 5px 0px rgba(50, 50, 50, 0.4); +} + +.countdown.running:hover .countdown-controls > button:active, +.countdown.running:focus-within .countdown-controls > button:active{ + transform: translate(0, calc(var(--coutndown-transition-distance) / -5)); +} + +/* ----- Fullscreen ----- */ +.countdown.countdown-fullscreen { + z-index: 0; +} + +.countdown-fullscreen.running .countdown-controls { + top: 1rem; + left: 0; + right: 0; + justify-content: center; +} + +.countdown-fullscreen.running .countdown-controls > button + button { + margin-left: 1rem; +} diff --git a/archive/2023-07-nyr/_freeze/advanced-01-feature-engineering/libs/countdown-0.4.0/countdown.js b/archive/2023-07-nyr/_freeze/advanced-01-feature-engineering/libs/countdown-0.4.0/countdown.js new file mode 100644 index 00000000..a058ad8f --- /dev/null +++ b/archive/2023-07-nyr/_freeze/advanced-01-feature-engineering/libs/countdown-0.4.0/countdown.js @@ -0,0 +1,478 @@ +/* globals Shiny,Audio */ +class CountdownTimer { + constructor (el, opts) { + if (typeof el === 'string' || el instanceof String) { + el = document.querySelector(el) + } + + if (el.counter) { + return el.counter + } + + const minutes = parseInt(el.querySelector('.minutes').innerText || '0') + const seconds = parseInt(el.querySelector('.seconds').innerText || '0') + const duration = minutes * 60 + seconds + + function attrIsTrue (x) { + if (x === true) return true + return !!(x === 'true' || x === '' || x === '1') + } + + this.element = el + this.duration = duration + this.end = null + this.is_running = false + this.warn_when = parseInt(el.dataset.warnWhen) || -1 + this.update_every = parseInt(el.dataset.updateEvery) || 1 + this.play_sound = attrIsTrue(el.dataset.playSound) + this.blink_colon = attrIsTrue(el.dataset.blinkColon) + this.startImmediately = attrIsTrue(el.dataset.startImmediately) + this.timeout = null + this.display = { minutes, seconds } + + if (opts.src_location) { + this.src_location = opts.src_location + } + + this.addEventListeners() + } + + addEventListeners () { + const self = this + + if (this.startImmediately) { + if (window.remark && window.slideshow) { + // Remark (xaringan) support + const isOnVisibleSlide = () => { + return document.querySelector('.remark-visible').contains(self.element) + } + if (isOnVisibleSlide()) { + self.start() + } else { + let started_once = 0 + window.slideshow.on('afterShowSlide', function () { + if (started_once > 0) return + if (isOnVisibleSlide()) { + self.start() + started_once = 1 + } + }) + } + } else if (window.Reveal) { + // Revealjs (quarto) support + const isOnVisibleSlide = () => { + const currentSlide = document.querySelector('.reveal .slide.present') + return currentSlide ? currentSlide.contains(self.element) : false + } + if (isOnVisibleSlide()) { + self.start() + } else { + const revealStartTimer = () => { + if (isOnVisibleSlide()) { + self.start() + window.Reveal.off('slidechanged', revealStartTimer) + } + } + window.Reveal.on('slidechanged', revealStartTimer) + } + } else if (window.IntersectionObserver) { + // All other situtations use IntersectionObserver + const onVisible = (element, callback) => { + new window.IntersectionObserver((entries, observer) => { + entries.forEach(entry => { + if (entry.intersectionRatio > 0) { + callback(element) + observer.disconnect() + } + }) + }).observe(element) + } + onVisible(this.element, el => el.countdown.start()) + } else { + // or just start the timer as soon as it's initialized + this.start() + } + } + + function haltEvent (ev) { + ev.preventDefault() + ev.stopPropagation() + } + function isSpaceOrEnter (ev) { + return ev.code === 'Space' || ev.code === 'Enter' + } + function isArrowUpOrDown (ev) { + return ev.code === 'ArrowUp' || ev.code === 'ArrowDown' + } + + ;['click', 'touchend'].forEach(function (eventType) { + self.element.addEventListener(eventType, function (ev) { + haltEvent(ev) + self.is_running ? self.stop() : self.start() + }) + }) + this.element.addEventListener('keydown', function (ev) { + if (ev.code === "Escape") { + self.reset() + haltEvent(ev) + } + if (!isSpaceOrEnter(ev) && !isArrowUpOrDown(ev)) return + haltEvent(ev) + if (isSpaceOrEnter(ev)) { + self.is_running ? self.stop() : self.start() + return + } + + if (!self.is_running) return + + if (ev.code === 'ArrowUp') { + self.bumpUp() + } else if (ev.code === 'ArrowDown') { + self.bumpDown() + } + }) + this.element.addEventListener('dblclick', function (ev) { + haltEvent(ev) + if (self.is_running) self.reset() + }) + this.element.addEventListener('touchmove', haltEvent) + + const btnBumpDown = this.element.querySelector('.countdown-bump-down') + ;['click', 'touchend'].forEach(function (eventType) { + btnBumpDown.addEventListener(eventType, function (ev) { + haltEvent(ev) + if (self.is_running) self.bumpDown() + }) + }) + btnBumpDown.addEventListener('keydown', function (ev) { + if (!isSpaceOrEnter(ev) || !self.is_running) return + haltEvent(ev) + self.bumpDown() + }) + + const btnBumpUp = this.element.querySelector('.countdown-bump-up') + ;['click', 'touchend'].forEach(function (eventType) { + btnBumpUp.addEventListener(eventType, function (ev) { + haltEvent(ev) + if (self.is_running) self.bumpUp() + }) + }) + btnBumpUp.addEventListener('keydown', function (ev) { + if (!isSpaceOrEnter(ev) || !self.is_running) return + haltEvent(ev) + self.bumpUp() + }) + this.element.querySelector('.countdown-controls').addEventListener('dblclick', function (ev) { + haltEvent(ev) + }) + } + + remainingTime () { + const remaining = this.is_running + ? (this.end - Date.now()) / 1000 + : this.remaining || this.duration + + let minutes = Math.floor(remaining / 60) + let seconds = Math.ceil(remaining - minutes * 60) + + if (seconds > 59) { + minutes = minutes + 1 + seconds = seconds - 60 + } + + return { remaining, minutes, seconds } + } + + start () { + if (this.is_running) return + + this.is_running = true + + if (this.remaining) { + // Having a static remaining time indicates timer was paused + this.end = Date.now() + this.remaining * 1000 + this.remaining = null + } else { + this.end = Date.now() + this.duration * 1000 + } + + this.reportStateToShiny('start') + + this.element.classList.remove('finished') + this.element.classList.add('running') + this.update(true) + this.tick() + } + + tick (run_again) { + if (typeof run_again === 'undefined') { + run_again = true + } + + if (!this.is_running) return + + const { seconds: secondsWas } = this.display + this.update() + + if (run_again) { + const delay = (this.end - Date.now() > 10000) ? 1000 : 250 + this.blinkColon(secondsWas) + this.timeout = setTimeout(this.tick.bind(this), delay) + } + } + + blinkColon (secondsWas) { + // don't blink unless option is set + if (!this.blink_colon) return + // warn_when always updates the seconds + if (this.warn_when > 0 && Date.now() + this.warn_when > this.end) { + this.element.classList.remove('blink-colon') + return + } + const { seconds: secondsIs } = this.display + if (secondsIs > 10 || secondsWas !== secondsIs) { + this.element.classList.toggle('blink-colon') + } + } + + update (force) { + if (typeof force === 'undefined') { + force = false + } + + const { remaining, minutes, seconds } = this.remainingTime() + + const setRemainingTime = (selector, time) => { + const timeContainer = this.element.querySelector(selector) + if (!timeContainer) return + time = Math.max(time, 0) + timeContainer.innerText = String(time).padStart(2, 0) + } + + if (this.is_running && remaining < 0.25) { + this.stop() + setRemainingTime('.minutes', 0) + setRemainingTime('.seconds', 0) + this.playSound() + return + } + + const should_update = force || + Math.round(remaining) < this.warn_when || + Math.round(remaining) % this.update_every === 0 + + if (should_update) { + this.element.classList.toggle('warning', remaining <= this.warn_when) + this.display = { minutes, seconds } + setRemainingTime('.minutes', minutes) + setRemainingTime('.seconds', seconds) + } + } + + stop () { + const { remaining } = this.remainingTime() + if (remaining > 1) { + this.remaining = remaining + } + this.element.classList.remove('running') + this.element.classList.remove('warning') + this.element.classList.remove('blink-colon') + this.element.classList.add('finished') + this.is_running = false + this.end = null + this.reportStateToShiny('stop') + this.timeout = clearTimeout(this.timeout) + } + + reset () { + this.stop() + this.remaining = null + this.update(true) + this.reportStateToShiny('reset') + this.element.classList.remove('finished') + this.element.classList.remove('warning') + } + + setValues (opts) { + if (typeof opts.warn_when !== 'undefined') { + this.warn_when = opts.warn_when + } + if (typeof opts.update_every !== 'undefined') { + this.update_every = opts.update_every + } + if (typeof opts.blink_colon !== 'undefined') { + this.blink_colon = opts.blink_colon + if (!opts.blink_colon) { + this.element.classList.remove('blink-colon') + } + } + if (typeof opts.play_sound !== 'undefined') { + this.play_sound = opts.play_sound + } + if (typeof opts.duration !== 'undefined') { + this.duration = opts.duration + if (this.is_running) { + this.reset() + this.start() + } + } + this.reportStateToShiny('update') + this.update(true) + } + + bumpTimer (val, round) { + round = typeof round === 'boolean' ? round : true + const { remaining } = this.remainingTime() + let newRemaining = remaining + val + if (newRemaining <= 0) { + this.setRemaining(0) + this.stop() + return + } + if (round && newRemaining > 10) { + newRemaining = Math.round(newRemaining / 5) * 5 + } + this.setRemaining(newRemaining) + this.reportStateToShiny(val > 0 ? 'bumpUp' : 'bumpDown') + this.update(true) + } + + bumpUp (val) { + if (!this.is_running) { + console.error('timer is not running') + return + } + this.bumpTimer( + val || this.bumpIncrementValue(), + typeof val === 'undefined' + ) + } + + bumpDown (val) { + if (!this.is_running) { + console.error('timer is not running') + return + } + this.bumpTimer( + val || -1 * this.bumpIncrementValue(), + typeof val === 'undefined' + ) + } + + setRemaining (val) { + if (!this.is_running) { + console.error('timer is not running') + return + } + this.end = Date.now() + val * 1000 + this.update(true) + } + + playSound () { + let url = this.play_sound + if (!url) return + if (typeof url === 'boolean') { + const src = this.src_location + ? this.src_location.replace('/countdown.js', '') + : 'libs/countdown' + url = src + '/smb_stage_clear.mp3' + } + const sound = new Audio(url) + sound.play() + } + + bumpIncrementValue (val) { + val = val || this.remainingTime().remaining + if (val <= 30) { + return 5 + } else if (val <= 300) { + return 15 + } else if (val <= 3000) { + return 30 + } else { + return 60 + } + } + + reportStateToShiny (action) { + if (!window.Shiny) return + + const inputId = this.element.id + const data = { + event: { + action, + time: new Date().toISOString() + }, + timer: { + is_running: this.is_running, + end: this.end ? new Date(this.end).toISOString() : null, + remaining: this.remainingTime() + } + } + + function shinySetInputValue () { + if (!window.Shiny.setInputValue) { + setTimeout(shinySetInputValue, 100) + return + } + window.Shiny.setInputValue(inputId, data) + } + + shinySetInputValue() + } +} + +(function () { + const CURRENT_SCRIPT = document.currentScript.getAttribute('src') + + document.addEventListener('DOMContentLoaded', function () { + const els = document.querySelectorAll('.countdown') + if (!els || !els.length) { + return + } + els.forEach(function (el) { + el.countdown = new CountdownTimer(el, { src_location: CURRENT_SCRIPT }) + }) + + if (window.Shiny) { + Shiny.addCustomMessageHandler('countdown:update', function (x) { + if (!x.id) { + console.error('No `id` provided, cannot update countdown') + return + } + const el = document.getElementById(x.id) + el.countdown.setValues(x) + }) + + Shiny.addCustomMessageHandler('countdown:start', function (id) { + const el = document.getElementById(id) + if (!el) return + el.countdown.start() + }) + + Shiny.addCustomMessageHandler('countdown:stop', function (id) { + const el = document.getElementById(id) + if (!el) return + el.countdown.stop() + }) + + Shiny.addCustomMessageHandler('countdown:reset', function (id) { + const el = document.getElementById(id) + if (!el) return + el.countdown.reset() + }) + + Shiny.addCustomMessageHandler('countdown:bumpUp', function (id) { + const el = document.getElementById(id) + if (!el) return + el.countdown.bumpUp() + }) + + Shiny.addCustomMessageHandler('countdown:bumpDown', function (id) { + const el = document.getElementById(id) + if (!el) return + el.countdown.bumpDown() + }) + } + }) +})() diff --git a/archive/2023-07-nyr/_freeze/advanced-01-feature-engineering/libs/countdown-0.4.0/smb_stage_clear.mp3 b/archive/2023-07-nyr/_freeze/advanced-01-feature-engineering/libs/countdown-0.4.0/smb_stage_clear.mp3 new file mode 100644 index 00000000..da2ddc2c Binary files /dev/null and b/archive/2023-07-nyr/_freeze/advanced-01-feature-engineering/libs/countdown-0.4.0/smb_stage_clear.mp3 differ diff --git a/archive/2023-07-nyr/_freeze/advanced-02-tuning-hyperparameters/execute-results/html.json b/archive/2023-07-nyr/_freeze/advanced-02-tuning-hyperparameters/execute-results/html.json new file mode 100644 index 00000000..291efd73 --- /dev/null +++ b/archive/2023-07-nyr/_freeze/advanced-02-tuning-hyperparameters/execute-results/html.json @@ -0,0 +1,23 @@ +{ + "hash": "0e183a03b8c38cce1cf1df5e00d08e01", + "result": { + "markdown": "---\ntitle: \"2 - Tuning Hyperparameters\"\nsubtitle: \"Advanced tidymodels\"\nformat:\n revealjs: \n slide-number: true\n footer: \n include-before-body: header.html\n include-after-body: footer-annotations.html\n theme: [default, tidymodels.scss]\n width: 1280\n height: 720\nknitr:\n opts_chunk: \n echo: true\n collapse: true\n comment: \"#>\"\n fig.path: \"figures/\"\n---\n\n\n\n\n\n\n\n## Previously - Setup ![](hexes/tidymodels.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n:::: {.columns}\n\n::: {.column width=\"40%\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidymodels)\nlibrary(modeldatatoo)\nlibrary(textrecipes)\nlibrary(bonsai)\n\n# Max's usual settings: \ntidymodels_prefer()\ntheme_set(theme_bw())\noptions(\n pillar.advice = FALSE, \n pillar.min_title_chars = Inf\n)\n\nreg_metrics <- metric_set(mae, rsq)\n```\n:::\n\n\n:::\n\n::: {.column width=\"60%\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(295)\nhotel_rates <- \n data_hotel_rates() %>% \n sample_n(5000) %>% \n arrange(arrival_date) %>% \n select(-arrival_date_num, -arrival_date) %>% \n mutate(\n company = factor(as.character(company)),\n country = factor(as.character(country)),\n agent = factor(as.character(agent))\n )\n```\n:::\n\n\n\n:::\n\n::::\n\n\n## Previously - Data Usage ![](hexes/rsample.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(4028)\nhotel_split <- initial_split(hotel_rates, strata = avg_price_per_room)\n\nhotel_tr <- training(hotel_split)\nhotel_te <- testing(hotel_split)\n\nset.seed(472)\nhotel_rs <- vfold_cv(hotel_tr, strata = avg_price_per_room)\n```\n:::\n\n\n\n## Previously - Feature engineering ![](hexes/textrecipes.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"} ![](hexes/recipes.png){.absolute top=-20 right=64 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(textrecipes)\n\nhash_rec <-\n recipe(avg_price_per_room ~ ., data = hotel_tr) %>%\n step_YeoJohnson(lead_time) %>%\n # Defaults to 32 signed indicator columns\n step_dummy_hash(agent) %>%\n step_dummy_hash(company) %>%\n # Regular indicators for the others\n step_dummy(all_nominal_predictors()) %>% \n step_zv(all_predictors())\n```\n:::\n\n\n# Optimizing Models via Tuning Parameters\n\n## Tuning parameters\n\nSome model or preprocessing parameters cannot be estimated directly from the data.\n\n. . .\n\nSome examples:\n\n- Tree depth in decision trees\n- Number of neighbors in a K-nearest neighbor model\n\n# Activation function in neural networks?\n\nSigmoidal functions, ReLu, etc.\n\n::: fragment\nYes, it is a tuning parameter.\nβœ…\n:::\n\n# Number of feature hashing columns to generate?\n\n::: fragment\nYes, it is a tuning parameter.\nβœ…\n:::\n\n# Bayesian priors for model parameters?\n\n::: fragment\nHmmmm, probably not.\nThese are based on prior belief.\n❌\n:::\n\n# Covariance/correlation matrix structure in mixed models?\n\n::: fragment\nYes, but it is unlikely to affect performance.\n:::\n\n::: fragment\nIt will impact inference though.\nπŸ€”\n:::\n\n\n\n# Is the random seed a tuning parameter?\n\n::: fragment\nNope. It is not. \n❌\n:::\n\n## Optimize tuning parameters\n\n- Try different values and measure their performance.\n\n. . .\n\n- Find good values for these parameters.\n\n. . .\n\n- Once the value(s) of the parameter(s) are determined, a model can be finalized by fitting the model to the entire training set.\n\n\n## Tagging parameters for tuning ![](hexes/tune.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\nWith tidymodels, you can mark the parameters that you want to optimize with a value of `tune()`. \n\n
\n\nThe function itself just returns... itself: \n\n\n::: {.cell}\n\n```{.r .cell-code}\ntune()\n#> tune()\nstr(tune())\n#> language tune()\n\n# optionally add a label\ntune(\"I hope that the workshop is going well\")\n#> tune(\"I hope that the workshop is going well\")\n```\n:::\n\n\n. . . \n\nFor example...\n\n## Optimizing the hash features ![](hexes/tune.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"} ![](hexes/textrecipes.png){.absolute top=-20 right=64 width=\"64\" height=\"74.24\"} ![](hexes/recipes.png){.absolute top=-20 right=128 width=\"64\" height=\"74.24\"}\n\nOur new recipe is: \n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"4-5|\"}\nhash_rec <-\n recipe(avg_price_per_room ~ ., data = hotel_tr) %>%\n step_YeoJohnson(lead_time) %>%\n step_dummy_hash(agent, num_terms = tune(\"agent hash\")) %>%\n step_dummy_hash(company, num_terms = tune(\"company hash\")) %>%\n step_zv(all_predictors())\n```\n:::\n\n\n
\n\nWe will be using a tree-based model in a minute. \n\n - The other categorical predictors are left as-is.\n - That's why there is no `step_dummy()`. \n\n\n## Boosted Trees\n\nThese are popular ensemble methods that build a _sequence_ of tree models. \n\n
\n\nEach tree uses the results of the previous tree to better predict samples, especially those that have been poorly predicted. \n\n
\n\nEach tree in the ensemble is saved and new samples are predicted using a weighted average of the votes of each tree in the ensemble. \n\n
\n\nWe'll focus on the popular lightgbm implementation. \n\n## Boosted Tree Tuning Parameters\n\nSome _possible_ parameters: \n\n* `mtry`: The number of predictors randomly sampled at each split (in $[1, ncol(x)]$ or $(0, 1]$).\n* `trees`: The number of trees ($[1, \\infty]$, but usually up to thousands)\n* `min_n`: The number of samples needed to further split ($[1, n]$).\n* `learn_rate`: The rate that each tree adapts from previous iterations ($(0, \\infty]$, usual maximum is 0.1).\n* `stop_iter`: The number of iterations of boosting where _no improvement_ was shown before stopping ($[1, trees]$)\n\n## Boosted Tree Tuning Parameters\n\nTBH it is usually not difficult to optimize these models. \n\n
\n\nOften, there are multiple _candidate_ tuning parameter combinations that have very good results. \n\n
\n\nTo demonstrate simple concepts, we'll look at optimizing the number of trees in the ensemble (between 1 and 100) and the learning rate ($10^{-5}$ to $10^{-1}$).\n\n## Boosted Tree Tuning Parameters ![](hexes/workflows.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"} ![](hexes/parsnip.png){.absolute top=-20 right=64 width=\"64\" height=\"74.24\"} ![](hexes/bonsai.png){.absolute top=-20 right=128 width=\"64\" height=\"74.24\"}\n\nWe'll need to load the bonsai package. This has the information needed to use lightgbm\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(bonsai)\nlgbm_spec <- \n boost_tree(trees = tune(), learn_rate = tune()) %>% \n set_mode(\"regression\") %>% \n set_engine(\"lightgbm\")\n\nlgbm_wflow <- workflow(hash_rec, lgbm_spec)\n```\n:::\n\n\n\n\n\n## Optimize tuning parameters\n\nThe main two strategies for optimization are:\n\n. . .\n\n- **Grid search** πŸ’  which tests a pre-defined set of candidate values\n\n- **Iterative search** πŸŒ€ which suggests/estimates new values of candidate parameters to evaluate\n\n\n## Grid search\n\nA small grid of points trying to minimize the error via learning rate: \n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](images/small_init.svg){fig-align='center' width=60%}\n:::\n:::\n\n\n\n## Grid search\n\nIn reality we would probably sample the space more densely: \n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](images/grid_points.svg){fig-align='center' width=60%}\n:::\n:::\n\n\n\n## Iterative Search\n\nWe could start with a few points and search the space:\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](animations/anime_seq.gif){fig-align='center' width=60%}\n:::\n:::\n\n\n# Grid Search\n\n## Parameters\n\n- The tidymodels framework provides pre-defined information on tuning parameters (such as their type, range, transformations, etc).\n\n- The `extract_parameter_set_dials()` function extracts these tuning parameters and the info.\n\n::: fragment\n#### Grids\n\n- Create your grid manually or automatically.\n\n- The `grid_*()` functions can make a grid.\n:::\n\n::: notes\nMost basic (but very effective) way to tune models\n:::\n\n## Create a grid ![](hexes/workflows.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"} ![](hexes/dials.png){.absolute top=-20 right=64 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlgbm_wflow %>% \n extract_parameter_set_dials()\n#> Collection of 4 parameters for tuning\n#> \n#> identifier type object\n#> trees trees nparam[+]\n#> learn_rate learn_rate nparam[+]\n#> agent hash num_terms nparam[+]\n#> company hash num_terms nparam[+]\n\n# Individual functions: \ntrees()\n#> # Trees (quantitative)\n#> Range: [1, 2000]\nlearn_rate()\n#> Learning Rate (quantitative)\n#> Transformer: log-10 [1e-100, Inf]\n#> Range (transformed scale): [-10, -1]\n```\n:::\n\n\n::: fragment\nA parameter set can be updated (e.g. to change the ranges).\n:::\n\n## Create a grid ![](hexes/workflows.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"} ![](hexes/dials.png){.absolute top=-20 right=64 width=\"64\" height=\"74.24\"}\n\n::: columns\n::: {.column width=\"65%\"}\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(12)\ngrid <- \n lgbm_wflow %>% \n extract_parameter_set_dials() %>% \n grid_latin_hypercube(size = 25)\n\ngrid\n#> # A tibble: 25 Γ— 4\n#> trees learn_rate `agent hash` `company hash`\n#> \n#> 1 1629 0.00000440 524 1454\n#> 2 1746 0.0000000751 1009 2865\n#> 3 53 0.0000180 2313 367\n#> 4 442 0.000000445 347 460\n#> 5 1413 0.0000000208 3232 553\n#> 6 1488 0.0000578 3692 639\n#> 7 906 0.000385 602 332\n#> 8 1884 0.00000000101 1127 567\n#> 9 1812 0.0239 961 1183\n#> 10 393 0.000000117 487 1783\n#> # β„Ή 15 more rows\n```\n:::\n\n:::\n\n::: {.column width=\"35%\"}\n::: fragment\n- A *space-filling design* tends to perform better than random grids.\n- Space-filling designs are also usually more efficient than regular grids.\n:::\n:::\n:::\n\n## Your turn {transition=\"slide-in\"}\n\n![](images/parsnip-flagger.jpg){.absolute top=\"0\" right=\"0\" width=\"150\" height=\"150\"}\n\n*Create a grid for our tunable workflow.*\n\n*Try creating a regular grid.*\n\n\n::: {.cell}\n::: {.cell-output-display}\n```{=html}\n
\n
\n03:00\n
\n```\n:::\n:::\n\n\n## Create a regular grid ![](hexes/workflows.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"} ![](hexes/dials.png){.absolute top=-20 right=64 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"5\"}\nset.seed(12)\ngrid <- \n lgbm_wflow %>% \n extract_parameter_set_dials() %>% \n grid_regular(levels = 4)\n\ngrid\n#> # A tibble: 256 Γ— 4\n#> trees learn_rate `agent hash` `company hash`\n#> \n#> 1 1 0.0000000001 256 256\n#> 2 667 0.0000000001 256 256\n#> 3 1333 0.0000000001 256 256\n#> 4 2000 0.0000000001 256 256\n#> 5 1 0.0000001 256 256\n#> 6 667 0.0000001 256 256\n#> 7 1333 0.0000001 256 256\n#> 8 2000 0.0000001 256 256\n#> 9 1 0.0001 256 256\n#> 10 667 0.0001 256 256\n#> # β„Ή 246 more rows\n```\n:::\n\n\n\n## Your turn {transition=\"slide-in\"}\n\n![](images/parsnip-flagger.jpg){.absolute top=\"0\" right=\"0\" width=\"150\" height=\"150\"}\n\n
\n\n*What advantage would a regular grid have?* \n\n\n\n## Update parameter ranges ![](hexes/workflows.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"} ![](hexes/dials.png){.absolute top=-20 right=64 width=\"64\" height=\"74.24\"} {.annotation}\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"4-5|\"}\nlgbm_param <- \n lgbm_wflow %>% \n extract_parameter_set_dials() %>% \n update(trees = trees(c(1L, 100L)),\n learn_rate = learn_rate(c(-5, -1)))\n\nset.seed(712)\ngrid <- \n lgbm_param %>% \n grid_latin_hypercube(size = 25)\n\ngrid\n#> # A tibble: 25 Γ— 4\n#> trees learn_rate `agent hash` `company hash`\n#> \n#> 1 75 0.000312 2991 1250\n#> 2 4 0.0000337 899 3088\n#> 3 15 0.0295 520 1578\n#> 4 8 0.0997 1256 3592\n#> 5 80 0.000622 419 258\n#> 6 70 0.000474 2499 1089\n#> 7 35 0.000165 287 2376\n#> 8 64 0.00137 389 359\n#> 9 58 0.0000250 616 881\n#> 10 84 0.0639 2311 2635\n#> # β„Ή 15 more rows\n```\n:::\n\n\n\n## The results ![](hexes/ggplot2.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"} ![](hexes/dials.png){.absolute top=-20 right=64 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell layout-align=\"center\" output-location='column'}\n\n```{.r .cell-code}\ngrid %>% \n ggplot(aes(trees, learn_rate)) +\n geom_point(size = 4) +\n scale_y_log10()\n```\n\n::: {.cell-output-display}\n![](figures/sfd-1.svg){fig-align='center' width=480}\n:::\n:::\n\n\nNote that the learning rates are uniform on the log-10 scale. \n\n\n# Use the `tune_*()` functions to tune models\n\n\n## Choosing tuning parameters ![](hexes/workflows.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"} ![](hexes/tune.png){.absolute top=-20 right=64 width=\"64\" height=\"74.24\"} ![](hexes/recipes.png){.absolute top=-20 right=128 width=\"64\" height=\"74.24\"} ![](hexes/parsnip.png){.absolute top=-20 right=192 width=\"64\" height=\"74.24\"} ![](hexes/bonsai.png){.absolute top=-20 right=256 width=\"64\" height=\"74.24\"}\n\nLet's take our previous model and tune more parameters:\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"2,12-13|\"}\nlgbm_spec <- \n boost_tree(trees = tune(), learn_rate = tune(), min_n = tune()) %>% \n set_mode(\"regression\") %>% \n set_engine(\"lightgbm\")\n\nlgbm_wflow <- workflow(hash_rec, lgbm_spec)\n\n# Update the feature hash ranges (log-2 units)\nlgbm_param <-\n lgbm_wflow %>%\n extract_parameter_set_dials() %>%\n update(`agent hash` = num_hash(c(3, 8)),\n `company hash` = num_hash(c(3, 8)))\n```\n:::\n\n\n\n## Grid Search ![](hexes/workflows.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"} ![](hexes/tune.png){.absolute top=-20 right=64 width=\"64\" height=\"74.24\"} ![](hexes/dials.png){.absolute top=-20 right=128 width=\"64\" height=\"74.24\"} \n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"2|\"}\nset.seed(9)\nctrl <- control_grid(save_pred = TRUE)\n\nlgbm_res <-\n lgbm_wflow %>%\n tune_grid(\n resamples = hotel_rs,\n grid = 25,\n # The options below are not required by default\n param_info = lgbm_param, \n control = ctrl,\n metrics = reg_metrics\n )\n```\n:::\n\n\n::: notes\n- `tune_grid()` is representative of tuning function syntax\n- similar to `fit_resamples()`\n:::\n\n\n\n## Grid Search ![](hexes/workflows.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"} ![](hexes/tune.png){.absolute top=-20 right=64 width=\"64\" height=\"74.24\"} ![](hexes/dials.png){.absolute top=-20 right=128 width=\"64\" height=\"74.24\"} \n\n\n::: {.cell}\n\n```{.r .cell-code}\nlgbm_res \n#> # Tuning results\n#> # 10-fold cross-validation using stratification \n#> # A tibble: 10 Γ— 5\n#> splits id .metrics .notes .predictions \n#> \n#> 1 Fold01 \n#> 2 Fold02 \n#> 3 Fold03 \n#> 4 Fold04 \n#> 5 Fold05 \n#> 6 Fold06 \n#> 7 Fold07 \n#> 8 Fold08 \n#> 9 Fold09 \n#> 10 Fold10 \n```\n:::\n\n\n\n## Grid results ![](hexes/tune.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nautoplot(lgbm_res)\n```\n\n::: {.cell-output-display}\n![](figures/autoplot-1.svg){fig-align='center' width=80%}\n:::\n:::\n\n\n## Tuning results ![](hexes/tune.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncollect_metrics(lgbm_res)\n#> # A tibble: 50 Γ— 11\n#> trees min_n learn_rate `agent hash` `company hash` .metric .estimator mean n std_err .config \n#> \n#> 1 298 19 4.15e- 9 222 36 mae standard 53.2 10 0.427 Preprocessor01_Model1\n#> 2 298 19 4.15e- 9 222 36 rsq standard 0.811 10 0.00785 Preprocessor01_Model1\n#> 3 1394 5 5.82e- 6 28 21 mae standard 52.9 10 0.424 Preprocessor02_Model1\n#> 4 1394 5 5.82e- 6 28 21 rsq standard 0.810 10 0.00857 Preprocessor02_Model1\n#> 5 774 12 4.41e- 2 27 95 mae standard 10.5 10 0.175 Preprocessor03_Model1\n#> 6 774 12 4.41e- 2 27 95 rsq standard 0.939 10 0.00381 Preprocessor03_Model1\n#> 7 1342 7 6.84e-10 71 17 mae standard 53.2 10 0.427 Preprocessor04_Model1\n#> 8 1342 7 6.84e-10 71 17 rsq standard 0.810 10 0.00903 Preprocessor04_Model1\n#> 9 669 39 8.62e- 7 141 145 mae standard 53.2 10 0.426 Preprocessor05_Model1\n#> 10 669 39 8.62e- 7 141 145 rsq standard 0.808 10 0.00661 Preprocessor05_Model1\n#> # β„Ή 40 more rows\n```\n:::\n\n\n## Tuning results ![](hexes/tune.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncollect_metrics(lgbm_res, summarize = FALSE)\n#> # A tibble: 500 Γ— 10\n#> id trees min_n learn_rate `agent hash` `company hash` .metric .estimator .estimate .config \n#> \n#> 1 Fold01 298 19 0.00000000415 222 36 mae standard 51.8 Preprocessor01_Model1\n#> 2 Fold01 298 19 0.00000000415 222 36 rsq standard 0.834 Preprocessor01_Model1\n#> 3 Fold02 298 19 0.00000000415 222 36 mae standard 52.1 Preprocessor01_Model1\n#> 4 Fold02 298 19 0.00000000415 222 36 rsq standard 0.801 Preprocessor01_Model1\n#> 5 Fold03 298 19 0.00000000415 222 36 mae standard 52.2 Preprocessor01_Model1\n#> 6 Fold03 298 19 0.00000000415 222 36 rsq standard 0.784 Preprocessor01_Model1\n#> 7 Fold04 298 19 0.00000000415 222 36 mae standard 51.7 Preprocessor01_Model1\n#> 8 Fold04 298 19 0.00000000415 222 36 rsq standard 0.828 Preprocessor01_Model1\n#> 9 Fold05 298 19 0.00000000415 222 36 mae standard 55.2 Preprocessor01_Model1\n#> 10 Fold05 298 19 0.00000000415 222 36 rsq standard 0.850 Preprocessor01_Model1\n#> # β„Ή 490 more rows\n```\n:::\n\n\n## Choose a parameter combination ![](hexes/tune.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nshow_best(lgbm_res, metric = \"rsq\")\n#> # A tibble: 5 Γ— 11\n#> trees min_n learn_rate `agent hash` `company hash` .metric .estimator mean n std_err .config \n#> \n#> 1 1890 10 0.0159 115 174 rsq standard 0.940 10 0.00369 Preprocessor12_Model1\n#> 2 774 12 0.0441 27 95 rsq standard 0.939 10 0.00381 Preprocessor03_Model1\n#> 3 1638 36 0.0409 15 120 rsq standard 0.938 10 0.00346 Preprocessor16_Model1\n#> 4 963 23 0.00556 157 13 rsq standard 0.930 10 0.00358 Preprocessor06_Model1\n#> 5 590 5 0.00320 85 73 rsq standard 0.905 10 0.00505 Preprocessor24_Model1\n```\n:::\n\n\n## Choose a parameter combination ![](hexes/tune.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\nCreate your own tibble for final parameters or use one of the `tune::select_*()` functions:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlgbm_best <- select_best(lgbm_res, metric = \"mae\")\nlgbm_best\n#> # A tibble: 1 Γ— 6\n#> trees min_n learn_rate `agent hash` `company hash` .config \n#> \n#> 1 774 12 0.0441 27 95 Preprocessor03_Model1\n```\n:::\n\n\n## Checking Calibration ![](hexes/tune.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"} ![](hexes/probably.png){.absolute top=-20 right=64 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell output-location='column'}\n\n```{.r .cell-code}\nlibrary(probably)\nlgbm_res %>%\n collect_predictions(\n parameters = lgbm_best\n ) %>%\n cal_plot_regression(\n truth = avg_price_per_room,\n estimate = .pred,\n alpha = 1 / 3\n )\n```\n\n::: {.cell-output-display}\n![](figures/lgb-cal-plot-1.svg){width=90%}\n:::\n:::\n\n\n\n## Running in parallel\n\n::: columns\n::: {.column width=\"60%\"}\n- Grid search, combined with resampling, requires fitting a lot of models!\n\n- These models don't depend on one another and can be run in parallel.\n\nWe can use a *parallel backend* to do this:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncores <- parallelly::availableCores(logical = FALSE)\ncl <- parallel::makePSOCKcluster(cores)\ndoParallel::registerDoParallel(cl)\n\n# Now call `tune_grid()`!\n\n# Shut it down with:\nforeach::registerDoSEQ()\nparallel::stopCluster(cl)\n```\n:::\n\n:::\n\n::: {.column width=\"40%\"}\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](figures/resample-times-1.svg){fig-align='center' width=100%}\n:::\n:::\n\n:::\n:::\n\n## Running in parallel\n\nSpeed-ups are fairly linear up to the number of physical cores (10 here).\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](figures/parallel-speedup-1.svg){fig-align='center' width=90%}\n:::\n:::\n\n\n:::notes\nFaceted on the expensiveness of preprocessing used.\n:::\n\n\n## Early stopping for boosted trees {.annotation}\n\nWe have directly optimized the number of trees as a tuning parameter. \n\nInstead we could \n \n - Set the number of trees to a single large number.\n - Stop adding trees when performance gets worse. \n \nThis is known as \"early stopping\" and there is a parameter for that: `stop_iter`.\n\nEarly stopping has a potential to decrease the tuning time. \n\n\n## Your turn {transition=\"slide-in\"}\n\n![](images/parsnip-flagger.jpg){.absolute top=\"0\" right=\"0\" width=\"150\" height=\"150\"}\n\n
\n\n\n*Set `trees = 2000` and tune the `stop_iter` parameter.* \n\n\n\n::: {.cell}\n::: {.cell-output-display}\n```{=html}\n
\n
\n10:00\n
\n```\n:::\n:::\n\n::: {.cell}\n\n:::\n", + "supporting": [ + "figures" + ], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": { + "include-in-header": [ + "\n\n" + ], + "include-after-body": [ + "\n\n\n" + ] + }, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/archive/2023-07-nyr/_freeze/advanced-02-tuning-hyperparameters/figure-revealjs/autoplot-1.svg b/archive/2023-07-nyr/_freeze/advanced-02-tuning-hyperparameters/figure-revealjs/autoplot-1.svg new file mode 100644 index 00000000..024da85d --- /dev/null +++ b/archive/2023-07-nyr/_freeze/advanced-02-tuning-hyperparameters/figure-revealjs/autoplot-1.svg @@ -0,0 +1,574 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +# Randomly Selected Predictors + + + + + + + + + + +# Trees + + + + + + + + + + +Learning Rate (log-10) + + + + + + + + + + +Minimal Node Size + + + + + + + + + + +mae + + + + + + + + + + +rsq + + + + + +0 +10 +20 + + + + + +0 +500 +1000 +1500 +2000 + + + + +-10.0 +-7.5 +-5.0 +-2.5 + + + + +10 +20 +30 +40 +10 +20 +30 +40 +50 + + + + + +0.75 +0.80 +0.85 +0.90 +0.95 + + + + + + + diff --git a/archive/2023-07-nyr/_freeze/advanced-02-tuning-hyperparameters/figure-revealjs/lgb-cal-plot-1.svg b/archive/2023-07-nyr/_freeze/advanced-02-tuning-hyperparameters/figure-revealjs/lgb-cal-plot-1.svg new file mode 100644 index 00000000..bb9462f1 --- /dev/null +++ b/archive/2023-07-nyr/_freeze/advanced-02-tuning-hyperparameters/figure-revealjs/lgb-cal-plot-1.svg @@ -0,0 +1,3837 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +0 +100 +200 +300 +400 + + + + + + + + + + +0 +100 +200 +300 +400 +Observed +Predicted + + diff --git a/archive/2023-07-nyr/_freeze/advanced-02-tuning-hyperparameters/figure-revealjs/parallel-speedup-1.svg b/archive/2023-07-nyr/_freeze/advanced-02-tuning-hyperparameters/figure-revealjs/parallel-speedup-1.svg new file mode 100644 index 00000000..dfcf5021 --- /dev/null +++ b/archive/2023-07-nyr/_freeze/advanced-02-tuning-hyperparameters/figure-revealjs/parallel-speedup-1.svg @@ -0,0 +1,241 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +none + + + + + + + + + + +light + + + + + + + + + + +expensive + + + + + + +5 +10 +15 +20 + + + + +5 +10 +15 +20 + + + + +5 +10 +15 +20 +5 +10 +15 +20 + + + + +Number of Workers +Speed-up + +parallel_over + + + + + + +everything +resamples + + diff --git a/archive/2023-07-nyr/_freeze/advanced-02-tuning-hyperparameters/figure-revealjs/resample-times-1.svg b/archive/2023-07-nyr/_freeze/advanced-02-tuning-hyperparameters/figure-revealjs/resample-times-1.svg new file mode 100644 index 00000000..e022781f --- /dev/null +++ b/archive/2023-07-nyr/_freeze/advanced-02-tuning-hyperparameters/figure-revealjs/resample-times-1.svg @@ -0,0 +1,121 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +Fold1 / worker 1 +Fold2 / worker 4 +Fold3 / worker 3 +Fold4 / worker 5 +Fold5 / worker 2 + + + + + + + + + + +0 +1 +2 +3 +4 +Elapsed Time + +operation + + + + +model +preprocess + + diff --git a/archive/2023-07-nyr/_freeze/advanced-02-tuning-hyperparameters/figure-revealjs/sfd-1.svg b/archive/2023-07-nyr/_freeze/advanced-02-tuning-hyperparameters/figure-revealjs/sfd-1.svg new file mode 100644 index 00000000..68ce6eb9 --- /dev/null +++ b/archive/2023-07-nyr/_freeze/advanced-02-tuning-hyperparameters/figure-revealjs/sfd-1.svg @@ -0,0 +1,102 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +1e-05 +1e-04 +1e-03 +1e-02 +1e-01 + + + + + + + + + + +0 +25 +50 +75 +100 +trees +learn_rate + + diff --git a/archive/2023-07-nyr/_freeze/advanced-02-tuning-hyperparameters/libs/countdown-0.4.0/countdown.css b/archive/2023-07-nyr/_freeze/advanced-02-tuning-hyperparameters/libs/countdown-0.4.0/countdown.css new file mode 100644 index 00000000..bf387012 --- /dev/null +++ b/archive/2023-07-nyr/_freeze/advanced-02-tuning-hyperparameters/libs/countdown-0.4.0/countdown.css @@ -0,0 +1,144 @@ +.countdown { + background: inherit; + position: absolute; + cursor: pointer; + font-size: 3rem; + line-height: 1; + border-color: #ddd; + border-width: 3px; + border-style: solid; + border-radius: 15px; + box-shadow: 0px 4px 10px 0px rgba(50, 50, 50, 0.4); + -webkit-box-shadow: 0px 4px 10px 0px rgba(50, 50, 50, 0.4); + margin: 0.6em; + padding: 10px 15px; + text-align: center; + z-index: 10; + -webkit-user-select: none; + -moz-user-select: none; + -ms-user-select: none; + user-select: none; +} +.countdown { + display: flex; + align-items: center; + justify-content: center; +} +.countdown .countdown-time { + background: none; + font-size: 100%; + padding: 0; +} +.countdown-digits { + color: inherit; +} +.countdown.running { + border-color: #2A9B59FF; + background-color: #43AC6A; +} +.countdown.running .countdown-digits { + color: #002F14FF; +} +.countdown.finished { + border-color: #DE3000FF; + background-color: #F04124; +} +.countdown.finished .countdown-digits { + color: #4A0900FF; +} +.countdown.running.warning { + border-color: #CEAC04FF; + background-color: #E6C229; +} +.countdown.running.warning .countdown-digits { + color: #3A2F02FF; +} + +.countdown.running.blink-colon .countdown-digits.colon { + opacity: 0.1; +} + +/* ------ Controls ------ */ +.countdown:not(.running) .countdown-controls { + display: none; +} + +.countdown-controls { + position: absolute; + top: -0.5rem; + right: -0.5rem; + left: -0.5rem; + display: flex; + justify-content: space-between; + margin: 0; + padding: 0; +} + +.countdown-controls > button { + font-size: 1.5rem; + width: 1rem; + height: 1rem; + display: inline-block; + display: flex; + flex-direction: column; + align-items: center; + justify-content: center; + font-family: monospace; + padding: 10px; + margin: 0; + background: inherit; + border: 2px solid; + border-radius: 100%; + transition: 50ms transform ease-in-out, 150ms opacity ease-in; + --countdown-transition-distance: 10px; +} + +.countdown .countdown-controls > button:last-child { + transform: translate(calc(-1 * var(--countdown-transition-distance)), var(--countdown-transition-distance)); + opacity: 0; + color: #002F14FF; + background-color: #43AC6A; + border-color: #2A9B59FF; +} + +.countdown .countdown-controls > button:first-child { + transform: translate(var(--countdown-transition-distance), var(--countdown-transition-distance)); + opacity: 0; + color: #4A0900FF; + background-color: #F04124; + border-color: #DE3000FF; +} + +.countdown.running:hover .countdown-controls > button, +.countdown.running:focus-within .countdown-controls > button{ + transform: translate(0, 0); + opacity: 1; +} + +.countdown.running:hover .countdown-controls > button:hover, +.countdown.running:focus-within .countdown-controls > button:hover{ + transform: translate(0, calc(var(--countdown-transition-distance) / -2)); + box-shadow: 0px 2px 5px 0px rgba(50, 50, 50, 0.4); + -webkit-box-shadow: 0px 2px 5px 0px rgba(50, 50, 50, 0.4); +} + +.countdown.running:hover .countdown-controls > button:active, +.countdown.running:focus-within .countdown-controls > button:active{ + transform: translate(0, calc(var(--coutndown-transition-distance) / -5)); +} + +/* ----- Fullscreen ----- */ +.countdown.countdown-fullscreen { + z-index: 0; +} + +.countdown-fullscreen.running .countdown-controls { + top: 1rem; + left: 0; + right: 0; + justify-content: center; +} + +.countdown-fullscreen.running .countdown-controls > button + button { + margin-left: 1rem; +} diff --git a/archive/2023-07-nyr/_freeze/advanced-02-tuning-hyperparameters/libs/countdown-0.4.0/countdown.js b/archive/2023-07-nyr/_freeze/advanced-02-tuning-hyperparameters/libs/countdown-0.4.0/countdown.js new file mode 100644 index 00000000..a058ad8f --- /dev/null +++ b/archive/2023-07-nyr/_freeze/advanced-02-tuning-hyperparameters/libs/countdown-0.4.0/countdown.js @@ -0,0 +1,478 @@ +/* globals Shiny,Audio */ +class CountdownTimer { + constructor (el, opts) { + if (typeof el === 'string' || el instanceof String) { + el = document.querySelector(el) + } + + if (el.counter) { + return el.counter + } + + const minutes = parseInt(el.querySelector('.minutes').innerText || '0') + const seconds = parseInt(el.querySelector('.seconds').innerText || '0') + const duration = minutes * 60 + seconds + + function attrIsTrue (x) { + if (x === true) return true + return !!(x === 'true' || x === '' || x === '1') + } + + this.element = el + this.duration = duration + this.end = null + this.is_running = false + this.warn_when = parseInt(el.dataset.warnWhen) || -1 + this.update_every = parseInt(el.dataset.updateEvery) || 1 + this.play_sound = attrIsTrue(el.dataset.playSound) + this.blink_colon = attrIsTrue(el.dataset.blinkColon) + this.startImmediately = attrIsTrue(el.dataset.startImmediately) + this.timeout = null + this.display = { minutes, seconds } + + if (opts.src_location) { + this.src_location = opts.src_location + } + + this.addEventListeners() + } + + addEventListeners () { + const self = this + + if (this.startImmediately) { + if (window.remark && window.slideshow) { + // Remark (xaringan) support + const isOnVisibleSlide = () => { + return document.querySelector('.remark-visible').contains(self.element) + } + if (isOnVisibleSlide()) { + self.start() + } else { + let started_once = 0 + window.slideshow.on('afterShowSlide', function () { + if (started_once > 0) return + if (isOnVisibleSlide()) { + self.start() + started_once = 1 + } + }) + } + } else if (window.Reveal) { + // Revealjs (quarto) support + const isOnVisibleSlide = () => { + const currentSlide = document.querySelector('.reveal .slide.present') + return currentSlide ? currentSlide.contains(self.element) : false + } + if (isOnVisibleSlide()) { + self.start() + } else { + const revealStartTimer = () => { + if (isOnVisibleSlide()) { + self.start() + window.Reveal.off('slidechanged', revealStartTimer) + } + } + window.Reveal.on('slidechanged', revealStartTimer) + } + } else if (window.IntersectionObserver) { + // All other situtations use IntersectionObserver + const onVisible = (element, callback) => { + new window.IntersectionObserver((entries, observer) => { + entries.forEach(entry => { + if (entry.intersectionRatio > 0) { + callback(element) + observer.disconnect() + } + }) + }).observe(element) + } + onVisible(this.element, el => el.countdown.start()) + } else { + // or just start the timer as soon as it's initialized + this.start() + } + } + + function haltEvent (ev) { + ev.preventDefault() + ev.stopPropagation() + } + function isSpaceOrEnter (ev) { + return ev.code === 'Space' || ev.code === 'Enter' + } + function isArrowUpOrDown (ev) { + return ev.code === 'ArrowUp' || ev.code === 'ArrowDown' + } + + ;['click', 'touchend'].forEach(function (eventType) { + self.element.addEventListener(eventType, function (ev) { + haltEvent(ev) + self.is_running ? self.stop() : self.start() + }) + }) + this.element.addEventListener('keydown', function (ev) { + if (ev.code === "Escape") { + self.reset() + haltEvent(ev) + } + if (!isSpaceOrEnter(ev) && !isArrowUpOrDown(ev)) return + haltEvent(ev) + if (isSpaceOrEnter(ev)) { + self.is_running ? self.stop() : self.start() + return + } + + if (!self.is_running) return + + if (ev.code === 'ArrowUp') { + self.bumpUp() + } else if (ev.code === 'ArrowDown') { + self.bumpDown() + } + }) + this.element.addEventListener('dblclick', function (ev) { + haltEvent(ev) + if (self.is_running) self.reset() + }) + this.element.addEventListener('touchmove', haltEvent) + + const btnBumpDown = this.element.querySelector('.countdown-bump-down') + ;['click', 'touchend'].forEach(function (eventType) { + btnBumpDown.addEventListener(eventType, function (ev) { + haltEvent(ev) + if (self.is_running) self.bumpDown() + }) + }) + btnBumpDown.addEventListener('keydown', function (ev) { + if (!isSpaceOrEnter(ev) || !self.is_running) return + haltEvent(ev) + self.bumpDown() + }) + + const btnBumpUp = this.element.querySelector('.countdown-bump-up') + ;['click', 'touchend'].forEach(function (eventType) { + btnBumpUp.addEventListener(eventType, function (ev) { + haltEvent(ev) + if (self.is_running) self.bumpUp() + }) + }) + btnBumpUp.addEventListener('keydown', function (ev) { + if (!isSpaceOrEnter(ev) || !self.is_running) return + haltEvent(ev) + self.bumpUp() + }) + this.element.querySelector('.countdown-controls').addEventListener('dblclick', function (ev) { + haltEvent(ev) + }) + } + + remainingTime () { + const remaining = this.is_running + ? (this.end - Date.now()) / 1000 + : this.remaining || this.duration + + let minutes = Math.floor(remaining / 60) + let seconds = Math.ceil(remaining - minutes * 60) + + if (seconds > 59) { + minutes = minutes + 1 + seconds = seconds - 60 + } + + return { remaining, minutes, seconds } + } + + start () { + if (this.is_running) return + + this.is_running = true + + if (this.remaining) { + // Having a static remaining time indicates timer was paused + this.end = Date.now() + this.remaining * 1000 + this.remaining = null + } else { + this.end = Date.now() + this.duration * 1000 + } + + this.reportStateToShiny('start') + + this.element.classList.remove('finished') + this.element.classList.add('running') + this.update(true) + this.tick() + } + + tick (run_again) { + if (typeof run_again === 'undefined') { + run_again = true + } + + if (!this.is_running) return + + const { seconds: secondsWas } = this.display + this.update() + + if (run_again) { + const delay = (this.end - Date.now() > 10000) ? 1000 : 250 + this.blinkColon(secondsWas) + this.timeout = setTimeout(this.tick.bind(this), delay) + } + } + + blinkColon (secondsWas) { + // don't blink unless option is set + if (!this.blink_colon) return + // warn_when always updates the seconds + if (this.warn_when > 0 && Date.now() + this.warn_when > this.end) { + this.element.classList.remove('blink-colon') + return + } + const { seconds: secondsIs } = this.display + if (secondsIs > 10 || secondsWas !== secondsIs) { + this.element.classList.toggle('blink-colon') + } + } + + update (force) { + if (typeof force === 'undefined') { + force = false + } + + const { remaining, minutes, seconds } = this.remainingTime() + + const setRemainingTime = (selector, time) => { + const timeContainer = this.element.querySelector(selector) + if (!timeContainer) return + time = Math.max(time, 0) + timeContainer.innerText = String(time).padStart(2, 0) + } + + if (this.is_running && remaining < 0.25) { + this.stop() + setRemainingTime('.minutes', 0) + setRemainingTime('.seconds', 0) + this.playSound() + return + } + + const should_update = force || + Math.round(remaining) < this.warn_when || + Math.round(remaining) % this.update_every === 0 + + if (should_update) { + this.element.classList.toggle('warning', remaining <= this.warn_when) + this.display = { minutes, seconds } + setRemainingTime('.minutes', minutes) + setRemainingTime('.seconds', seconds) + } + } + + stop () { + const { remaining } = this.remainingTime() + if (remaining > 1) { + this.remaining = remaining + } + this.element.classList.remove('running') + this.element.classList.remove('warning') + this.element.classList.remove('blink-colon') + this.element.classList.add('finished') + this.is_running = false + this.end = null + this.reportStateToShiny('stop') + this.timeout = clearTimeout(this.timeout) + } + + reset () { + this.stop() + this.remaining = null + this.update(true) + this.reportStateToShiny('reset') + this.element.classList.remove('finished') + this.element.classList.remove('warning') + } + + setValues (opts) { + if (typeof opts.warn_when !== 'undefined') { + this.warn_when = opts.warn_when + } + if (typeof opts.update_every !== 'undefined') { + this.update_every = opts.update_every + } + if (typeof opts.blink_colon !== 'undefined') { + this.blink_colon = opts.blink_colon + if (!opts.blink_colon) { + this.element.classList.remove('blink-colon') + } + } + if (typeof opts.play_sound !== 'undefined') { + this.play_sound = opts.play_sound + } + if (typeof opts.duration !== 'undefined') { + this.duration = opts.duration + if (this.is_running) { + this.reset() + this.start() + } + } + this.reportStateToShiny('update') + this.update(true) + } + + bumpTimer (val, round) { + round = typeof round === 'boolean' ? round : true + const { remaining } = this.remainingTime() + let newRemaining = remaining + val + if (newRemaining <= 0) { + this.setRemaining(0) + this.stop() + return + } + if (round && newRemaining > 10) { + newRemaining = Math.round(newRemaining / 5) * 5 + } + this.setRemaining(newRemaining) + this.reportStateToShiny(val > 0 ? 'bumpUp' : 'bumpDown') + this.update(true) + } + + bumpUp (val) { + if (!this.is_running) { + console.error('timer is not running') + return + } + this.bumpTimer( + val || this.bumpIncrementValue(), + typeof val === 'undefined' + ) + } + + bumpDown (val) { + if (!this.is_running) { + console.error('timer is not running') + return + } + this.bumpTimer( + val || -1 * this.bumpIncrementValue(), + typeof val === 'undefined' + ) + } + + setRemaining (val) { + if (!this.is_running) { + console.error('timer is not running') + return + } + this.end = Date.now() + val * 1000 + this.update(true) + } + + playSound () { + let url = this.play_sound + if (!url) return + if (typeof url === 'boolean') { + const src = this.src_location + ? this.src_location.replace('/countdown.js', '') + : 'libs/countdown' + url = src + '/smb_stage_clear.mp3' + } + const sound = new Audio(url) + sound.play() + } + + bumpIncrementValue (val) { + val = val || this.remainingTime().remaining + if (val <= 30) { + return 5 + } else if (val <= 300) { + return 15 + } else if (val <= 3000) { + return 30 + } else { + return 60 + } + } + + reportStateToShiny (action) { + if (!window.Shiny) return + + const inputId = this.element.id + const data = { + event: { + action, + time: new Date().toISOString() + }, + timer: { + is_running: this.is_running, + end: this.end ? new Date(this.end).toISOString() : null, + remaining: this.remainingTime() + } + } + + function shinySetInputValue () { + if (!window.Shiny.setInputValue) { + setTimeout(shinySetInputValue, 100) + return + } + window.Shiny.setInputValue(inputId, data) + } + + shinySetInputValue() + } +} + +(function () { + const CURRENT_SCRIPT = document.currentScript.getAttribute('src') + + document.addEventListener('DOMContentLoaded', function () { + const els = document.querySelectorAll('.countdown') + if (!els || !els.length) { + return + } + els.forEach(function (el) { + el.countdown = new CountdownTimer(el, { src_location: CURRENT_SCRIPT }) + }) + + if (window.Shiny) { + Shiny.addCustomMessageHandler('countdown:update', function (x) { + if (!x.id) { + console.error('No `id` provided, cannot update countdown') + return + } + const el = document.getElementById(x.id) + el.countdown.setValues(x) + }) + + Shiny.addCustomMessageHandler('countdown:start', function (id) { + const el = document.getElementById(id) + if (!el) return + el.countdown.start() + }) + + Shiny.addCustomMessageHandler('countdown:stop', function (id) { + const el = document.getElementById(id) + if (!el) return + el.countdown.stop() + }) + + Shiny.addCustomMessageHandler('countdown:reset', function (id) { + const el = document.getElementById(id) + if (!el) return + el.countdown.reset() + }) + + Shiny.addCustomMessageHandler('countdown:bumpUp', function (id) { + const el = document.getElementById(id) + if (!el) return + el.countdown.bumpUp() + }) + + Shiny.addCustomMessageHandler('countdown:bumpDown', function (id) { + const el = document.getElementById(id) + if (!el) return + el.countdown.bumpDown() + }) + } + }) +})() diff --git a/archive/2023-07-nyr/_freeze/advanced-02-tuning-hyperparameters/libs/countdown-0.4.0/smb_stage_clear.mp3 b/archive/2023-07-nyr/_freeze/advanced-02-tuning-hyperparameters/libs/countdown-0.4.0/smb_stage_clear.mp3 new file mode 100644 index 00000000..da2ddc2c Binary files /dev/null and b/archive/2023-07-nyr/_freeze/advanced-02-tuning-hyperparameters/libs/countdown-0.4.0/smb_stage_clear.mp3 differ diff --git a/archive/2023-07-nyr/_freeze/advanced-03-racing/execute-results/html.json b/archive/2023-07-nyr/_freeze/advanced-03-racing/execute-results/html.json new file mode 100644 index 00000000..dbd5fff3 --- /dev/null +++ b/archive/2023-07-nyr/_freeze/advanced-03-racing/execute-results/html.json @@ -0,0 +1,21 @@ +{ + "hash": "93a0d1db000d6a20b3292980b0c3be5d", + "result": { + "markdown": "---\ntitle: \"3 - Grid Search via Racing\"\nsubtitle: \"Advanced tidymodels\"\nformat:\n revealjs: \n slide-number: true\n footer: \n include-before-body: header.html\n include-after-body: footer-annotations.html\n theme: [default, tidymodels.scss]\n width: 1280\n height: 720\nknitr:\n opts_chunk: \n echo: true\n collapse: true\n comment: \"#>\"\n fig.path: \"figures/\"\n---\n\n\n\n\n\n\n## Previously - Setup ![](hexes/tidymodels.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n:::: {.columns}\n\n::: {.column width=\"40%\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidymodels)\nlibrary(modeldatatoo)\nlibrary(textrecipes)\nlibrary(bonsai)\n\n# Max's usual settings: \ntidymodels_prefer()\ntheme_set(theme_bw())\noptions(\n pillar.advice = FALSE, \n pillar.min_title_chars = Inf\n)\n\nreg_metrics <- metric_set(mae, rsq)\n```\n:::\n\n\n:::\n\n::: {.column width=\"60%\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(295)\nhotel_rates <- \n data_hotel_rates() %>% \n sample_n(5000) %>% \n arrange(arrival_date) %>% \n select(-arrival_date_num, -arrival_date) %>% \n mutate(\n company = factor(as.character(company)),\n country = factor(as.character(country)),\n agent = factor(as.character(agent))\n )\n```\n:::\n\n\n\n:::\n\n::::\n\n\n## Previously - Data Usage ![](hexes/rsample.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(4028)\nhotel_split <-\n initial_split(hotel_rates, strata = avg_price_per_room)\n\nhotel_tr <- training(hotel_split)\nhotel_te <- testing(hotel_split)\n\nset.seed(472)\nhotel_rs <- vfold_cv(hotel_tr, strata = avg_price_per_room)\n```\n:::\n\n\n## Previously - Boosting Model ![](hexes/tune.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"} ![](hexes/textrecipes.png){.absolute top=-20 right=64 width=\"64\" height=\"74.24\"} ![](hexes/recipes.png){.absolute top=-20 right=128 width=\"64\" height=\"74.24\"} ![](hexes/bonsai.png){.absolute top=-20 right=192 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nhotel_rec <-\n recipe(avg_price_per_room ~ ., data = hotel_tr) %>%\n step_YeoJohnson(lead_time) %>%\n step_dummy_hash(agent, num_terms = tune(\"agent hash\")) %>%\n step_dummy_hash(company, num_terms = tune(\"company hash\")) %>%\n step_zv(all_predictors())\n\nlgbm_spec <- \n boost_tree(trees = tune(), learn_rate = tune(), min_n = tune()) %>% \n set_mode(\"regression\") %>% \n set_engine(\"lightgbm\")\n\nlgbm_wflow <- workflow(hotel_rec, lgbm_spec)\n\nlgbm_param <-\n lgbm_wflow %>%\n extract_parameter_set_dials() %>%\n update(`agent hash` = num_hash(c(3, 8)),\n `company hash` = num_hash(c(3, 8)))\n```\n:::\n\n\n\n## Making Grid Search More Efficient\n\nIn the last section, we evaluated 250 models (25 candidates times 10 resamples).\n\nWe can make this go faster using parallel processing. \n\nAlso, for some models, we can _fit_ far fewer models than the number that are being evaluated. \n \n * For boosting, a model with `X` trees can often predict on candidates with less than `X` trees. \n \nBoth of these methods can lead to enormous speed-ups. \n\n\n## Model Racing \n\n[_Racing_](https://scholar.google.com/scholar?hl=en&as_sdt=0%2C7&q=+Hoeffding+racing) is an old tool that we can use to go even faster. \n\n1. Evaluate all of the candidate models but only for a few resamples. \n1. Determine which candidates have a low probability of being selected.\n1. Eliminate poor candidates.\n1. Repeat with next resample (until no more resamples remain) \n\nThis can result in fitting a small number of models. \n\n\n## Discarding Candidates\n\nHow do we eliminate tuning parameter combinations? \n\nThere are a few methods to do so. We'll use one based on analysis of variance (ANOVA). \n\n_However_... there is typically a large difference between resamples in the results. \n\n## Resampling Results (Non-Racing)\n\n:::: {.columns}\n\n::: {.column width=\"50%\"}\n\n\n\n::: {.cell layout-align=\"center\"}\n\n:::\n\n\nHere are some realistic (but simulated) examples of two candidate models. \n\nAn error estimate is measured for each of 10 resamples. \n\n - The lines connect resamples. \n\nThere is usually a significant resample-to-resample effect (rank corr: 0.83). \n\n:::\n\n::: {.column width=\"50%\"}\n\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](figures/race-data-1.svg){fig-align='center' width=100%}\n:::\n:::\n\n\n:::\n\n::::\n\n\n## Are Candidates Different?\n\nOne way to evaluate these models is to do a paired t-test\n \n - or a t-test on their differences matched by resamples\n\nWith $n = 10$ resamples, the confidence interval is (0.99, 2.8), indicating that candidate number 2 has smaller error. \n\nWhat if we were to compare each model candidate to the current best at each resample? \n\nOne shows superiority when 4 resamples have been evaluated.\n\n\n## Evaluating Differences in Candidates\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](figures/race-ci-1.svg){fig-align='center' width=70%}\n:::\n:::\n\n\n## Interim Analysis of Results\n\nOne version of racing uses a _mixed model ANOVA_ to construct one-sided confidence intervals for each candidate versus the current best. \n\nAny candidates whose bound does not include zero are discarded. [Here](https://www.tmwr.org/race_results.mp4) is an animation.\n\nThe resamples are analyzed in a random order.\n\n
\n\n[Kuhn (2014)](https://arxiv.org/abs/1405.6974) has examples and simulations to show that the method works. \n\nThe [finetune](https://finetune.tidymodels.org/) package has functions `tune_race_anova()` and `tune_race_win_loss()`. \n\n\n## Racing ![](hexes/tune.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"} ![](hexes/textrecipes.png){.absolute top=-20 right=64 width=\"64\" height=\"74.24\"} ![](hexes/recipes.png){.absolute top=-20 right=128 width=\"64\" height=\"74.24\"} ![](hexes/finetune.png){.absolute top=-20 right=192 width=\"64\" height=\"74.24\"} ![](hexes/bonsai.png){.absolute top=-20 right=256 width=\"64\" height=\"74.24\"}\n\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"1,8|\"}\n# Let's use a larger grid\nset.seed(8945)\nlgbm_grid <- \n lgbm_param %>% \n grid_latin_hypercube(size = 50)\n\nlibrary(finetune)\n\nset.seed(9)\nlgbm_race_res <-\n lgbm_wflow %>%\n tune_race_anova(\n resamples = hotel_rs,\n grid = lgbm_grid, \n metrics = reg_metrics\n )\n```\n:::\n\n\nThe syntax and helper functions are extremely similar to those shown for `tune_grid()`. \n\n\n## Racing Results ![](hexes/tune.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nshow_best(lgbm_race_res, metric = \"mae\")\n#> # A tibble: 2 Γ— 11\n#> trees min_n learn_rate `agent hash` `company hash` .metric .estimator mean n std_err .config \n#> \n#> 1 1014 5 0.0791 35 181 mae standard 10.3 10 0.202 Preprocessor06_Model1\n#> 2 1516 7 0.0421 176 12 mae standard 10.4 10 0.200 Preprocessor42_Model1\n```\n:::\n\n\n## Racing Results ![](hexes/finetune.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n:::: {.columns}\n\n::: {.column width=\"50%\"}\nOnly 170 models were fit (out of 500). \n\n`select_best()` never considers candidate models that did not get to the end of the race. \n\nThere is a helper function to see how candidate models were removed from consideration. \n\n:::\n\n::: {.column width=\"50%\"}\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nplot_race(lgbm_race_res) + \n scale_x_continuous(breaks = pretty_breaks())\n```\n\n::: {.cell-output-display}\n![](figures/plot-race-1.svg){fig-align='center' width=100%}\n:::\n:::\n\n\n:::\n\n::::\n\n\n## Your turn {transition=\"slide-in\"}\n\n- *Run `tune_race_anova()` with a different seed.*\n- *Did you get the same or similar results?*\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n```{=html}\n
\n
\n10:00\n
\n```\n:::\n:::\n\n::: {.cell}\n\n:::\n", + "supporting": [], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": { + "include-in-header": [ + "\n\n" + ], + "include-after-body": [ + "\n\n\n" + ] + }, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/archive/2023-07-nyr/_freeze/advanced-03-racing/libs/countdown-0.4.0/countdown.css b/archive/2023-07-nyr/_freeze/advanced-03-racing/libs/countdown-0.4.0/countdown.css new file mode 100644 index 00000000..bf387012 --- /dev/null +++ b/archive/2023-07-nyr/_freeze/advanced-03-racing/libs/countdown-0.4.0/countdown.css @@ -0,0 +1,144 @@ +.countdown { + background: inherit; + position: absolute; + cursor: pointer; + font-size: 3rem; + line-height: 1; + border-color: #ddd; + border-width: 3px; + border-style: solid; + border-radius: 15px; + box-shadow: 0px 4px 10px 0px rgba(50, 50, 50, 0.4); + -webkit-box-shadow: 0px 4px 10px 0px rgba(50, 50, 50, 0.4); + margin: 0.6em; + padding: 10px 15px; + text-align: center; + z-index: 10; + -webkit-user-select: none; + -moz-user-select: none; + -ms-user-select: none; + user-select: none; +} +.countdown { + display: flex; + align-items: center; + justify-content: center; +} +.countdown .countdown-time { + background: none; + font-size: 100%; + padding: 0; +} +.countdown-digits { + color: inherit; +} +.countdown.running { + border-color: #2A9B59FF; + background-color: #43AC6A; +} +.countdown.running .countdown-digits { + color: #002F14FF; +} +.countdown.finished { + border-color: #DE3000FF; + background-color: #F04124; +} +.countdown.finished .countdown-digits { + color: #4A0900FF; +} +.countdown.running.warning { + border-color: #CEAC04FF; + background-color: #E6C229; +} +.countdown.running.warning .countdown-digits { + color: #3A2F02FF; +} + +.countdown.running.blink-colon .countdown-digits.colon { + opacity: 0.1; +} + +/* ------ Controls ------ */ +.countdown:not(.running) .countdown-controls { + display: none; +} + +.countdown-controls { + position: absolute; + top: -0.5rem; + right: -0.5rem; + left: -0.5rem; + display: flex; + justify-content: space-between; + margin: 0; + padding: 0; +} + +.countdown-controls > button { + font-size: 1.5rem; + width: 1rem; + height: 1rem; + display: inline-block; + display: flex; + flex-direction: column; + align-items: center; + justify-content: center; + font-family: monospace; + padding: 10px; + margin: 0; + background: inherit; + border: 2px solid; + border-radius: 100%; + transition: 50ms transform ease-in-out, 150ms opacity ease-in; + --countdown-transition-distance: 10px; +} + +.countdown .countdown-controls > button:last-child { + transform: translate(calc(-1 * var(--countdown-transition-distance)), var(--countdown-transition-distance)); + opacity: 0; + color: #002F14FF; + background-color: #43AC6A; + border-color: #2A9B59FF; +} + +.countdown .countdown-controls > button:first-child { + transform: translate(var(--countdown-transition-distance), var(--countdown-transition-distance)); + opacity: 0; + color: #4A0900FF; + background-color: #F04124; + border-color: #DE3000FF; +} + +.countdown.running:hover .countdown-controls > button, +.countdown.running:focus-within .countdown-controls > button{ + transform: translate(0, 0); + opacity: 1; +} + +.countdown.running:hover .countdown-controls > button:hover, +.countdown.running:focus-within .countdown-controls > button:hover{ + transform: translate(0, calc(var(--countdown-transition-distance) / -2)); + box-shadow: 0px 2px 5px 0px rgba(50, 50, 50, 0.4); + -webkit-box-shadow: 0px 2px 5px 0px rgba(50, 50, 50, 0.4); +} + +.countdown.running:hover .countdown-controls > button:active, +.countdown.running:focus-within .countdown-controls > button:active{ + transform: translate(0, calc(var(--coutndown-transition-distance) / -5)); +} + +/* ----- Fullscreen ----- */ +.countdown.countdown-fullscreen { + z-index: 0; +} + +.countdown-fullscreen.running .countdown-controls { + top: 1rem; + left: 0; + right: 0; + justify-content: center; +} + +.countdown-fullscreen.running .countdown-controls > button + button { + margin-left: 1rem; +} diff --git a/archive/2023-07-nyr/_freeze/advanced-03-racing/libs/countdown-0.4.0/countdown.js b/archive/2023-07-nyr/_freeze/advanced-03-racing/libs/countdown-0.4.0/countdown.js new file mode 100644 index 00000000..a058ad8f --- /dev/null +++ b/archive/2023-07-nyr/_freeze/advanced-03-racing/libs/countdown-0.4.0/countdown.js @@ -0,0 +1,478 @@ +/* globals Shiny,Audio */ +class CountdownTimer { + constructor (el, opts) { + if (typeof el === 'string' || el instanceof String) { + el = document.querySelector(el) + } + + if (el.counter) { + return el.counter + } + + const minutes = parseInt(el.querySelector('.minutes').innerText || '0') + const seconds = parseInt(el.querySelector('.seconds').innerText || '0') + const duration = minutes * 60 + seconds + + function attrIsTrue (x) { + if (x === true) return true + return !!(x === 'true' || x === '' || x === '1') + } + + this.element = el + this.duration = duration + this.end = null + this.is_running = false + this.warn_when = parseInt(el.dataset.warnWhen) || -1 + this.update_every = parseInt(el.dataset.updateEvery) || 1 + this.play_sound = attrIsTrue(el.dataset.playSound) + this.blink_colon = attrIsTrue(el.dataset.blinkColon) + this.startImmediately = attrIsTrue(el.dataset.startImmediately) + this.timeout = null + this.display = { minutes, seconds } + + if (opts.src_location) { + this.src_location = opts.src_location + } + + this.addEventListeners() + } + + addEventListeners () { + const self = this + + if (this.startImmediately) { + if (window.remark && window.slideshow) { + // Remark (xaringan) support + const isOnVisibleSlide = () => { + return document.querySelector('.remark-visible').contains(self.element) + } + if (isOnVisibleSlide()) { + self.start() + } else { + let started_once = 0 + window.slideshow.on('afterShowSlide', function () { + if (started_once > 0) return + if (isOnVisibleSlide()) { + self.start() + started_once = 1 + } + }) + } + } else if (window.Reveal) { + // Revealjs (quarto) support + const isOnVisibleSlide = () => { + const currentSlide = document.querySelector('.reveal .slide.present') + return currentSlide ? currentSlide.contains(self.element) : false + } + if (isOnVisibleSlide()) { + self.start() + } else { + const revealStartTimer = () => { + if (isOnVisibleSlide()) { + self.start() + window.Reveal.off('slidechanged', revealStartTimer) + } + } + window.Reveal.on('slidechanged', revealStartTimer) + } + } else if (window.IntersectionObserver) { + // All other situtations use IntersectionObserver + const onVisible = (element, callback) => { + new window.IntersectionObserver((entries, observer) => { + entries.forEach(entry => { + if (entry.intersectionRatio > 0) { + callback(element) + observer.disconnect() + } + }) + }).observe(element) + } + onVisible(this.element, el => el.countdown.start()) + } else { + // or just start the timer as soon as it's initialized + this.start() + } + } + + function haltEvent (ev) { + ev.preventDefault() + ev.stopPropagation() + } + function isSpaceOrEnter (ev) { + return ev.code === 'Space' || ev.code === 'Enter' + } + function isArrowUpOrDown (ev) { + return ev.code === 'ArrowUp' || ev.code === 'ArrowDown' + } + + ;['click', 'touchend'].forEach(function (eventType) { + self.element.addEventListener(eventType, function (ev) { + haltEvent(ev) + self.is_running ? self.stop() : self.start() + }) + }) + this.element.addEventListener('keydown', function (ev) { + if (ev.code === "Escape") { + self.reset() + haltEvent(ev) + } + if (!isSpaceOrEnter(ev) && !isArrowUpOrDown(ev)) return + haltEvent(ev) + if (isSpaceOrEnter(ev)) { + self.is_running ? self.stop() : self.start() + return + } + + if (!self.is_running) return + + if (ev.code === 'ArrowUp') { + self.bumpUp() + } else if (ev.code === 'ArrowDown') { + self.bumpDown() + } + }) + this.element.addEventListener('dblclick', function (ev) { + haltEvent(ev) + if (self.is_running) self.reset() + }) + this.element.addEventListener('touchmove', haltEvent) + + const btnBumpDown = this.element.querySelector('.countdown-bump-down') + ;['click', 'touchend'].forEach(function (eventType) { + btnBumpDown.addEventListener(eventType, function (ev) { + haltEvent(ev) + if (self.is_running) self.bumpDown() + }) + }) + btnBumpDown.addEventListener('keydown', function (ev) { + if (!isSpaceOrEnter(ev) || !self.is_running) return + haltEvent(ev) + self.bumpDown() + }) + + const btnBumpUp = this.element.querySelector('.countdown-bump-up') + ;['click', 'touchend'].forEach(function (eventType) { + btnBumpUp.addEventListener(eventType, function (ev) { + haltEvent(ev) + if (self.is_running) self.bumpUp() + }) + }) + btnBumpUp.addEventListener('keydown', function (ev) { + if (!isSpaceOrEnter(ev) || !self.is_running) return + haltEvent(ev) + self.bumpUp() + }) + this.element.querySelector('.countdown-controls').addEventListener('dblclick', function (ev) { + haltEvent(ev) + }) + } + + remainingTime () { + const remaining = this.is_running + ? (this.end - Date.now()) / 1000 + : this.remaining || this.duration + + let minutes = Math.floor(remaining / 60) + let seconds = Math.ceil(remaining - minutes * 60) + + if (seconds > 59) { + minutes = minutes + 1 + seconds = seconds - 60 + } + + return { remaining, minutes, seconds } + } + + start () { + if (this.is_running) return + + this.is_running = true + + if (this.remaining) { + // Having a static remaining time indicates timer was paused + this.end = Date.now() + this.remaining * 1000 + this.remaining = null + } else { + this.end = Date.now() + this.duration * 1000 + } + + this.reportStateToShiny('start') + + this.element.classList.remove('finished') + this.element.classList.add('running') + this.update(true) + this.tick() + } + + tick (run_again) { + if (typeof run_again === 'undefined') { + run_again = true + } + + if (!this.is_running) return + + const { seconds: secondsWas } = this.display + this.update() + + if (run_again) { + const delay = (this.end - Date.now() > 10000) ? 1000 : 250 + this.blinkColon(secondsWas) + this.timeout = setTimeout(this.tick.bind(this), delay) + } + } + + blinkColon (secondsWas) { + // don't blink unless option is set + if (!this.blink_colon) return + // warn_when always updates the seconds + if (this.warn_when > 0 && Date.now() + this.warn_when > this.end) { + this.element.classList.remove('blink-colon') + return + } + const { seconds: secondsIs } = this.display + if (secondsIs > 10 || secondsWas !== secondsIs) { + this.element.classList.toggle('blink-colon') + } + } + + update (force) { + if (typeof force === 'undefined') { + force = false + } + + const { remaining, minutes, seconds } = this.remainingTime() + + const setRemainingTime = (selector, time) => { + const timeContainer = this.element.querySelector(selector) + if (!timeContainer) return + time = Math.max(time, 0) + timeContainer.innerText = String(time).padStart(2, 0) + } + + if (this.is_running && remaining < 0.25) { + this.stop() + setRemainingTime('.minutes', 0) + setRemainingTime('.seconds', 0) + this.playSound() + return + } + + const should_update = force || + Math.round(remaining) < this.warn_when || + Math.round(remaining) % this.update_every === 0 + + if (should_update) { + this.element.classList.toggle('warning', remaining <= this.warn_when) + this.display = { minutes, seconds } + setRemainingTime('.minutes', minutes) + setRemainingTime('.seconds', seconds) + } + } + + stop () { + const { remaining } = this.remainingTime() + if (remaining > 1) { + this.remaining = remaining + } + this.element.classList.remove('running') + this.element.classList.remove('warning') + this.element.classList.remove('blink-colon') + this.element.classList.add('finished') + this.is_running = false + this.end = null + this.reportStateToShiny('stop') + this.timeout = clearTimeout(this.timeout) + } + + reset () { + this.stop() + this.remaining = null + this.update(true) + this.reportStateToShiny('reset') + this.element.classList.remove('finished') + this.element.classList.remove('warning') + } + + setValues (opts) { + if (typeof opts.warn_when !== 'undefined') { + this.warn_when = opts.warn_when + } + if (typeof opts.update_every !== 'undefined') { + this.update_every = opts.update_every + } + if (typeof opts.blink_colon !== 'undefined') { + this.blink_colon = opts.blink_colon + if (!opts.blink_colon) { + this.element.classList.remove('blink-colon') + } + } + if (typeof opts.play_sound !== 'undefined') { + this.play_sound = opts.play_sound + } + if (typeof opts.duration !== 'undefined') { + this.duration = opts.duration + if (this.is_running) { + this.reset() + this.start() + } + } + this.reportStateToShiny('update') + this.update(true) + } + + bumpTimer (val, round) { + round = typeof round === 'boolean' ? round : true + const { remaining } = this.remainingTime() + let newRemaining = remaining + val + if (newRemaining <= 0) { + this.setRemaining(0) + this.stop() + return + } + if (round && newRemaining > 10) { + newRemaining = Math.round(newRemaining / 5) * 5 + } + this.setRemaining(newRemaining) + this.reportStateToShiny(val > 0 ? 'bumpUp' : 'bumpDown') + this.update(true) + } + + bumpUp (val) { + if (!this.is_running) { + console.error('timer is not running') + return + } + this.bumpTimer( + val || this.bumpIncrementValue(), + typeof val === 'undefined' + ) + } + + bumpDown (val) { + if (!this.is_running) { + console.error('timer is not running') + return + } + this.bumpTimer( + val || -1 * this.bumpIncrementValue(), + typeof val === 'undefined' + ) + } + + setRemaining (val) { + if (!this.is_running) { + console.error('timer is not running') + return + } + this.end = Date.now() + val * 1000 + this.update(true) + } + + playSound () { + let url = this.play_sound + if (!url) return + if (typeof url === 'boolean') { + const src = this.src_location + ? this.src_location.replace('/countdown.js', '') + : 'libs/countdown' + url = src + '/smb_stage_clear.mp3' + } + const sound = new Audio(url) + sound.play() + } + + bumpIncrementValue (val) { + val = val || this.remainingTime().remaining + if (val <= 30) { + return 5 + } else if (val <= 300) { + return 15 + } else if (val <= 3000) { + return 30 + } else { + return 60 + } + } + + reportStateToShiny (action) { + if (!window.Shiny) return + + const inputId = this.element.id + const data = { + event: { + action, + time: new Date().toISOString() + }, + timer: { + is_running: this.is_running, + end: this.end ? new Date(this.end).toISOString() : null, + remaining: this.remainingTime() + } + } + + function shinySetInputValue () { + if (!window.Shiny.setInputValue) { + setTimeout(shinySetInputValue, 100) + return + } + window.Shiny.setInputValue(inputId, data) + } + + shinySetInputValue() + } +} + +(function () { + const CURRENT_SCRIPT = document.currentScript.getAttribute('src') + + document.addEventListener('DOMContentLoaded', function () { + const els = document.querySelectorAll('.countdown') + if (!els || !els.length) { + return + } + els.forEach(function (el) { + el.countdown = new CountdownTimer(el, { src_location: CURRENT_SCRIPT }) + }) + + if (window.Shiny) { + Shiny.addCustomMessageHandler('countdown:update', function (x) { + if (!x.id) { + console.error('No `id` provided, cannot update countdown') + return + } + const el = document.getElementById(x.id) + el.countdown.setValues(x) + }) + + Shiny.addCustomMessageHandler('countdown:start', function (id) { + const el = document.getElementById(id) + if (!el) return + el.countdown.start() + }) + + Shiny.addCustomMessageHandler('countdown:stop', function (id) { + const el = document.getElementById(id) + if (!el) return + el.countdown.stop() + }) + + Shiny.addCustomMessageHandler('countdown:reset', function (id) { + const el = document.getElementById(id) + if (!el) return + el.countdown.reset() + }) + + Shiny.addCustomMessageHandler('countdown:bumpUp', function (id) { + const el = document.getElementById(id) + if (!el) return + el.countdown.bumpUp() + }) + + Shiny.addCustomMessageHandler('countdown:bumpDown', function (id) { + const el = document.getElementById(id) + if (!el) return + el.countdown.bumpDown() + }) + } + }) +})() diff --git a/archive/2023-07-nyr/_freeze/advanced-03-racing/libs/countdown-0.4.0/smb_stage_clear.mp3 b/archive/2023-07-nyr/_freeze/advanced-03-racing/libs/countdown-0.4.0/smb_stage_clear.mp3 new file mode 100644 index 00000000..da2ddc2c Binary files /dev/null and b/archive/2023-07-nyr/_freeze/advanced-03-racing/libs/countdown-0.4.0/smb_stage_clear.mp3 differ diff --git a/archive/2023-07-nyr/_freeze/advanced-04-iterative/execute-results/html.json b/archive/2023-07-nyr/_freeze/advanced-04-iterative/execute-results/html.json new file mode 100644 index 00000000..80389e9d --- /dev/null +++ b/archive/2023-07-nyr/_freeze/advanced-04-iterative/execute-results/html.json @@ -0,0 +1,21 @@ +{ + "hash": "8835ce5fc026395ffb3d1a28b06c4c9f", + "result": { + "markdown": "---\ntitle: \"4 - Iterative Search\"\nsubtitle: \"Advanced tidymodels\"\nformat:\n revealjs: \n slide-number: true\n footer: \n include-before-body: header.html\n include-after-body: footer-annotations.html\n theme: [default, tidymodels.scss]\n width: 1280\n height: 720\nknitr:\n opts_chunk: \n echo: true\n collapse: true\n comment: \"#>\"\n fig.path: \"figures/\"\n---\n\n\n\n\n\n\n## Previously - Setup\n\n:::: {.columns}\n\n::: {.column width=\"40%\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidymodels)\nlibrary(modeldatatoo)\nlibrary(textrecipes)\nlibrary(bonsai)\n\n# Max's usual settings: \ntidymodels_prefer()\ntheme_set(theme_bw())\noptions(\n pillar.advice = FALSE, \n pillar.min_title_chars = Inf\n)\n```\n:::\n\n\n:::\n\n::: {.column width=\"60%\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(295)\nhotel_rates <- \n data_hotel_rates() %>% \n sample_n(5000) %>% \n arrange(arrival_date) %>% \n select(-arrival_date_num, -arrival_date) %>% \n mutate(\n company = factor(as.character(company)),\n country = factor(as.character(country)),\n agent = factor(as.character(agent))\n )\n```\n:::\n\n\n\n:::\n\n::::\n\n\n## Previously - Data Usage\n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(4028)\nhotel_split <- initial_split(hotel_rates, strata = avg_price_per_room)\n\nhotel_tr <- training(hotel_split)\nhotel_te <- testing(hotel_split)\n\nset.seed(472)\nhotel_rs <- vfold_cv(hotel_tr, strata = avg_price_per_room)\n```\n:::\n\n\n## Previously - Boosting Model\n\n\n::: {.cell}\n\n```{.r .cell-code}\nhotel_rec <-\n recipe(avg_price_per_room ~ ., data = hotel_tr) %>%\n step_YeoJohnson(lead_time) %>%\n step_dummy_hash(agent, num_terms = tune(\"agent hash\")) %>%\n step_dummy_hash(company, num_terms = tune(\"company hash\")) %>%\n step_zv(all_predictors())\n\nlgbm_spec <- \n boost_tree(trees = tune(), learn_rate = tune(), min_n = tune()) %>% \n set_mode(\"regression\") %>% \n set_engine(\"lightgbm\")\n\nlgbm_wflow <- workflow(hotel_rec, lgbm_spec)\n\nlgbm_param <-\n lgbm_wflow %>%\n extract_parameter_set_dials() %>%\n update(`agent hash` = num_hash(c(3, 8)),\n `company hash` = num_hash(c(3, 8)))\n```\n:::\n\n\n## Iterative Search\n\nInstead of pre-defining a grid of candidate points, we can model our current results to predict what the next candidate point should be. \n\n
\n\nSuppose that we are only tuning the learning rate in our boosted tree. \n\n
\n\nWe could do something like: \n\n```r\nmae_pred <- lm(mae ~ learn_rate, data = resample_results)\n```\n\nand use this to predict and rank new learning rate candidates. \n\n\n## Iterative Search\n\nA linear model probably isn't the best choice though (more in a minute). \n\nTo illustrate the process, we resampled a large grid of learning rate values for our data to show what the relationship is between MAE and learning rate. \n\nNow suppose that we used a grid of three points in the parameter range for learning rate...\n\n\n## A Large Grid\n\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](figures/grid-large-1.svg){fig-align='center' width=50%}\n:::\n:::\n\n\n\n## A Three Point Grid\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](figures/grid-large-sampled-1.svg){fig-align='center' width=50%}\n:::\n:::\n\n\n## Gaussian Processes and Optimization\n\nWe can make a \"meta-model\" with a small set of historical performance results. \n\n[Gaussian Processes](https://gaussianprocess.org/gpml/) (GP) models are a good choice to model performance. \n\n- It is a Bayesian model so we are using **Bayesian Optimization (BO)**.\n- For regression, we can assume that our data are multivariate normal. \n- We also define a _covariance_ function for the variance relationship between data points. A common one is:\n\n$$\\operatorname{cov}(\\boldsymbol{x}_i, \\boldsymbol{x}_j) = \\exp\\left(-\\frac{1}{2}|\\boldsymbol{x}_i - \\boldsymbol{x}_j|^2\\right) + \\sigma^2_{ij}$$\n\n\n:::notes\nGPs are good because \n\n- they are flexible regression models (in the sense that splines are flexible). \n- we need to get mean and variance predictions (and they are Bayesian)\n- their variability is based on spatial distances.\n\nSome people use random forests (with conformal variance estimates) or other methods but GPs are most popular.\n:::\n\n\n## Predicting Candidates\n\nThe GP model can take candidate tuning parameter combinations as inputs and make predictions for performance (e.g. MAE)\n\n - The _mean_ performance\n - The _variance_ of performance \n \nThe variance is mostly driven by spatial variability (the previous equation). \n\nThe predicted variance is zero at locations of actual data points and becomes very high when far away from any observed data. \n\n\n## Your turn {transition=\"slide-in\"}\n\n:::: {.columns}\n\n::: {.column width=\"50%\"}\n\n*Your GP makes predictions on two new candidate tuning parameters.* \n\n*We want to minimize MAE.* \n\n*Which should we choose?*\n\n:::\n\n::: {.column width=\"50%\"}\n\n::: {.cell}\n::: {.cell-output-display}\n![](figures/two-candidates-1.svg){width=100%}\n:::\n:::\n\n:::\n\n::::\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n```{=html}\n
\n
\n03:00\n
\n```\n:::\n:::\n\n\n\n\n## GP Fit (ribbon is mean +/- 1SD)\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](figures/gp-iter-0-1.svg){fig-align='center' width=50%}\n:::\n:::\n\n\n\n## Choosing New Candidates\n\nThis isn't a very good fit but we can still use it.\n\nHow can we use the outputs to choose the next point to measure?\n\n
\n\n[_Acquisition functions_](https://ekamperi.github.io/machine%20learning/2021/06/11/acquisition-functions.html) take the predicted mean and variance and use them to balance: \n\n - _exploration_: new candidates should explore new areas.\n - _exploitation_: new candidates must stay near existing values. \n\nExploration focuses on the variance, exploitation is about the mean. \n\n## Acquisition Functions\n\nWe'll use an acquisition function to select a new candidate.\n\nThe most popular method appears to be _expected improvement_ ([EI](https://arxiv.org/pdf/1911.12809.pdf)) above the current best results. \n \n - Zero at existing data points. \n - The _expected_ improvement is integrated over all possible improvement (\"expected\" in the probability sense). \n\nWe would probably pick the point with the largest EI as the next point. \n\n(There are other functions beyond EI.)\n\n## Expected Improvement\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](figures/gp-iter-0-ei-1.svg){fig-align='center' width=50%}\n:::\n:::\n\n\n## Iteration\n\nOnce we pick the candidate point, we measure performance for it (e.g. resampling). \n\n
\n\nAnother GP is fit, EI is recomputed, and so on. \n\n
\n\nWe stop when we have completed the allowed number of iterations _or_ if we don't see any improvement after a pre-set number of attempts. \n\n\n## GP Fit with four points\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](figures/gp-iter-1-1.svg){fig-align='center' width=50%}\n:::\n:::\n\n\n\n## Expected Improvement\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](figures/gp-iter-1-ei-1.svg){fig-align='center' width=50%}\n:::\n:::\n\n\n\n## GP Evolution\n\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](animations/anime_gp.gif){fig-align='center' width=50%}\n:::\n:::\n\n\n\n## Expected Improvement Evolution\n\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](animations/anime_improvement.gif){fig-align='center' width=50%}\n:::\n:::\n\n\n## BO in tidymodels\n\nWe'll use a function called `tune_bayes()` that has very similar syntax to `tune_grid()`. \n\n
\n\nIt has an additional `initial` argument for the initial set of performance estimates and parameter combinations for the GP model. \n\n## Initial grid points\n\n`initial` can be the results of another `tune_*()` function or an integer (in which case `tune_grid()` is used under to hood to make such an initial set of results).\n \n - We'll run the optimization more than once, so let's make an initial grid of results to serve as the substrate for the BO. \n\n - I suggest at least the number of tuning parameters plus two as the initial grid for BO. \n\n## An Initial Grid\n\n\n::: {.cell}\n\n```{.r .cell-code}\nreg_metrics <- metric_set(mae, rsq)\n\nset.seed(12)\ninit_res <-\n lgbm_wflow %>%\n tune_grid(\n resamples = hotel_rs,\n grid = nrow(lgbm_param) + 2,\n param_info = lgbm_param,\n metrics = reg_metrics\n )\n\nshow_best(init_res, metric = \"mae\")\n#> # A tibble: 5 Γ— 11\n#> trees min_n learn_rate `agent hash` `company hash` .metric .estimator mean n std_err .config \n#> \n#> 1 390 10 0.0139 13 62 mae standard 11.9 10 0.208 Preprocessor1_Model1\n#> 2 718 31 0.00112 72 25 mae standard 29.1 10 0.325 Preprocessor4_Model1\n#> 3 1236 22 0.0000261 11 17 mae standard 51.8 10 0.416 Preprocessor7_Model1\n#> 4 1044 25 0.00000832 34 12 mae standard 52.8 10 0.424 Preprocessor5_Model1\n#> 5 1599 7 0.0000000402 254 179 mae standard 53.2 10 0.427 Preprocessor6_Model1\n```\n:::\n\n\n## BO using tidymodels\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"4,6-8|\"}\nset.seed(15)\nlgbm_bayes_res <-\n lgbm_wflow %>%\n tune_bayes(\n resamples = hotel_rs,\n initial = init_res, # <- initial results\n iter = 20,\n param_info = lgbm_param,\n metrics = reg_metrics\n )\n\nshow_best(lgbm_bayes_res, metric = \"mae\")\n#> # A tibble: 5 Γ— 12\n#> trees min_n learn_rate `agent hash` `company hash` .metric .estimator mean n std_err .config .iter\n#> \n#> 1 1665 2 0.0593 12 59 mae standard 10.1 10 0.173 Iter13 13\n#> 2 1179 2 0.0552 161 121 mae standard 10.2 10 0.147 Iter7 7\n#> 3 1609 6 0.0592 186 192 mae standard 10.2 10 0.195 Iter17 17\n#> 4 1352 6 0.0799 217 46 mae standard 10.3 10 0.211 Iter4 4\n#> 5 1647 4 0.0819 12 240 mae standard 10.3 10 0.198 Iter20 20\n```\n:::\n\n\n\n## Plotting BO Results\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nautoplot(lgbm_bayes_res, metric = \"mae\")\n```\n\n::: {.cell-output-display}\n![](figures/autoplot-marginals-1.svg){fig-align='center' width=50%}\n:::\n:::\n\n\n\n## Plotting BO Results\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nautoplot(lgbm_bayes_res, metric = \"mae\", type = \"parameters\")\n```\n\n::: {.cell-output-display}\n![](figures/autoplot-param-1.svg){fig-align='center' width=50%}\n:::\n:::\n\n\n\n## Plotting BO Results\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nautoplot(lgbm_bayes_res, metric = \"mae\", type = \"performance\")\n```\n\n::: {.cell-output-display}\n![](figures/autoplot-perf-1.svg){fig-align='center' width=50%}\n:::\n:::\n\n\n\n## ENHANCE\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nautoplot(lgbm_bayes_res, metric = \"mae\", type = \"performance\") +\n ylim(c(9.5, 14))\n```\n\n::: {.cell-output-display}\n![](figures/autoplot-perf-zoomed-1.svg){fig-align='center' width=50%}\n:::\n:::\n\n\n\n## Your turn {transition=\"slide-in\"}\n\n*Let's try a different acquisition function: `conf_bound(kappa)`.*\n\n*We'll use the `objective` argument to set it.*\n\n*Choose your own `kappa` value:*\n\n - *Larger values will explore the space more.* \n - *\"Large\" values are usually less than one.*\n\n\n\n::: {.cell}\n::: {.cell-output-display}\n```{=html}\n
\n
\n10:00\n
\n```\n:::\n:::\n\n\n## Notes\n\n- Stopping `tune_bayes()` will return the current results. \n\n- Parallel processing can still be used to more efficiently measure each candidate point. \n\n- There are [a lot of other iterative methods](https://github.com/topepo/Optimization-Methods-for-Tuning-Predictive-Models) that you can use. \n\n- The finetune package also has functions for [simulated annealing](https://www.tmwr.org/iterative-search.html#simulated-annealing) search. \n\n## Finalizing the Model\n\nLet's say that we've tried a lot of different models and we like our lightgbm model the most. \n\nWhat do we do now? \n\n * Finalize the workflow by choosing the values for the tuning parameters. \n * Fit the model on the entire training set. \n * Verify performance using the test set. \n * Document and publish the model(?)\n \n## Locking Down the Tuning Parameters\n\nWe can take the results of the Bayesian optimization and accept the best results: \n\n\n::: {.cell}\n\n```{.r .cell-code}\nbest_param <- select_best(lgbm_bayes_res, metric = \"mae\")\nfinal_wflow <- \n lgbm_wflow %>% \n finalize_workflow(best_param)\nfinal_wflow\n#> ══ Workflow ══════════════════════════════════════════════════════════\n#> Preprocessor: Recipe\n#> Model: boost_tree()\n#> \n#> ── Preprocessor ──────────────────────────────────────────────────────\n#> 4 Recipe Steps\n#> \n#> β€’ step_YeoJohnson()\n#> β€’ step_dummy_hash()\n#> β€’ step_dummy_hash()\n#> β€’ step_zv()\n#> \n#> ── Model ─────────────────────────────────────────────────────────────\n#> Boosted Tree Model Specification (regression)\n#> \n#> Main Arguments:\n#> trees = 1665\n#> min_n = 2\n#> learn_rate = 0.0592557571004946\n#> \n#> Computational engine: lightgbm\n```\n:::\n\n\n## The Final Fit\n\nWe can use individual functions: \n\n```r\nfinal_fit <- final_wflow %>% fit(data = hotel_tr)\n\n# then predict() or augment() \n# then compute metrics\n```\n\n
\n\nRemember that there is also a convenience function to do all of this: \n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(3893)\nfinal_res <- final_wflow %>% last_fit(hotel_split, metrics = reg_metrics)\nfinal_res\n#> # Resampling results\n#> # Manual resampling \n#> # A tibble: 1 Γ— 6\n#> splits id .metrics .notes .predictions .workflow \n#> \n#> 1 train/test split \n```\n:::\n\n\n## Test Set Results\n\n:::: {.columns}\n\n::: {.column width=\"65%\"}\n\n::: {.cell}\n\n```{.r .cell-code}\nfinal_res %>% \n collect_predictions() %>% \n cal_plot_regression(\n truth = avg_price_per_room, \n estimate = .pred, \n alpha = 1 / 4)\n```\n:::\n\n\nTest set performance: \n\n\n::: {.cell}\n\n```{.r .cell-code}\nfinal_res %>% collect_metrics()\n#> # A tibble: 2 Γ— 4\n#> .metric .estimator .estimate .config \n#> \n#> 1 mae standard 10.5 Preprocessor1_Model1\n#> 2 rsq standard 0.937 Preprocessor1_Model1\n```\n:::\n\n:::\n\n::: {.column width=\"35%\"}\n\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](figures/test-cal-1.svg){fig-align='center' width=100%}\n:::\n:::\n\n\n:::\n\n::::\n\n\n::: {.cell}\n\n:::\n", + "supporting": [], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": { + "include-in-header": [ + "\n\n" + ], + "include-after-body": [ + "\n\n\n" + ] + }, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/archive/2023-07-nyr/_freeze/advanced-04-iterative/libs/countdown-0.4.0/countdown.css b/archive/2023-07-nyr/_freeze/advanced-04-iterative/libs/countdown-0.4.0/countdown.css new file mode 100644 index 00000000..bf387012 --- /dev/null +++ b/archive/2023-07-nyr/_freeze/advanced-04-iterative/libs/countdown-0.4.0/countdown.css @@ -0,0 +1,144 @@ +.countdown { + background: inherit; + position: absolute; + cursor: pointer; + font-size: 3rem; + line-height: 1; + border-color: #ddd; + border-width: 3px; + border-style: solid; + border-radius: 15px; + box-shadow: 0px 4px 10px 0px rgba(50, 50, 50, 0.4); + -webkit-box-shadow: 0px 4px 10px 0px rgba(50, 50, 50, 0.4); + margin: 0.6em; + padding: 10px 15px; + text-align: center; + z-index: 10; + -webkit-user-select: none; + -moz-user-select: none; + -ms-user-select: none; + user-select: none; +} +.countdown { + display: flex; + align-items: center; + justify-content: center; +} +.countdown .countdown-time { + background: none; + font-size: 100%; + padding: 0; +} +.countdown-digits { + color: inherit; +} +.countdown.running { + border-color: #2A9B59FF; + background-color: #43AC6A; +} +.countdown.running .countdown-digits { + color: #002F14FF; +} +.countdown.finished { + border-color: #DE3000FF; + background-color: #F04124; +} +.countdown.finished .countdown-digits { + color: #4A0900FF; +} +.countdown.running.warning { + border-color: #CEAC04FF; + background-color: #E6C229; +} +.countdown.running.warning .countdown-digits { + color: #3A2F02FF; +} + +.countdown.running.blink-colon .countdown-digits.colon { + opacity: 0.1; +} + +/* ------ Controls ------ */ +.countdown:not(.running) .countdown-controls { + display: none; +} + +.countdown-controls { + position: absolute; + top: -0.5rem; + right: -0.5rem; + left: -0.5rem; + display: flex; + justify-content: space-between; + margin: 0; + padding: 0; +} + +.countdown-controls > button { + font-size: 1.5rem; + width: 1rem; + height: 1rem; + display: inline-block; + display: flex; + flex-direction: column; + align-items: center; + justify-content: center; + font-family: monospace; + padding: 10px; + margin: 0; + background: inherit; + border: 2px solid; + border-radius: 100%; + transition: 50ms transform ease-in-out, 150ms opacity ease-in; + --countdown-transition-distance: 10px; +} + +.countdown .countdown-controls > button:last-child { + transform: translate(calc(-1 * var(--countdown-transition-distance)), var(--countdown-transition-distance)); + opacity: 0; + color: #002F14FF; + background-color: #43AC6A; + border-color: #2A9B59FF; +} + +.countdown .countdown-controls > button:first-child { + transform: translate(var(--countdown-transition-distance), var(--countdown-transition-distance)); + opacity: 0; + color: #4A0900FF; + background-color: #F04124; + border-color: #DE3000FF; +} + +.countdown.running:hover .countdown-controls > button, +.countdown.running:focus-within .countdown-controls > button{ + transform: translate(0, 0); + opacity: 1; +} + +.countdown.running:hover .countdown-controls > button:hover, +.countdown.running:focus-within .countdown-controls > button:hover{ + transform: translate(0, calc(var(--countdown-transition-distance) / -2)); + box-shadow: 0px 2px 5px 0px rgba(50, 50, 50, 0.4); + -webkit-box-shadow: 0px 2px 5px 0px rgba(50, 50, 50, 0.4); +} + +.countdown.running:hover .countdown-controls > button:active, +.countdown.running:focus-within .countdown-controls > button:active{ + transform: translate(0, calc(var(--coutndown-transition-distance) / -5)); +} + +/* ----- Fullscreen ----- */ +.countdown.countdown-fullscreen { + z-index: 0; +} + +.countdown-fullscreen.running .countdown-controls { + top: 1rem; + left: 0; + right: 0; + justify-content: center; +} + +.countdown-fullscreen.running .countdown-controls > button + button { + margin-left: 1rem; +} diff --git a/archive/2023-07-nyr/_freeze/advanced-04-iterative/libs/countdown-0.4.0/countdown.js b/archive/2023-07-nyr/_freeze/advanced-04-iterative/libs/countdown-0.4.0/countdown.js new file mode 100644 index 00000000..a058ad8f --- /dev/null +++ b/archive/2023-07-nyr/_freeze/advanced-04-iterative/libs/countdown-0.4.0/countdown.js @@ -0,0 +1,478 @@ +/* globals Shiny,Audio */ +class CountdownTimer { + constructor (el, opts) { + if (typeof el === 'string' || el instanceof String) { + el = document.querySelector(el) + } + + if (el.counter) { + return el.counter + } + + const minutes = parseInt(el.querySelector('.minutes').innerText || '0') + const seconds = parseInt(el.querySelector('.seconds').innerText || '0') + const duration = minutes * 60 + seconds + + function attrIsTrue (x) { + if (x === true) return true + return !!(x === 'true' || x === '' || x === '1') + } + + this.element = el + this.duration = duration + this.end = null + this.is_running = false + this.warn_when = parseInt(el.dataset.warnWhen) || -1 + this.update_every = parseInt(el.dataset.updateEvery) || 1 + this.play_sound = attrIsTrue(el.dataset.playSound) + this.blink_colon = attrIsTrue(el.dataset.blinkColon) + this.startImmediately = attrIsTrue(el.dataset.startImmediately) + this.timeout = null + this.display = { minutes, seconds } + + if (opts.src_location) { + this.src_location = opts.src_location + } + + this.addEventListeners() + } + + addEventListeners () { + const self = this + + if (this.startImmediately) { + if (window.remark && window.slideshow) { + // Remark (xaringan) support + const isOnVisibleSlide = () => { + return document.querySelector('.remark-visible').contains(self.element) + } + if (isOnVisibleSlide()) { + self.start() + } else { + let started_once = 0 + window.slideshow.on('afterShowSlide', function () { + if (started_once > 0) return + if (isOnVisibleSlide()) { + self.start() + started_once = 1 + } + }) + } + } else if (window.Reveal) { + // Revealjs (quarto) support + const isOnVisibleSlide = () => { + const currentSlide = document.querySelector('.reveal .slide.present') + return currentSlide ? currentSlide.contains(self.element) : false + } + if (isOnVisibleSlide()) { + self.start() + } else { + const revealStartTimer = () => { + if (isOnVisibleSlide()) { + self.start() + window.Reveal.off('slidechanged', revealStartTimer) + } + } + window.Reveal.on('slidechanged', revealStartTimer) + } + } else if (window.IntersectionObserver) { + // All other situtations use IntersectionObserver + const onVisible = (element, callback) => { + new window.IntersectionObserver((entries, observer) => { + entries.forEach(entry => { + if (entry.intersectionRatio > 0) { + callback(element) + observer.disconnect() + } + }) + }).observe(element) + } + onVisible(this.element, el => el.countdown.start()) + } else { + // or just start the timer as soon as it's initialized + this.start() + } + } + + function haltEvent (ev) { + ev.preventDefault() + ev.stopPropagation() + } + function isSpaceOrEnter (ev) { + return ev.code === 'Space' || ev.code === 'Enter' + } + function isArrowUpOrDown (ev) { + return ev.code === 'ArrowUp' || ev.code === 'ArrowDown' + } + + ;['click', 'touchend'].forEach(function (eventType) { + self.element.addEventListener(eventType, function (ev) { + haltEvent(ev) + self.is_running ? self.stop() : self.start() + }) + }) + this.element.addEventListener('keydown', function (ev) { + if (ev.code === "Escape") { + self.reset() + haltEvent(ev) + } + if (!isSpaceOrEnter(ev) && !isArrowUpOrDown(ev)) return + haltEvent(ev) + if (isSpaceOrEnter(ev)) { + self.is_running ? self.stop() : self.start() + return + } + + if (!self.is_running) return + + if (ev.code === 'ArrowUp') { + self.bumpUp() + } else if (ev.code === 'ArrowDown') { + self.bumpDown() + } + }) + this.element.addEventListener('dblclick', function (ev) { + haltEvent(ev) + if (self.is_running) self.reset() + }) + this.element.addEventListener('touchmove', haltEvent) + + const btnBumpDown = this.element.querySelector('.countdown-bump-down') + ;['click', 'touchend'].forEach(function (eventType) { + btnBumpDown.addEventListener(eventType, function (ev) { + haltEvent(ev) + if (self.is_running) self.bumpDown() + }) + }) + btnBumpDown.addEventListener('keydown', function (ev) { + if (!isSpaceOrEnter(ev) || !self.is_running) return + haltEvent(ev) + self.bumpDown() + }) + + const btnBumpUp = this.element.querySelector('.countdown-bump-up') + ;['click', 'touchend'].forEach(function (eventType) { + btnBumpUp.addEventListener(eventType, function (ev) { + haltEvent(ev) + if (self.is_running) self.bumpUp() + }) + }) + btnBumpUp.addEventListener('keydown', function (ev) { + if (!isSpaceOrEnter(ev) || !self.is_running) return + haltEvent(ev) + self.bumpUp() + }) + this.element.querySelector('.countdown-controls').addEventListener('dblclick', function (ev) { + haltEvent(ev) + }) + } + + remainingTime () { + const remaining = this.is_running + ? (this.end - Date.now()) / 1000 + : this.remaining || this.duration + + let minutes = Math.floor(remaining / 60) + let seconds = Math.ceil(remaining - minutes * 60) + + if (seconds > 59) { + minutes = minutes + 1 + seconds = seconds - 60 + } + + return { remaining, minutes, seconds } + } + + start () { + if (this.is_running) return + + this.is_running = true + + if (this.remaining) { + // Having a static remaining time indicates timer was paused + this.end = Date.now() + this.remaining * 1000 + this.remaining = null + } else { + this.end = Date.now() + this.duration * 1000 + } + + this.reportStateToShiny('start') + + this.element.classList.remove('finished') + this.element.classList.add('running') + this.update(true) + this.tick() + } + + tick (run_again) { + if (typeof run_again === 'undefined') { + run_again = true + } + + if (!this.is_running) return + + const { seconds: secondsWas } = this.display + this.update() + + if (run_again) { + const delay = (this.end - Date.now() > 10000) ? 1000 : 250 + this.blinkColon(secondsWas) + this.timeout = setTimeout(this.tick.bind(this), delay) + } + } + + blinkColon (secondsWas) { + // don't blink unless option is set + if (!this.blink_colon) return + // warn_when always updates the seconds + if (this.warn_when > 0 && Date.now() + this.warn_when > this.end) { + this.element.classList.remove('blink-colon') + return + } + const { seconds: secondsIs } = this.display + if (secondsIs > 10 || secondsWas !== secondsIs) { + this.element.classList.toggle('blink-colon') + } + } + + update (force) { + if (typeof force === 'undefined') { + force = false + } + + const { remaining, minutes, seconds } = this.remainingTime() + + const setRemainingTime = (selector, time) => { + const timeContainer = this.element.querySelector(selector) + if (!timeContainer) return + time = Math.max(time, 0) + timeContainer.innerText = String(time).padStart(2, 0) + } + + if (this.is_running && remaining < 0.25) { + this.stop() + setRemainingTime('.minutes', 0) + setRemainingTime('.seconds', 0) + this.playSound() + return + } + + const should_update = force || + Math.round(remaining) < this.warn_when || + Math.round(remaining) % this.update_every === 0 + + if (should_update) { + this.element.classList.toggle('warning', remaining <= this.warn_when) + this.display = { minutes, seconds } + setRemainingTime('.minutes', minutes) + setRemainingTime('.seconds', seconds) + } + } + + stop () { + const { remaining } = this.remainingTime() + if (remaining > 1) { + this.remaining = remaining + } + this.element.classList.remove('running') + this.element.classList.remove('warning') + this.element.classList.remove('blink-colon') + this.element.classList.add('finished') + this.is_running = false + this.end = null + this.reportStateToShiny('stop') + this.timeout = clearTimeout(this.timeout) + } + + reset () { + this.stop() + this.remaining = null + this.update(true) + this.reportStateToShiny('reset') + this.element.classList.remove('finished') + this.element.classList.remove('warning') + } + + setValues (opts) { + if (typeof opts.warn_when !== 'undefined') { + this.warn_when = opts.warn_when + } + if (typeof opts.update_every !== 'undefined') { + this.update_every = opts.update_every + } + if (typeof opts.blink_colon !== 'undefined') { + this.blink_colon = opts.blink_colon + if (!opts.blink_colon) { + this.element.classList.remove('blink-colon') + } + } + if (typeof opts.play_sound !== 'undefined') { + this.play_sound = opts.play_sound + } + if (typeof opts.duration !== 'undefined') { + this.duration = opts.duration + if (this.is_running) { + this.reset() + this.start() + } + } + this.reportStateToShiny('update') + this.update(true) + } + + bumpTimer (val, round) { + round = typeof round === 'boolean' ? round : true + const { remaining } = this.remainingTime() + let newRemaining = remaining + val + if (newRemaining <= 0) { + this.setRemaining(0) + this.stop() + return + } + if (round && newRemaining > 10) { + newRemaining = Math.round(newRemaining / 5) * 5 + } + this.setRemaining(newRemaining) + this.reportStateToShiny(val > 0 ? 'bumpUp' : 'bumpDown') + this.update(true) + } + + bumpUp (val) { + if (!this.is_running) { + console.error('timer is not running') + return + } + this.bumpTimer( + val || this.bumpIncrementValue(), + typeof val === 'undefined' + ) + } + + bumpDown (val) { + if (!this.is_running) { + console.error('timer is not running') + return + } + this.bumpTimer( + val || -1 * this.bumpIncrementValue(), + typeof val === 'undefined' + ) + } + + setRemaining (val) { + if (!this.is_running) { + console.error('timer is not running') + return + } + this.end = Date.now() + val * 1000 + this.update(true) + } + + playSound () { + let url = this.play_sound + if (!url) return + if (typeof url === 'boolean') { + const src = this.src_location + ? this.src_location.replace('/countdown.js', '') + : 'libs/countdown' + url = src + '/smb_stage_clear.mp3' + } + const sound = new Audio(url) + sound.play() + } + + bumpIncrementValue (val) { + val = val || this.remainingTime().remaining + if (val <= 30) { + return 5 + } else if (val <= 300) { + return 15 + } else if (val <= 3000) { + return 30 + } else { + return 60 + } + } + + reportStateToShiny (action) { + if (!window.Shiny) return + + const inputId = this.element.id + const data = { + event: { + action, + time: new Date().toISOString() + }, + timer: { + is_running: this.is_running, + end: this.end ? new Date(this.end).toISOString() : null, + remaining: this.remainingTime() + } + } + + function shinySetInputValue () { + if (!window.Shiny.setInputValue) { + setTimeout(shinySetInputValue, 100) + return + } + window.Shiny.setInputValue(inputId, data) + } + + shinySetInputValue() + } +} + +(function () { + const CURRENT_SCRIPT = document.currentScript.getAttribute('src') + + document.addEventListener('DOMContentLoaded', function () { + const els = document.querySelectorAll('.countdown') + if (!els || !els.length) { + return + } + els.forEach(function (el) { + el.countdown = new CountdownTimer(el, { src_location: CURRENT_SCRIPT }) + }) + + if (window.Shiny) { + Shiny.addCustomMessageHandler('countdown:update', function (x) { + if (!x.id) { + console.error('No `id` provided, cannot update countdown') + return + } + const el = document.getElementById(x.id) + el.countdown.setValues(x) + }) + + Shiny.addCustomMessageHandler('countdown:start', function (id) { + const el = document.getElementById(id) + if (!el) return + el.countdown.start() + }) + + Shiny.addCustomMessageHandler('countdown:stop', function (id) { + const el = document.getElementById(id) + if (!el) return + el.countdown.stop() + }) + + Shiny.addCustomMessageHandler('countdown:reset', function (id) { + const el = document.getElementById(id) + if (!el) return + el.countdown.reset() + }) + + Shiny.addCustomMessageHandler('countdown:bumpUp', function (id) { + const el = document.getElementById(id) + if (!el) return + el.countdown.bumpUp() + }) + + Shiny.addCustomMessageHandler('countdown:bumpDown', function (id) { + const el = document.getElementById(id) + if (!el) return + el.countdown.bumpDown() + }) + } + }) +})() diff --git a/archive/2023-07-nyr/_freeze/advanced-04-iterative/libs/countdown-0.4.0/smb_stage_clear.mp3 b/archive/2023-07-nyr/_freeze/advanced-04-iterative/libs/countdown-0.4.0/smb_stage_clear.mp3 new file mode 100644 index 00000000..da2ddc2c Binary files /dev/null and b/archive/2023-07-nyr/_freeze/advanced-04-iterative/libs/countdown-0.4.0/smb_stage_clear.mp3 differ diff --git a/archive/2023-07-nyr/_freeze/annotations/execute-results/html.json b/archive/2023-07-nyr/_freeze/annotations/execute-results/html.json new file mode 100644 index 00000000..92a22442 --- /dev/null +++ b/archive/2023-07-nyr/_freeze/annotations/execute-results/html.json @@ -0,0 +1,16 @@ +{ + "hash": "34abff9197cbe142abc26d11f45a3f71", + "result": { + "markdown": "---\ntitle: \"Annotations\"\nknitr:\n opts_chunk: \n echo: true\n collapse: true\n comment: \"#>\"\n fig.path: \"figures/\"\n---\n\n\n\n\n
\n\n# 01 - Introduction\n\n## πŸ‘€\n\nThis page contains _annotations_ for selected slides. \n\nThere's a lot that we want to tell you. We don't want people to have to frantically scribble down things that we say that are not on the slides. \n\nWe've added sections to this document with longer explanations and links to other resources. \n\n
\n\n# 02 - Data Budget\n\n## The initial split\n\nWhat does `set.seed()` do? \n\nWe’ll use pseudo-random numbers (PRN) to partition the data into training and testing. PRN are numbers that emulate truly random numbers (but really are not truly random). \n\nThink of PRN as a box that takes a starting value (the \"seed\") that produces random numbers using that starting value as an input into its process. \n\nIf we know a seed value, we can reproduce our \"random\" numbers. To use a different set of random numbers, choose a different seed value. \n\nFor example: \n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(1)\nrunif(3)\n#> [1] 0.2655087 0.3721239 0.5728534\n\n# Get a new set of random numbers:\nset.seed(2)\nrunif(3)\n#> [1] 0.1848823 0.7023740 0.5733263\n\n# We can reproduce the old ones with the same seed\nset.seed(1)\nrunif(3)\n#> [1] 0.2655087 0.3721239 0.5728534\n```\n:::\n\n\nIf we _don’t_ set the seed, R uses the clock time and the process ID to create a seed. This isn’t reproducible. \n\nSince we want our code to be reproducible, we set the seeds before random numbers are used. \n\nIn theory, you can set the seed once at the start of a script. However, if we do interactive data analysis, we might unwittingly use random numbers while coding. In that case, the stream is not the same and we don’t get reproducible results. \n\nThe value of the seed is an integer and really has no meaning. Max has a script to generate random integers to use as seeds to \"spread the randomness around\". It is basically:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncat(paste0(\"set.seed(\", sample.int(10000, 5), \")\", collapse = \"\\n\"))\n#> set.seed(9725)\n#> set.seed(8462)\n#> set.seed(4050)\n#> set.seed(8789)\n#> set.seed(1301)\n```\n:::\n\n\n
\n\n# 03 - What Makes A Model?\n\n## What is wrong with this? \n\nIf we treat the preprocessing as a separate task, it raises the risk that we might accidentally overfit to the data at hand. \n\nFor example, someone might estimate something from the entire data set (such as the principle components) and treat that data as if it were known (and not estimated). Depending on the what was done with the data, consequences in doing that could be:\n\n* Your performance metrics are slightly-to-moderately optimistic (e.g. you might think your accuracy is 85% when it is actually 75%)\n* A consequential component of the analysis is not right and the model just doesn’t work. \n\nThe big issue here is that you won’t be able to figure this out until you get a new piece of data, such as the test set. \n\nA really good example of this is in [β€˜Selection bias in gene extraction on the basis of microarray gene-expression data’](https://pubmed.ncbi.nlm.nih.gov/11983868/). The authors re-analyze a previous publication and show that the original researchers did not include feature selection in the workflow. Because of that, their performance statistics were extremely optimistic. In one case, they could do the original analysis on complete noise and still achieve zero errors. \n\nGenerally speaking, this problem is referred to as [data leakage](https://en.wikipedia.org/wiki/Leakage_(machine_learning)). Some other references: \n\n * [Overfitting to Predictors and External Validation](https://bookdown.org/max/FES/selection-overfitting.html)\n * [Are We Learning Yet? A Meta Review of Evaluation Failures Across Machine Learning](https://datasets-benchmarks-proceedings.neurips.cc/paper/2021/hash/757b505cfd34c64c85ca5b5690ee5293-Abstract-round2.html)\n * [Navigating the pitfalls of applying machine learning in genomics](https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=Navigating+the+pitfalls+of+applying+machine+learning+in+genomics&btnG=)\n * [A review of feature selection techniques in bioinformatics](https://academic.oup.com/bioinformatics/article/23/19/2507/185254)\n * [On Over-fitting in Model Selection and Subsequent Selection Bias in Performance Evaluation](https://www.jmlr.org/papers/volume11/cawley10a/cawley10a.pdf)\n\n
\n\n# 04 - Evaluating Models\n\n## Where are the fitted models?\n\nThe primary purpose of resampling is to estimate model performance. The models are almost never needed again. \n\nAlso, if the data set is large, the model object may require a lot of memory to save so, by default, we don't keep them. \n\nFor more advanced use cases, you can extract and save them. See:\n\n * \n * (an example)\n\n\n## Validation set\n\nThe upcoming version of the rsample package (1.2.0) will have a new set of functions specific to validation sets. They will allow you to make an initial _three-way split_ and still use a validation set with the tune package. \n\n
\n\n# 06 - Tuning Hyperparameters\n\n## Update parameter ranges\n\nIn about 90% of the cases, the dials function that you use to update the parameter range has the same name as the argument. For example, if you were to update the `mtry` parameter in a random forests model, the code would look like\n\n\n::: {.cell}\n\n```{.r .cell-code}\nparameter_object %>% \n update(mtry = mtry(c(1, 100)))\n```\n:::\n\n\nThere are some cases where the parameter function, or its associated values, are different from the argument name. \n\nFor example, with `step_spline_naturall()`, we might want to tune the `deg_free` argument (for the degrees of freedom of a spline function. ). In this case, the argument name is `deg_free` but we update it with `spline_degree()`. \n\n`deg_free` represents the general concept of degrees of freedom and could be associated with many different things. For example, if we ever had an argument that was the number of degrees of freedom for a $t$ distribution, we would call that argument `deg_free`. \n\nFor splines, we probably want a wider range for the degrees of freedom. We made a specialized function called `spline_degree()` to be used in these cases. \n\nHow can you tell when this happens? There is a helper function called `tunable()` and that gives information on how we make the default ranges for parameters. There is a column in these objects names `call_info`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidymodels)\nns_tunable <- \n recipe(mpg ~ ., data = mtcars) %>% \n step_spline_natural(dis, deg_free = tune()) %>% \n tunable()\n\nns_tunable\n#> # A tibble: 1 Γ— 5\n#> name call_info source component component_id \n#> \n#> 1 deg_free recipe step_spline_natural spline_natural_P1Tjg\nns_tunable$call_info\n#> [[1]]\n#> [[1]]$pkg\n#> [1] \"dials\"\n#> \n#> [[1]]$fun\n#> [1] \"spline_degree\"\n#> \n#> [[1]]$range\n#> [1] 2 15\n```\n:::\n\n\n\n## Early stopping for boosted trees\n\nWhen deciding on the number of boosting iterations, there are two main strategies:\n\n * Directly tune it (`trees = tune()`)\n \n * Set it to one value and tune the number of early stopping iterations (`trees = 500`, `stop_iter = tune()`).\n\nEarly stopping is when we monitor the performance of the model. If the model doesn't make any improvements for `stop_iter` iterations, training stops. \n\nHere's an example where, after eleven iterations, performance starts to get worse. \n\n\n\n::: {.cell}\n::: {.cell-output-display}\n![](figures/early-stopping-1.svg){width=672}\n:::\n:::\n\n\nThis is likely due to over-fitting so we stop the model at eleven boosting iterations. \n\nEarly stopping usually has good results and takes far less time. \n\nWe _could_ an engine argument called `validation` here. That's not an argument to any function in the lightgbm package. \n\nbonsai has its own wrapper around (`lightgbm::lgb.train()`) called `bonsai::train_lightgbm()`. We use that here and it has a `validation` argument.\n\nHow would you know that? There are a few different ways:\n\n * Look at the documentation in `?boost_tree` and click on the `lightgbm` entry in the engine list. \n * Check out the pkgdown reference website \n * Run the `translate()` function on the parsnip specification object. \n\nThe first two options are best since they tell you a lot more about the particularities of each model engine (there are a lot for lightgbm). \n\n
\n\n# Extras - Effect Encodings\n\n## Per-agent statistics\n\nThe effect encoding method essentially takes the effect of a variable, like agent, and makes a data column for that effect. In our example, affect of the agent on the ADR is quantified by a model and then added as a data column to be used in the model. \n\nSuppose agent Max has a single reservation in the data and it had an ADR of €200. If we used a naive estimate for Max’s effect, the model is being told that Max should always produce an effect of €200. That’s a very poor estimate since it is from a single data point. \n\nContrast this with seasoned agent Davis, who has taken 250 reservations with an average ADR of €100. Davis’s mean is more predictive because it is estimated with better data (i.e., more total reservations). \nPartial pooling leverages the entire data set and can borrow strength from all of the agents. It is a common tool in Bayesian estimation and non-Bayesian mixed models. If a agent’s data is of good quality, the partial pooling effect estimate is closer to the raw mean. Max’s data is not great and is \"shrunk\" towards the center of the overall average. Since there is so little known about Max’s reservation history, this is a better effect estimate (until more data is available for him). \n\nThe Stan documentation has a pretty good vignette on this: \n\nAlso, _Bayes Rules!_ has a nice section on this: \n\nSince this example has a numeric outcome, partial pooling is very similar to the James–Stein estimator: \n\n## Agent effects\n\nEffect encoding might result in a somewhat circular argument: the column is more likely to be important to the model since it is the output of a separate model. The risk here is that we might over-fit the effect to the data. For this reason, it is super important to make sure that we verify that we aren’t overfitting by checking with resampling (or a validation set). \n\nPartial pooling somewhat lowers the risk of overfitting since it tends to correct for agents with small sample sizes. It can’t correct for improper data usage or data leakage though. \n\n", + "supporting": [ + "figures" + ], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": {}, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/archive/2023-07-nyr/_freeze/annotations/figure-html/early-stopping-1.svg b/archive/2023-07-nyr/_freeze/annotations/figure-html/early-stopping-1.svg new file mode 100644 index 00000000..30bc5e3c --- /dev/null +++ b/archive/2023-07-nyr/_freeze/annotations/figure-html/early-stopping-1.svg @@ -0,0 +1,82 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +0.5 +0.6 +0.7 +0.8 + + + + + + + +4 +8 +12 +iterations +ROC AUC + + diff --git a/archive/2023-07-nyr/_freeze/classwork/01-classwork/execute-results/html.json b/archive/2023-07-nyr/_freeze/classwork/01-classwork/execute-results/html.json new file mode 100644 index 00000000..f50e9bd5 --- /dev/null +++ b/archive/2023-07-nyr/_freeze/classwork/01-classwork/execute-results/html.json @@ -0,0 +1,14 @@ +{ + "hash": "60a84d8a786ce8a81c3001fe50e93128", + "result": { + "markdown": "---\ntitle: \"1 - Introduction - Classwork\"\nsubtitle: \"Machine learning with tidymodels\"\n---\n\n\nWe recommend restarting R between each slide deck!\n\n\n::: {.cell}\n\n:::\n\n\n## Your turn\n\nHow are statistics and machine learning related?\n\nHow are they similar? Different?\n\n", + "supporting": [], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": {}, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/archive/2023-07-nyr/_freeze/classwork/02-classwork/execute-results/html.json b/archive/2023-07-nyr/_freeze/classwork/02-classwork/execute-results/html.json new file mode 100644 index 00000000..92e31940 --- /dev/null +++ b/archive/2023-07-nyr/_freeze/classwork/02-classwork/execute-results/html.json @@ -0,0 +1,14 @@ +{ + "hash": "b68ec3789c6815fd7121322033f4366c", + "result": { + "markdown": "---\ntitle: \"2 - Your data budget - Classwork\"\nsubtitle: \"Machine learning with tidymodels\"\neditor_options: \n chunk_output_type: console\n---\n\n\nWe recommend restarting R between each slide deck!\n\n## Data on taxi trips in Chicago in 2022\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidymodels)\n```\n\n::: {.cell-output .cell-output-stderr}\n```\n── Attaching packages ────────────────────────────────────── tidymodels 1.1.0 ──\n```\n:::\n\n::: {.cell-output .cell-output-stderr}\n```\nβœ” broom 1.0.5 βœ” recipes 1.0.6 \nβœ” dials 1.2.0 βœ” rsample 1.1.1.9000\nβœ” dplyr 1.1.2 βœ” tibble 3.2.1 \nβœ” ggplot2 3.4.2 βœ” tidyr 1.3.0 \nβœ” infer 1.0.4 βœ” tune 1.1.1.9001\nβœ” modeldata 1.1.0 βœ” workflows 1.1.3 \nβœ” parsnip 1.1.0.9003 βœ” workflowsets 1.0.1 \nβœ” purrr 1.0.1 βœ” yardstick 1.2.0.9001\n```\n:::\n\n::: {.cell-output .cell-output-stderr}\n```\n── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──\nβœ– purrr::discard() masks scales::discard()\nβœ– dplyr::filter() masks stats::filter()\nβœ– dplyr::lag() masks stats::lag()\nβœ– recipes::step() masks stats::step()\nβ€’ Learn how to get started at https://www.tidymodels.org/start/\n```\n:::\n\n```{.r .cell-code}\nlibrary(modeldatatoo)\n\ntaxi <- data_taxi()\n\n# Slightly modify the original data for the purposes of this workshop\ntaxi <- taxi %>%\n mutate(month = factor(month, levels = c(\"Jan\", \"Feb\", \"Mar\", \"Apr\"))) %>% \n select(-c(id, duration, fare, tolls, extras, total_cost, payment_type)) %>% \n drop_na()\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\ntaxi\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 8,807 Γ— 7\n tip distance company local dow month hour\n \n 1 yes 1.24 Sun Taxi no Thu Feb 13\n 2 no 5.39 Flash Cab no Sat Mar 12\n 3 yes 3.01 City Service no Wed Feb 17\n 4 no 18.4 Sun Taxi no Sat Apr 6\n 5 yes 1.76 Sun Taxi no Sun Jan 15\n 6 yes 13.6 Sun Taxi no Mon Feb 17\n 7 yes 3.71 City Service no Mon Mar 21\n 8 yes 4.8 other no Tue Mar 9\n 9 yes 18.0 City Service no Fri Jan 19\n10 no 17.5 other yes Thu Apr 12\n# β„Ή 8,797 more rows\n```\n:::\n:::\n\n\n## Your turn\n\nWhen is a good time to split your data?\n\n## Data splitting and spending\n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(123)\n\ntaxi_split <- initial_split(taxi)\ntaxi_split\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n\n<6605/2202/8807>\n```\n:::\n:::\n\n\nExtract the training and testing sets\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntaxi_train <- training(taxi_split)\ntaxi_test <- testing(taxi_split)\n```\n:::\n\n\n## Your turn\n\nSplit your data so 20% is held out for the test set.\n\nTry out different values in `set.seed()` to see how the results change.\n\nHint: Which argument in `initial_split()` handles the proportion split into training vs testing?\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Your code here!\n```\n:::\n\n\n## Your turn\n\nExplore the `taxi_train` data on your own!\n\n- What's the distribution of the outcome, tip?\n- What's the distribution of numeric variables like distance?\n- How does tip differ across the categorical variables?\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Your code here!\n```\n:::\n\n\n## Stratification\n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(123)\n\ntaxi_split <- initial_split(taxi, prop = 0.8, strata = tip)\ntaxi_train <- training(taxi_split)\ntaxi_test <- testing(taxi_split)\n```\n:::\n", + "supporting": [], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": {}, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/archive/2023-07-nyr/_freeze/classwork/03-classwork/execute-results/html.json b/archive/2023-07-nyr/_freeze/classwork/03-classwork/execute-results/html.json new file mode 100644 index 00000000..34494659 --- /dev/null +++ b/archive/2023-07-nyr/_freeze/classwork/03-classwork/execute-results/html.json @@ -0,0 +1,16 @@ +{ + "hash": "8072aef5e9db75391ba542a1336e312d", + "result": { + "markdown": "---\ntitle: \"3 - What makes a model? - Classwork\"\nsubtitle: \"Machine learning with tidymodels\"\neditor_options: \n chunk_output_type: console\n---\n\n\nWe recommend restarting R between each slide deck!\n\n## Setup\n\nSetup from deck 2\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidymodels)\n```\n\n::: {.cell-output .cell-output-stderr}\n```\n── Attaching packages ────────────────────────────────────── tidymodels 1.1.0 ──\n```\n:::\n\n::: {.cell-output .cell-output-stderr}\n```\nβœ” broom 1.0.5 βœ” recipes 1.0.6 \nβœ” dials 1.2.0 βœ” rsample 1.1.1.9000\nβœ” dplyr 1.1.2 βœ” tibble 3.2.1 \nβœ” ggplot2 3.4.2 βœ” tidyr 1.3.0 \nβœ” infer 1.0.4 βœ” tune 1.1.1.9001\nβœ” modeldata 1.1.0 βœ” workflows 1.1.3 \nβœ” parsnip 1.1.0.9003 βœ” workflowsets 1.0.1 \nβœ” purrr 1.0.1 βœ” yardstick 1.2.0.9001\n```\n:::\n\n::: {.cell-output .cell-output-stderr}\n```\n── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──\nβœ– purrr::discard() masks scales::discard()\nβœ– dplyr::filter() masks stats::filter()\nβœ– dplyr::lag() masks stats::lag()\nβœ– recipes::step() masks stats::step()\nβ€’ Learn how to get started at https://www.tidymodels.org/start/\n```\n:::\n\n```{.r .cell-code}\nlibrary(modeldatatoo)\n\ntaxi <- data_taxi()\n\ntaxi <- taxi %>%\n mutate(month = factor(month, levels = c(\"Jan\", \"Feb\", \"Mar\", \"Apr\"))) %>% \n select(-c(id, duration, fare, tolls, extras, total_cost, payment_type)) %>% \n drop_na()\n\nset.seed(123)\n\ntaxi_split <- initial_split(taxi, prop = 0.8, strata = tip)\ntaxi_train <- training(taxi_split)\ntaxi_test <- testing(taxi_split)\n```\n:::\n\n\n## Your turn\n\nHow do you fit a linear model in R?\n\nHow many different ways can you think of?\n\nDiscuss with your neighbor!\n\n## To specify a model\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Model\nlinear_reg()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nLinear Regression Model Specification (regression)\n\nComputational engine: lm \n```\n:::\n\n```{.r .cell-code}\n# Engine\nlinear_reg() %>%\n set_engine(\"glmnet\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nLinear Regression Model Specification (regression)\n\nComputational engine: glmnet \n```\n:::\n\n```{.r .cell-code}\n# Mode - Some models have a default mode, others don't\ndecision_tree() %>% \n set_mode(\"regression\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nDecision Tree Model Specification (regression)\n\nComputational engine: rpart \n```\n:::\n:::\n\n\n## Your turn\n\nEdit the chunk below to use a different model!\n\nAll available models are listed at \n\n\n::: {.cell}\n\n```{.r .cell-code}\ntree_spec <- decision_tree() %>% \n set_mode(\"classification\")\n\ntree_spec\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nDecision Tree Model Specification (classification)\n\nComputational engine: rpart \n```\n:::\n:::\n\n\n## A model workflow\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntree_spec <-\n decision_tree() %>% \n set_mode(\"classification\")\n```\n:::\n\n\nFit with parsnip:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntree_spec %>% \n fit(tip ~ ., data = taxi_train) \n```\n\n::: {.cell-output .cell-output-stdout}\n```\nparsnip model object\n\nn= 7045 \n\nnode), split, n, loss, yval, (yprob)\n * denotes terminal node\n\n 1) root 7045 2069 yes (0.70631654 0.29368346) \n 2) company=Chicago Independents,City Service,Sun Taxi,Taxicab Insurance Agency Llc,other 4328 744 yes (0.82809612 0.17190388) \n 4) distance< 4.615 2365 254 yes (0.89260042 0.10739958) *\n 5) distance>=4.615 1963 490 yes (0.75038207 0.24961793) \n 10) distance>=12.565 1069 81 yes (0.92422825 0.07577175) *\n 11) distance< 12.565 894 409 yes (0.54250559 0.45749441) \n 22) company=Chicago Independents,Sun Taxi,Taxicab Insurance Agency Llc 278 71 yes (0.74460432 0.25539568) *\n 23) company=City Service,other 616 278 no (0.45129870 0.54870130) \n 46) distance< 7.205 178 59 yes (0.66853933 0.33146067) *\n 47) distance>=7.205 438 159 no (0.36301370 0.63698630) *\n 3) company=Flash Cab,Taxi Affiliation Services 2717 1325 yes (0.51232978 0.48767022) \n 6) distance< 3.235 1331 391 yes (0.70623591 0.29376409) *\n 7) distance>=3.235 1386 452 no (0.32611833 0.67388167) \n 14) distance>=12.39 344 90 yes (0.73837209 0.26162791) *\n 15) distance< 12.39 1042 198 no (0.19001919 0.80998081) *\n```\n:::\n:::\n\n\nFit with a workflow:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nworkflow() %>%\n add_formula(tip ~ .) %>%\n add_model(tree_spec) %>%\n fit(data = taxi_train) \n```\n\n::: {.cell-output .cell-output-stdout}\n```\n══ Workflow [trained] ══════════════════════════════════════════════════════════\nPreprocessor: Formula\nModel: decision_tree()\n\n── Preprocessor ────────────────────────────────────────────────────────────────\ntip ~ .\n\n── Model ───────────────────────────────────────────────────────────────────────\nn= 7045 \n\nnode), split, n, loss, yval, (yprob)\n * denotes terminal node\n\n 1) root 7045 2069 yes (0.70631654 0.29368346) \n 2) company=Chicago Independents,City Service,Sun Taxi,Taxicab Insurance Agency Llc,other 4328 744 yes (0.82809612 0.17190388) \n 4) distance< 4.615 2365 254 yes (0.89260042 0.10739958) *\n 5) distance>=4.615 1963 490 yes (0.75038207 0.24961793) \n 10) distance>=12.565 1069 81 yes (0.92422825 0.07577175) *\n 11) distance< 12.565 894 409 yes (0.54250559 0.45749441) \n 22) company=Chicago Independents,Sun Taxi,Taxicab Insurance Agency Llc 278 71 yes (0.74460432 0.25539568) *\n 23) company=City Service,other 616 278 no (0.45129870 0.54870130) \n 46) distance< 7.205 178 59 yes (0.66853933 0.33146067) *\n 47) distance>=7.205 438 159 no (0.36301370 0.63698630) *\n 3) company=Flash Cab,Taxi Affiliation Services 2717 1325 yes (0.51232978 0.48767022) \n 6) distance< 3.235 1331 391 yes (0.70623591 0.29376409) *\n 7) distance>=3.235 1386 452 no (0.32611833 0.67388167) \n 14) distance>=12.39 344 90 yes (0.73837209 0.26162791) *\n 15) distance< 12.39 1042 198 no (0.19001919 0.80998081) *\n```\n:::\n:::\n\n\n\"Shortcut\" by specifying the preprocessor and model spec directly in the `workflow()` call:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nworkflow(tip ~ ., tree_spec) %>% \n fit(data = taxi_train) \n```\n\n::: {.cell-output .cell-output-stdout}\n```\n══ Workflow [trained] ══════════════════════════════════════════════════════════\nPreprocessor: Formula\nModel: decision_tree()\n\n── Preprocessor ────────────────────────────────────────────────────────────────\ntip ~ .\n\n── Model ───────────────────────────────────────────────────────────────────────\nn= 7045 \n\nnode), split, n, loss, yval, (yprob)\n * denotes terminal node\n\n 1) root 7045 2069 yes (0.70631654 0.29368346) \n 2) company=Chicago Independents,City Service,Sun Taxi,Taxicab Insurance Agency Llc,other 4328 744 yes (0.82809612 0.17190388) \n 4) distance< 4.615 2365 254 yes (0.89260042 0.10739958) *\n 5) distance>=4.615 1963 490 yes (0.75038207 0.24961793) \n 10) distance>=12.565 1069 81 yes (0.92422825 0.07577175) *\n 11) distance< 12.565 894 409 yes (0.54250559 0.45749441) \n 22) company=Chicago Independents,Sun Taxi,Taxicab Insurance Agency Llc 278 71 yes (0.74460432 0.25539568) *\n 23) company=City Service,other 616 278 no (0.45129870 0.54870130) \n 46) distance< 7.205 178 59 yes (0.66853933 0.33146067) *\n 47) distance>=7.205 438 159 no (0.36301370 0.63698630) *\n 3) company=Flash Cab,Taxi Affiliation Services 2717 1325 yes (0.51232978 0.48767022) \n 6) distance< 3.235 1331 391 yes (0.70623591 0.29376409) *\n 7) distance>=3.235 1386 452 no (0.32611833 0.67388167) \n 14) distance>=12.39 344 90 yes (0.73837209 0.26162791) *\n 15) distance< 12.39 1042 198 no (0.19001919 0.80998081) *\n```\n:::\n:::\n\n\n## Your turn\n\nEdit the chunk below to make a workflow with your own model of choice!\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntree_spec <-\n decision_tree() %>% \n set_mode(\"classification\")\n\ntree_wflow <- workflow() %>%\n add_formula(tip ~ .) %>%\n add_model(tree_spec)\n\ntree_wflow\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n══ Workflow ════════════════════════════════════════════════════════════════════\nPreprocessor: Formula\nModel: decision_tree()\n\n── Preprocessor ────────────────────────────────────────────────────────────────\ntip ~ .\n\n── Model ───────────────────────────────────────────────────────────────────────\nDecision Tree Model Specification (classification)\n\nComputational engine: rpart \n```\n:::\n:::\n\n\n## Predict with your model\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntree_fit <-\n workflow(tip ~ ., tree_spec) %>% \n fit(data = taxi_train) \n```\n:::\n\n\n## Your turn\n\nWhat do you get from running the following code? What do you notice about the structure of the result?\n\n\n::: {.cell}\n\n```{.r .cell-code}\npredict(tree_fit, new_data = taxi_test)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 1,762 Γ— 1\n .pred_class\n \n 1 yes \n 2 yes \n 3 yes \n 4 yes \n 5 yes \n 6 yes \n 7 yes \n 8 yes \n 9 yes \n10 yes \n# β„Ή 1,752 more rows\n```\n:::\n:::\n\n\n## Your turn\n\nWhat do you get from running the following code? How is `augment()` different from `predict()`?\n\n\n::: {.cell}\n\n```{.r .cell-code}\naugment(tree_fit, new_data = taxi_test)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 1,762 Γ— 10\n tip distance company local dow month hour .pred_class .pred_yes .pred_no\n \n 1 no 17.5 other yes Thu Apr 12 yes 0.924 0.0758\n 2 yes 2.26 other no Fri Apr 16 yes 0.893 0.107 \n 3 yes 2.71 City S… no Thu Apr 8 yes 0.893 0.107 \n 4 yes 18.6 other no Mon Feb 17 yes 0.924 0.0758\n 5 no 1.02 City S… no Wed Mar 11 yes 0.893 0.107 \n 6 yes 2 other no Sat Feb 20 yes 0.893 0.107 \n 7 yes 6.81 City S… no Fri Feb 23 yes 0.669 0.331 \n 8 yes 2.28 Sun Ta… no Thu Apr 22 yes 0.893 0.107 \n 9 yes 0.93 Sun Ta… no Fri Mar 18 yes 0.893 0.107 \n10 yes 18.8 City S… no Tue Feb 7 yes 0.924 0.0758\n# β„Ή 1,752 more rows\n```\n:::\n:::\n\n\n## Understand your model\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(rpart.plot)\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nLoading required package: rpart\n```\n:::\n\n::: {.cell-output .cell-output-stderr}\n```\n\nAttaching package: 'rpart'\n```\n:::\n\n::: {.cell-output .cell-output-stderr}\n```\nThe following object is masked from 'package:dials':\n\n prune\n```\n:::\n\n```{.r .cell-code}\ntree_fit %>%\n extract_fit_engine() %>%\n rpart.plot(roundint = FALSE)\n```\n\n::: {.cell-output-display}\n![](03-classwork_files/figure-html/unnamed-chunk-10-1.png){width=672}\n:::\n:::\n\n\n## Your turn\n\nTry extracting the model engine object from your fitted workflow!\n\n\n::: {.cell}\n\n```{.r .cell-code}\n### Your code here\n```\n:::\n\n\nWhat kind of object is it? What can you do with it?\n\n⚠️ Never `predict()` with any extracted components!\n\nYou can also read the documentation for object extraction:\nhttps://workflows.tidymodels.org/reference/extract-workflow.html\n\n## Your turn\n\nExplore how you might deploy your `tree_fit` model using vetiver.\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(vetiver)\nlibrary(plumber)\n\n# Create a vetiver model object\nv <- vetiver_model(tree_fit, \"taxi_tips\")\nv\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\n# Create a predictable Plumber API\npr <- pr() %>%\n vetiver_api(v)\n\npr\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\n# Run the API server in a new window\npr_run(pr)\n```\n:::\n", + "supporting": [ + "03-classwork_files" + ], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": {}, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/archive/2023-07-nyr/_freeze/classwork/03-classwork/figure-html/unnamed-chunk-10-1.png b/archive/2023-07-nyr/_freeze/classwork/03-classwork/figure-html/unnamed-chunk-10-1.png new file mode 100644 index 00000000..6a6433db Binary files /dev/null and b/archive/2023-07-nyr/_freeze/classwork/03-classwork/figure-html/unnamed-chunk-10-1.png differ diff --git a/archive/2023-07-nyr/_freeze/classwork/04-classwork/execute-results/html.json b/archive/2023-07-nyr/_freeze/classwork/04-classwork/execute-results/html.json new file mode 100644 index 00000000..bb01766c --- /dev/null +++ b/archive/2023-07-nyr/_freeze/classwork/04-classwork/execute-results/html.json @@ -0,0 +1,16 @@ +{ + "hash": "78f18430960f94e2b17b90d193e77cf2", + "result": { + "markdown": "---\ntitle: \"4 - Evaluating models - Classwork\"\nsubtitle: \"Machine learning with tidymodels\"\neditor_options: \n chunk_output_type: console\n---\n\n\nWe recommend restarting R between each slide deck!\n\n## Setup\n\nSetup from deck 3\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidymodels)\n```\n\n::: {.cell-output .cell-output-stderr}\n```\n── Attaching packages ────────────────────────────────────── tidymodels 1.1.0 ──\n```\n:::\n\n::: {.cell-output .cell-output-stderr}\n```\nβœ” broom 1.0.5 βœ” recipes 1.0.6 \nβœ” dials 1.2.0 βœ” rsample 1.1.1.9000\nβœ” dplyr 1.1.2 βœ” tibble 3.2.1 \nβœ” ggplot2 3.4.2 βœ” tidyr 1.3.0 \nβœ” infer 1.0.4 βœ” tune 1.1.1.9001\nβœ” modeldata 1.1.0 βœ” workflows 1.1.3 \nβœ” parsnip 1.1.0.9003 βœ” workflowsets 1.0.1 \nβœ” purrr 1.0.1 βœ” yardstick 1.2.0.9001\n```\n:::\n\n::: {.cell-output .cell-output-stderr}\n```\n── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──\nβœ– purrr::discard() masks scales::discard()\nβœ– dplyr::filter() masks stats::filter()\nβœ– dplyr::lag() masks stats::lag()\nβœ– recipes::step() masks stats::step()\nβ€’ Use tidymodels_prefer() to resolve common conflicts.\n```\n:::\n\n```{.r .cell-code}\nlibrary(modeldatatoo)\n\ntaxi <- data_taxi()\n\ntaxi <- taxi %>%\n mutate(month = factor(month, levels = c(\"Jan\", \"Feb\", \"Mar\", \"Apr\"))) %>% \n select(-c(id, duration, fare, tolls, extras, total_cost, payment_type)) %>% \n drop_na()\n\nset.seed(123)\ntaxi_split <- initial_split(taxi, prop = 0.8, strata = tip)\ntaxi_train <- training(taxi_split)\ntaxi_test <- testing(taxi_split)\n\ntree_spec <- decision_tree(cost_complexity = 0.0001, mode = \"classification\")\ntaxi_wflow <- workflow(tip ~ ., tree_spec)\ntaxi_fit <- fit(taxi_wflow, taxi_train)\n```\n:::\n\n\n## Metrics for model performance\n\n`conf_mat()` can be used to see how well the model is doing at prediction\n\n\n::: {.cell}\n\n```{.r .cell-code}\naugment(taxi_fit, new_data = taxi_train) %>%\n conf_mat(truth = tip, estimate = .pred_class)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n Truth\nPrediction yes no\n yes 4639 660\n no 337 1409\n```\n:::\n:::\n\n\nand it has nice plotting features\n\n\n::: {.cell}\n\n```{.r .cell-code}\naugment(taxi_fit, new_data = taxi_train) %>%\n conf_mat(truth = tip, estimate = .pred_class) %>%\n autoplot(type = \"heatmap\")\n```\n\n::: {.cell-output-display}\n![](04-classwork_files/figure-html/unnamed-chunk-3-1.png){width=672}\n:::\n:::\n\n\nusing the same interface we can calculate metrics\n\n\n::: {.cell}\n\n```{.r .cell-code}\naugment(taxi_fit, new_data = taxi_train) %>%\n accuracy(truth = tip, estimate = .pred_class)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 1 Γ— 3\n .metric .estimator .estimate\n \n1 accuracy binary 0.858\n```\n:::\n:::\n\n\nAll yardstick metric functions work with grouped data frames!\n\n\n::: {.cell}\n\n```{.r .cell-code}\naugment(taxi_fit, new_data = taxi_train) %>%\n group_by(local) %>%\n accuracy(truth = tip, estimate = .pred_class)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 2 Γ— 4\n local .metric .estimator .estimate\n \n1 yes accuracy binary 0.840\n2 no accuracy binary 0.862\n```\n:::\n:::\n\n\nMetric sets are a way to combine multiple similar metric functions together into a new function.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntaxi_metrics <- metric_set(accuracy, specificity, sensitivity)\n\naugment(taxi_fit, new_data = taxi_train) %>%\n taxi_metrics(truth = tip, estimate = .pred_class)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 3 Γ— 3\n .metric .estimator .estimate\n \n1 accuracy binary 0.858\n2 specificity binary 0.681\n3 sensitivity binary 0.932\n```\n:::\n:::\n\n\n## Your turn\n\nCompute and plot an ROC curve for your current model.\n\nWhat data is being used for this ROC curve plot?\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Your code here!\n```\n:::\n\n\n## Dangers of overfitting\n\nRepredicting the training set, bad!\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntaxi_fit %>%\n augment(taxi_train)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 7,045 Γ— 10\n tip distance company local dow month hour .pred_class .pred_yes .pred_no\n \n 1 no 5.39 Flash … no Sat Mar 12 no 0.0625 0.937 \n 2 no 18.4 Sun Ta… no Sat Apr 6 yes 0.924 0.0758\n 3 no 5.8 other no Tue Jan 10 no 0.391 0.609 \n 4 no 6.85 Flash … no Fri Apr 8 no 0.112 0.888 \n 5 no 9.5 City S… no Wed Jan 7 no 0.129 0.871 \n 6 no 12 other no Fri Apr 11 no 0.326 0.674 \n 7 no 8.9 Taxi A… no Mon Feb 14 no 0.0917 0.908 \n 8 no 1.38 other no Fri Apr 16 yes 0.902 0.0980\n 9 no 9.12 Flash … no Wed Apr 9 no 0.0917 0.908 \n10 no 2.28 City S… no Thu Apr 16 yes 0.933 0.0668\n# β„Ή 7,035 more rows\n```\n:::\n:::\n\n\n\"Resubstitution estimate\" - This should be the best possible performance that you could ever achieve, but it can be very misleading!\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntaxi_fit %>%\n augment(taxi_train) %>%\n accuracy(tip, .pred_class)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 1 Γ— 3\n .metric .estimator .estimate\n \n1 accuracy binary 0.858\n```\n:::\n:::\n\n\nNow on the test set, see that it performs worse? This is closer to \"real\" performance.\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntaxi_fit %>%\n augment(taxi_test) %>%\n accuracy(tip, .pred_class)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 1 Γ— 3\n .metric .estimator .estimate\n \n1 accuracy binary 0.795\n```\n:::\n:::\n\n\n## Your turn\n\nUse `augment()` and and a metric function to compute a classification metric like `brier_class()`.\n\nCompute the metrics for both training and testing data to demonstrate overfitting!\n\nNotice the evidence of overfitting!\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Your code here!\n\n# Use `augment()` and `brier_class()` with `taxi_fit`\ntaxi_fit\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n══ Workflow [trained] ══════════════════════════════════════════════════════════\nPreprocessor: Formula\nModel: decision_tree()\n\n── Preprocessor ────────────────────────────────────────────────────────────────\ntip ~ .\n\n── Model ───────────────────────────────────────────────────────────────────────\nn= 7045 \n\nnode), split, n, loss, yval, (yprob)\n * denotes terminal node\n\n 1) root 7045 2069 yes (0.70631654 0.29368346) \n 2) company=Chicago Independents,City Service,Sun Taxi,Taxicab Insurance Agency Llc,other 4328 744 yes (0.82809612 0.17190388) \n 4) distance< 4.615 2365 254 yes (0.89260042 0.10739958) \n 8) distance< 3.375 2101 211 yes (0.89957163 0.10042837) \n 16) local=no 1469 130 yes (0.91150442 0.08849558) \n 32) company=Chicago Independents,City Service,Taxicab Insurance Agency Llc 674 45 yes (0.93323442 0.06676558) *\n 33) company=Sun Taxi,other 795 85 yes (0.89308176 0.10691824) \n 66) hour< 22.5 769 79 yes (0.89726918 0.10273082) \n 132) hour>=20.5 42 1 yes (0.97619048 0.02380952) *\n 133) hour< 20.5 727 78 yes (0.89270977 0.10729023) \n 266) distance< 2.285 561 55 yes (0.90196078 0.09803922) *\n 267) distance>=2.285 166 23 yes (0.86144578 0.13855422) \n 534) hour>=13.5 100 6 yes (0.94000000 0.06000000) *\n 535) hour< 13.5 66 17 yes (0.74242424 0.25757576) \n 1070) distance>=2.685 41 7 yes (0.82926829 0.17073171) *\n 1071) distance< 2.685 25 10 yes (0.60000000 0.40000000) \n 2142) hour< 10.5 13 3 yes (0.76923077 0.23076923) *\n 2143) hour>=10.5 12 5 no (0.41666667 0.58333333) *\n 67) hour>=22.5 26 6 yes (0.76923077 0.23076923) *\n 17) local=yes 632 81 yes (0.87183544 0.12816456) *\n 9) distance>=3.375 264 43 yes (0.83712121 0.16287879) \n 18) dow=Sun,Mon,Wed,Thu,Fri,Sat 230 30 yes (0.86956522 0.13043478) *\n 19) dow=Tue 34 13 yes (0.61764706 0.38235294) \n 38) month=Jan,Mar,Apr 26 6 yes (0.76923077 0.23076923) *\n 39) month=Feb 8 1 no (0.12500000 0.87500000) *\n 5) distance>=4.615 1963 490 yes (0.75038207 0.24961793) \n 10) distance>=12.565 1069 81 yes (0.92422825 0.07577175) *\n 11) distance< 12.565 894 409 yes (0.54250559 0.45749441) \n 22) company=Chicago Independents,Sun Taxi,Taxicab Insurance Agency Llc 278 71 yes (0.74460432 0.25539568) \n 44) distance< 8.105 136 22 yes (0.83823529 0.16176471) \n 88) company=Chicago Independents 25 0 yes (1.00000000 0.00000000) *\n 89) company=Sun Taxi,Taxicab Insurance Agency Llc 111 22 yes (0.80180180 0.19819820) \n 178) dow=Sun,Mon,Thu,Fri,Sat 74 10 yes (0.86486486 0.13513514) *\n 179) dow=Tue,Wed 37 12 yes (0.67567568 0.32432432) \n 358) hour>=16.5 16 2 yes (0.87500000 0.12500000) *\n 359) hour< 16.5 21 10 yes (0.52380952 0.47619048) \n 718) hour< 9.5 9 2 yes (0.77777778 0.22222222) *\n 719) hour>=9.5 12 4 no (0.33333333 0.66666667) *\n 45) distance>=8.105 142 49 yes (0.65492958 0.34507042) \n 90) company=Chicago Independents,Taxicab Insurance Agency Llc 72 15 yes (0.79166667 0.20833333) \n 180) distance>=10.855 29 0 yes (1.00000000 0.00000000) *\n 181) distance< 10.855 43 15 yes (0.65116279 0.34883721) \n 362) dow=Mon,Tue,Wed,Thu 27 4 yes (0.85185185 0.14814815) *\n 363) dow=Sun,Fri,Sat 16 5 no (0.31250000 0.68750000) *\n 91) company=Sun Taxi 70 34 yes (0.51428571 0.48571429) \n\n...\nand 232 more lines.\n```\n:::\n:::\n\n\n## Your turn\n\nIf we use 10 folds, what percent of the training data:\n\n- ends up in analysis?\n- ends up in assessment?\n\nfor each fold\n\n## Resampling\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# v = 10 is the default\nvfold_cv(taxi_train)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# 10-fold cross-validation \n# A tibble: 10 Γ— 2\n splits id \n \n 1 Fold01\n 2 Fold02\n 3 Fold03\n 4 Fold04\n 5 Fold05\n 6 Fold06\n 7 Fold07\n 8 Fold08\n 9 Fold09\n10 Fold10\n```\n:::\n:::\n\n\nWhat is in a resampling result?\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntaxi_folds <- vfold_cv(taxi_train, v = 10)\n\n# Individual splits of analysis/assessment data\ntaxi_folds$splits[1:3]\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n[[1]]\n\n<6340/705/7045>\n\n[[2]]\n\n<6340/705/7045>\n\n[[3]]\n\n<6340/705/7045>\n```\n:::\n:::\n\n\nStratification often helps, with very little downside\n\n\n::: {.cell}\n\n```{.r .cell-code}\nvfold_cv(taxi_train, strata = tip)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# 10-fold cross-validation using stratification \n# A tibble: 10 Γ— 2\n splits id \n \n 1 Fold01\n 2 Fold02\n 3 Fold03\n 4 Fold04\n 5 Fold05\n 6 Fold06\n 7 Fold07\n 8 Fold08\n 9 Fold09\n10 Fold10\n```\n:::\n:::\n\n\nWe'll use this setup:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(123)\ntaxi_folds <- vfold_cv(taxi_train, v = 10, strata = tip)\ntaxi_folds\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# 10-fold cross-validation using stratification \n# A tibble: 10 Γ— 2\n splits id \n \n 1 Fold01\n 2 Fold02\n 3 Fold03\n 4 Fold04\n 5 Fold05\n 6 Fold06\n 7 Fold07\n 8 Fold08\n 9 Fold09\n10 Fold10\n```\n:::\n:::\n\n\n## Evaluating model performance\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Fit the workflow on each analysis set,\n# then compute performance on each assessment set\ntaxi_res <- fit_resamples(taxi_wflow, taxi_folds)\ntaxi_res\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# Resampling results\n# 10-fold cross-validation using stratification \n# A tibble: 10 Γ— 4\n splits id .metrics .notes \n \n 1 Fold01 \n 2 Fold02 \n 3 Fold03 \n 4 Fold04 \n 5 Fold05 \n 6 Fold06 \n 7 Fold07 \n 8 Fold08 \n 9 Fold09 \n10 Fold10 \n```\n:::\n:::\n\n\nAggregate metrics\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntaxi_res %>%\n collect_metrics()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 2 Γ— 6\n .metric .estimator mean n std_err .config \n \n1 accuracy binary 0.793 10 0.00293 Preprocessor1_Model1\n2 roc_auc binary 0.809 10 0.00461 Preprocessor1_Model1\n```\n:::\n:::\n\n\nIf you want to analyze the assessment set (i.e. holdout) predictions, then you need to adjust the control object and tell it to save them:\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Save the assessment set results\nctrl_taxi <- control_resamples(save_pred = TRUE)\n\ntaxi_res <- fit_resamples(taxi_wflow, taxi_folds, control = ctrl_taxi)\n\ntaxi_preds <- collect_predictions(taxi_res)\ntaxi_preds\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 7,045 Γ— 7\n id .pred_yes .pred_no .row .pred_class tip .config \n \n 1 Fold01 0.936 0.0638 10 yes no Preprocessor1_Model1\n 2 Fold01 0.898 0.102 20 yes no Preprocessor1_Model1\n 3 Fold01 0.898 0.102 47 yes no Preprocessor1_Model1\n 4 Fold01 0.101 0.899 51 no no Preprocessor1_Model1\n 5 Fold01 0.871 0.129 59 yes no Preprocessor1_Model1\n 6 Fold01 0.0815 0.918 60 no no Preprocessor1_Model1\n 7 Fold01 0.162 0.838 92 no no Preprocessor1_Model1\n 8 Fold01 0.26 0.74 97 no no Preprocessor1_Model1\n 9 Fold01 0.274 0.726 98 no no Preprocessor1_Model1\n10 Fold01 0.804 0.196 104 yes no Preprocessor1_Model1\n# β„Ή 7,035 more rows\n```\n:::\n:::\n\n\n## Bootstrapping\n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(3214)\nbootstraps(taxi_train)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# Bootstrap sampling \n# A tibble: 25 Γ— 2\n splits id \n \n 1 Bootstrap01\n 2 Bootstrap02\n 3 Bootstrap03\n 4 Bootstrap04\n 5 Bootstrap05\n 6 Bootstrap06\n 7 Bootstrap07\n 8 Bootstrap08\n 9 Bootstrap09\n10 Bootstrap10\n# β„Ή 15 more rows\n```\n:::\n:::\n\n\n## Your turn\n\nCreate:\n\n- Monte Carlo Cross-Validation sets\n- validation set\n\n(use the reference guide to find the function)\n\nhttps://rsample.tidymodels.org/reference/index.html\n\nDon't forget to set a seed when you resample!\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Your code here!\n```\n:::\n\n\n## Create a random forest model\n\n\n::: {.cell}\n\n```{.r .cell-code}\nrf_spec <- rand_forest(trees = 1000, mode = \"classification\")\nrf_spec\n```\n\n::: {.cell-output .cell-output-stdout}\n```\nRandom Forest Model Specification (classification)\n\nMain Arguments:\n trees = 1000\n\nComputational engine: ranger \n```\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nrf_wflow <- workflow(tip ~ ., rf_spec)\nrf_wflow\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n══ Workflow ════════════════════════════════════════════════════════════════════\nPreprocessor: Formula\nModel: rand_forest()\n\n── Preprocessor ────────────────────────────────────────────────────────────────\ntip ~ .\n\n── Model ───────────────────────────────────────────────────────────────────────\nRandom Forest Model Specification (classification)\n\nMain Arguments:\n trees = 1000\n\nComputational engine: ranger \n```\n:::\n:::\n\n\n## Your turn\n\nUse `fit_resamples()` and `rf_wflow` to:\n\n- Keep predictions\n- Compute metrics\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Your code here!\n```\n:::\n\n\n## Evaluate a workflow set\n\n\n::: {.cell}\n\n```{.r .cell-code}\nwf_set <- workflow_set(list(tip ~ .), list(tree_spec, rf_spec))\nwf_set\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A workflow set/tibble: 2 Γ— 4\n wflow_id info option result \n \n1 formula_decision_tree \n2 formula_rand_forest \n```\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nwf_set_fit <- wf_set %>%\n workflow_map(\"fit_resamples\", resamples = taxi_folds)\n\nwf_set_fit\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A workflow set/tibble: 2 Γ— 4\n wflow_id info option result \n \n1 formula_decision_tree \n2 formula_rand_forest \n```\n:::\n:::\n\n\nRank the sets of models by their aggregate metric performance\n\n\n::: {.cell}\n\n```{.r .cell-code}\nwf_set_fit %>%\n rank_results()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 4 Γ— 9\n wflow_id .config .metric mean std_err n preprocessor model rank\n \n1 formula_rand_for… Prepro… accura… 0.812 0.00296 10 formula rand… 1\n2 formula_rand_for… Prepro… roc_auc 0.832 0.00516 10 formula rand… 1\n3 formula_decision… Prepro… accura… 0.793 0.00293 10 formula deci… 2\n4 formula_decision… Prepro… roc_auc 0.809 0.00461 10 formula deci… 2\n```\n:::\n:::\n\n\n## Your turn\n\nWhen do you think a workflow set would be useful?\n\nDiscuss with your neighbors!\n\n## The final fit\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# `taxi_split` has train + test info\nfinal_fit <- last_fit(rf_wflow, taxi_split) \n\nfinal_fit\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# Resampling results\n# Manual resampling \n# A tibble: 1 Γ— 6\n splits id .metrics .notes .predictions .workflow \n \n1 train/test split \n```\n:::\n:::\n\n\nTest set metrics:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncollect_metrics(final_fit)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 2 Γ— 4\n .metric .estimator .estimate .config \n \n1 accuracy binary 0.808 Preprocessor1_Model1\n2 roc_auc binary 0.816 Preprocessor1_Model1\n```\n:::\n:::\n\n\nTest set predictions:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncollect_predictions(final_fit)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 1,762 Γ— 7\n id .pred_yes .pred_no .row .pred_class tip .config \n \n 1 train/test split 0.734 0.266 10 yes no Preprocessor1_Mo…\n 2 train/test split 0.827 0.173 29 yes yes Preprocessor1_Mo…\n 3 train/test split 0.906 0.0939 35 yes yes Preprocessor1_Mo…\n 4 train/test split 0.907 0.0927 42 yes yes Preprocessor1_Mo…\n 5 train/test split 0.919 0.0808 47 yes no Preprocessor1_Mo…\n 6 train/test split 0.850 0.150 54 yes yes Preprocessor1_Mo…\n 7 train/test split 0.586 0.414 59 yes yes Preprocessor1_Mo…\n 8 train/test split 0.917 0.0829 62 yes yes Preprocessor1_Mo…\n 9 train/test split 0.807 0.193 63 yes yes Preprocessor1_Mo…\n10 train/test split 0.963 0.0368 69 yes yes Preprocessor1_Mo…\n# β„Ή 1,752 more rows\n```\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\ncollect_predictions(final_fit) %>%\n ggplot(aes(.pred_class, fill = tip)) + \n geom_bar() \n```\n\n::: {.cell-output-display}\n![](04-classwork_files/figure-html/unnamed-chunk-30-1.png){width=672}\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nextract_workflow(final_fit)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n══ Workflow [trained] ══════════════════════════════════════════════════════════\nPreprocessor: Formula\nModel: rand_forest()\n\n── Preprocessor ────────────────────────────────────────────────────────────────\ntip ~ .\n\n── Model ───────────────────────────────────────────────────────────────────────\nRanger result\n\nCall:\n ranger::ranger(x = maybe_data_frame(x), y = y, num.trees = ~1000, num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1), probability = TRUE) \n\nType: Probability estimation \nNumber of trees: 1000 \nSample size: 7045 \nNumber of independent variables: 6 \nMtry: 2 \nTarget node size: 10 \nVariable importance mode: none \nSplitrule: gini \nOOB prediction error (Brier s.): 0.1371421 \n```\n:::\n:::\n\n\n## Your turn\n\nWhich model do you think you would decide to use?\n\nWhat surprised you the most?\n\nWhat is one thing you are looking forward to for tomorrow?\n", + "supporting": [ + "04-classwork_files" + ], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": {}, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/archive/2023-07-nyr/_freeze/classwork/04-classwork/figure-html/unnamed-chunk-3-1.png b/archive/2023-07-nyr/_freeze/classwork/04-classwork/figure-html/unnamed-chunk-3-1.png new file mode 100644 index 00000000..b0a38830 Binary files /dev/null and b/archive/2023-07-nyr/_freeze/classwork/04-classwork/figure-html/unnamed-chunk-3-1.png differ diff --git a/archive/2023-07-nyr/_freeze/classwork/04-classwork/figure-html/unnamed-chunk-30-1.png b/archive/2023-07-nyr/_freeze/classwork/04-classwork/figure-html/unnamed-chunk-30-1.png new file mode 100644 index 00000000..aae952ab Binary files /dev/null and b/archive/2023-07-nyr/_freeze/classwork/04-classwork/figure-html/unnamed-chunk-30-1.png differ diff --git a/archive/2023-07-nyr/_freeze/classwork/advanced-01-classwork/execute-results/html.json b/archive/2023-07-nyr/_freeze/classwork/advanced-01-classwork/execute-results/html.json new file mode 100644 index 00000000..babf14ab --- /dev/null +++ b/archive/2023-07-nyr/_freeze/classwork/advanced-01-classwork/execute-results/html.json @@ -0,0 +1,16 @@ +{ + "hash": "e37ca08121037f4f4e24606a28f5ce44", + "result": { + "markdown": "---\ntitle: \"1 - Feature Engineering - Classwork\"\nsubtitle: \"Advanced tidymodels\"\neditor_options: \n chunk_output_type: console\n---\n\n\nWe recommend restarting R between each slide deck!\n\n## Hotel data\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidymodels)\n```\n\n::: {.cell-output .cell-output-stderr}\n```\n── Attaching packages ────────────────────────────────────── tidymodels 1.1.0 ──\n```\n:::\n\n::: {.cell-output .cell-output-stderr}\n```\nβœ” broom 1.0.5 βœ” recipes 1.0.6 \nβœ” dials 1.2.0 βœ” rsample 1.1.1.9000\nβœ” dplyr 1.1.2 βœ” tibble 3.2.1 \nβœ” ggplot2 3.4.2 βœ” tidyr 1.3.0 \nβœ” infer 1.0.4 βœ” tune 1.1.1.9001\nβœ” modeldata 1.1.0 βœ” workflows 1.1.3 \nβœ” parsnip 1.1.0.9003 βœ” workflowsets 1.0.1 \nβœ” purrr 1.0.1 βœ” yardstick 1.2.0.9001\n```\n:::\n\n::: {.cell-output .cell-output-stderr}\n```\n── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──\nβœ– purrr::discard() masks scales::discard()\nβœ– dplyr::filter() masks stats::filter()\nβœ– dplyr::lag() masks stats::lag()\nβœ– recipes::step() masks stats::step()\nβ€’ Use suppressPackageStartupMessages() to eliminate package startup messages\n```\n:::\n\n```{.r .cell-code}\nlibrary(modeldatatoo)\n\n# Max's usual settings: \ntidymodels_prefer()\ntheme_set(theme_bw())\noptions(\n pillar.advice = FALSE, \n pillar.min_title_chars = Inf\n)\n\nset.seed(295)\nhotel_rates <- \n data_hotel_rates() %>% \n sample_n(5000) %>% \n arrange(arrival_date) %>% \n select(-arrival_date_num, -arrival_date) %>% \n mutate(\n company = factor(as.character(company)),\n country = factor(as.character(country)),\n agent = factor(as.character(agent))\n )\n```\n:::\n\n\n## Data spending\n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(4028)\nhotel_split <-\n initial_split(hotel_rates, strata = avg_price_per_room)\n\nhotel_tr <- training(hotel_split)\nhotel_te <- testing(hotel_split)\n```\n:::\n\n\n## Your turn\n\nLet's take some time and investigate the _training data_. The outcome is `avg_price_per_room`. \n\nAre there any interesting characteristics of the data?\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Your code here!\n```\n:::\n\n\n## Resampling Strategy\n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(472)\nhotel_rs <- vfold_cv(hotel_tr, strata = avg_price_per_room)\nhotel_rs\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# 10-fold cross-validation using stratification \n# A tibble: 10 Γ— 2\n splits id \n \n 1 Fold01\n 2 Fold02\n 3 Fold03\n 4 Fold04\n 5 Fold05\n 6 Fold06\n 7 Fold07\n 8 Fold08\n 9 Fold09\n10 Fold10\n```\n:::\n:::\n\n\n## A first recipe\n\n\n::: {.cell}\n\n```{.r .cell-code}\nhotel_rec <- \n recipe(avg_price_per_room ~ ., data = hotel_tr)\n\nsummary(hotel_rec)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 28 Γ— 4\n variable type role source \n \n 1 lead_time predictor original\n 2 arrival_date_day_of_month predictor original\n 3 stays_in_weekend_nights predictor original\n 4 stays_in_week_nights predictor original\n 5 adults predictor original\n 6 children predictor original\n 7 babies predictor original\n 8 meal predictor original\n 9 country predictor original\n10 market_segment predictor original\n# β„Ή 18 more rows\n```\n:::\n:::\n\n\n## Your turn\n\nWhat do you think are in the `type` vectors for the `lead_time` and `country` columns?\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Your code here!\n```\n:::\n\n\n## A base recipe\n\n\n::: {.cell}\n\n```{.r .cell-code}\nhotel_rec <- \n recipe(avg_price_per_room ~ ., data = hotel_tr) %>% \n # create indicator variables\n step_dummy(all_nominal_predictors()) %>% \n # filter out constant columns\n step_zv(all_predictors()) %>% \n # normalize\n step_normalize(all_numeric_predictors())\n```\n:::\n\n\n## Different options to reduce correlation\n\n\n::: {.cell}\n\n```{.r .cell-code}\nhotel_rec <- \n recipe(avg_price_per_room ~ ., data = hotel_tr) %>% \n step_dummy(all_nominal_predictors()) %>% \n step_zv(all_predictors()) %>% \n step_normalize(all_numeric_predictors()) %>% \n step_corr(all_numeric_predictors(), threshold = 0.9)\n\nhotel_rec <- \n recipe(avg_price_per_room ~ ., data = hotel_tr) %>% \n step_dummy(all_nominal_predictors()) %>% \n step_zv(all_predictors()) %>% \n step_normalize(all_numeric_predictors()) %>% \n step_pca(all_numeric_predictors())\n\nhotel_rec <- \n recipe(avg_price_per_room ~ ., data = hotel_tr) %>% \n step_dummy(all_nominal_predictors()) %>% \n step_zv(all_predictors()) %>% \n step_normalize(all_numeric_predictors()) %>% \n embed::step_umap(all_numeric_predictors(), outcome = vars(avg_price_per_room))\n```\n:::\n\n\n## Other possible steps\n\nFor example, natural splines:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nhotel_rec <- \n recipe(avg_price_per_room ~ ., data = hotel_tr) %>% \n step_dummy(all_nominal_predictors()) %>% \n step_zv(all_predictors()) %>% \n step_normalize(all_numeric_predictors()) %>% \n step_spline_natural(year_day, deg_free = 10)\n```\n:::\n\n\n## Your turn \n\nCreate a `recipe()` for the hotel data to:\n\n- use a Yeo-Johnson (YJ) transformation on `lead_time`\n- convert factors to indicator variables\n- remove zero-variance variables\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Your code here!\n```\n:::\n\n\n## Minimal recipe for the hotel data\n\n\n::: {.cell}\n\n```{.r .cell-code}\nhotel_indicators <-\n recipe(avg_price_per_room ~ ., data = hotel_tr) %>% \n step_YeoJohnson(lead_time) %>%\n step_dummy(all_nominal_predictors()) %>%\n step_zv(all_predictors())\n```\n:::\n\n\n\n## Measuring Performance\n\nWe'll compute two measures, mean absolute error (MAE) and the coefficient of determination (a.k.a $R^2$), and focus on the MAE for parameter optimization. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nreg_metrics <- metric_set(mae, rsq)\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(9)\n\nhotel_lm_wflow <-\n workflow() %>%\n add_recipe(hotel_indicators) %>%\n add_model(linear_reg())\n \nctrl <- control_resamples(save_pred = TRUE)\nhotel_lm_res <-\n hotel_lm_wflow %>%\n fit_resamples(hotel_rs, control = ctrl, metrics = reg_metrics)\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nβ†’ A | warning: prediction from a rank-deficient fit may be misleading\n```\n:::\n\n::: {.cell-output .cell-output-stderr}\n```\nThere were issues with some computations A: x1\n```\n:::\n\n::: {.cell-output .cell-output-stderr}\n```\nThere were issues with some computations A: x2\n```\n:::\n\n::: {.cell-output .cell-output-stderr}\n```\nThere were issues with some computations A: x8\n```\n:::\n\n::: {.cell-output .cell-output-stderr}\n```\nThere were issues with some computations A: x9\n```\n:::\n\n::: {.cell-output .cell-output-stderr}\n```\n\n```\n:::\n\n```{.r .cell-code}\ncollect_metrics(hotel_lm_res)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 2 Γ— 6\n .metric .estimator mean n std_err .config \n \n1 mae standard 17.3 10 0.199 Preprocessor1_Model1\n2 rsq standard 0.874 10 0.00400 Preprocessor1_Model1\n```\n:::\n:::\n\n\n## Your turn\n\nUse `fit_resamples()` to fit your workflow with a recipe.\n\nCollect the predictions from the results.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Your code here!\n```\n:::\n\n\n## Holdout predictions\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Since we used `save_pred = TRUE`\nlm_val_pred <- collect_predictions(hotel_lm_res)\nlm_val_pred %>% slice(1:7)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 7 Γ— 5\n id .pred .row avg_price_per_room .config \n \n1 Fold01 62.1 20 40 Preprocessor1_Model1\n2 Fold01 48.0 28 54 Preprocessor1_Model1\n3 Fold01 64.6 45 50 Preprocessor1_Model1\n4 Fold01 45.8 49 42 Preprocessor1_Model1\n5 Fold01 45.8 61 49 Preprocessor1_Model1\n6 Fold01 30.0 66 40 Preprocessor1_Model1\n7 Fold01 38.8 88 49 Preprocessor1_Model1\n```\n:::\n:::\n\n\n## Calibration Plot \n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(probably)\n\ncal_plot_regression(hotel_lm_res, alpha = 1 / 5)\n```\n\n::: {.cell-output-display}\n![](advanced-01-classwork_files/figure-html/unnamed-chunk-16-1.png){width=672}\n:::\n:::\n\n\n## What do we do with the agent and company data? \n\nCollapsing factor levels: \n\n\n::: {.cell}\n\n```{.r .cell-code}\nhotel_other_rec <-\n recipe(avg_price_per_room ~ ., data = hotel_tr) %>% \n step_YeoJohnson(lead_time) %>%\n step_other(agent, threshold = 0.001) %>%\n step_dummy(all_nominal_predictors()) %>%\n step_zv(all_predictors())\n\nhotel_other_wflow <-\n hotel_lm_wflow %>%\n update_recipe(hotel_other_rec)\n\nhotel_other_res <-\n hotel_other_wflow %>%\n fit_resamples(hotel_rs, control = ctrl, metrics = reg_metrics)\n```\n\n::: {.cell-output .cell-output-stderr}\n```\nβ†’ A | warning: prediction from a rank-deficient fit may be misleading\n```\n:::\n\n::: {.cell-output .cell-output-stderr}\n```\nThere were issues with some computations A: x1\n```\n:::\n\n::: {.cell-output .cell-output-stderr}\n```\nThere were issues with some computations A: x4\n```\n:::\n\n::: {.cell-output .cell-output-stderr}\n```\nThere were issues with some computations A: x9\n```\n:::\n\n::: {.cell-output .cell-output-stderr}\n```\n\n```\n:::\n\n```{.r .cell-code}\ncollect_metrics(hotel_other_res)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 2 Γ— 6\n .metric .estimator mean n std_err .config \n \n1 mae standard 17.4 10 0.205 Preprocessor1_Model1\n2 rsq standard 0.874 10 0.00417 Preprocessor1_Model1\n```\n:::\n:::\n\n\nFeature Hashing:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(textrecipes)\n\nhash_rec <-\n recipe(avg_price_per_room ~ ., data = hotel_tr) %>%\n step_YeoJohnson(lead_time) %>%\n # Defaults to 32 signed indicator columns\n step_dummy_hash(agent) %>%\n step_dummy_hash(company) %>%\n # Regular indicators for the others\n step_dummy(all_nominal_predictors()) %>% \n step_zv(all_predictors())\n\nhotel_hash_wflow <-\n hotel_lm_wflow %>%\n update_recipe(hash_rec)\n\nhotel_hash_res <-\n hotel_hash_wflow %>%\n fit_resamples(hotel_rs, control = ctrl, metrics = reg_metrics)\n```\n\n::: {.cell-output .cell-output-stderr}\n```\n'as(, \"dgCMatrix\")' is deprecated.\nUse 'as(., \"CsparseMatrix\")' instead.\nSee help(\"Deprecated\") and help(\"Matrix-deprecated\").\n```\n:::\n\n::: {.cell-output .cell-output-stderr}\n```\nβ†’ A | warning: prediction from a rank-deficient fit may be misleading\n```\n:::\n\n::: {.cell-output .cell-output-stderr}\n```\nThere were issues with some computations A: x3\n```\n:::\n\n::: {.cell-output .cell-output-stderr}\n```\nThere were issues with some computations A: x9\n```\n:::\n\n::: {.cell-output .cell-output-stderr}\n```\n\n```\n:::\n\n```{.r .cell-code}\ncollect_metrics(hotel_hash_res)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 2 Γ— 6\n .metric .estimator mean n std_err .config \n \n1 mae standard 17.5 10 0.256 Preprocessor1_Model1\n2 rsq standard 0.872 10 0.00395 Preprocessor1_Model1\n```\n:::\n:::\n\n\n## Debugging a recipe\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Estimate the transformation coefficients\nhash_rec_fit <- prep(hash_rec)\n\n# Get the transformation coefficient\ntidy(hash_rec_fit, number = 1)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 1 Γ— 3\n terms value id \n \n1 lead_time 0.173 YeoJohnson_zATeA\n```\n:::\n\n```{.r .cell-code}\n# Get the processed data\nbake(hash_rec_fit, hotel_tr %>% slice(1:3), contains(\"_agent_\"))\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 3 Γ— 30\n dummyhash_agent_01 dummyhash_agent_02 dummyhash_agent_03 dummyhash_agent_04\n \n1 0 0 0 0\n2 0 -1 0 0\n3 0 0 0 0\n# β„Ή 26 more variables: dummyhash_agent_05 , dummyhash_agent_06 ,\n# dummyhash_agent_07 , dummyhash_agent_08 ,\n# dummyhash_agent_10 , dummyhash_agent_11 ,\n# dummyhash_agent_12 , dummyhash_agent_13 ,\n# dummyhash_agent_14 , dummyhash_agent_15 ,\n# dummyhash_agent_16 , dummyhash_agent_18 ,\n# dummyhash_agent_19 , dummyhash_agent_20 , …\n```\n:::\n:::\n", + "supporting": [ + "advanced-01-classwork_files" + ], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": {}, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/archive/2023-07-nyr/_freeze/classwork/advanced-01-classwork/figure-html/unnamed-chunk-16-1.png b/archive/2023-07-nyr/_freeze/classwork/advanced-01-classwork/figure-html/unnamed-chunk-16-1.png new file mode 100644 index 00000000..7d3a9947 Binary files /dev/null and b/archive/2023-07-nyr/_freeze/classwork/advanced-01-classwork/figure-html/unnamed-chunk-16-1.png differ diff --git a/archive/2023-07-nyr/_freeze/classwork/advanced-02-classwork/execute-results/html.json b/archive/2023-07-nyr/_freeze/classwork/advanced-02-classwork/execute-results/html.json new file mode 100644 index 00000000..1e63b868 --- /dev/null +++ b/archive/2023-07-nyr/_freeze/classwork/advanced-02-classwork/execute-results/html.json @@ -0,0 +1,16 @@ +{ + "hash": "3249c8486bcd82d9350779ea680f2fd1", + "result": { + "markdown": "---\ntitle: \"2 - Tuning Hyperparameters - Classwork\"\nsubtitle: \"Advanced tidymodels\"\neditor_options: \n chunk_output_type: console\n---\n\n\nWe recommend restarting R between each slide deck!\n\n## Setup\n\nSetup from deck 1\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidymodels)\n```\n\n::: {.cell-output .cell-output-stderr}\n```\n── Attaching packages ────────────────────────────────────── tidymodels 1.1.0 ──\n```\n:::\n\n::: {.cell-output .cell-output-stderr}\n```\nβœ” broom 1.0.5 βœ” recipes 1.0.6 \nβœ” dials 1.2.0 βœ” rsample 1.1.1.9000\nβœ” dplyr 1.1.2 βœ” tibble 3.2.1 \nβœ” ggplot2 3.4.2 βœ” tidyr 1.3.0 \nβœ” infer 1.0.4 βœ” tune 1.1.1.9001\nβœ” modeldata 1.1.0 βœ” workflows 1.1.3 \nβœ” parsnip 1.1.0.9003 βœ” workflowsets 1.0.1 \nβœ” purrr 1.0.1 βœ” yardstick 1.2.0.9001\n```\n:::\n\n::: {.cell-output .cell-output-stderr}\n```\n── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──\nβœ– purrr::discard() masks scales::discard()\nβœ– dplyr::filter() masks stats::filter()\nβœ– dplyr::lag() masks stats::lag()\nβœ– recipes::step() masks stats::step()\nβ€’ Dig deeper into tidy modeling with R at https://www.tmwr.org\n```\n:::\n\n```{.r .cell-code}\nlibrary(modeldatatoo)\nlibrary(textrecipes)\n\n# Max's usual settings: \ntidymodels_prefer()\ntheme_set(theme_bw())\noptions(\n pillar.advice = FALSE, \n pillar.min_title_chars = Inf\n)\n\nreg_metrics <- metric_set(mae, rsq)\n\nset.seed(295)\nhotel_rates <- \n data_hotel_rates() %>% \n sample_n(5000) %>% \n arrange(arrival_date) %>% \n select(-arrival_date_num, -arrival_date) %>% \n mutate(\n company = factor(as.character(company)),\n country = factor(as.character(country)),\n agent = factor(as.character(agent))\n )\n\nset.seed(4028)\nhotel_split <-\n initial_split(hotel_rates, strata = avg_price_per_room)\n\nhotel_tr <- training(hotel_split)\nhotel_te <- testing(hotel_split)\n\nset.seed(472)\nhotel_rs <- vfold_cv(hotel_tr, strata = avg_price_per_room)\n\nhash_rec <-\n recipe(avg_price_per_room ~ ., data = hotel_tr) %>%\n step_YeoJohnson(lead_time) %>%\n # Defaults to 32 signed indicator columns\n step_dummy_hash(agent) %>%\n step_dummy_hash(company) %>%\n # Regular indicators for the others\n step_dummy(all_nominal_predictors()) %>% \n step_zv(all_predictors())\n```\n:::\n\n\n## Tagging parameters for tuning\n\n\n::: {.cell}\n\n```{.r .cell-code}\nhash_rec <-\n recipe(avg_price_per_room ~ ., data = hotel_tr) %>%\n step_YeoJohnson(lead_time) %>%\n step_dummy_hash(agent, num_terms = tune(\"agent hash\")) %>%\n step_dummy_hash(company, num_terms = tune(\"company hash\")) %>%\n step_zv(all_predictors())\n```\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(bonsai)\n\nlgbm_spec <- \n boost_tree(trees = tune(), learn_rate = tune()) %>% \n set_mode(\"regression\") %>% \n set_engine(\"lightgbm\")\n\nlgbm_wflow <- workflow(hash_rec, lgbm_spec)\n```\n:::\n\n\n## Create a grid \n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(12)\ngrid <- \n lgbm_wflow %>% \n extract_parameter_set_dials() %>% \n grid_latin_hypercube(size = 25)\n\ngrid\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 25 Γ— 4\n trees learn_rate `agent hash` `company hash`\n \n 1 1629 0.00000440 524 1454\n 2 1746 0.0000000751 1009 2865\n 3 53 0.0000180 2313 367\n 4 442 0.000000445 347 460\n 5 1413 0.0000000208 3232 553\n 6 1488 0.0000578 3692 639\n 7 906 0.000385 602 332\n 8 1884 0.00000000101 1127 567\n 9 1812 0.0239 961 1183\n10 393 0.000000117 487 1783\n# β„Ή 15 more rows\n```\n:::\n:::\n\n\n## Your turn \n\nCreate a grid for our tunable workflow.\n\nTry creating a regular grid.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Your code here!\n```\n:::\n\n\n## Your turn \n\nWhat advantage would a regular grid have? \n\nDiscuss with your neighbor!\n\n## Update parameter ranges \n\n\n::: {.cell}\n\n```{.r .cell-code}\nlgbm_param <- \n lgbm_wflow %>% \n extract_parameter_set_dials() %>% \n update(trees = trees(c(1L, 100L)),\n learn_rate = learn_rate(c(-5, -1)))\n\nset.seed(712)\ngrid <- \n lgbm_param %>% \n grid_latin_hypercube(size = 25)\n\ngrid\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 25 Γ— 4\n trees learn_rate `agent hash` `company hash`\n \n 1 75 0.000312 2991 1250\n 2 4 0.0000337 899 3088\n 3 15 0.0295 520 1578\n 4 8 0.0997 1256 3592\n 5 80 0.000622 419 258\n 6 70 0.000474 2499 1089\n 7 35 0.000165 287 2376\n 8 64 0.00137 389 359\n 9 58 0.0000250 616 881\n10 84 0.0639 2311 2635\n# β„Ή 15 more rows\n```\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\ngrid %>% \n ggplot(aes(trees, learn_rate)) +\n geom_point(size = 4) +\n scale_y_log10()\n```\n\n::: {.cell-output-display}\n![](advanced-02-classwork_files/figure-html/unnamed-chunk-7-1.png){width=672}\n:::\n:::\n\n\n## Grid Search\n\nLet's take our previous model and tune more parameters:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlgbm_spec <- \n boost_tree(trees = tune(), learn_rate = tune(), min_n = tune()) %>% \n set_mode(\"regression\") %>% \n set_engine(\"lightgbm\")\n\nlgbm_wflow <- workflow(hash_rec, lgbm_spec)\n\n# Update the feature hash ranges (log-2 units)\nlgbm_param <-\n lgbm_wflow %>%\n extract_parameter_set_dials() %>%\n update(`agent hash` = num_hash(c(3, 8)),\n `company hash` = num_hash(c(3, 8)))\n```\n:::\n\n\nRun the grid search:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(9)\nctrl <- control_grid(save_pred = TRUE)\n\nlgbm_res <-\n lgbm_wflow %>%\n tune_grid(\n resamples = hotel_rs,\n grid = 25,\n # The options below are not required by default\n param_info = lgbm_param, \n control = ctrl,\n metrics = reg_metrics\n )\n```\n\n::: {.cell-output .cell-output-stderr}\n```\n'as(, \"dgCMatrix\")' is deprecated.\nUse 'as(., \"CsparseMatrix\")' instead.\nSee help(\"Deprecated\") and help(\"Matrix-deprecated\").\n```\n:::\n\n```{.r .cell-code}\nlgbm_res \n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# Tuning results\n# 10-fold cross-validation using stratification \n# A tibble: 10 Γ— 5\n splits id .metrics .notes .predictions\n \n 1 Fold01 \n 2 Fold02 \n 3 Fold03 \n 4 Fold04 \n 5 Fold05 \n 6 Fold06 \n 7 Fold07 \n 8 Fold08 \n 9 Fold09 \n10 Fold10 \n```\n:::\n:::\n\n\nInspect results:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nautoplot(lgbm_res)\n```\n\n::: {.cell-output-display}\n![](advanced-02-classwork_files/figure-html/unnamed-chunk-10-1.png){width=672}\n:::\n\n```{.r .cell-code}\ncollect_metrics(lgbm_res)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 50 Γ— 11\n trees min_n learn_rate `agent hash` `company hash` .metric .estimator mean\n \n 1 298 19 4.15e- 9 222 36 mae standard 53.2 \n 2 298 19 4.15e- 9 222 36 rsq standard 0.811\n 3 1394 5 5.82e- 6 28 21 mae standard 52.9 \n 4 1394 5 5.82e- 6 28 21 rsq standard 0.810\n 5 774 12 4.41e- 2 27 95 mae standard 10.5 \n 6 774 12 4.41e- 2 27 95 rsq standard 0.939\n 7 1342 7 6.84e-10 71 17 mae standard 53.2 \n 8 1342 7 6.84e-10 71 17 rsq standard 0.810\n 9 669 39 8.62e- 7 141 145 mae standard 53.2 \n10 669 39 8.62e- 7 141 145 rsq standard 0.808\n# β„Ή 40 more rows\n# β„Ή 3 more variables: n , std_err , .config \n```\n:::\n\n```{.r .cell-code}\ncollect_metrics(lgbm_res, summarize = FALSE)\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 500 Γ— 10\n id trees min_n learn_rate `agent hash` `company hash` .metric .estimator\n \n 1 Fold01 298 19 4.15e-9 222 36 mae standard \n 2 Fold01 298 19 4.15e-9 222 36 rsq standard \n 3 Fold02 298 19 4.15e-9 222 36 mae standard \n 4 Fold02 298 19 4.15e-9 222 36 rsq standard \n 5 Fold03 298 19 4.15e-9 222 36 mae standard \n 6 Fold03 298 19 4.15e-9 222 36 rsq standard \n 7 Fold04 298 19 4.15e-9 222 36 mae standard \n 8 Fold04 298 19 4.15e-9 222 36 rsq standard \n 9 Fold05 298 19 4.15e-9 222 36 mae standard \n10 Fold05 298 19 4.15e-9 222 36 rsq standard \n# β„Ή 490 more rows\n# β„Ή 2 more variables: .estimate , .config \n```\n:::\n:::\n\n\n## Choose a parameter combination\n\n\n::: {.cell}\n\n```{.r .cell-code}\nshow_best(lgbm_res, metric = \"rsq\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 5 Γ— 11\n trees min_n learn_rate `agent hash` `company hash` .metric .estimator mean\n \n1 1890 10 0.0159 115 174 rsq standard 0.940\n2 774 12 0.0441 27 95 rsq standard 0.939\n3 1638 36 0.0409 15 120 rsq standard 0.938\n4 963 23 0.00556 157 13 rsq standard 0.930\n5 590 5 0.00320 85 73 rsq standard 0.905\n# β„Ή 3 more variables: n , std_err , .config \n```\n:::\n\n```{.r .cell-code}\nlgbm_best <- select_best(lgbm_res, metric = \"mae\")\nlgbm_best\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 1 Γ— 6\n trees min_n learn_rate `agent hash` `company hash` .config \n \n1 774 12 0.0441 27 95 Preprocessor03_Model1\n```\n:::\n:::\n\n\n## Checking Calibration\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(probably)\n\nlgbm_res %>%\n collect_predictions(\n parameters = lgbm_best\n ) %>%\n cal_plot_regression(\n truth = avg_price_per_room,\n estimate = .pred,\n alpha = 1 / 3\n )\n```\n\n::: {.cell-output-display}\n![](advanced-02-classwork_files/figure-html/unnamed-chunk-12-1.png){width=672}\n:::\n:::\n\n\n## Running in parallel\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncores <- parallelly::availableCores(logical = FALSE)\ncl <- parallel::makePSOCKcluster(cores)\ndoParallel::registerDoParallel(cl)\n\n# Now call `tune_grid()`!\n\n# Shut it down with:\nforeach::registerDoSEQ()\nparallel::stopCluster(cl)\n```\n:::\n\n\n## Your turn\n\nTry early stopping: Set `trees = 2000` and tune the `stop_iter` parameter!\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Your code here!\n```\n:::\n", + "supporting": [ + "advanced-02-classwork_files" + ], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": {}, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/archive/2023-07-nyr/_freeze/classwork/advanced-02-classwork/figure-html/unnamed-chunk-10-1.png b/archive/2023-07-nyr/_freeze/classwork/advanced-02-classwork/figure-html/unnamed-chunk-10-1.png new file mode 100644 index 00000000..07224536 Binary files /dev/null and b/archive/2023-07-nyr/_freeze/classwork/advanced-02-classwork/figure-html/unnamed-chunk-10-1.png differ diff --git a/archive/2023-07-nyr/_freeze/classwork/advanced-02-classwork/figure-html/unnamed-chunk-12-1.png b/archive/2023-07-nyr/_freeze/classwork/advanced-02-classwork/figure-html/unnamed-chunk-12-1.png new file mode 100644 index 00000000..6b654a8a Binary files /dev/null and b/archive/2023-07-nyr/_freeze/classwork/advanced-02-classwork/figure-html/unnamed-chunk-12-1.png differ diff --git a/archive/2023-07-nyr/_freeze/classwork/advanced-02-classwork/figure-html/unnamed-chunk-7-1.png b/archive/2023-07-nyr/_freeze/classwork/advanced-02-classwork/figure-html/unnamed-chunk-7-1.png new file mode 100644 index 00000000..36bf1356 Binary files /dev/null and b/archive/2023-07-nyr/_freeze/classwork/advanced-02-classwork/figure-html/unnamed-chunk-7-1.png differ diff --git a/archive/2023-07-nyr/_freeze/classwork/advanced-03-classwork/execute-results/html.json b/archive/2023-07-nyr/_freeze/classwork/advanced-03-classwork/execute-results/html.json new file mode 100644 index 00000000..c536f227 --- /dev/null +++ b/archive/2023-07-nyr/_freeze/classwork/advanced-03-classwork/execute-results/html.json @@ -0,0 +1,16 @@ +{ + "hash": "528e1ad06a52f923a430c646236a5726", + "result": { + "markdown": "---\ntitle: \"3 - Grid Search via Racing - Classwork\"\nsubtitle: \"Advanced tidymodels\"\neditor_options: \n chunk_output_type: console\n---\n\n\nWe recommend restarting R between each slide deck!\n\n## Setup\n\nSetup from deck 2\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidymodels)\n```\n\n::: {.cell-output .cell-output-stderr}\n```\n── Attaching packages ────────────────────────────────────── tidymodels 1.1.0 ──\n```\n:::\n\n::: {.cell-output .cell-output-stderr}\n```\nβœ” broom 1.0.5 βœ” recipes 1.0.6 \nβœ” dials 1.2.0 βœ” rsample 1.1.1.9000\nβœ” dplyr 1.1.2 βœ” tibble 3.2.1 \nβœ” ggplot2 3.4.2 βœ” tidyr 1.3.0 \nβœ” infer 1.0.4 βœ” tune 1.1.1.9001\nβœ” modeldata 1.1.0 βœ” workflows 1.1.3 \nβœ” parsnip 1.1.0.9003 βœ” workflowsets 1.0.1 \nβœ” purrr 1.0.1 βœ” yardstick 1.2.0.9001\n```\n:::\n\n::: {.cell-output .cell-output-stderr}\n```\n── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──\nβœ– purrr::discard() masks scales::discard()\nβœ– dplyr::filter() masks stats::filter()\nβœ– dplyr::lag() masks stats::lag()\nβœ– recipes::step() masks stats::step()\nβ€’ Use tidymodels_prefer() to resolve common conflicts.\n```\n:::\n\n```{.r .cell-code}\nlibrary(modeldatatoo)\nlibrary(textrecipes)\nlibrary(bonsai)\n\n# Max's usual settings: \ntidymodels_prefer()\ntheme_set(theme_bw())\noptions(\n pillar.advice = FALSE, \n pillar.min_title_chars = Inf\n)\n\nreg_metrics <- metric_set(mae, rsq)\n\nset.seed(295)\nhotel_rates <- \n data_hotel_rates() %>% \n sample_n(5000) %>% \n arrange(arrival_date) %>% \n select(-arrival_date_num, -arrival_date) %>% \n mutate(\n company = factor(as.character(company)),\n country = factor(as.character(country)),\n agent = factor(as.character(agent))\n )\n\nset.seed(4028)\nhotel_split <-\n initial_split(hotel_rates, strata = avg_price_per_room)\n\nhotel_tr <- training(hotel_split)\nhotel_te <- testing(hotel_split)\n\nset.seed(472)\nhotel_rs <- vfold_cv(hotel_tr, strata = avg_price_per_room)\n\nhotel_rec <-\n recipe(avg_price_per_room ~ ., data = hotel_tr) %>%\n step_YeoJohnson(lead_time) %>%\n step_dummy_hash(agent, num_terms = tune(\"agent hash\")) %>%\n step_dummy_hash(company, num_terms = tune(\"company hash\")) %>%\n step_zv(all_predictors())\n\nlgbm_spec <- \n boost_tree(trees = tune(), learn_rate = tune(), min_n = tune()) %>% \n set_mode(\"regression\") %>% \n set_engine(\"lightgbm\")\n\nlgbm_wflow <- workflow(hotel_rec, lgbm_spec)\n\nlgbm_param <-\n lgbm_wflow %>%\n extract_parameter_set_dials() %>%\n update(`agent hash` = num_hash(c(3, 8)),\n `company hash` = num_hash(c(3, 8)))\n```\n:::\n\n\n## Racing \n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Let's use a larger grid\nset.seed(8945)\nlgbm_grid <- \n lgbm_param %>% \n grid_latin_hypercube(size = 50)\n\nlibrary(finetune)\n\nset.seed(9)\nlgbm_race_res <-\n lgbm_wflow %>%\n tune_race_anova(\n resamples = hotel_rs,\n grid = lgbm_grid, \n metrics = reg_metrics\n )\n```\n\n::: {.cell-output .cell-output-stderr}\n```\n'as(, \"dgCMatrix\")' is deprecated.\nUse 'as(., \"CsparseMatrix\")' instead.\nSee help(\"Deprecated\") and help(\"Matrix-deprecated\").\n```\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nshow_best(lgbm_race_res, metric = \"mae\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 2 Γ— 11\n trees min_n learn_rate `agent hash` `company hash` .metric .estimator mean\n \n1 1014 5 0.0791 35 181 mae standard 10.3\n2 1516 7 0.0421 176 12 mae standard 10.4\n# β„Ή 3 more variables: n , std_err , .config \n```\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nplot_race(lgbm_race_res) + \n scale_x_continuous(breaks = pretty_breaks())\n```\n\n::: {.cell-output-display}\n![](advanced-03-classwork_files/figure-html/unnamed-chunk-4-1.png){width=672}\n:::\n:::\n\n\n\n## Your turn\n\nRun `tune_race_anova()` with a different seed.\n\nDid you get the same or similar results?\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Your code here!\n```\n:::\n", + "supporting": [ + "advanced-03-classwork_files" + ], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": {}, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/archive/2023-07-nyr/_freeze/classwork/advanced-03-classwork/figure-html/unnamed-chunk-4-1.png b/archive/2023-07-nyr/_freeze/classwork/advanced-03-classwork/figure-html/unnamed-chunk-4-1.png new file mode 100644 index 00000000..e71e454f Binary files /dev/null and b/archive/2023-07-nyr/_freeze/classwork/advanced-03-classwork/figure-html/unnamed-chunk-4-1.png differ diff --git a/archive/2023-07-nyr/_freeze/classwork/advanced-04-classwork/execute-results/html.json b/archive/2023-07-nyr/_freeze/classwork/advanced-04-classwork/execute-results/html.json new file mode 100644 index 00000000..7b671255 --- /dev/null +++ b/archive/2023-07-nyr/_freeze/classwork/advanced-04-classwork/execute-results/html.json @@ -0,0 +1,16 @@ +{ + "hash": "c1fa12ef361dcce19074fb247be50d03", + "result": { + "markdown": "---\ntitle: \"4 - Iterative Search - Classwork\"\nsubtitle: \"Advanced tidymodels\"\neditor_options: \n chunk_output_type: console\n---\n\n\nWe recommend restarting R between each slide deck!\n\n## Setup\n\nSetup from deck 3\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidymodels)\n```\n\n::: {.cell-output .cell-output-stderr}\n```\n── Attaching packages ────────────────────────────────────── tidymodels 1.1.0 ──\n```\n:::\n\n::: {.cell-output .cell-output-stderr}\n```\nβœ” broom 1.0.5 βœ” recipes 1.0.6 \nβœ” dials 1.2.0 βœ” rsample 1.1.1.9000\nβœ” dplyr 1.1.2 βœ” tibble 3.2.1 \nβœ” ggplot2 3.4.2 βœ” tidyr 1.3.0 \nβœ” infer 1.0.4 βœ” tune 1.1.1.9001\nβœ” modeldata 1.1.0 βœ” workflows 1.1.3 \nβœ” parsnip 1.1.0.9003 βœ” workflowsets 1.0.1 \nβœ” purrr 1.0.1 βœ” yardstick 1.2.0.9001\n```\n:::\n\n::: {.cell-output .cell-output-stderr}\n```\n── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──\nβœ– purrr::discard() masks scales::discard()\nβœ– dplyr::filter() masks stats::filter()\nβœ– dplyr::lag() masks stats::lag()\nβœ– recipes::step() masks stats::step()\nβ€’ Search for functions across packages at https://www.tidymodels.org/find/\n```\n:::\n\n```{.r .cell-code}\nlibrary(modeldatatoo)\nlibrary(textrecipes)\nlibrary(bonsai)\nlibrary(probably)\n```\n\n::: {.cell-output .cell-output-stderr}\n```\n\nAttaching package: 'probably'\n```\n:::\n\n::: {.cell-output .cell-output-stderr}\n```\nThe following objects are masked from 'package:base':\n\n as.factor, as.ordered\n```\n:::\n\n```{.r .cell-code}\n# Max's usual settings: \ntidymodels_prefer()\ntheme_set(theme_bw())\noptions(\n pillar.advice = FALSE, \n pillar.min_title_chars = Inf\n)\n\nset.seed(295)\nhotel_rates <- \n data_hotel_rates() %>% \n sample_n(5000) %>% \n arrange(arrival_date) %>% \n select(-arrival_date_num, -arrival_date) %>% \n mutate(\n company = factor(as.character(company)),\n country = factor(as.character(country)),\n agent = factor(as.character(agent))\n )\n\nset.seed(4028)\nhotel_split <-\n initial_split(hotel_rates, strata = avg_price_per_room)\n\nhotel_tr <- training(hotel_split)\nhotel_te <- testing(hotel_split)\n\nset.seed(472)\nhotel_rs <- vfold_cv(hotel_tr, strata = avg_price_per_room)\n\nhotel_rec <-\n recipe(avg_price_per_room ~ ., data = hotel_tr) %>%\n step_YeoJohnson(lead_time) %>%\n step_dummy_hash(agent, num_terms = tune(\"agent hash\")) %>%\n step_dummy_hash(company, num_terms = tune(\"company hash\")) %>%\n step_zv(all_predictors())\n\nlgbm_spec <- \n boost_tree(trees = tune(), learn_rate = tune(), min_n = tune()) %>% \n set_mode(\"regression\") %>% \n set_engine(\"lightgbm\")\n\nlgbm_wflow <- workflow(hotel_rec, lgbm_spec)\n\nlgbm_param <-\n lgbm_wflow %>%\n extract_parameter_set_dials() %>%\n update(`agent hash` = num_hash(c(3, 8)),\n `company hash` = num_hash(c(3, 8)))\n```\n:::\n\n\n## Your turn\n\nYour GP makes predictions on two new candidate tuning parameters. We want to minimize MAE.\n\nWhich should we choose?\n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(28383)\nnum_points <- 5000\nexerc_data <- \n tibble(MAE = c(rnorm(num_points, 10, 2), rnorm(num_points, 13, 1 / 2)),\n `Choose:` = rep(paste(\"candidate\", 1:2), each = num_points))\n\nexerc_data %>% \n ggplot(aes(MAE, col = `Choose:`)) + \n geom_line(stat = \"density\", adjust = 1.25, trim = TRUE, linewidth = 1) +\n theme(legend.position = \"top\")\n```\n\n::: {.cell-output-display}\n![](advanced-04-classwork_files/figure-html/unnamed-chunk-2-1.png){width=672}\n:::\n:::\n\n\n## An Initial Grid\n\n\n::: {.cell}\n\n```{.r .cell-code}\nreg_metrics <- metric_set(mae, rsq)\n\nset.seed(12)\ninit_res <-\n lgbm_wflow %>%\n tune_grid(\n resamples = hotel_rs,\n grid = nrow(lgbm_param) + 2,\n param_info = lgbm_param,\n metrics = reg_metrics\n )\n```\n\n::: {.cell-output .cell-output-stderr}\n```\n'as(, \"dgCMatrix\")' is deprecated.\nUse 'as(., \"CsparseMatrix\")' instead.\nSee help(\"Deprecated\") and help(\"Matrix-deprecated\").\n```\n:::\n\n```{.r .cell-code}\nshow_best(init_res, metric = \"mae\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 5 Γ— 11\n trees min_n learn_rate `agent hash` `company hash` .metric .estimator mean\n \n1 390 10 0.0139 13 62 mae standard 11.9\n2 718 31 0.00112 72 25 mae standard 29.1\n3 1236 22 0.0000261 11 17 mae standard 51.8\n4 1044 25 0.00000832 34 12 mae standard 52.8\n5 1599 7 0.0000000402 254 179 mae standard 53.2\n# β„Ή 3 more variables: n , std_err , .config \n```\n:::\n:::\n\n\n## Bayesian Optimization \n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(15)\nlgbm_bayes_res <-\n lgbm_wflow %>%\n tune_bayes(\n resamples = hotel_rs,\n initial = init_res, # <- initial results\n iter = 20,\n param_info = lgbm_param,\n metrics = reg_metrics\n )\n\nshow_best(lgbm_bayes_res, metric = \"mae\")\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 5 Γ— 12\n trees min_n learn_rate `agent hash` `company hash` .metric .estimator mean\n \n1 1665 2 0.0593 12 59 mae standard 10.1\n2 1179 2 0.0552 161 121 mae standard 10.2\n3 1609 6 0.0592 186 192 mae standard 10.2\n4 1352 6 0.0799 217 46 mae standard 10.3\n5 1647 4 0.0819 12 240 mae standard 10.3\n# β„Ή 4 more variables: n , std_err , .config , .iter \n```\n:::\n:::\n\n\nPlotting results\n\n\n::: {.cell}\n\n```{.r .cell-code}\nautoplot(lgbm_bayes_res, metric = \"mae\")\n```\n\n::: {.cell-output-display}\n![](advanced-04-classwork_files/figure-html/unnamed-chunk-5-1.png){width=672}\n:::\n\n```{.r .cell-code}\nautoplot(lgbm_bayes_res, metric = \"mae\", type = \"parameters\")\n```\n\n::: {.cell-output-display}\n![](advanced-04-classwork_files/figure-html/unnamed-chunk-5-2.png){width=672}\n:::\n\n```{.r .cell-code}\nautoplot(lgbm_bayes_res, metric = \"mae\", type = \"performance\")\n```\n\n::: {.cell-output-display}\n![](advanced-04-classwork_files/figure-html/unnamed-chunk-5-3.png){width=672}\n:::\n:::\n\n\n## Your turn\n\nLet's try a different acquisition function: `conf_bound(kappa)`.\n\nWe'll use the `objective` argument to set it.\n\nChoose your own `kappa` value:\n\n- Larger values will explore the space more.\n- \"Large\" values are usually less than one.\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Your code here!\n```\n:::\n\n\n## Finalize the workflow\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbest_param <- select_best(lgbm_bayes_res, metric = \"mae\")\n\nfinal_wflow <- \n lgbm_wflow %>% \n finalize_workflow(best_param)\n\nfinal_wflow\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n══ Workflow ════════════════════════════════════════════════════════════════════\nPreprocessor: Recipe\nModel: boost_tree()\n\n── Preprocessor ────────────────────────────────────────────────────────────────\n4 Recipe Steps\n\nβ€’ step_YeoJohnson()\nβ€’ step_dummy_hash()\nβ€’ step_dummy_hash()\nβ€’ step_zv()\n\n── Model ───────────────────────────────────────────────────────────────────────\nBoosted Tree Model Specification (regression)\n\nMain Arguments:\n trees = 1665\n min_n = 2\n learn_rate = 0.0592557571004946\n\nComputational engine: lightgbm \n```\n:::\n:::\n\n\n## The Final Fit\n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(3893)\nfinal_res <- final_wflow %>% last_fit(hotel_split, metrics = reg_metrics)\n\nfinal_res\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# Resampling results\n# Manual resampling \n# A tibble: 1 Γ— 6\n splits id .metrics .notes .predictions .workflow \n \n1 train/test split \n```\n:::\n:::\n\n\n## Test Set Results\n\n\n::: {.cell}\n\n```{.r .cell-code}\nfinal_res %>% \n collect_predictions() %>% \n cal_plot_regression(\n truth = avg_price_per_room, \n estimate = .pred, \n alpha = 1 / 4)\n```\n\n::: {.cell-output-display}\n![](advanced-04-classwork_files/figure-html/unnamed-chunk-9-1.png){width=672}\n:::\n:::\n\n::: {.cell}\n\n```{.r .cell-code}\nfinal_res %>% collect_metrics()\n```\n\n::: {.cell-output .cell-output-stdout}\n```\n# A tibble: 2 Γ— 4\n .metric .estimator .estimate .config \n \n1 mae standard 10.5 Preprocessor1_Model1\n2 rsq standard 0.937 Preprocessor1_Model1\n```\n:::\n:::\n", + "supporting": [ + "advanced-04-classwork_files" + ], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": {}, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/archive/2023-07-nyr/_freeze/classwork/advanced-04-classwork/figure-html/unnamed-chunk-2-1.png b/archive/2023-07-nyr/_freeze/classwork/advanced-04-classwork/figure-html/unnamed-chunk-2-1.png new file mode 100644 index 00000000..e42e29c7 Binary files /dev/null and b/archive/2023-07-nyr/_freeze/classwork/advanced-04-classwork/figure-html/unnamed-chunk-2-1.png differ diff --git a/archive/2023-07-nyr/_freeze/classwork/advanced-04-classwork/figure-html/unnamed-chunk-5-1.png b/archive/2023-07-nyr/_freeze/classwork/advanced-04-classwork/figure-html/unnamed-chunk-5-1.png new file mode 100644 index 00000000..4f0c0612 Binary files /dev/null and b/archive/2023-07-nyr/_freeze/classwork/advanced-04-classwork/figure-html/unnamed-chunk-5-1.png differ diff --git a/archive/2023-07-nyr/_freeze/classwork/advanced-04-classwork/figure-html/unnamed-chunk-5-2.png b/archive/2023-07-nyr/_freeze/classwork/advanced-04-classwork/figure-html/unnamed-chunk-5-2.png new file mode 100644 index 00000000..81166f69 Binary files /dev/null and b/archive/2023-07-nyr/_freeze/classwork/advanced-04-classwork/figure-html/unnamed-chunk-5-2.png differ diff --git a/archive/2023-07-nyr/_freeze/classwork/advanced-04-classwork/figure-html/unnamed-chunk-5-3.png b/archive/2023-07-nyr/_freeze/classwork/advanced-04-classwork/figure-html/unnamed-chunk-5-3.png new file mode 100644 index 00000000..7cc6551a Binary files /dev/null and b/archive/2023-07-nyr/_freeze/classwork/advanced-04-classwork/figure-html/unnamed-chunk-5-3.png differ diff --git a/archive/2023-07-nyr/_freeze/classwork/advanced-04-classwork/figure-html/unnamed-chunk-9-1.png b/archive/2023-07-nyr/_freeze/classwork/advanced-04-classwork/figure-html/unnamed-chunk-9-1.png new file mode 100644 index 00000000..345618f1 Binary files /dev/null and b/archive/2023-07-nyr/_freeze/classwork/advanced-04-classwork/figure-html/unnamed-chunk-9-1.png differ diff --git a/archive/2023-07-nyr/_freeze/extras-effect-encodings/execute-results/html.json b/archive/2023-07-nyr/_freeze/extras-effect-encodings/execute-results/html.json new file mode 100644 index 00000000..bde6f835 --- /dev/null +++ b/archive/2023-07-nyr/_freeze/extras-effect-encodings/execute-results/html.json @@ -0,0 +1,18 @@ +{ + "hash": "8ce23d67b651258eabce1cb1cb8b0cfc", + "result": { + "markdown": "---\ntitle: \"Extras - Effect Encodings\"\nsubtitle: \"Advanced tidymodels\"\nformat:\n revealjs: \n slide-number: true\n footer: \n include-before-body: header.html\n include-after-body: footer-annotations.html\n theme: [default, tidymodels.scss]\n width: 1280\n height: 720\nknitr:\n opts_chunk: \n echo: true\n collapse: true\n comment: \"#>\"\n fig.path: \"figures/\"\n---\n\n\n\n\n\n\n## Previously - Setup\n\n:::: {.columns}\n\n::: {.column width=\"40%\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidymodels)\nlibrary(modeldatatoo)\nlibrary(textrecipes)\nlibrary(bonsai)\n\n# Max's usual settings: \ntidymodels_prefer()\ntheme_set(theme_bw())\noptions(\n pillar.advice = FALSE, \n pillar.min_title_chars = Inf\n)\n```\n:::\n\n\n:::\n\n::: {.column width=\"60%\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(295)\nhotel_rates <- \n data_hotel_rates() %>% \n sample_n(5000) %>% \n arrange(arrival_date) %>% \n select(-arrival_date_num, -arrival_date) %>% \n mutate(\n company = factor(as.character(company)),\n country = factor(as.character(country)),\n agent = factor(as.character(agent))\n )\n```\n:::\n\n\n\n:::\n\n::::\n\n\n## Previously - Data Usage\n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(4028)\nhotel_split <-\n initial_split(hotel_rates, strata = avg_price_per_room)\n\nhotel_tr <- training(hotel_split)\nhotel_te <- testing(hotel_split)\n\nset.seed(472)\nhotel_rs <- vfold_cv(hotel_tr, strata = avg_price_per_room)\n```\n:::\n\n\n\n## What do we do with the agent and company data? \n\nThere are 98 unique agent values and 100 companies in our training set. How can we include this information in our model?\n\n. . .\n\nWe could:\n\n- make the full set of indicator variables 😳\n\n- lump agents and companies that rarely occur into an \"other\" group\n\n- use [feature hashing](https://www.tmwr.org/categorical.html#feature-hashing) to create a smaller set of indicator variables\n\n- use effect encoding to replace the `agent` and `company` columns with the estimated effect of that predictor\n\n\n\n\n\n\n\n\n## Per-agent statistics {.annotation}\n\n::: columns\n::: {.column width=\"50%\"}\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](figures/effects-freq-1.svg){fig-align='center' width=90%}\n:::\n:::\n\n:::\n\n::: {.column width=\"50%\"}\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](figures/effects-adr-1.svg){fig-align='center' width=90%}\n:::\n:::\n\n:::\n:::\n\n\n## What is an effect encoding?\n\nWe replace the qualitative’s predictor data with their _effect on the outcome_. \n\n::: columns\n::: {.column width=\"50%\"}\nData before:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nbefore\n#> # A tibble: 7 Γ— 3\n#> avg_price_per_room agent .row\n#> \n#> 1 52.7 cynthia_worsley 1\n#> 2 51.8 carlos_bryant 2\n#> 3 53.8 lance_hitchcock 3\n#> 4 51.8 lance_hitchcock 4\n#> 5 46.8 cynthia_worsley 5\n#> 6 54.7 charles_najera 6\n#> 7 46.8 cynthia_worsley 7\n```\n:::\n\n\n:::\n\n::: {.column width=\"50%\"}\n\nData after:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nafter\n#> # A tibble: 7 Γ— 3\n#> avg_price_per_room agent .row\n#> \n#> 1 52.7 88.5 1\n#> 2 51.8 89.5 2\n#> 3 53.8 79.8 3\n#> 4 51.8 79.8 4\n#> 5 46.8 88.5 5\n#> 6 54.7 109. 6\n#> 7 46.8 88.5 7\n```\n:::\n\n\n:::\n:::\n\nThe `agent` column is replaced with an estimate of the ADR. \n\n\n## Per-agent statistics again \n\n::: columns\n::: {.column width=\"50%\"}\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](figures/effects-again-1.svg){fig-align='center' width=90%}\n:::\n\n::: {.cell-output-display}\n![](figures/effects-again-2.svg){fig-align='center' width=90%}\n:::\n:::\n\n:::\n\n::: {.column width=\"50%\"}\n\n- Good statistical methods for estimating these means use *partial pooling*.\n\n\n- Pooling borrows strength across agents and shrinks extreme values towards the mean for agents with very few transations\n\n\n- The embed package has recipe steps for effect encodings.\n\n:::\n:::\n\n\n:::notes\nPartial pooling gives better estimates for agents with fewer reservations by shrinking the estimate to the overall ADR mean\n\n\n:::\n\n## Partial pooling\n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](figures/effect-compare-1.svg){fig-align='center' width=576}\n:::\n:::\n\n\n## Agent effects ![](hexes/recipes.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"} ![](hexes/embed.png){.absolute top=-20 right=64 width=\"64\" height=\"74.24\"} {.annotation}\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"1,6|\"}\nlibrary(embed)\n\nhotel_effect_rec <-\n recipe(avg_price_per_room ~ ., data = hotel_tr) %>% \n step_YeoJohnson(lead_time) %>%\n step_lencode_mixed(agent, company, outcome = vars(avg_price_per_room)) %>%\n step_dummy(all_nominal_predictors()) %>%\n step_zv(all_predictors())\n```\n:::\n\n\n. . .\n\nIt is very important to appropriately validate the effect encoding step to make sure that we are not overfitting.\n\n## Effect encoding results ![](hexes/workflows.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"} ![](hexes/tune.png){.absolute top=-20 right=64 width=\"64\" height=\"74.24\"} ![](hexes/recipes.png){.absolute top=-20 right=128 width=\"64\" height=\"74.24\"} ![](hexes/embed.png){.absolute top=-20 right=192 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"4|\"}\nhotel_effect_wflow <-\n workflow() %>%\n add_model(linear_reg()) %>% \n update_recipe(hotel_effect_rec)\n\nreg_metrics <- metric_set(mae, rsq)\n\nhotel_effect_res <-\n hotel_effect_wflow %>%\n fit_resamples(hotel_rs, metrics = reg_metrics)\n\ncollect_metrics(hotel_effect_res)\n#> # A tibble: 2 Γ— 6\n#> .metric .estimator mean n std_err .config \n#> \n#> 1 mae standard 17.8 10 0.236 Preprocessor1_Model1\n#> 2 rsq standard 0.867 10 0.00377 Preprocessor1_Model1\n```\n:::\n\n\nSlightly worse but it can handle new agents (if they occur).\n\n\n\n", + "supporting": [], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": { + "include-after-body": [ + "\n\n\n" + ] + }, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/archive/2023-07-nyr/_freeze/extras-transit-case-study/execute-results/html.json b/archive/2023-07-nyr/_freeze/extras-transit-case-study/execute-results/html.json new file mode 100644 index 00000000..fd571bf5 --- /dev/null +++ b/archive/2023-07-nyr/_freeze/extras-transit-case-study/execute-results/html.json @@ -0,0 +1,21 @@ +{ + "hash": "ab0cf007fdfc7dcac22e6319151e49ee", + "result": { + "markdown": "---\ntitle: \"Case Study on Transportation\"\nsubtitle: \"Machine learning with tidymodels\"\nformat:\n revealjs: \n slide-number: true\n footer: \n include-before-body: header.html\n include-after-body: footer-annotations.html\n theme: [default, tidymodels.scss]\n width: 1280\n height: 720\nknitr:\n opts_chunk: \n echo: true\n collapse: true\n comment: \"#>\"\n fig.path: \"figures/\"\n---\n\n\n\n\n\n\n## Chicago L-Train data\n\nSeveral years worth of pre-pandemic data were assembled to try to predict the daily number of people entering the Clark and Lake elevated (\"L\") train station in Chicago. \n\n\nMore information: \n\n- Several Chapters in _Feature Engineering and Selection_. \n\n - Start with [Section 4.1](https://bookdown.org/max/FES/chicago-intro.html) \n - See [Section 1.3](https://bookdown.org/max/FES/a-more-complex-example.html)\n\n- Video: [_The Global Pandemic Ruined My Favorite Data Set_](https://www.youtube.com/watch?v=KkpKSqbGnBA)\n\n\n## Predictors\n\n- the 14-day lagged ridership at this and other stations (units: thousands of rides/day)\n- weather data\n- home/away game schedules for Chicago teams\n- the date\n\nThe data are in `modeldata`. See `?Chicago`. \n\n\n## L Train Locations\n\n\n::: {.cell}\n::: {.cell-output-display}\n```{=html}\n
\n\n```\n:::\n:::\n\n\n## Your turn: Explore the Data\n\n*Take a look at these data for a few minutes and see if you can find any interesting characteristics in the predictors or the outcome.* \n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(tidymodels)\nlibrary(rules)\ndata(\"Chicago\")\ndim(Chicago)\n#> [1] 5698 50\nstations\n#> [1] \"Austin\" \"Quincy_Wells\" \"Belmont\" \"Archer_35th\" \n#> [5] \"Oak_Park\" \"Western\" \"Clark_Lake\" \"Clinton\" \n#> [9] \"Merchandise_Mart\" \"Irving_Park\" \"Washington_Wells\" \"Harlem\" \n#> [13] \"Monroe\" \"Polk\" \"Ashland\" \"Kedzie\" \n#> [17] \"Addison\" \"Jefferson_Park\" \"Montrose\" \"California\"\n```\n:::\n\n::: {.cell}\n::: {.cell-output-display}\n```{=html}\n
\n
\n05:00\n
\n```\n:::\n:::\n\n\n\n## Splitting with Chicago data ![](hexes/rsample.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\nLet's put the last two weeks of data into the test set. `initial_time_split()` can be used for this purpose:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ndata(Chicago)\n\nchi_split <- initial_time_split(Chicago, prop = 1 - (14/nrow(Chicago)))\nchi_split\n#> \n#> <5684/14/5698>\n\nchi_train <- training(chi_split)\nchi_test <- testing(chi_split)\n\n## training\nnrow(chi_train)\n#> [1] 5684\n \n## testing\nnrow(chi_test)\n#> [1] 14\n```\n:::\n\n\n## Time series resampling \n\nOur Chicago data is over time. Regular cross-validation, which uses random sampling, may not be the best idea. \n\nWe can emulate our training/test split by making similar resamples. \n\n* Fold 1: Take the first X years of data as the analysis set, the next 2 weeks as the assessment set.\n\n* Fold 2: Take the first X years + 2 weeks of data as the analysis set, the next 2 weeks as the assessment set.\n\n* and so on\n\n## Rolling forecast origin resampling \n\n\n::: {.cell layout-align=\"center\"}\n::: {.cell-output-display}\n![](images/rolling.svg){fig-align='center' width=70%}\n:::\n:::\n\n\n:::notes\nThis image shows overlapping assessment sets. We will use non-overlapping data but it could be done wither way.\n:::\n\n## Times series resampling ![](hexes/rsample.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"4|\"}\nchi_rs <-\n chi_train %>%\n sliding_period(\n index = \"date\", \n\n\n\n\n )\n```\n:::\n\n\nUse the `date` column to find the date data. \n\n\n## Times series resampling ![](hexes/rsample.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"5|\"}\nchi_rs <-\n chi_train %>%\n sliding_period(\n index = \"date\", \n period = \"week\",\n\n\n\n )\n```\n:::\n\n\nOur units will be weeks. \n\n\n## Times series resampling ![](hexes/rsample.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"6|\"}\nchi_rs <-\n chi_train %>%\n sliding_period(\n index = \"date\", \n period = \"week\",\n lookback = 52 * 15 \n \n \n )\n```\n:::\n\n\nEvery analysis set has 15 years of data\n\n\n\n## Times series resampling ![](hexes/rsample.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"7|\"}\nchi_rs <-\n chi_train %>%\n sliding_period(\n index = \"date\", \n period = \"week\",\n lookback = 52 * 15,\n assess_stop = 2,\n\n )\n```\n:::\n\n\nEvery assessment set has 2 weeks of data\n\n\n## Times series resampling ![](hexes/rsample.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"8|\"}\nchi_rs <-\n chi_train %>%\n sliding_period(\n index = \"date\", \n period = \"week\",\n lookback = 52 * 15,\n assess_stop = 2,\n step = 2 \n )\n```\n:::\n\n\nIncrement by 2 weeks so that there are no overlapping assessment sets. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nchi_rs$splits[[1]] %>% assessment() %>% pluck(\"date\") %>% range()\n#> [1] \"2016-01-07\" \"2016-01-20\"\nchi_rs$splits[[2]] %>% assessment() %>% pluck(\"date\") %>% range()\n#> [1] \"2016-01-21\" \"2016-02-03\"\n```\n:::\n\n\n\n## Our resampling object ![](hexes/rsample.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n::: columns\n::: {.column width=\"45%\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nchi_rs\n#> # Sliding period resampling \n#> # A tibble: 16 Γ— 2\n#> splits id \n#> \n#> 1 Slice01\n#> 2 Slice02\n#> 3 Slice03\n#> 4 Slice04\n#> 5 Slice05\n#> 6 Slice06\n#> 7 Slice07\n#> 8 Slice08\n#> 9 Slice09\n#> 10 Slice10\n#> 11 Slice11\n#> 12 Slice12\n#> 13 Slice13\n#> 14 Slice14\n#> 15 Slice15\n#> 16 Slice16\n```\n:::\n\n\n:::\n\n::: {.column width=\"5%\"}\n\n:::\n\n::: {.column width=\"50%\"}\n\nWe will fit 16 models on 16 slightly different analysis sets. \n\nEach will produce a separate performance metrics. \n\nWe will average the 16 metrics to get the resampling estimate of that statistic. \n\n:::\n:::\n\n\n## Feature engineering with recipes ![](hexes/recipes.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nchi_rec <- \n recipe(ridership ~ ., data = chi_train)\n```\n:::\n\n\nBased on the formula, the function assigns columns to roles of \"outcome\" or \"predictor\"\n\n## A recipe\n\n\n::: {.cell}\n\n```{.r .cell-code}\nsummary(chi_rec)\n#> # A tibble: 50 Γ— 4\n#> variable type role source \n#> \n#> 1 Austin predictor original\n#> 2 Quincy_Wells predictor original\n#> 3 Belmont predictor original\n#> 4 Archer_35th predictor original\n#> 5 Oak_Park predictor original\n#> 6 Western predictor original\n#> 7 Clark_Lake predictor original\n#> 8 Clinton predictor original\n#> 9 Merchandise_Mart predictor original\n#> 10 Irving_Park predictor original\n#> # β„Ή 40 more rows\n```\n:::\n\n\n\n\n## A recipe - work with dates ![](hexes/recipes.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"3|\"}\nchi_rec <- \n recipe(ridership ~ ., data = chi_train) %>% \n step_date(date, features = c(\"dow\", \"month\", \"year\")) \n```\n:::\n\n\nThis creates three new columns in the data based on the date. Note that the day-of-the-week column is a factor.\n\n\n## A recipe - work with dates ![](hexes/recipes.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"4|\"}\nchi_rec <- \n recipe(ridership ~ ., data = chi_train) %>% \n step_date(date, features = c(\"dow\", \"month\", \"year\")) %>% \n step_holiday(date) \n```\n:::\n\n\nAdd indicators for major holidays. Specific holidays, especially those non-USA, can also be generated. \n\nAt this point, we don't need `date` anymore. Instead of deleting it (there is a step for that) we will change its _role_ to be an identification variable. \n\n:::notes\nWe might want to change the role (instead of removing the column) because it will stay in the data set (even when resampled) and might be useful for diagnosing issues.\n:::\n\n\n## A recipe - work with dates ![](hexes/recipes.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"5,6|\"}\nchi_rec <- \n recipe(ridership ~ ., data = chi_train) %>% \n step_date(date, features = c(\"dow\", \"month\", \"year\")) %>% \n step_holiday(date) %>% \n update_role(date, new_role = \"id\") %>%\n update_role_requirements(role = \"id\", bake = TRUE)\n```\n:::\n\n\n`date` is still in the data set but tidymodels knows not to treat it as an analysis column. \n\n`update_role_requirements()` is needed to make sure that this column is required when making new data points. \n\n## A recipe - remove constant columns ![](hexes/recipes.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"7|\"}\nchi_rec <- \n recipe(ridership ~ ., data = chi_train) %>% \n step_date(date, features = c(\"dow\", \"month\", \"year\")) %>% \n step_holiday(date) %>% \n update_role(date, new_role = \"id\") %>%\n update_role_requirements(role = \"id\", bake = TRUE) %>% \n step_zv(all_nominal_predictors()) \n```\n:::\n\n\n\n## A recipe - handle correlations ![](hexes/recipes.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\nThe station columns have a very high degree of correlation. \n\nWe might want to decorrelated them with principle component analysis to help the model fits go more easily. \n\nThe vector `stations` contains all station names and can be used to identify all the relevant columns.\n\n\n::: {.cell}\n\n```{.r .cell-code code-line-numbers=\"7|\"}\nchi_pca_rec <- \n chi_rec %>% \n step_normalize(all_of(!!stations)) %>% \n step_pca(all_of(!!stations), num_comp = tune())\n```\n:::\n\n\nWe'll tune the number of PCA components for (default) values of one to four.\n\n## Make some models ![](hexes/yardstick.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"} ![](hexes/tune.png){.absolute top=-20 right=64 width=\"64\" height=\"74.24\"} ![](hexes/rules.png){.absolute top=-20 right=128 width=\"64\" height=\"74.24\"} ![](hexes/recipes.png){.absolute top=-20 right=192 width=\"64\" height=\"74.24\"} ![](hexes/parsnip.png){.absolute top=-20 right=256 width=\"64\" height=\"74.24\"}\n\nLet's try three models. The first one requires the `rules` package (loaded earlier).\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncb_spec <- cubist_rules(committees = 25, neighbors = tune())\nmars_spec <- mars(prod_degree = tune()) %>% set_mode(\"regression\")\nlm_spec <- linear_reg()\n\nchi_set <- \n workflow_set(\n list(pca = chi_pca_rec, basic = chi_rec), \n list(cubist = cb_spec, mars = mars_spec, lm = lm_spec)\n ) %>% \n # Evaluate models using mean absolute errors\n option_add(metrics = metric_set(mae))\n```\n:::\n\n\n\n:::notes\nBriefly talk about Cubist being a (sort of) boosted rule-based model and MARS being a nonlinear regression model. Both incorporate feature selection nicely. \n:::\n\n## Process them on the resamples\n\n\n::: {.cell}\n\n```{.r .cell-code}\n# Set up some objects for stacking ensembles (in a few slides)\ngrid_ctrl <- control_grid(save_pred = TRUE, save_workflow = TRUE)\n\nchi_res <- \n chi_set %>% \n workflow_map(\n resamples = chi_rs,\n grid = 10,\n control = grid_ctrl,\n verbose = TRUE,\n seed = 12\n )\n```\n:::\n\n\n## How do the results look? \n\n\n::: {.cell}\n\n```{.r .cell-code}\nrank_results(chi_res)\n#> # A tibble: 31 Γ— 9\n#> wflow_id .config .metric mean std_err n preprocessor model rank\n#> \n#> 1 pca_cubist Preprocessor1_Model1 mae 0.798 0.104 16 recipe cubis… 1\n#> 2 pca_cubist Preprocessor3_Model3 mae 0.978 0.110 16 recipe cubis… 2\n#> 3 pca_cubist Preprocessor4_Model2 mae 0.983 0.122 16 recipe cubis… 3\n#> 4 pca_cubist Preprocessor4_Model1 mae 0.991 0.127 16 recipe cubis… 4\n#> 5 pca_cubist Preprocessor3_Model2 mae 0.991 0.113 16 recipe cubis… 5\n#> 6 pca_cubist Preprocessor2_Model2 mae 1.02 0.118 16 recipe cubis… 6\n#> 7 pca_cubist Preprocessor1_Model3 mae 1.05 0.134 16 recipe cubis… 7\n#> 8 basic_cubist Preprocessor1_Model8 mae 1.07 0.115 16 recipe cubis… 8\n#> 9 basic_cubist Preprocessor1_Model7 mae 1.07 0.112 16 recipe cubis… 9\n#> 10 basic_cubist Preprocessor1_Model6 mae 1.07 0.114 16 recipe cubis… 10\n#> # β„Ή 21 more rows\n```\n:::\n\n\n## Plot the results ![](hexes/tune.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"} ![](hexes/ggplot2.png){.absolute top=-20 right=64 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nautoplot(chi_res)\n```\n\n::: {.cell-output-display}\n![](figures/set-results-1.svg){fig-align='center' width=960}\n:::\n:::\n\n\n## Pull out specific results ![](hexes/tune.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"} ![](hexes/ggplot2.png){.absolute top=-20 right=64 width=\"64\" height=\"74.24\"}\n\nWe can also pull out the specific tuning results and look at them: \n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nchi_res %>% \n extract_workflow_set_result(\"pca_cubist\") %>% \n autoplot()\n```\n\n::: {.cell-output-display}\n![](figures/cubist-autoplot-1.svg){fig-align='center' width=960}\n:::\n:::\n\n\n\n## Why choose just one `final_fit`? ![](hexes/stacks.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n_Model stacks_ generate predictions that are informed by several models.\n\n## Why choose just one `final_fit`? ![](hexes/stacks.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n![](images/stack_01.png)\n\n## Why choose just one `final_fit`? ![](hexes/stacks.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n![](images/stack_02.png)\n\n## Why choose just one `final_fit`? ![](hexes/stacks.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n![](images/stack_03.png)\n\n## Why choose just one `final_fit`? ![](hexes/stacks.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n![](images/stack_04.png)\n\n## Why choose just one `final_fit`? ![](hexes/stacks.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n![](images/stack_05.png)\n\n## Building a model stack ![](hexes/stacks.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(stacks)\n```\n:::\n\n\n1) Define candidate members\n2) Initialize a data stack object\n3) Add candidate ensemble members to the data stack\n4) Evaluate how to combine their predictions\n5) Fit candidate ensemble members with non-zero stacking coefficients\n6) Predict on new data!\n\n\n## Start the stack and add members ![](hexes/stacks.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\nCollect all of the resampling results for all model configurations. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nchi_stack <- \n stacks() %>% \n add_candidates(chi_res)\n```\n:::\n\n\n\n## Estimate weights for each candidate ![](hexes/stacks.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\nWhich configurations should be retained? Uses a penalized linear model: \n\n\n::: {.cell}\n\n```{.r .cell-code}\nset.seed(122)\nchi_stack_res <- blend_predictions(chi_stack)\n\nchi_stack_res\n#> # A tibble: 5 Γ— 3\n#> member type weight\n#> \n#> 1 pca_cubist_1_1 cubist_rules 0.343\n#> 2 pca_cubist_3_2 cubist_rules 0.236\n#> 3 basic_cubist_1_4 cubist_rules 0.189\n#> 4 pca_lm_4_1 linear_reg 0.163\n#> 5 pca_cubist_3_3 cubist_rules 0.109\n```\n:::\n\n\n## How did it do? ![](hexes/stacks.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"} ![](hexes/ggplot2.png){.absolute top=-20 right=64 width=\"64\" height=\"74.24\"}\n\nThe overall results of the penalized model: \n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nautoplot(chi_stack_res)\n```\n\n::: {.cell-output-display}\n![](figures/stack-autoplot-1.svg){fig-align='center' width=960}\n:::\n:::\n\n\n\n\n## What does it use? ![](hexes/stacks.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"} ![](hexes/ggplot2.png){.absolute top=-20 right=64 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nautoplot(chi_stack_res, type = \"weights\")\n```\n\n::: {.cell-output-display}\n![](figures/stack-members-1.svg){fig-align='center' width=960}\n:::\n:::\n\n\n\n## Fit the required candidate models![](hexes/stacks.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\nFor each model we retain in the stack, we need their model fit on the entire training set. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nchi_stack_res <- fit_members(chi_stack_res)\n```\n:::\n\n\n\n## The test set: best Cubist model ![](hexes/workflows.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"} ![](hexes/tune.png){.absolute top=-20 right=64 width=\"64\" height=\"74.24\"}\n\nWe can pull out the results and the workflow to fit the single best cubist model. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nbest_cubist <- \n chi_res %>% \n extract_workflow_set_result(\"pca_cubist\") %>% \n select_best()\n\ncubist_res <- \n chi_res %>% \n extract_workflow(\"pca_cubist\") %>% \n finalize_workflow(best_cubist) %>% \n last_fit(split = chi_split, metrics = metric_set(mae))\n```\n:::\n\n\n## The test set: stack ensemble![](hexes/stacks.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"}\n\nWe don't have `last_fit()` for stacks (yet) so we manually make predictions. \n\n\n::: {.cell}\n\n```{.r .cell-code}\nstack_pred <- \n predict(chi_stack_res, chi_test) %>% \n bind_cols(chi_test)\n```\n:::\n\n\n## Compare the results ![](hexes/tune.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"} ![](hexes/stacks.png){.absolute top=-20 right=64 width=\"64\" height=\"74.24\"}\n\nSingle best versus the stack:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncollect_metrics(cubist_res)\n#> # A tibble: 1 Γ— 4\n#> .metric .estimator .estimate .config \n#> \n#> 1 mae standard 0.670 Preprocessor1_Model1\n\nstack_pred %>% mae(ridership, .pred)\n#> # A tibble: 1 Γ— 3\n#> .metric .estimator .estimate\n#> \n#> 1 mae standard 0.689\n```\n:::\n\n\n\n## Plot the test set ![](hexes/tune.png){.absolute top=-20 right=0 width=\"64\" height=\"74.24\"} ![](hexes/ggplot2.png){.absolute top=-20 right=64 width=\"64\" height=\"74.24\"}\n\n\n::: {.cell layout-align=\"center\" output-location='column-fragment'}\n\n```{.r .cell-code}\nlibrary(probably)\ncubist_res %>% \n collect_predictions() %>% \n ggplot(aes(ridership, .pred)) + \n geom_point(alpha = 1 / 2) + \n geom_abline(lty = 2, col = \"green\") + \n coord_obs_pred()\n```\n\n::: {.cell-output-display}\n![](figures/obs-pred-1.svg){fig-align='center' width=480}\n:::\n:::\n", + "supporting": [], + "filters": [ + "rmarkdown/pagebreak.lua" + ], + "includes": { + "include-in-header": [ + "\n\n\n\n\n\n\n\n\n\n\n\n\n" + ], + "include-after-body": [ + "\n\n\n" + ] + }, + "engineDependencies": {}, + "preserve": {}, + "postProcess": true + } +} \ No newline at end of file diff --git a/archive/2023-07-nyr/_freeze/extras-transit-case-study/libs/Proj4Leaflet-1.0.1/proj4leaflet.js b/archive/2023-07-nyr/_freeze/extras-transit-case-study/libs/Proj4Leaflet-1.0.1/proj4leaflet.js new file mode 100644 index 00000000..eaa650c1 --- /dev/null +++ b/archive/2023-07-nyr/_freeze/extras-transit-case-study/libs/Proj4Leaflet-1.0.1/proj4leaflet.js @@ -0,0 +1,272 @@ +(function (factory) { + var L, proj4; + if (typeof define === 'function' && define.amd) { + // AMD + define(['leaflet', 'proj4'], factory); + } else if (typeof module === 'object' && typeof module.exports === "object") { + // Node/CommonJS + L = require('leaflet'); + proj4 = require('proj4'); + module.exports = factory(L, proj4); + } else { + // Browser globals + if (typeof window.L === 'undefined' || typeof window.proj4 === 'undefined') + throw 'Leaflet and proj4 must be loaded first'; + factory(window.L, window.proj4); + } +}(function (L, proj4) { + if (proj4.__esModule && proj4.default) { + // If proj4 was bundled as an ES6 module, unwrap it to get + // to the actual main proj4 object. + // See discussion in https://github.com/kartena/Proj4Leaflet/pull/147 + proj4 = proj4.default; + } + + L.Proj = {}; + + L.Proj._isProj4Obj = function(a) { + return (typeof a.inverse !== 'undefined' && + typeof a.forward !== 'undefined'); + }; + + L.Proj.Projection = L.Class.extend({ + initialize: function(code, def, bounds) { + var isP4 = L.Proj._isProj4Obj(code); + this._proj = isP4 ? code : this._projFromCodeDef(code, def); + this.bounds = isP4 ? def : bounds; + }, + + project: function (latlng) { + var point = this._proj.forward([latlng.lng, latlng.lat]); + return new L.Point(point[0], point[1]); + }, + + unproject: function (point, unbounded) { + var point2 = this._proj.inverse([point.x, point.y]); + return new L.LatLng(point2[1], point2[0], unbounded); + }, + + _projFromCodeDef: function(code, def) { + if (def) { + proj4.defs(code, def); + } else if (proj4.defs[code] === undefined) { + var urn = code.split(':'); + if (urn.length > 3) { + code = urn[urn.length - 3] + ':' + urn[urn.length - 1]; + } + if (proj4.defs[code] === undefined) { + throw 'No projection definition for code ' + code; + } + } + + return proj4(code); + } + }); + + L.Proj.CRS = L.Class.extend({ + includes: L.CRS, + + options: { + transformation: new L.Transformation(1, 0, -1, 0) + }, + + initialize: function(a, b, c) { + var code, + proj, + def, + options; + + if (L.Proj._isProj4Obj(a)) { + proj = a; + code = proj.srsCode; + options = b || {}; + + this.projection = new L.Proj.Projection(proj, options.bounds); + } else { + code = a; + def = b; + options = c || {}; + this.projection = new L.Proj.Projection(code, def, options.bounds); + } + + L.Util.setOptions(this, options); + this.code = code; + this.transformation = this.options.transformation; + + if (this.options.origin) { + this.transformation = + new L.Transformation(1, -this.options.origin[0], + -1, this.options.origin[1]); + } + + if (this.options.scales) { + this._scales = this.options.scales; + } else if (this.options.resolutions) { + this._scales = []; + for (var i = this.options.resolutions.length - 1; i >= 0; i--) { + if (this.options.resolutions[i]) { + this._scales[i] = 1 / this.options.resolutions[i]; + } + } + } + + this.infinite = !this.options.bounds; + + }, + + scale: function(zoom) { + var iZoom = Math.floor(zoom), + baseScale, + nextScale, + scaleDiff, + zDiff; + if (zoom === iZoom) { + return this._scales[zoom]; + } else { + // Non-integer zoom, interpolate + baseScale = this._scales[iZoom]; + nextScale = this._scales[iZoom + 1]; + scaleDiff = nextScale - baseScale; + zDiff = (zoom - iZoom); + return baseScale + scaleDiff * zDiff; + } + }, + + zoom: function(scale) { + // Find closest number in this._scales, down + var downScale = this._closestElement(this._scales, scale), + downZoom = this._scales.indexOf(downScale), + nextScale, + nextZoom, + scaleDiff; + // Check if scale is downScale => return array index + if (scale === downScale) { + return downZoom; + } + if (downScale === undefined) { + return -Infinity; + } + // Interpolate + nextZoom = downZoom + 1; + nextScale = this._scales[nextZoom]; + if (nextScale === undefined) { + return Infinity; + } + scaleDiff = nextScale - downScale; + return (scale - downScale) / scaleDiff + downZoom; + }, + + distance: L.CRS.Earth.distance, + + R: L.CRS.Earth.R, + + /* Get the closest lowest element in an array */ + _closestElement: function(array, element) { + var low; + for (var i = array.length; i--;) { + if (array[i] <= element && (low === undefined || low < array[i])) { + low = array[i]; + } + } + return low; + } + }); + + L.Proj.GeoJSON = L.GeoJSON.extend({ + initialize: function(geojson, options) { + this._callLevel = 0; + L.GeoJSON.prototype.initialize.call(this, geojson, options); + }, + + addData: function(geojson) { + var crs; + + if (geojson) { + if (geojson.crs && geojson.crs.type === 'name') { + crs = new L.Proj.CRS(geojson.crs.properties.name); + } else if (geojson.crs && geojson.crs.type) { + crs = new L.Proj.CRS(geojson.crs.type + ':' + geojson.crs.properties.code); + } + + if (crs !== undefined) { + this.options.coordsToLatLng = function(coords) { + var point = L.point(coords[0], coords[1]); + return crs.projection.unproject(point); + }; + } + } + + // Base class' addData might call us recursively, but + // CRS shouldn't be cleared in that case, since CRS applies + // to the whole GeoJSON, inluding sub-features. + this._callLevel++; + try { + L.GeoJSON.prototype.addData.call(this, geojson); + } finally { + this._callLevel--; + if (this._callLevel === 0) { + delete this.options.coordsToLatLng; + } + } + } + }); + + L.Proj.geoJson = function(geojson, options) { + return new L.Proj.GeoJSON(geojson, options); + }; + + L.Proj.ImageOverlay = L.ImageOverlay.extend({ + initialize: function (url, bounds, options) { + L.ImageOverlay.prototype.initialize.call(this, url, null, options); + this._projectedBounds = bounds; + }, + + // Danger ahead: Overriding internal methods in Leaflet. + // Decided to do this rather than making a copy of L.ImageOverlay + // and doing very tiny modifications to it. + // Future will tell if this was wise or not. + _animateZoom: function (event) { + var scale = this._map.getZoomScale(event.zoom); + var northWest = L.point(this._projectedBounds.min.x, this._projectedBounds.max.y); + var offset = this._projectedToNewLayerPoint(northWest, event.zoom, event.center); + + L.DomUtil.setTransform(this._image, offset, scale); + }, + + _reset: function () { + var zoom = this._map.getZoom(); + var pixelOrigin = this._map.getPixelOrigin(); + var bounds = L.bounds( + this._transform(this._projectedBounds.min, zoom)._subtract(pixelOrigin), + this._transform(this._projectedBounds.max, zoom)._subtract(pixelOrigin) + ); + var size = bounds.getSize(); + + L.DomUtil.setPosition(this._image, bounds.min); + this._image.style.width = size.x + 'px'; + this._image.style.height = size.y + 'px'; + }, + + _projectedToNewLayerPoint: function (point, zoom, center) { + var viewHalf = this._map.getSize()._divideBy(2); + var newTopLeft = this._map.project(center, zoom)._subtract(viewHalf)._round(); + var topLeft = newTopLeft.add(this._map._getMapPanePos()); + + return this._transform(point, zoom)._subtract(topLeft); + }, + + _transform: function (point, zoom) { + var crs = this._map.options.crs; + var transformation = crs.transformation; + var scale = crs.scale(zoom); + + return transformation.transform(point, scale); + } + }); + + L.Proj.imageOverlay = function (url, bounds, options) { + return new L.Proj.ImageOverlay(url, bounds, options); + }; + + return L.Proj; +})); diff --git a/archive/2023-07-nyr/_freeze/extras-transit-case-study/libs/countdown-0.4.0/countdown.css b/archive/2023-07-nyr/_freeze/extras-transit-case-study/libs/countdown-0.4.0/countdown.css new file mode 100644 index 00000000..bf387012 --- /dev/null +++ b/archive/2023-07-nyr/_freeze/extras-transit-case-study/libs/countdown-0.4.0/countdown.css @@ -0,0 +1,144 @@ +.countdown { + background: inherit; + position: absolute; + cursor: pointer; + font-size: 3rem; + line-height: 1; + border-color: #ddd; + border-width: 3px; + border-style: solid; + border-radius: 15px; + box-shadow: 0px 4px 10px 0px rgba(50, 50, 50, 0.4); + -webkit-box-shadow: 0px 4px 10px 0px rgba(50, 50, 50, 0.4); + margin: 0.6em; + padding: 10px 15px; + text-align: center; + z-index: 10; + -webkit-user-select: none; + -moz-user-select: none; + -ms-user-select: none; + user-select: none; +} +.countdown { + display: flex; + align-items: center; + justify-content: center; +} +.countdown .countdown-time { + background: none; + font-size: 100%; + padding: 0; +} +.countdown-digits { + color: inherit; +} +.countdown.running { + border-color: #2A9B59FF; + background-color: #43AC6A; +} +.countdown.running .countdown-digits { + color: #002F14FF; +} +.countdown.finished { + border-color: #DE3000FF; + background-color: #F04124; +} +.countdown.finished .countdown-digits { + color: #4A0900FF; +} +.countdown.running.warning { + border-color: #CEAC04FF; + background-color: #E6C229; +} +.countdown.running.warning .countdown-digits { + color: #3A2F02FF; +} + +.countdown.running.blink-colon .countdown-digits.colon { + opacity: 0.1; +} + +/* ------ Controls ------ */ +.countdown:not(.running) .countdown-controls { + display: none; +} + +.countdown-controls { + position: absolute; + top: -0.5rem; + right: -0.5rem; + left: -0.5rem; + display: flex; + justify-content: space-between; + margin: 0; + padding: 0; +} + +.countdown-controls > button { + font-size: 1.5rem; + width: 1rem; + height: 1rem; + display: inline-block; + display: flex; + flex-direction: column; + align-items: center; + justify-content: center; + font-family: monospace; + padding: 10px; + margin: 0; + background: inherit; + border: 2px solid; + border-radius: 100%; + transition: 50ms transform ease-in-out, 150ms opacity ease-in; + --countdown-transition-distance: 10px; +} + +.countdown .countdown-controls > button:last-child { + transform: translate(calc(-1 * var(--countdown-transition-distance)), var(--countdown-transition-distance)); + opacity: 0; + color: #002F14FF; + background-color: #43AC6A; + border-color: #2A9B59FF; +} + +.countdown .countdown-controls > button:first-child { + transform: translate(var(--countdown-transition-distance), var(--countdown-transition-distance)); + opacity: 0; + color: #4A0900FF; + background-color: #F04124; + border-color: #DE3000FF; +} + +.countdown.running:hover .countdown-controls > button, +.countdown.running:focus-within .countdown-controls > button{ + transform: translate(0, 0); + opacity: 1; +} + +.countdown.running:hover .countdown-controls > button:hover, +.countdown.running:focus-within .countdown-controls > button:hover{ + transform: translate(0, calc(var(--countdown-transition-distance) / -2)); + box-shadow: 0px 2px 5px 0px rgba(50, 50, 50, 0.4); + -webkit-box-shadow: 0px 2px 5px 0px rgba(50, 50, 50, 0.4); +} + +.countdown.running:hover .countdown-controls > button:active, +.countdown.running:focus-within .countdown-controls > button:active{ + transform: translate(0, calc(var(--coutndown-transition-distance) / -5)); +} + +/* ----- Fullscreen ----- */ +.countdown.countdown-fullscreen { + z-index: 0; +} + +.countdown-fullscreen.running .countdown-controls { + top: 1rem; + left: 0; + right: 0; + justify-content: center; +} + +.countdown-fullscreen.running .countdown-controls > button + button { + margin-left: 1rem; +} diff --git a/archive/2023-07-nyr/_freeze/extras-transit-case-study/libs/countdown-0.4.0/countdown.js b/archive/2023-07-nyr/_freeze/extras-transit-case-study/libs/countdown-0.4.0/countdown.js new file mode 100644 index 00000000..a058ad8f --- /dev/null +++ b/archive/2023-07-nyr/_freeze/extras-transit-case-study/libs/countdown-0.4.0/countdown.js @@ -0,0 +1,478 @@ +/* globals Shiny,Audio */ +class CountdownTimer { + constructor (el, opts) { + if (typeof el === 'string' || el instanceof String) { + el = document.querySelector(el) + } + + if (el.counter) { + return el.counter + } + + const minutes = parseInt(el.querySelector('.minutes').innerText || '0') + const seconds = parseInt(el.querySelector('.seconds').innerText || '0') + const duration = minutes * 60 + seconds + + function attrIsTrue (x) { + if (x === true) return true + return !!(x === 'true' || x === '' || x === '1') + } + + this.element = el + this.duration = duration + this.end = null + this.is_running = false + this.warn_when = parseInt(el.dataset.warnWhen) || -1 + this.update_every = parseInt(el.dataset.updateEvery) || 1 + this.play_sound = attrIsTrue(el.dataset.playSound) + this.blink_colon = attrIsTrue(el.dataset.blinkColon) + this.startImmediately = attrIsTrue(el.dataset.startImmediately) + this.timeout = null + this.display = { minutes, seconds } + + if (opts.src_location) { + this.src_location = opts.src_location + } + + this.addEventListeners() + } + + addEventListeners () { + const self = this + + if (this.startImmediately) { + if (window.remark && window.slideshow) { + // Remark (xaringan) support + const isOnVisibleSlide = () => { + return document.querySelector('.remark-visible').contains(self.element) + } + if (isOnVisibleSlide()) { + self.start() + } else { + let started_once = 0 + window.slideshow.on('afterShowSlide', function () { + if (started_once > 0) return + if (isOnVisibleSlide()) { + self.start() + started_once = 1 + } + }) + } + } else if (window.Reveal) { + // Revealjs (quarto) support + const isOnVisibleSlide = () => { + const currentSlide = document.querySelector('.reveal .slide.present') + return currentSlide ? currentSlide.contains(self.element) : false + } + if (isOnVisibleSlide()) { + self.start() + } else { + const revealStartTimer = () => { + if (isOnVisibleSlide()) { + self.start() + window.Reveal.off('slidechanged', revealStartTimer) + } + } + window.Reveal.on('slidechanged', revealStartTimer) + } + } else if (window.IntersectionObserver) { + // All other situtations use IntersectionObserver + const onVisible = (element, callback) => { + new window.IntersectionObserver((entries, observer) => { + entries.forEach(entry => { + if (entry.intersectionRatio > 0) { + callback(element) + observer.disconnect() + } + }) + }).observe(element) + } + onVisible(this.element, el => el.countdown.start()) + } else { + // or just start the timer as soon as it's initialized + this.start() + } + } + + function haltEvent (ev) { + ev.preventDefault() + ev.stopPropagation() + } + function isSpaceOrEnter (ev) { + return ev.code === 'Space' || ev.code === 'Enter' + } + function isArrowUpOrDown (ev) { + return ev.code === 'ArrowUp' || ev.code === 'ArrowDown' + } + + ;['click', 'touchend'].forEach(function (eventType) { + self.element.addEventListener(eventType, function (ev) { + haltEvent(ev) + self.is_running ? self.stop() : self.start() + }) + }) + this.element.addEventListener('keydown', function (ev) { + if (ev.code === "Escape") { + self.reset() + haltEvent(ev) + } + if (!isSpaceOrEnter(ev) && !isArrowUpOrDown(ev)) return + haltEvent(ev) + if (isSpaceOrEnter(ev)) { + self.is_running ? self.stop() : self.start() + return + } + + if (!self.is_running) return + + if (ev.code === 'ArrowUp') { + self.bumpUp() + } else if (ev.code === 'ArrowDown') { + self.bumpDown() + } + }) + this.element.addEventListener('dblclick', function (ev) { + haltEvent(ev) + if (self.is_running) self.reset() + }) + this.element.addEventListener('touchmove', haltEvent) + + const btnBumpDown = this.element.querySelector('.countdown-bump-down') + ;['click', 'touchend'].forEach(function (eventType) { + btnBumpDown.addEventListener(eventType, function (ev) { + haltEvent(ev) + if (self.is_running) self.bumpDown() + }) + }) + btnBumpDown.addEventListener('keydown', function (ev) { + if (!isSpaceOrEnter(ev) || !self.is_running) return + haltEvent(ev) + self.bumpDown() + }) + + const btnBumpUp = this.element.querySelector('.countdown-bump-up') + ;['click', 'touchend'].forEach(function (eventType) { + btnBumpUp.addEventListener(eventType, function (ev) { + haltEvent(ev) + if (self.is_running) self.bumpUp() + }) + }) + btnBumpUp.addEventListener('keydown', function (ev) { + if (!isSpaceOrEnter(ev) || !self.is_running) return + haltEvent(ev) + self.bumpUp() + }) + this.element.querySelector('.countdown-controls').addEventListener('dblclick', function (ev) { + haltEvent(ev) + }) + } + + remainingTime () { + const remaining = this.is_running + ? (this.end - Date.now()) / 1000 + : this.remaining || this.duration + + let minutes = Math.floor(remaining / 60) + let seconds = Math.ceil(remaining - minutes * 60) + + if (seconds > 59) { + minutes = minutes + 1 + seconds = seconds - 60 + } + + return { remaining, minutes, seconds } + } + + start () { + if (this.is_running) return + + this.is_running = true + + if (this.remaining) { + // Having a static remaining time indicates timer was paused + this.end = Date.now() + this.remaining * 1000 + this.remaining = null + } else { + this.end = Date.now() + this.duration * 1000 + } + + this.reportStateToShiny('start') + + this.element.classList.remove('finished') + this.element.classList.add('running') + this.update(true) + this.tick() + } + + tick (run_again) { + if (typeof run_again === 'undefined') { + run_again = true + } + + if (!this.is_running) return + + const { seconds: secondsWas } = this.display + this.update() + + if (run_again) { + const delay = (this.end - Date.now() > 10000) ? 1000 : 250 + this.blinkColon(secondsWas) + this.timeout = setTimeout(this.tick.bind(this), delay) + } + } + + blinkColon (secondsWas) { + // don't blink unless option is set + if (!this.blink_colon) return + // warn_when always updates the seconds + if (this.warn_when > 0 && Date.now() + this.warn_when > this.end) { + this.element.classList.remove('blink-colon') + return + } + const { seconds: secondsIs } = this.display + if (secondsIs > 10 || secondsWas !== secondsIs) { + this.element.classList.toggle('blink-colon') + } + } + + update (force) { + if (typeof force === 'undefined') { + force = false + } + + const { remaining, minutes, seconds } = this.remainingTime() + + const setRemainingTime = (selector, time) => { + const timeContainer = this.element.querySelector(selector) + if (!timeContainer) return + time = Math.max(time, 0) + timeContainer.innerText = String(time).padStart(2, 0) + } + + if (this.is_running && remaining < 0.25) { + this.stop() + setRemainingTime('.minutes', 0) + setRemainingTime('.seconds', 0) + this.playSound() + return + } + + const should_update = force || + Math.round(remaining) < this.warn_when || + Math.round(remaining) % this.update_every === 0 + + if (should_update) { + this.element.classList.toggle('warning', remaining <= this.warn_when) + this.display = { minutes, seconds } + setRemainingTime('.minutes', minutes) + setRemainingTime('.seconds', seconds) + } + } + + stop () { + const { remaining } = this.remainingTime() + if (remaining > 1) { + this.remaining = remaining + } + this.element.classList.remove('running') + this.element.classList.remove('warning') + this.element.classList.remove('blink-colon') + this.element.classList.add('finished') + this.is_running = false + this.end = null + this.reportStateToShiny('stop') + this.timeout = clearTimeout(this.timeout) + } + + reset () { + this.stop() + this.remaining = null + this.update(true) + this.reportStateToShiny('reset') + this.element.classList.remove('finished') + this.element.classList.remove('warning') + } + + setValues (opts) { + if (typeof opts.warn_when !== 'undefined') { + this.warn_when = opts.warn_when + } + if (typeof opts.update_every !== 'undefined') { + this.update_every = opts.update_every + } + if (typeof opts.blink_colon !== 'undefined') { + this.blink_colon = opts.blink_colon + if (!opts.blink_colon) { + this.element.classList.remove('blink-colon') + } + } + if (typeof opts.play_sound !== 'undefined') { + this.play_sound = opts.play_sound + } + if (typeof opts.duration !== 'undefined') { + this.duration = opts.duration + if (this.is_running) { + this.reset() + this.start() + } + } + this.reportStateToShiny('update') + this.update(true) + } + + bumpTimer (val, round) { + round = typeof round === 'boolean' ? round : true + const { remaining } = this.remainingTime() + let newRemaining = remaining + val + if (newRemaining <= 0) { + this.setRemaining(0) + this.stop() + return + } + if (round && newRemaining > 10) { + newRemaining = Math.round(newRemaining / 5) * 5 + } + this.setRemaining(newRemaining) + this.reportStateToShiny(val > 0 ? 'bumpUp' : 'bumpDown') + this.update(true) + } + + bumpUp (val) { + if (!this.is_running) { + console.error('timer is not running') + return + } + this.bumpTimer( + val || this.bumpIncrementValue(), + typeof val === 'undefined' + ) + } + + bumpDown (val) { + if (!this.is_running) { + console.error('timer is not running') + return + } + this.bumpTimer( + val || -1 * this.bumpIncrementValue(), + typeof val === 'undefined' + ) + } + + setRemaining (val) { + if (!this.is_running) { + console.error('timer is not running') + return + } + this.end = Date.now() + val * 1000 + this.update(true) + } + + playSound () { + let url = this.play_sound + if (!url) return + if (typeof url === 'boolean') { + const src = this.src_location + ? this.src_location.replace('/countdown.js', '') + : 'libs/countdown' + url = src + '/smb_stage_clear.mp3' + } + const sound = new Audio(url) + sound.play() + } + + bumpIncrementValue (val) { + val = val || this.remainingTime().remaining + if (val <= 30) { + return 5 + } else if (val <= 300) { + return 15 + } else if (val <= 3000) { + return 30 + } else { + return 60 + } + } + + reportStateToShiny (action) { + if (!window.Shiny) return + + const inputId = this.element.id + const data = { + event: { + action, + time: new Date().toISOString() + }, + timer: { + is_running: this.is_running, + end: this.end ? new Date(this.end).toISOString() : null, + remaining: this.remainingTime() + } + } + + function shinySetInputValue () { + if (!window.Shiny.setInputValue) { + setTimeout(shinySetInputValue, 100) + return + } + window.Shiny.setInputValue(inputId, data) + } + + shinySetInputValue() + } +} + +(function () { + const CURRENT_SCRIPT = document.currentScript.getAttribute('src') + + document.addEventListener('DOMContentLoaded', function () { + const els = document.querySelectorAll('.countdown') + if (!els || !els.length) { + return + } + els.forEach(function (el) { + el.countdown = new CountdownTimer(el, { src_location: CURRENT_SCRIPT }) + }) + + if (window.Shiny) { + Shiny.addCustomMessageHandler('countdown:update', function (x) { + if (!x.id) { + console.error('No `id` provided, cannot update countdown') + return + } + const el = document.getElementById(x.id) + el.countdown.setValues(x) + }) + + Shiny.addCustomMessageHandler('countdown:start', function (id) { + const el = document.getElementById(id) + if (!el) return + el.countdown.start() + }) + + Shiny.addCustomMessageHandler('countdown:stop', function (id) { + const el = document.getElementById(id) + if (!el) return + el.countdown.stop() + }) + + Shiny.addCustomMessageHandler('countdown:reset', function (id) { + const el = document.getElementById(id) + if (!el) return + el.countdown.reset() + }) + + Shiny.addCustomMessageHandler('countdown:bumpUp', function (id) { + const el = document.getElementById(id) + if (!el) return + el.countdown.bumpUp() + }) + + Shiny.addCustomMessageHandler('countdown:bumpDown', function (id) { + const el = document.getElementById(id) + if (!el) return + el.countdown.bumpDown() + }) + } + }) +})() diff --git a/archive/2023-07-nyr/_freeze/extras-transit-case-study/libs/countdown-0.4.0/smb_stage_clear.mp3 b/archive/2023-07-nyr/_freeze/extras-transit-case-study/libs/countdown-0.4.0/smb_stage_clear.mp3 new file mode 100644 index 00000000..da2ddc2c Binary files /dev/null and b/archive/2023-07-nyr/_freeze/extras-transit-case-study/libs/countdown-0.4.0/smb_stage_clear.mp3 differ diff --git a/archive/2023-07-nyr/_freeze/extras-transit-case-study/libs/htmlwidgets-1.6.2/htmlwidgets.js b/archive/2023-07-nyr/_freeze/extras-transit-case-study/libs/htmlwidgets-1.6.2/htmlwidgets.js new file mode 100644 index 00000000..1067d029 --- /dev/null +++ b/archive/2023-07-nyr/_freeze/extras-transit-case-study/libs/htmlwidgets-1.6.2/htmlwidgets.js @@ -0,0 +1,901 @@ +(function() { + // If window.HTMLWidgets is already defined, then use it; otherwise create a + // new object. This allows preceding code to set options that affect the + // initialization process (though none currently exist). + window.HTMLWidgets = window.HTMLWidgets || {}; + + // See if we're running in a viewer pane. If not, we're in a web browser. + var viewerMode = window.HTMLWidgets.viewerMode = + /\bviewer_pane=1\b/.test(window.location); + + // See if we're running in Shiny mode. If not, it's a static document. + // Note that static widgets can appear in both Shiny and static modes, but + // obviously, Shiny widgets can only appear in Shiny apps/documents. + var shinyMode = window.HTMLWidgets.shinyMode = + typeof(window.Shiny) !== "undefined" && !!window.Shiny.outputBindings; + + // We can't count on jQuery being available, so we implement our own + // version if necessary. + function querySelectorAll(scope, selector) { + if (typeof(jQuery) !== "undefined" && scope instanceof jQuery) { + return scope.find(selector); + } + if (scope.querySelectorAll) { + return scope.querySelectorAll(selector); + } + } + + function asArray(value) { + if (value === null) + return []; + if ($.isArray(value)) + return value; + return [value]; + } + + // Implement jQuery's extend + function extend(target /*, ... */) { + if (arguments.length == 1) { + return target; + } + for (var i = 1; i < arguments.length; i++) { + var source = arguments[i]; + for (var prop in source) { + if (source.hasOwnProperty(prop)) { + target[prop] = source[prop]; + } + } + } + return target; + } + + // IE8 doesn't support Array.forEach. + function forEach(values, callback, thisArg) { + if (values.forEach) { + values.forEach(callback, thisArg); + } else { + for (var i = 0; i < values.length; i++) { + callback.call(thisArg, values[i], i, values); + } + } + } + + // Replaces the specified method with the return value of funcSource. + // + // Note that funcSource should not BE the new method, it should be a function + // that RETURNS the new method. funcSource receives a single argument that is + // the overridden method, it can be called from the new method. The overridden + // method can be called like a regular function, it has the target permanently + // bound to it so "this" will work correctly. + function overrideMethod(target, methodName, funcSource) { + var superFunc = target[methodName] || function() {}; + var superFuncBound = function() { + return superFunc.apply(target, arguments); + }; + target[methodName] = funcSource(superFuncBound); + } + + // Add a method to delegator that, when invoked, calls + // delegatee.methodName. If there is no such method on + // the delegatee, but there was one on delegator before + // delegateMethod was called, then the original version + // is invoked instead. + // For example: + // + // var a = { + // method1: function() { console.log('a1'); } + // method2: function() { console.log('a2'); } + // }; + // var b = { + // method1: function() { console.log('b1'); } + // }; + // delegateMethod(a, b, "method1"); + // delegateMethod(a, b, "method2"); + // a.method1(); + // a.method2(); + // + // The output would be "b1", "a2". + function delegateMethod(delegator, delegatee, methodName) { + var inherited = delegator[methodName]; + delegator[methodName] = function() { + var target = delegatee; + var method = delegatee[methodName]; + + // The method doesn't exist on the delegatee. Instead, + // call the method on the delegator, if it exists. + if (!method) { + target = delegator; + method = inherited; + } + + if (method) { + return method.apply(target, arguments); + } + }; + } + + // Implement a vague facsimilie of jQuery's data method + function elementData(el, name, value) { + if (arguments.length == 2) { + return el["htmlwidget_data_" + name]; + } else if (arguments.length == 3) { + el["htmlwidget_data_" + name] = value; + return el; + } else { + throw new Error("Wrong number of arguments for elementData: " + + arguments.length); + } + } + + // http://stackoverflow.com/questions/3446170/escape-string-for-use-in-javascript-regex + function escapeRegExp(str) { + return str.replace(/[\-\[\]\/\{\}\(\)\*\+\?\.\\\^\$\|]/g, "\\$&"); + } + + function hasClass(el, className) { + var re = new RegExp("\\b" + escapeRegExp(className) + "\\b"); + return re.test(el.className); + } + + // elements - array (or array-like object) of HTML elements + // className - class name to test for + // include - if true, only return elements with given className; + // if false, only return elements *without* given className + function filterByClass(elements, className, include) { + var results = []; + for (var i = 0; i < elements.length; i++) { + if (hasClass(elements[i], className) == include) + results.push(elements[i]); + } + return results; + } + + function on(obj, eventName, func) { + if (obj.addEventListener) { + obj.addEventListener(eventName, func, false); + } else if (obj.attachEvent) { + obj.attachEvent(eventName, func); + } + } + + function off(obj, eventName, func) { + if (obj.removeEventListener) + obj.removeEventListener(eventName, func, false); + else if (obj.detachEvent) { + obj.detachEvent(eventName, func); + } + } + + // Translate array of values to top/right/bottom/left, as usual with + // the "padding" CSS property + // https://developer.mozilla.org/en-US/docs/Web/CSS/padding + function unpackPadding(value) { + if (typeof(value) === "number") + value = [value]; + if (value.length === 1) { + return {top: value[0], right: value[0], bottom: value[0], left: value[0]}; + } + if (value.length === 2) { + return {top: value[0], right: value[1], bottom: value[0], left: value[1]}; + } + if (value.length === 3) { + return {top: value[0], right: value[1], bottom: value[2], left: value[1]}; + } + if (value.length === 4) { + return {top: value[0], right: value[1], bottom: value[2], left: value[3]}; + } + } + + // Convert an unpacked padding object to a CSS value + function paddingToCss(paddingObj) { + return paddingObj.top + "px " + paddingObj.right + "px " + paddingObj.bottom + "px " + paddingObj.left + "px"; + } + + // Makes a number suitable for CSS + function px(x) { + if (typeof(x) === "number") + return x + "px"; + else + return x; + } + + // Retrieves runtime widget sizing information for an element. + // The return value is either null, or an object with fill, padding, + // defaultWidth, defaultHeight fields. + function sizingPolicy(el) { + var sizingEl = document.querySelector("script[data-for='" + el.id + "'][type='application/htmlwidget-sizing']"); + if (!sizingEl) + return null; + var sp = JSON.parse(sizingEl.textContent || sizingEl.text || "{}"); + if (viewerMode) { + return sp.viewer; + } else { + return sp.browser; + } + } + + // @param tasks Array of strings (or falsy value, in which case no-op). + // Each element must be a valid JavaScript expression that yields a + // function. Or, can be an array of objects with "code" and "data" + // properties; in this case, the "code" property should be a string + // of JS that's an expr that yields a function, and "data" should be + // an object that will be added as an additional argument when that + // function is called. + // @param target The object that will be "this" for each function + // execution. + // @param args Array of arguments to be passed to the functions. (The + // same arguments will be passed to all functions.) + function evalAndRun(tasks, target, args) { + if (tasks) { + forEach(tasks, function(task) { + var theseArgs = args; + if (typeof(task) === "object") { + theseArgs = theseArgs.concat([task.data]); + task = task.code; + } + var taskFunc = tryEval(task); + if (typeof(taskFunc) !== "function") { + throw new Error("Task must be a function! Source:\n" + task); + } + taskFunc.apply(target, theseArgs); + }); + } + } + + // Attempt eval() both with and without enclosing in parentheses. + // Note that enclosing coerces a function declaration into + // an expression that eval() can parse + // (otherwise, a SyntaxError is thrown) + function tryEval(code) { + var result = null; + try { + result = eval("(" + code + ")"); + } catch(error) { + if (!(error instanceof SyntaxError)) { + throw error; + } + try { + result = eval(code); + } catch(e) { + if (e instanceof SyntaxError) { + throw error; + } else { + throw e; + } + } + } + return result; + } + + function initSizing(el) { + var sizing = sizingPolicy(el); + if (!sizing) + return; + + var cel = document.getElementById("htmlwidget_container"); + if (!cel) + return; + + if (typeof(sizing.padding) !== "undefined") { + document.body.style.margin = "0"; + document.body.style.padding = paddingToCss(unpackPadding(sizing.padding)); + } + + if (sizing.fill) { + document.body.style.overflow = "hidden"; + document.body.style.width = "100%"; + document.body.style.height = "100%"; + document.documentElement.style.width = "100%"; + document.documentElement.style.height = "100%"; + cel.style.position = "absolute"; + var pad = unpackPadding(sizing.padding); + cel.style.top = pad.top + "px"; + cel.style.right = pad.right + "px"; + cel.style.bottom = pad.bottom + "px"; + cel.style.left = pad.left + "px"; + el.style.width = "100%"; + el.style.height = "100%"; + + return { + getWidth: function() { return cel.getBoundingClientRect().width; }, + getHeight: function() { return cel.getBoundingClientRect().height; } + }; + + } else { + el.style.width = px(sizing.width); + el.style.height = px(sizing.height); + + return { + getWidth: function() { return cel.getBoundingClientRect().width; }, + getHeight: function() { return cel.getBoundingClientRect().height; } + }; + } + } + + // Default implementations for methods + var defaults = { + find: function(scope) { + return querySelectorAll(scope, "." + this.name); + }, + renderError: function(el, err) { + var $el = $(el); + + this.clearError(el); + + // Add all these error classes, as Shiny does + var errClass = "shiny-output-error"; + if (err.type !== null) { + // use the classes of the error condition as CSS class names + errClass = errClass + " " + $.map(asArray(err.type), function(type) { + return errClass + "-" + type; + }).join(" "); + } + errClass = errClass + " htmlwidgets-error"; + + // Is el inline or block? If inline or inline-block, just display:none it + // and add an inline error. + var display = $el.css("display"); + $el.data("restore-display-mode", display); + + if (display === "inline" || display === "inline-block") { + $el.hide(); + if (err.message !== "") { + var errorSpan = $("").addClass(errClass); + errorSpan.text(err.message); + $el.after(errorSpan); + } + } else if (display === "block") { + // If block, add an error just after the el, set visibility:none on the + // el, and position the error to be on top of the el. + // Mark it with a unique ID and CSS class so we can remove it later. + $el.css("visibility", "hidden"); + if (err.message !== "") { + var errorDiv = $("
").addClass(errClass).css("position", "absolute") + .css("top", el.offsetTop) + .css("left", el.offsetLeft) + // setting width can push out the page size, forcing otherwise + // unnecessary scrollbars to appear and making it impossible for + // the element to shrink; so use max-width instead + .css("maxWidth", el.offsetWidth) + .css("height", el.offsetHeight); + errorDiv.text(err.message); + $el.after(errorDiv); + + // Really dumb way to keep the size/position of the error in sync with + // the parent element as the window is resized or whatever. + var intId = setInterval(function() { + if (!errorDiv[0].parentElement) { + clearInterval(intId); + return; + } + errorDiv + .css("top", el.offsetTop) + .css("left", el.offsetLeft) + .css("maxWidth", el.offsetWidth) + .css("height", el.offsetHeight); + }, 500); + } + } + }, + clearError: function(el) { + var $el = $(el); + var display = $el.data("restore-display-mode"); + $el.data("restore-display-mode", null); + + if (display === "inline" || display === "inline-block") { + if (display) + $el.css("display", display); + $(el.nextSibling).filter(".htmlwidgets-error").remove(); + } else if (display === "block"){ + $el.css("visibility", "inherit"); + $(el.nextSibling).filter(".htmlwidgets-error").remove(); + } + }, + sizing: {} + }; + + // Called by widget bindings to register a new type of widget. The definition + // object can contain the following properties: + // - name (required) - A string indicating the binding name, which will be + // used by default as the CSS classname to look for. + // - initialize (optional) - A function(el) that will be called once per + // widget element; if a value is returned, it will be passed as the third + // value to renderValue. + // - renderValue (required) - A function(el, data, initValue) that will be + // called with data. Static contexts will cause this to be called once per + // element; Shiny apps will cause this to be called multiple times per + // element, as the data changes. + window.HTMLWidgets.widget = function(definition) { + if (!definition.name) { + throw new Error("Widget must have a name"); + } + if (!definition.type) { + throw new Error("Widget must have a type"); + } + // Currently we only support output widgets + if (definition.type !== "output") { + throw new Error("Unrecognized widget type '" + definition.type + "'"); + } + // TODO: Verify that .name is a valid CSS classname + + // Support new-style instance-bound definitions. Old-style class-bound + // definitions have one widget "object" per widget per type/class of + // widget; the renderValue and resize methods on such widget objects + // take el and instance arguments, because the widget object can't + // store them. New-style instance-bound definitions have one widget + // object per widget instance; the definition that's passed in doesn't + // provide renderValue or resize methods at all, just the single method + // factory(el, width, height) + // which returns an object that has renderValue(x) and resize(w, h). + // This enables a far more natural programming style for the widget + // author, who can store per-instance state using either OO-style + // instance fields or functional-style closure variables (I guess this + // is in contrast to what can only be called C-style pseudo-OO which is + // what we required before). + if (definition.factory) { + definition = createLegacyDefinitionAdapter(definition); + } + + if (!definition.renderValue) { + throw new Error("Widget must have a renderValue function"); + } + + // For static rendering (non-Shiny), use a simple widget registration + // scheme. We also use this scheme for Shiny apps/documents that also + // contain static widgets. + window.HTMLWidgets.widgets = window.HTMLWidgets.widgets || []; + // Merge defaults into the definition; don't mutate the original definition. + var staticBinding = extend({}, defaults, definition); + overrideMethod(staticBinding, "find", function(superfunc) { + return function(scope) { + var results = superfunc(scope); + // Filter out Shiny outputs, we only want the static kind + return filterByClass(results, "html-widget-output", false); + }; + }); + window.HTMLWidgets.widgets.push(staticBinding); + + if (shinyMode) { + // Shiny is running. Register the definition with an output binding. + // The definition itself will not be the output binding, instead + // we will make an output binding object that delegates to the + // definition. This is because we foolishly used the same method + // name (renderValue) for htmlwidgets definition and Shiny bindings + // but they actually have quite different semantics (the Shiny + // bindings receive data that includes lots of metadata that it + // strips off before calling htmlwidgets renderValue). We can't + // just ignore the difference because in some widgets it's helpful + // to call this.renderValue() from inside of resize(), and if + // we're not delegating, then that call will go to the Shiny + // version instead of the htmlwidgets version. + + // Merge defaults with definition, without mutating either. + var bindingDef = extend({}, defaults, definition); + + // This object will be our actual Shiny binding. + var shinyBinding = new Shiny.OutputBinding(); + + // With a few exceptions, we'll want to simply use the bindingDef's + // version of methods if they are available, otherwise fall back to + // Shiny's defaults. NOTE: If Shiny's output bindings gain additional + // methods in the future, and we want them to be overrideable by + // HTMLWidget binding definitions, then we'll need to add them to this + // list. + delegateMethod(shinyBinding, bindingDef, "getId"); + delegateMethod(shinyBinding, bindingDef, "onValueChange"); + delegateMethod(shinyBinding, bindingDef, "onValueError"); + delegateMethod(shinyBinding, bindingDef, "renderError"); + delegateMethod(shinyBinding, bindingDef, "clearError"); + delegateMethod(shinyBinding, bindingDef, "showProgress"); + + // The find, renderValue, and resize are handled differently, because we + // want to actually decorate the behavior of the bindingDef methods. + + shinyBinding.find = function(scope) { + var results = bindingDef.find(scope); + + // Only return elements that are Shiny outputs, not static ones + var dynamicResults = results.filter(".html-widget-output"); + + // It's possible that whatever caused Shiny to think there might be + // new dynamic outputs, also caused there to be new static outputs. + // Since there might be lots of different htmlwidgets bindings, we + // schedule execution for later--no need to staticRender multiple + // times. + if (results.length !== dynamicResults.length) + scheduleStaticRender(); + + return dynamicResults; + }; + + // Wrap renderValue to handle initialization, which unfortunately isn't + // supported natively by Shiny at the time of this writing. + + shinyBinding.renderValue = function(el, data) { + Shiny.renderDependencies(data.deps); + // Resolve strings marked as javascript literals to objects + if (!(data.evals instanceof Array)) data.evals = [data.evals]; + for (var i = 0; data.evals && i < data.evals.length; i++) { + window.HTMLWidgets.evaluateStringMember(data.x, data.evals[i]); + } + if (!bindingDef.renderOnNullValue) { + if (data.x === null) { + el.style.visibility = "hidden"; + return; + } else { + el.style.visibility = "inherit"; + } + } + if (!elementData(el, "initialized")) { + initSizing(el); + + elementData(el, "initialized", true); + if (bindingDef.initialize) { + var rect = el.getBoundingClientRect(); + var result = bindingDef.initialize(el, rect.width, rect.height); + elementData(el, "init_result", result); + } + } + bindingDef.renderValue(el, data.x, elementData(el, "init_result")); + evalAndRun(data.jsHooks.render, elementData(el, "init_result"), [el, data.x]); + }; + + // Only override resize if bindingDef implements it + if (bindingDef.resize) { + shinyBinding.resize = function(el, width, height) { + // Shiny can call resize before initialize/renderValue have been + // called, which doesn't make sense for widgets. + if (elementData(el, "initialized")) { + bindingDef.resize(el, width, height, elementData(el, "init_result")); + } + }; + } + + Shiny.outputBindings.register(shinyBinding, bindingDef.name); + } + }; + + var scheduleStaticRenderTimerId = null; + function scheduleStaticRender() { + if (!scheduleStaticRenderTimerId) { + scheduleStaticRenderTimerId = setTimeout(function() { + scheduleStaticRenderTimerId = null; + window.HTMLWidgets.staticRender(); + }, 1); + } + } + + // Render static widgets after the document finishes loading + // Statically render all elements that are of this widget's class + window.HTMLWidgets.staticRender = function() { + var bindings = window.HTMLWidgets.widgets || []; + forEach(bindings, function(binding) { + var matches = binding.find(document.documentElement); + forEach(matches, function(el) { + var sizeObj = initSizing(el, binding); + + var getSize = function(el) { + if (sizeObj) { + return {w: sizeObj.getWidth(), h: sizeObj.getHeight()} + } else { + var rect = el.getBoundingClientRect(); + return {w: rect.width, h: rect.height} + } + }; + + if (hasClass(el, "html-widget-static-bound")) + return; + el.className = el.className + " html-widget-static-bound"; + + var initResult; + if (binding.initialize) { + var size = getSize(el); + initResult = binding.initialize(el, size.w, size.h); + elementData(el, "init_result", initResult); + } + + if (binding.resize) { + var lastSize = getSize(el); + var resizeHandler = function(e) { + var size = getSize(el); + if (size.w === 0 && size.h === 0) + return; + if (size.w === lastSize.w && size.h === lastSize.h) + return; + lastSize = size; + binding.resize(el, size.w, size.h, initResult); + }; + + on(window, "resize", resizeHandler); + + // This is needed for cases where we're running in a Shiny + // app, but the widget itself is not a Shiny output, but + // rather a simple static widget. One example of this is + // an rmarkdown document that has runtime:shiny and widget + // that isn't in a render function. Shiny only knows to + // call resize handlers for Shiny outputs, not for static + // widgets, so we do it ourselves. + if (window.jQuery) { + window.jQuery(document).on( + "shown.htmlwidgets shown.bs.tab.htmlwidgets shown.bs.collapse.htmlwidgets", + resizeHandler + ); + window.jQuery(document).on( + "hidden.htmlwidgets hidden.bs.tab.htmlwidgets hidden.bs.collapse.htmlwidgets", + resizeHandler + ); + } + + // This is needed for the specific case of ioslides, which + // flips slides between display:none and display:block. + // Ideally we would not have to have ioslide-specific code + // here, but rather have ioslides raise a generic event, + // but the rmarkdown package just went to CRAN so the + // window to getting that fixed may be long. + if (window.addEventListener) { + // It's OK to limit this to window.addEventListener + // browsers because ioslides itself only supports + // such browsers. + on(document, "slideenter", resizeHandler); + on(document, "slideleave", resizeHandler); + } + } + + var scriptData = document.querySelector("script[data-for='" + el.id + "'][type='application/json']"); + if (scriptData) { + var data = JSON.parse(scriptData.textContent || scriptData.text); + // Resolve strings marked as javascript literals to objects + if (!(data.evals instanceof Array)) data.evals = [data.evals]; + for (var k = 0; data.evals && k < data.evals.length; k++) { + window.HTMLWidgets.evaluateStringMember(data.x, data.evals[k]); + } + binding.renderValue(el, data.x, initResult); + evalAndRun(data.jsHooks.render, initResult, [el, data.x]); + } + }); + }); + + invokePostRenderHandlers(); + } + + + function has_jQuery3() { + if (!window.jQuery) { + return false; + } + var $version = window.jQuery.fn.jquery; + var $major_version = parseInt($version.split(".")[0]); + return $major_version >= 3; + } + + /* + / Shiny 1.4 bumped jQuery from 1.x to 3.x which means jQuery's + / on-ready handler (i.e., $(fn)) is now asyncronous (i.e., it now + / really means $(setTimeout(fn)). + / https://jquery.com/upgrade-guide/3.0/#breaking-change-document-ready-handlers-are-now-asynchronous + / + / Since Shiny uses $() to schedule initShiny, shiny>=1.4 calls initShiny + / one tick later than it did before, which means staticRender() is + / called renderValue() earlier than (advanced) widget authors might be expecting. + / https://github.com/rstudio/shiny/issues/2630 + / + / For a concrete example, leaflet has some methods (e.g., updateBounds) + / which reference Shiny methods registered in initShiny (e.g., setInputValue). + / Since leaflet is privy to this life-cycle, it knows to use setTimeout() to + / delay execution of those methods (until Shiny methods are ready) + / https://github.com/rstudio/leaflet/blob/18ec981/javascript/src/index.js#L266-L268 + / + / Ideally widget authors wouldn't need to use this setTimeout() hack that + / leaflet uses to call Shiny methods on a staticRender(). In the long run, + / the logic initShiny should be broken up so that method registration happens + / right away, but binding happens later. + */ + function maybeStaticRenderLater() { + if (shinyMode && has_jQuery3()) { + window.jQuery(window.HTMLWidgets.staticRender); + } else { + window.HTMLWidgets.staticRender(); + } + } + + if (document.addEventListener) { + document.addEventListener("DOMContentLoaded", function() { + document.removeEventListener("DOMContentLoaded", arguments.callee, false); + maybeStaticRenderLater(); + }, false); + } else if (document.attachEvent) { + document.attachEvent("onreadystatechange", function() { + if (document.readyState === "complete") { + document.detachEvent("onreadystatechange", arguments.callee); + maybeStaticRenderLater(); + } + }); + } + + + window.HTMLWidgets.getAttachmentUrl = function(depname, key) { + // If no key, default to the first item + if (typeof(key) === "undefined") + key = 1; + + var link = document.getElementById(depname + "-" + key + "-attachment"); + if (!link) { + throw new Error("Attachment " + depname + "/" + key + " not found in document"); + } + return link.getAttribute("href"); + }; + + window.HTMLWidgets.dataframeToD3 = function(df) { + var names = []; + var length; + for (var name in df) { + if (df.hasOwnProperty(name)) + names.push(name); + if (typeof(df[name]) !== "object" || typeof(df[name].length) === "undefined") { + throw new Error("All fields must be arrays"); + } else if (typeof(length) !== "undefined" && length !== df[name].length) { + throw new Error("All fields must be arrays of the same length"); + } + length = df[name].length; + } + var results = []; + var item; + for (var row = 0; row < length; row++) { + item = {}; + for (var col = 0; col < names.length; col++) { + item[names[col]] = df[names[col]][row]; + } + results.push(item); + } + return results; + }; + + window.HTMLWidgets.transposeArray2D = function(array) { + if (array.length === 0) return array; + var newArray = array[0].map(function(col, i) { + return array.map(function(row) { + return row[i] + }) + }); + return newArray; + }; + // Split value at splitChar, but allow splitChar to be escaped + // using escapeChar. Any other characters escaped by escapeChar + // will be included as usual (including escapeChar itself). + function splitWithEscape(value, splitChar, escapeChar) { + var results = []; + var escapeMode = false; + var currentResult = ""; + for (var pos = 0; pos < value.length; pos++) { + if (!escapeMode) { + if (value[pos] === splitChar) { + results.push(currentResult); + currentResult = ""; + } else if (value[pos] === escapeChar) { + escapeMode = true; + } else { + currentResult += value[pos]; + } + } else { + currentResult += value[pos]; + escapeMode = false; + } + } + if (currentResult !== "") { + results.push(currentResult); + } + return results; + } + // Function authored by Yihui/JJ Allaire + window.HTMLWidgets.evaluateStringMember = function(o, member) { + var parts = splitWithEscape(member, '.', '\\'); + for (var i = 0, l = parts.length; i < l; i++) { + var part = parts[i]; + // part may be a character or 'numeric' member name + if (o !== null && typeof o === "object" && part in o) { + if (i == (l - 1)) { // if we are at the end of the line then evalulate + if (typeof o[part] === "string") + o[part] = tryEval(o[part]); + } else { // otherwise continue to next embedded object + o = o[part]; + } + } + } + }; + + // Retrieve the HTMLWidget instance (i.e. the return value of an + // HTMLWidget binding's initialize() or factory() function) + // associated with an element, or null if none. + window.HTMLWidgets.getInstance = function(el) { + return elementData(el, "init_result"); + }; + + // Finds the first element in the scope that matches the selector, + // and returns the HTMLWidget instance (i.e. the return value of + // an HTMLWidget binding's initialize() or factory() function) + // associated with that element, if any. If no element matches the + // selector, or the first matching element has no HTMLWidget + // instance associated with it, then null is returned. + // + // The scope argument is optional, and defaults to window.document. + window.HTMLWidgets.find = function(scope, selector) { + if (arguments.length == 1) { + selector = scope; + scope = document; + } + + var el = scope.querySelector(selector); + if (el === null) { + return null; + } else { + return window.HTMLWidgets.getInstance(el); + } + }; + + // Finds all elements in the scope that match the selector, and + // returns the HTMLWidget instances (i.e. the return values of + // an HTMLWidget binding's initialize() or factory() function) + // associated with the elements, in an array. If elements that + // match the selector don't have an associated HTMLWidget + // instance, the returned array will contain nulls. + // + // The scope argument is optional, and defaults to window.document. + window.HTMLWidgets.findAll = function(scope, selector) { + if (arguments.length == 1) { + selector = scope; + scope = document; + } + + var nodes = scope.querySelectorAll(selector); + var results = []; + for (var i = 0; i < nodes.length; i++) { + results.push(window.HTMLWidgets.getInstance(nodes[i])); + } + return results; + }; + + var postRenderHandlers = []; + function invokePostRenderHandlers() { + while (postRenderHandlers.length) { + var handler = postRenderHandlers.shift(); + if (handler) { + handler(); + } + } + } + + // Register the given callback function to be invoked after the + // next time static widgets are rendered. + window.HTMLWidgets.addPostRenderHandler = function(callback) { + postRenderHandlers.push(callback); + }; + + // Takes a new-style instance-bound definition, and returns an + // old-style class-bound definition. This saves us from having + // to rewrite all the logic in this file to accomodate both + // types of definitions. + function createLegacyDefinitionAdapter(defn) { + var result = { + name: defn.name, + type: defn.type, + initialize: function(el, width, height) { + return defn.factory(el, width, height); + }, + renderValue: function(el, x, instance) { + return instance.renderValue(x); + }, + resize: function(el, width, height, instance) { + return instance.resize(width, height); + } + }; + + if (defn.find) + result.find = defn.find; + if (defn.renderError) + result.renderError = defn.renderError; + if (defn.clearError) + result.clearError = defn.clearError; + + return result; + } +})(); diff --git a/archive/2023-07-nyr/_freeze/extras-transit-case-study/libs/jquery-1.12.4/jquery.min.js b/archive/2023-07-nyr/_freeze/extras-transit-case-study/libs/jquery-1.12.4/jquery.min.js new file mode 100644 index 00000000..e8364758 --- /dev/null +++ b/archive/2023-07-nyr/_freeze/extras-transit-case-study/libs/jquery-1.12.4/jquery.min.js @@ -0,0 +1,5 @@ +/*! jQuery v1.12.4 | (c) jQuery Foundation | jquery.org/license */ +!function(a,b){"object"==typeof module&&"object"==typeof module.exports?module.exports=a.document?b(a,!0):function(a){if(!a.document)throw new Error("jQuery requires a window with a document");return b(a)}:b(a)}("undefined"!=typeof window?window:this,function(a,b){var c=[],d=a.document,e=c.slice,f=c.concat,g=c.push,h=c.indexOf,i={},j=i.toString,k=i.hasOwnProperty,l={},m="1.12.4",n=function(a,b){return new n.fn.init(a,b)},o=/^[\s\uFEFF\xA0]+|[\s\uFEFF\xA0]+$/g,p=/^-ms-/,q=/-([\da-z])/gi,r=function(a,b){return b.toUpperCase()};n.fn=n.prototype={jquery:m,constructor:n,selector:"",length:0,toArray:function(){return e.call(this)},get:function(a){return null!=a?0>a?this[a+this.length]:this[a]:e.call(this)},pushStack:function(a){var b=n.merge(this.constructor(),a);return b.prevObject=this,b.context=this.context,b},each:function(a){return n.each(this,a)},map:function(a){return this.pushStack(n.map(this,function(b,c){return a.call(b,c,b)}))},slice:function(){return this.pushStack(e.apply(this,arguments))},first:function(){return this.eq(0)},last:function(){return this.eq(-1)},eq:function(a){var b=this.length,c=+a+(0>a?b:0);return this.pushStack(c>=0&&b>c?[this[c]]:[])},end:function(){return this.prevObject||this.constructor()},push:g,sort:c.sort,splice:c.splice},n.extend=n.fn.extend=function(){var a,b,c,d,e,f,g=arguments[0]||{},h=1,i=arguments.length,j=!1;for("boolean"==typeof g&&(j=g,g=arguments[h]||{},h++),"object"==typeof g||n.isFunction(g)||(g={}),h===i&&(g=this,h--);i>h;h++)if(null!=(e=arguments[h]))for(d in e)a=g[d],c=e[d],g!==c&&(j&&c&&(n.isPlainObject(c)||(b=n.isArray(c)))?(b?(b=!1,f=a&&n.isArray(a)?a:[]):f=a&&n.isPlainObject(a)?a:{},g[d]=n.extend(j,f,c)):void 0!==c&&(g[d]=c));return g},n.extend({expando:"jQuery"+(m+Math.random()).replace(/\D/g,""),isReady:!0,error:function(a){throw new Error(a)},noop:function(){},isFunction:function(a){return"function"===n.type(a)},isArray:Array.isArray||function(a){return"array"===n.type(a)},isWindow:function(a){return null!=a&&a==a.window},isNumeric:function(a){var b=a&&a.toString();return!n.isArray(a)&&b-parseFloat(b)+1>=0},isEmptyObject:function(a){var b;for(b in a)return!1;return!0},isPlainObject:function(a){var b;if(!a||"object"!==n.type(a)||a.nodeType||n.isWindow(a))return!1;try{if(a.constructor&&!k.call(a,"constructor")&&!k.call(a.constructor.prototype,"isPrototypeOf"))return!1}catch(c){return!1}if(!l.ownFirst)for(b in a)return k.call(a,b);for(b in a);return void 0===b||k.call(a,b)},type:function(a){return null==a?a+"":"object"==typeof a||"function"==typeof a?i[j.call(a)]||"object":typeof a},globalEval:function(b){b&&n.trim(b)&&(a.execScript||function(b){a.eval.call(a,b)})(b)},camelCase:function(a){return a.replace(p,"ms-").replace(q,r)},nodeName:function(a,b){return a.nodeName&&a.nodeName.toLowerCase()===b.toLowerCase()},each:function(a,b){var c,d=0;if(s(a)){for(c=a.length;c>d;d++)if(b.call(a[d],d,a[d])===!1)break}else for(d in a)if(b.call(a[d],d,a[d])===!1)break;return a},trim:function(a){return null==a?"":(a+"").replace(o,"")},makeArray:function(a,b){var c=b||[];return null!=a&&(s(Object(a))?n.merge(c,"string"==typeof a?[a]:a):g.call(c,a)),c},inArray:function(a,b,c){var d;if(b){if(h)return h.call(b,a,c);for(d=b.length,c=c?0>c?Math.max(0,d+c):c:0;d>c;c++)if(c in b&&b[c]===a)return c}return-1},merge:function(a,b){var c=+b.length,d=0,e=a.length;while(c>d)a[e++]=b[d++];if(c!==c)while(void 0!==b[d])a[e++]=b[d++];return a.length=e,a},grep:function(a,b,c){for(var d,e=[],f=0,g=a.length,h=!c;g>f;f++)d=!b(a[f],f),d!==h&&e.push(a[f]);return e},map:function(a,b,c){var d,e,g=0,h=[];if(s(a))for(d=a.length;d>g;g++)e=b(a[g],g,c),null!=e&&h.push(e);else for(g in a)e=b(a[g],g,c),null!=e&&h.push(e);return f.apply([],h)},guid:1,proxy:function(a,b){var c,d,f;return"string"==typeof b&&(f=a[b],b=a,a=f),n.isFunction(a)?(c=e.call(arguments,2),d=function(){return a.apply(b||this,c.concat(e.call(arguments)))},d.guid=a.guid=a.guid||n.guid++,d):void 0},now:function(){return+new Date},support:l}),"function"==typeof Symbol&&(n.fn[Symbol.iterator]=c[Symbol.iterator]),n.each("Boolean Number String Function Array Date RegExp Object Error Symbol".split(" "),function(a,b){i["[object "+b+"]"]=b.toLowerCase()});function s(a){var b=!!a&&"length"in a&&a.length,c=n.type(a);return"function"===c||n.isWindow(a)?!1:"array"===c||0===b||"number"==typeof b&&b>0&&b-1 in a}var t=function(a){var b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u="sizzle"+1*new Date,v=a.document,w=0,x=0,y=ga(),z=ga(),A=ga(),B=function(a,b){return a===b&&(l=!0),0},C=1<<31,D={}.hasOwnProperty,E=[],F=E.pop,G=E.push,H=E.push,I=E.slice,J=function(a,b){for(var c=0,d=a.length;d>c;c++)if(a[c]===b)return c;return-1},K="checked|selected|async|autofocus|autoplay|controls|defer|disabled|hidden|ismap|loop|multiple|open|readonly|required|scoped",L="[\\x20\\t\\r\\n\\f]",M="(?:\\\\.|[\\w-]|[^\\x00-\\xa0])+",N="\\["+L+"*("+M+")(?:"+L+"*([*^$|!~]?=)"+L+"*(?:'((?:\\\\.|[^\\\\'])*)'|\"((?:\\\\.|[^\\\\\"])*)\"|("+M+"))|)"+L+"*\\]",O=":("+M+")(?:\\((('((?:\\\\.|[^\\\\'])*)'|\"((?:\\\\.|[^\\\\\"])*)\")|((?:\\\\.|[^\\\\()[\\]]|"+N+")*)|.*)\\)|)",P=new RegExp(L+"+","g"),Q=new RegExp("^"+L+"+|((?:^|[^\\\\])(?:\\\\.)*)"+L+"+$","g"),R=new RegExp("^"+L+"*,"+L+"*"),S=new RegExp("^"+L+"*([>+~]|"+L+")"+L+"*"),T=new RegExp("="+L+"*([^\\]'\"]*?)"+L+"*\\]","g"),U=new RegExp(O),V=new RegExp("^"+M+"$"),W={ID:new RegExp("^#("+M+")"),CLASS:new RegExp("^\\.("+M+")"),TAG:new RegExp("^("+M+"|[*])"),ATTR:new RegExp("^"+N),PSEUDO:new RegExp("^"+O),CHILD:new RegExp("^:(only|first|last|nth|nth-last)-(child|of-type)(?:\\("+L+"*(even|odd|(([+-]|)(\\d*)n|)"+L+"*(?:([+-]|)"+L+"*(\\d+)|))"+L+"*\\)|)","i"),bool:new RegExp("^(?:"+K+")$","i"),needsContext:new RegExp("^"+L+"*[>+~]|:(even|odd|eq|gt|lt|nth|first|last)(?:\\("+L+"*((?:-\\d)?\\d*)"+L+"*\\)|)(?=[^-]|$)","i")},X=/^(?:input|select|textarea|button)$/i,Y=/^h\d$/i,Z=/^[^{]+\{\s*\[native \w/,$=/^(?:#([\w-]+)|(\w+)|\.([\w-]+))$/,_=/[+~]/,aa=/'|\\/g,ba=new RegExp("\\\\([\\da-f]{1,6}"+L+"?|("+L+")|.)","ig"),ca=function(a,b,c){var d="0x"+b-65536;return d!==d||c?b:0>d?String.fromCharCode(d+65536):String.fromCharCode(d>>10|55296,1023&d|56320)},da=function(){m()};try{H.apply(E=I.call(v.childNodes),v.childNodes),E[v.childNodes.length].nodeType}catch(ea){H={apply:E.length?function(a,b){G.apply(a,I.call(b))}:function(a,b){var c=a.length,d=0;while(a[c++]=b[d++]);a.length=c-1}}}function fa(a,b,d,e){var f,h,j,k,l,o,r,s,w=b&&b.ownerDocument,x=b?b.nodeType:9;if(d=d||[],"string"!=typeof a||!a||1!==x&&9!==x&&11!==x)return d;if(!e&&((b?b.ownerDocument||b:v)!==n&&m(b),b=b||n,p)){if(11!==x&&(o=$.exec(a)))if(f=o[1]){if(9===x){if(!(j=b.getElementById(f)))return d;if(j.id===f)return d.push(j),d}else if(w&&(j=w.getElementById(f))&&t(b,j)&&j.id===f)return d.push(j),d}else{if(o[2])return H.apply(d,b.getElementsByTagName(a)),d;if((f=o[3])&&c.getElementsByClassName&&b.getElementsByClassName)return H.apply(d,b.getElementsByClassName(f)),d}if(c.qsa&&!A[a+" "]&&(!q||!q.test(a))){if(1!==x)w=b,s=a;else if("object"!==b.nodeName.toLowerCase()){(k=b.getAttribute("id"))?k=k.replace(aa,"\\$&"):b.setAttribute("id",k=u),r=g(a),h=r.length,l=V.test(k)?"#"+k:"[id='"+k+"']";while(h--)r[h]=l+" "+qa(r[h]);s=r.join(","),w=_.test(a)&&oa(b.parentNode)||b}if(s)try{return H.apply(d,w.querySelectorAll(s)),d}catch(y){}finally{k===u&&b.removeAttribute("id")}}}return i(a.replace(Q,"$1"),b,d,e)}function ga(){var a=[];function b(c,e){return a.push(c+" ")>d.cacheLength&&delete b[a.shift()],b[c+" "]=e}return b}function ha(a){return a[u]=!0,a}function ia(a){var b=n.createElement("div");try{return!!a(b)}catch(c){return!1}finally{b.parentNode&&b.parentNode.removeChild(b),b=null}}function ja(a,b){var c=a.split("|"),e=c.length;while(e--)d.attrHandle[c[e]]=b}function ka(a,b){var c=b&&a,d=c&&1===a.nodeType&&1===b.nodeType&&(~b.sourceIndex||C)-(~a.sourceIndex||C);if(d)return d;if(c)while(c=c.nextSibling)if(c===b)return-1;return a?1:-1}function la(a){return function(b){var c=b.nodeName.toLowerCase();return"input"===c&&b.type===a}}function ma(a){return function(b){var c=b.nodeName.toLowerCase();return("input"===c||"button"===c)&&b.type===a}}function na(a){return ha(function(b){return b=+b,ha(function(c,d){var e,f=a([],c.length,b),g=f.length;while(g--)c[e=f[g]]&&(c[e]=!(d[e]=c[e]))})})}function oa(a){return a&&"undefined"!=typeof a.getElementsByTagName&&a}c=fa.support={},f=fa.isXML=function(a){var b=a&&(a.ownerDocument||a).documentElement;return b?"HTML"!==b.nodeName:!1},m=fa.setDocument=function(a){var b,e,g=a?a.ownerDocument||a:v;return g!==n&&9===g.nodeType&&g.documentElement?(n=g,o=n.documentElement,p=!f(n),(e=n.defaultView)&&e.top!==e&&(e.addEventListener?e.addEventListener("unload",da,!1):e.attachEvent&&e.attachEvent("onunload",da)),c.attributes=ia(function(a){return a.className="i",!a.getAttribute("className")}),c.getElementsByTagName=ia(function(a){return a.appendChild(n.createComment("")),!a.getElementsByTagName("*").length}),c.getElementsByClassName=Z.test(n.getElementsByClassName),c.getById=ia(function(a){return o.appendChild(a).id=u,!n.getElementsByName||!n.getElementsByName(u).length}),c.getById?(d.find.ID=function(a,b){if("undefined"!=typeof b.getElementById&&p){var c=b.getElementById(a);return c?[c]:[]}},d.filter.ID=function(a){var b=a.replace(ba,ca);return function(a){return a.getAttribute("id")===b}}):(delete d.find.ID,d.filter.ID=function(a){var b=a.replace(ba,ca);return function(a){var c="undefined"!=typeof a.getAttributeNode&&a.getAttributeNode("id");return c&&c.value===b}}),d.find.TAG=c.getElementsByTagName?function(a,b){return"undefined"!=typeof b.getElementsByTagName?b.getElementsByTagName(a):c.qsa?b.querySelectorAll(a):void 0}:function(a,b){var c,d=[],e=0,f=b.getElementsByTagName(a);if("*"===a){while(c=f[e++])1===c.nodeType&&d.push(c);return d}return f},d.find.CLASS=c.getElementsByClassName&&function(a,b){return"undefined"!=typeof b.getElementsByClassName&&p?b.getElementsByClassName(a):void 0},r=[],q=[],(c.qsa=Z.test(n.querySelectorAll))&&(ia(function(a){o.appendChild(a).innerHTML="",a.querySelectorAll("[msallowcapture^='']").length&&q.push("[*^$]="+L+"*(?:''|\"\")"),a.querySelectorAll("[selected]").length||q.push("\\["+L+"*(?:value|"+K+")"),a.querySelectorAll("[id~="+u+"-]").length||q.push("~="),a.querySelectorAll(":checked").length||q.push(":checked"),a.querySelectorAll("a#"+u+"+*").length||q.push(".#.+[+~]")}),ia(function(a){var b=n.createElement("input");b.setAttribute("type","hidden"),a.appendChild(b).setAttribute("name","D"),a.querySelectorAll("[name=d]").length&&q.push("name"+L+"*[*^$|!~]?="),a.querySelectorAll(":enabled").length||q.push(":enabled",":disabled"),a.querySelectorAll("*,:x"),q.push(",.*:")})),(c.matchesSelector=Z.test(s=o.matches||o.webkitMatchesSelector||o.mozMatchesSelector||o.oMatchesSelector||o.msMatchesSelector))&&ia(function(a){c.disconnectedMatch=s.call(a,"div"),s.call(a,"[s!='']:x"),r.push("!=",O)}),q=q.length&&new RegExp(q.join("|")),r=r.length&&new RegExp(r.join("|")),b=Z.test(o.compareDocumentPosition),t=b||Z.test(o.contains)?function(a,b){var c=9===a.nodeType?a.documentElement:a,d=b&&b.parentNode;return a===d||!(!d||1!==d.nodeType||!(c.contains?c.contains(d):a.compareDocumentPosition&&16&a.compareDocumentPosition(d)))}:function(a,b){if(b)while(b=b.parentNode)if(b===a)return!0;return!1},B=b?function(a,b){if(a===b)return l=!0,0;var d=!a.compareDocumentPosition-!b.compareDocumentPosition;return d?d:(d=(a.ownerDocument||a)===(b.ownerDocument||b)?a.compareDocumentPosition(b):1,1&d||!c.sortDetached&&b.compareDocumentPosition(a)===d?a===n||a.ownerDocument===v&&t(v,a)?-1:b===n||b.ownerDocument===v&&t(v,b)?1:k?J(k,a)-J(k,b):0:4&d?-1:1)}:function(a,b){if(a===b)return l=!0,0;var c,d=0,e=a.parentNode,f=b.parentNode,g=[a],h=[b];if(!e||!f)return a===n?-1:b===n?1:e?-1:f?1:k?J(k,a)-J(k,b):0;if(e===f)return ka(a,b);c=a;while(c=c.parentNode)g.unshift(c);c=b;while(c=c.parentNode)h.unshift(c);while(g[d]===h[d])d++;return d?ka(g[d],h[d]):g[d]===v?-1:h[d]===v?1:0},n):n},fa.matches=function(a,b){return fa(a,null,null,b)},fa.matchesSelector=function(a,b){if((a.ownerDocument||a)!==n&&m(a),b=b.replace(T,"='$1']"),c.matchesSelector&&p&&!A[b+" "]&&(!r||!r.test(b))&&(!q||!q.test(b)))try{var d=s.call(a,b);if(d||c.disconnectedMatch||a.document&&11!==a.document.nodeType)return d}catch(e){}return fa(b,n,null,[a]).length>0},fa.contains=function(a,b){return(a.ownerDocument||a)!==n&&m(a),t(a,b)},fa.attr=function(a,b){(a.ownerDocument||a)!==n&&m(a);var e=d.attrHandle[b.toLowerCase()],f=e&&D.call(d.attrHandle,b.toLowerCase())?e(a,b,!p):void 0;return void 0!==f?f:c.attributes||!p?a.getAttribute(b):(f=a.getAttributeNode(b))&&f.specified?f.value:null},fa.error=function(a){throw new Error("Syntax error, unrecognized expression: "+a)},fa.uniqueSort=function(a){var b,d=[],e=0,f=0;if(l=!c.detectDuplicates,k=!c.sortStable&&a.slice(0),a.sort(B),l){while(b=a[f++])b===a[f]&&(e=d.push(f));while(e--)a.splice(d[e],1)}return k=null,a},e=fa.getText=function(a){var b,c="",d=0,f=a.nodeType;if(f){if(1===f||9===f||11===f){if("string"==typeof a.textContent)return a.textContent;for(a=a.firstChild;a;a=a.nextSibling)c+=e(a)}else if(3===f||4===f)return a.nodeValue}else while(b=a[d++])c+=e(b);return c},d=fa.selectors={cacheLength:50,createPseudo:ha,match:W,attrHandle:{},find:{},relative:{">":{dir:"parentNode",first:!0}," ":{dir:"parentNode"},"+":{dir:"previousSibling",first:!0},"~":{dir:"previousSibling"}},preFilter:{ATTR:function(a){return a[1]=a[1].replace(ba,ca),a[3]=(a[3]||a[4]||a[5]||"").replace(ba,ca),"~="===a[2]&&(a[3]=" "+a[3]+" "),a.slice(0,4)},CHILD:function(a){return a[1]=a[1].toLowerCase(),"nth"===a[1].slice(0,3)?(a[3]||fa.error(a[0]),a[4]=+(a[4]?a[5]+(a[6]||1):2*("even"===a[3]||"odd"===a[3])),a[5]=+(a[7]+a[8]||"odd"===a[3])):a[3]&&fa.error(a[0]),a},PSEUDO:function(a){var b,c=!a[6]&&a[2];return W.CHILD.test(a[0])?null:(a[3]?a[2]=a[4]||a[5]||"":c&&U.test(c)&&(b=g(c,!0))&&(b=c.indexOf(")",c.length-b)-c.length)&&(a[0]=a[0].slice(0,b),a[2]=c.slice(0,b)),a.slice(0,3))}},filter:{TAG:function(a){var b=a.replace(ba,ca).toLowerCase();return"*"===a?function(){return!0}:function(a){return a.nodeName&&a.nodeName.toLowerCase()===b}},CLASS:function(a){var b=y[a+" "];return b||(b=new RegExp("(^|"+L+")"+a+"("+L+"|$)"))&&y(a,function(a){return b.test("string"==typeof a.className&&a.className||"undefined"!=typeof a.getAttribute&&a.getAttribute("class")||"")})},ATTR:function(a,b,c){return function(d){var e=fa.attr(d,a);return null==e?"!="===b:b?(e+="","="===b?e===c:"!="===b?e!==c:"^="===b?c&&0===e.indexOf(c):"*="===b?c&&e.indexOf(c)>-1:"$="===b?c&&e.slice(-c.length)===c:"~="===b?(" "+e.replace(P," ")+" ").indexOf(c)>-1:"|="===b?e===c||e.slice(0,c.length+1)===c+"-":!1):!0}},CHILD:function(a,b,c,d,e){var f="nth"!==a.slice(0,3),g="last"!==a.slice(-4),h="of-type"===b;return 1===d&&0===e?function(a){return!!a.parentNode}:function(b,c,i){var j,k,l,m,n,o,p=f!==g?"nextSibling":"previousSibling",q=b.parentNode,r=h&&b.nodeName.toLowerCase(),s=!i&&!h,t=!1;if(q){if(f){while(p){m=b;while(m=m[p])if(h?m.nodeName.toLowerCase()===r:1===m.nodeType)return!1;o=p="only"===a&&!o&&"nextSibling"}return!0}if(o=[g?q.firstChild:q.lastChild],g&&s){m=q,l=m[u]||(m[u]={}),k=l[m.uniqueID]||(l[m.uniqueID]={}),j=k[a]||[],n=j[0]===w&&j[1],t=n&&j[2],m=n&&q.childNodes[n];while(m=++n&&m&&m[p]||(t=n=0)||o.pop())if(1===m.nodeType&&++t&&m===b){k[a]=[w,n,t];break}}else if(s&&(m=b,l=m[u]||(m[u]={}),k=l[m.uniqueID]||(l[m.uniqueID]={}),j=k[a]||[],n=j[0]===w&&j[1],t=n),t===!1)while(m=++n&&m&&m[p]||(t=n=0)||o.pop())if((h?m.nodeName.toLowerCase()===r:1===m.nodeType)&&++t&&(s&&(l=m[u]||(m[u]={}),k=l[m.uniqueID]||(l[m.uniqueID]={}),k[a]=[w,t]),m===b))break;return t-=e,t===d||t%d===0&&t/d>=0}}},PSEUDO:function(a,b){var c,e=d.pseudos[a]||d.setFilters[a.toLowerCase()]||fa.error("unsupported pseudo: "+a);return e[u]?e(b):e.length>1?(c=[a,a,"",b],d.setFilters.hasOwnProperty(a.toLowerCase())?ha(function(a,c){var d,f=e(a,b),g=f.length;while(g--)d=J(a,f[g]),a[d]=!(c[d]=f[g])}):function(a){return e(a,0,c)}):e}},pseudos:{not:ha(function(a){var b=[],c=[],d=h(a.replace(Q,"$1"));return d[u]?ha(function(a,b,c,e){var f,g=d(a,null,e,[]),h=a.length;while(h--)(f=g[h])&&(a[h]=!(b[h]=f))}):function(a,e,f){return b[0]=a,d(b,null,f,c),b[0]=null,!c.pop()}}),has:ha(function(a){return function(b){return fa(a,b).length>0}}),contains:ha(function(a){return a=a.replace(ba,ca),function(b){return(b.textContent||b.innerText||e(b)).indexOf(a)>-1}}),lang:ha(function(a){return V.test(a||"")||fa.error("unsupported lang: "+a),a=a.replace(ba,ca).toLowerCase(),function(b){var c;do if(c=p?b.lang:b.getAttribute("xml:lang")||b.getAttribute("lang"))return c=c.toLowerCase(),c===a||0===c.indexOf(a+"-");while((b=b.parentNode)&&1===b.nodeType);return!1}}),target:function(b){var c=a.location&&a.location.hash;return c&&c.slice(1)===b.id},root:function(a){return a===o},focus:function(a){return a===n.activeElement&&(!n.hasFocus||n.hasFocus())&&!!(a.type||a.href||~a.tabIndex)},enabled:function(a){return a.disabled===!1},disabled:function(a){return a.disabled===!0},checked:function(a){var b=a.nodeName.toLowerCase();return"input"===b&&!!a.checked||"option"===b&&!!a.selected},selected:function(a){return a.parentNode&&a.parentNode.selectedIndex,a.selected===!0},empty:function(a){for(a=a.firstChild;a;a=a.nextSibling)if(a.nodeType<6)return!1;return!0},parent:function(a){return!d.pseudos.empty(a)},header:function(a){return Y.test(a.nodeName)},input:function(a){return X.test(a.nodeName)},button:function(a){var b=a.nodeName.toLowerCase();return"input"===b&&"button"===a.type||"button"===b},text:function(a){var b;return"input"===a.nodeName.toLowerCase()&&"text"===a.type&&(null==(b=a.getAttribute("type"))||"text"===b.toLowerCase())},first:na(function(){return[0]}),last:na(function(a,b){return[b-1]}),eq:na(function(a,b,c){return[0>c?c+b:c]}),even:na(function(a,b){for(var c=0;b>c;c+=2)a.push(c);return a}),odd:na(function(a,b){for(var c=1;b>c;c+=2)a.push(c);return a}),lt:na(function(a,b,c){for(var d=0>c?c+b:c;--d>=0;)a.push(d);return a}),gt:na(function(a,b,c){for(var d=0>c?c+b:c;++db;b++)d+=a[b].value;return d}function ra(a,b,c){var d=b.dir,e=c&&"parentNode"===d,f=x++;return b.first?function(b,c,f){while(b=b[d])if(1===b.nodeType||e)return a(b,c,f)}:function(b,c,g){var h,i,j,k=[w,f];if(g){while(b=b[d])if((1===b.nodeType||e)&&a(b,c,g))return!0}else while(b=b[d])if(1===b.nodeType||e){if(j=b[u]||(b[u]={}),i=j[b.uniqueID]||(j[b.uniqueID]={}),(h=i[d])&&h[0]===w&&h[1]===f)return k[2]=h[2];if(i[d]=k,k[2]=a(b,c,g))return!0}}}function sa(a){return a.length>1?function(b,c,d){var e=a.length;while(e--)if(!a[e](b,c,d))return!1;return!0}:a[0]}function ta(a,b,c){for(var d=0,e=b.length;e>d;d++)fa(a,b[d],c);return c}function ua(a,b,c,d,e){for(var f,g=[],h=0,i=a.length,j=null!=b;i>h;h++)(f=a[h])&&(c&&!c(f,d,e)||(g.push(f),j&&b.push(h)));return g}function va(a,b,c,d,e,f){return d&&!d[u]&&(d=va(d)),e&&!e[u]&&(e=va(e,f)),ha(function(f,g,h,i){var j,k,l,m=[],n=[],o=g.length,p=f||ta(b||"*",h.nodeType?[h]:h,[]),q=!a||!f&&b?p:ua(p,m,a,h,i),r=c?e||(f?a:o||d)?[]:g:q;if(c&&c(q,r,h,i),d){j=ua(r,n),d(j,[],h,i),k=j.length;while(k--)(l=j[k])&&(r[n[k]]=!(q[n[k]]=l))}if(f){if(e||a){if(e){j=[],k=r.length;while(k--)(l=r[k])&&j.push(q[k]=l);e(null,r=[],j,i)}k=r.length;while(k--)(l=r[k])&&(j=e?J(f,l):m[k])>-1&&(f[j]=!(g[j]=l))}}else r=ua(r===g?r.splice(o,r.length):r),e?e(null,g,r,i):H.apply(g,r)})}function wa(a){for(var b,c,e,f=a.length,g=d.relative[a[0].type],h=g||d.relative[" "],i=g?1:0,k=ra(function(a){return a===b},h,!0),l=ra(function(a){return J(b,a)>-1},h,!0),m=[function(a,c,d){var e=!g&&(d||c!==j)||((b=c).nodeType?k(a,c,d):l(a,c,d));return b=null,e}];f>i;i++)if(c=d.relative[a[i].type])m=[ra(sa(m),c)];else{if(c=d.filter[a[i].type].apply(null,a[i].matches),c[u]){for(e=++i;f>e;e++)if(d.relative[a[e].type])break;return va(i>1&&sa(m),i>1&&qa(a.slice(0,i-1).concat({value:" "===a[i-2].type?"*":""})).replace(Q,"$1"),c,e>i&&wa(a.slice(i,e)),f>e&&wa(a=a.slice(e)),f>e&&qa(a))}m.push(c)}return sa(m)}function xa(a,b){var c=b.length>0,e=a.length>0,f=function(f,g,h,i,k){var l,o,q,r=0,s="0",t=f&&[],u=[],v=j,x=f||e&&d.find.TAG("*",k),y=w+=null==v?1:Math.random()||.1,z=x.length;for(k&&(j=g===n||g||k);s!==z&&null!=(l=x[s]);s++){if(e&&l){o=0,g||l.ownerDocument===n||(m(l),h=!p);while(q=a[o++])if(q(l,g||n,h)){i.push(l);break}k&&(w=y)}c&&((l=!q&&l)&&r--,f&&t.push(l))}if(r+=s,c&&s!==r){o=0;while(q=b[o++])q(t,u,g,h);if(f){if(r>0)while(s--)t[s]||u[s]||(u[s]=F.call(i));u=ua(u)}H.apply(i,u),k&&!f&&u.length>0&&r+b.length>1&&fa.uniqueSort(i)}return k&&(w=y,j=v),t};return c?ha(f):f}return h=fa.compile=function(a,b){var c,d=[],e=[],f=A[a+" "];if(!f){b||(b=g(a)),c=b.length;while(c--)f=wa(b[c]),f[u]?d.push(f):e.push(f);f=A(a,xa(e,d)),f.selector=a}return f},i=fa.select=function(a,b,e,f){var i,j,k,l,m,n="function"==typeof a&&a,o=!f&&g(a=n.selector||a);if(e=e||[],1===o.length){if(j=o[0]=o[0].slice(0),j.length>2&&"ID"===(k=j[0]).type&&c.getById&&9===b.nodeType&&p&&d.relative[j[1].type]){if(b=(d.find.ID(k.matches[0].replace(ba,ca),b)||[])[0],!b)return e;n&&(b=b.parentNode),a=a.slice(j.shift().value.length)}i=W.needsContext.test(a)?0:j.length;while(i--){if(k=j[i],d.relative[l=k.type])break;if((m=d.find[l])&&(f=m(k.matches[0].replace(ba,ca),_.test(j[0].type)&&oa(b.parentNode)||b))){if(j.splice(i,1),a=f.length&&qa(j),!a)return H.apply(e,f),e;break}}}return(n||h(a,o))(f,b,!p,e,!b||_.test(a)&&oa(b.parentNode)||b),e},c.sortStable=u.split("").sort(B).join("")===u,c.detectDuplicates=!!l,m(),c.sortDetached=ia(function(a){return 1&a.compareDocumentPosition(n.createElement("div"))}),ia(function(a){return a.innerHTML="","#"===a.firstChild.getAttribute("href")})||ja("type|href|height|width",function(a,b,c){return c?void 0:a.getAttribute(b,"type"===b.toLowerCase()?1:2)}),c.attributes&&ia(function(a){return a.innerHTML="",a.firstChild.setAttribute("value",""),""===a.firstChild.getAttribute("value")})||ja("value",function(a,b,c){return c||"input"!==a.nodeName.toLowerCase()?void 0:a.defaultValue}),ia(function(a){return null==a.getAttribute("disabled")})||ja(K,function(a,b,c){var d;return c?void 0:a[b]===!0?b.toLowerCase():(d=a.getAttributeNode(b))&&d.specified?d.value:null}),fa}(a);n.find=t,n.expr=t.selectors,n.expr[":"]=n.expr.pseudos,n.uniqueSort=n.unique=t.uniqueSort,n.text=t.getText,n.isXMLDoc=t.isXML,n.contains=t.contains;var u=function(a,b,c){var d=[],e=void 0!==c;while((a=a[b])&&9!==a.nodeType)if(1===a.nodeType){if(e&&n(a).is(c))break;d.push(a)}return d},v=function(a,b){for(var c=[];a;a=a.nextSibling)1===a.nodeType&&a!==b&&c.push(a);return c},w=n.expr.match.needsContext,x=/^<([\w-]+)\s*\/?>(?:<\/\1>|)$/,y=/^.[^:#\[\.,]*$/;function z(a,b,c){if(n.isFunction(b))return n.grep(a,function(a,d){return!!b.call(a,d,a)!==c});if(b.nodeType)return n.grep(a,function(a){return a===b!==c});if("string"==typeof b){if(y.test(b))return n.filter(b,a,c);b=n.filter(b,a)}return n.grep(a,function(a){return n.inArray(a,b)>-1!==c})}n.filter=function(a,b,c){var d=b[0];return c&&(a=":not("+a+")"),1===b.length&&1===d.nodeType?n.find.matchesSelector(d,a)?[d]:[]:n.find.matches(a,n.grep(b,function(a){return 1===a.nodeType}))},n.fn.extend({find:function(a){var b,c=[],d=this,e=d.length;if("string"!=typeof a)return this.pushStack(n(a).filter(function(){for(b=0;e>b;b++)if(n.contains(d[b],this))return!0}));for(b=0;e>b;b++)n.find(a,d[b],c);return c=this.pushStack(e>1?n.unique(c):c),c.selector=this.selector?this.selector+" "+a:a,c},filter:function(a){return this.pushStack(z(this,a||[],!1))},not:function(a){return this.pushStack(z(this,a||[],!0))},is:function(a){return!!z(this,"string"==typeof a&&w.test(a)?n(a):a||[],!1).length}});var A,B=/^(?:\s*(<[\w\W]+>)[^>]*|#([\w-]*))$/,C=n.fn.init=function(a,b,c){var e,f;if(!a)return this;if(c=c||A,"string"==typeof a){if(e="<"===a.charAt(0)&&">"===a.charAt(a.length-1)&&a.length>=3?[null,a,null]:B.exec(a),!e||!e[1]&&b)return!b||b.jquery?(b||c).find(a):this.constructor(b).find(a);if(e[1]){if(b=b instanceof n?b[0]:b,n.merge(this,n.parseHTML(e[1],b&&b.nodeType?b.ownerDocument||b:d,!0)),x.test(e[1])&&n.isPlainObject(b))for(e in b)n.isFunction(this[e])?this[e](b[e]):this.attr(e,b[e]);return this}if(f=d.getElementById(e[2]),f&&f.parentNode){if(f.id!==e[2])return A.find(a);this.length=1,this[0]=f}return this.context=d,this.selector=a,this}return a.nodeType?(this.context=this[0]=a,this.length=1,this):n.isFunction(a)?"undefined"!=typeof c.ready?c.ready(a):a(n):(void 0!==a.selector&&(this.selector=a.selector,this.context=a.context),n.makeArray(a,this))};C.prototype=n.fn,A=n(d);var D=/^(?:parents|prev(?:Until|All))/,E={children:!0,contents:!0,next:!0,prev:!0};n.fn.extend({has:function(a){var b,c=n(a,this),d=c.length;return this.filter(function(){for(b=0;d>b;b++)if(n.contains(this,c[b]))return!0})},closest:function(a,b){for(var c,d=0,e=this.length,f=[],g=w.test(a)||"string"!=typeof a?n(a,b||this.context):0;e>d;d++)for(c=this[d];c&&c!==b;c=c.parentNode)if(c.nodeType<11&&(g?g.index(c)>-1:1===c.nodeType&&n.find.matchesSelector(c,a))){f.push(c);break}return this.pushStack(f.length>1?n.uniqueSort(f):f)},index:function(a){return a?"string"==typeof a?n.inArray(this[0],n(a)):n.inArray(a.jquery?a[0]:a,this):this[0]&&this[0].parentNode?this.first().prevAll().length:-1},add:function(a,b){return this.pushStack(n.uniqueSort(n.merge(this.get(),n(a,b))))},addBack:function(a){return this.add(null==a?this.prevObject:this.prevObject.filter(a))}});function F(a,b){do a=a[b];while(a&&1!==a.nodeType);return a}n.each({parent:function(a){var b=a.parentNode;return b&&11!==b.nodeType?b:null},parents:function(a){return u(a,"parentNode")},parentsUntil:function(a,b,c){return u(a,"parentNode",c)},next:function(a){return F(a,"nextSibling")},prev:function(a){return F(a,"previousSibling")},nextAll:function(a){return u(a,"nextSibling")},prevAll:function(a){return u(a,"previousSibling")},nextUntil:function(a,b,c){return u(a,"nextSibling",c)},prevUntil:function(a,b,c){return u(a,"previousSibling",c)},siblings:function(a){return v((a.parentNode||{}).firstChild,a)},children:function(a){return v(a.firstChild)},contents:function(a){return n.nodeName(a,"iframe")?a.contentDocument||a.contentWindow.document:n.merge([],a.childNodes)}},function(a,b){n.fn[a]=function(c,d){var e=n.map(this,b,c);return"Until"!==a.slice(-5)&&(d=c),d&&"string"==typeof d&&(e=n.filter(d,e)),this.length>1&&(E[a]||(e=n.uniqueSort(e)),D.test(a)&&(e=e.reverse())),this.pushStack(e)}});var G=/\S+/g;function H(a){var b={};return n.each(a.match(G)||[],function(a,c){b[c]=!0}),b}n.Callbacks=function(a){a="string"==typeof a?H(a):n.extend({},a);var b,c,d,e,f=[],g=[],h=-1,i=function(){for(e=a.once,d=b=!0;g.length;h=-1){c=g.shift();while(++h-1)f.splice(c,1),h>=c&&h--}),this},has:function(a){return a?n.inArray(a,f)>-1:f.length>0},empty:function(){return f&&(f=[]),this},disable:function(){return e=g=[],f=c="",this},disabled:function(){return!f},lock:function(){return e=!0,c||j.disable(),this},locked:function(){return!!e},fireWith:function(a,c){return e||(c=c||[],c=[a,c.slice?c.slice():c],g.push(c),b||i()),this},fire:function(){return j.fireWith(this,arguments),this},fired:function(){return!!d}};return j},n.extend({Deferred:function(a){var b=[["resolve","done",n.Callbacks("once memory"),"resolved"],["reject","fail",n.Callbacks("once memory"),"rejected"],["notify","progress",n.Callbacks("memory")]],c="pending",d={state:function(){return c},always:function(){return e.done(arguments).fail(arguments),this},then:function(){var a=arguments;return n.Deferred(function(c){n.each(b,function(b,f){var g=n.isFunction(a[b])&&a[b];e[f[1]](function(){var a=g&&g.apply(this,arguments);a&&n.isFunction(a.promise)?a.promise().progress(c.notify).done(c.resolve).fail(c.reject):c[f[0]+"With"](this===d?c.promise():this,g?[a]:arguments)})}),a=null}).promise()},promise:function(a){return null!=a?n.extend(a,d):d}},e={};return d.pipe=d.then,n.each(b,function(a,f){var g=f[2],h=f[3];d[f[1]]=g.add,h&&g.add(function(){c=h},b[1^a][2].disable,b[2][2].lock),e[f[0]]=function(){return e[f[0]+"With"](this===e?d:this,arguments),this},e[f[0]+"With"]=g.fireWith}),d.promise(e),a&&a.call(e,e),e},when:function(a){var b=0,c=e.call(arguments),d=c.length,f=1!==d||a&&n.isFunction(a.promise)?d:0,g=1===f?a:n.Deferred(),h=function(a,b,c){return function(d){b[a]=this,c[a]=arguments.length>1?e.call(arguments):d,c===i?g.notifyWith(b,c):--f||g.resolveWith(b,c)}},i,j,k;if(d>1)for(i=new Array(d),j=new Array(d),k=new Array(d);d>b;b++)c[b]&&n.isFunction(c[b].promise)?c[b].promise().progress(h(b,j,i)).done(h(b,k,c)).fail(g.reject):--f;return f||g.resolveWith(k,c),g.promise()}});var I;n.fn.ready=function(a){return n.ready.promise().done(a),this},n.extend({isReady:!1,readyWait:1,holdReady:function(a){a?n.readyWait++:n.ready(!0)},ready:function(a){(a===!0?--n.readyWait:n.isReady)||(n.isReady=!0,a!==!0&&--n.readyWait>0||(I.resolveWith(d,[n]),n.fn.triggerHandler&&(n(d).triggerHandler("ready"),n(d).off("ready"))))}});function J(){d.addEventListener?(d.removeEventListener("DOMContentLoaded",K),a.removeEventListener("load",K)):(d.detachEvent("onreadystatechange",K),a.detachEvent("onload",K))}function K(){(d.addEventListener||"load"===a.event.type||"complete"===d.readyState)&&(J(),n.ready())}n.ready.promise=function(b){if(!I)if(I=n.Deferred(),"complete"===d.readyState||"loading"!==d.readyState&&!d.documentElement.doScroll)a.setTimeout(n.ready);else if(d.addEventListener)d.addEventListener("DOMContentLoaded",K),a.addEventListener("load",K);else{d.attachEvent("onreadystatechange",K),a.attachEvent("onload",K);var c=!1;try{c=null==a.frameElement&&d.documentElement}catch(e){}c&&c.doScroll&&!function f(){if(!n.isReady){try{c.doScroll("left")}catch(b){return a.setTimeout(f,50)}J(),n.ready()}}()}return I.promise(b)},n.ready.promise();var L;for(L in n(l))break;l.ownFirst="0"===L,l.inlineBlockNeedsLayout=!1,n(function(){var a,b,c,e;c=d.getElementsByTagName("body")[0],c&&c.style&&(b=d.createElement("div"),e=d.createElement("div"),e.style.cssText="position:absolute;border:0;width:0;height:0;top:0;left:-9999px",c.appendChild(e).appendChild(b),"undefined"!=typeof b.style.zoom&&(b.style.cssText="display:inline;margin:0;border:0;padding:1px;width:1px;zoom:1",l.inlineBlockNeedsLayout=a=3===b.offsetWidth,a&&(c.style.zoom=1)),c.removeChild(e))}),function(){var a=d.createElement("div");l.deleteExpando=!0;try{delete a.test}catch(b){l.deleteExpando=!1}a=null}();var M=function(a){var b=n.noData[(a.nodeName+" ").toLowerCase()],c=+a.nodeType||1;return 1!==c&&9!==c?!1:!b||b!==!0&&a.getAttribute("classid")===b},N=/^(?:\{[\w\W]*\}|\[[\w\W]*\])$/,O=/([A-Z])/g;function P(a,b,c){if(void 0===c&&1===a.nodeType){var d="data-"+b.replace(O,"-$1").toLowerCase();if(c=a.getAttribute(d),"string"==typeof c){try{c="true"===c?!0:"false"===c?!1:"null"===c?null:+c+""===c?+c:N.test(c)?n.parseJSON(c):c}catch(e){}n.data(a,b,c)}else c=void 0; +}return c}function Q(a){var b;for(b in a)if(("data"!==b||!n.isEmptyObject(a[b]))&&"toJSON"!==b)return!1;return!0}function R(a,b,d,e){if(M(a)){var f,g,h=n.expando,i=a.nodeType,j=i?n.cache:a,k=i?a[h]:a[h]&&h;if(k&&j[k]&&(e||j[k].data)||void 0!==d||"string"!=typeof b)return k||(k=i?a[h]=c.pop()||n.guid++:h),j[k]||(j[k]=i?{}:{toJSON:n.noop}),"object"!=typeof b&&"function"!=typeof b||(e?j[k]=n.extend(j[k],b):j[k].data=n.extend(j[k].data,b)),g=j[k],e||(g.data||(g.data={}),g=g.data),void 0!==d&&(g[n.camelCase(b)]=d),"string"==typeof b?(f=g[b],null==f&&(f=g[n.camelCase(b)])):f=g,f}}function S(a,b,c){if(M(a)){var d,e,f=a.nodeType,g=f?n.cache:a,h=f?a[n.expando]:n.expando;if(g[h]){if(b&&(d=c?g[h]:g[h].data)){n.isArray(b)?b=b.concat(n.map(b,n.camelCase)):b in d?b=[b]:(b=n.camelCase(b),b=b in d?[b]:b.split(" ")),e=b.length;while(e--)delete d[b[e]];if(c?!Q(d):!n.isEmptyObject(d))return}(c||(delete g[h].data,Q(g[h])))&&(f?n.cleanData([a],!0):l.deleteExpando||g!=g.window?delete g[h]:g[h]=void 0)}}}n.extend({cache:{},noData:{"applet ":!0,"embed ":!0,"object ":"clsid:D27CDB6E-AE6D-11cf-96B8-444553540000"},hasData:function(a){return a=a.nodeType?n.cache[a[n.expando]]:a[n.expando],!!a&&!Q(a)},data:function(a,b,c){return R(a,b,c)},removeData:function(a,b){return S(a,b)},_data:function(a,b,c){return R(a,b,c,!0)},_removeData:function(a,b){return S(a,b,!0)}}),n.fn.extend({data:function(a,b){var c,d,e,f=this[0],g=f&&f.attributes;if(void 0===a){if(this.length&&(e=n.data(f),1===f.nodeType&&!n._data(f,"parsedAttrs"))){c=g.length;while(c--)g[c]&&(d=g[c].name,0===d.indexOf("data-")&&(d=n.camelCase(d.slice(5)),P(f,d,e[d])));n._data(f,"parsedAttrs",!0)}return e}return"object"==typeof a?this.each(function(){n.data(this,a)}):arguments.length>1?this.each(function(){n.data(this,a,b)}):f?P(f,a,n.data(f,a)):void 0},removeData:function(a){return this.each(function(){n.removeData(this,a)})}}),n.extend({queue:function(a,b,c){var d;return a?(b=(b||"fx")+"queue",d=n._data(a,b),c&&(!d||n.isArray(c)?d=n._data(a,b,n.makeArray(c)):d.push(c)),d||[]):void 0},dequeue:function(a,b){b=b||"fx";var c=n.queue(a,b),d=c.length,e=c.shift(),f=n._queueHooks(a,b),g=function(){n.dequeue(a,b)};"inprogress"===e&&(e=c.shift(),d--),e&&("fx"===b&&c.unshift("inprogress"),delete f.stop,e.call(a,g,f)),!d&&f&&f.empty.fire()},_queueHooks:function(a,b){var c=b+"queueHooks";return n._data(a,c)||n._data(a,c,{empty:n.Callbacks("once memory").add(function(){n._removeData(a,b+"queue"),n._removeData(a,c)})})}}),n.fn.extend({queue:function(a,b){var c=2;return"string"!=typeof a&&(b=a,a="fx",c--),arguments.lengthh;h++)b(a[h],c,g?d:d.call(a[h],h,b(a[h],c)));return e?a:j?b.call(a):i?b(a[0],c):f},Z=/^(?:checkbox|radio)$/i,$=/<([\w:-]+)/,_=/^$|\/(?:java|ecma)script/i,aa=/^\s+/,ba="abbr|article|aside|audio|bdi|canvas|data|datalist|details|dialog|figcaption|figure|footer|header|hgroup|main|mark|meter|nav|output|picture|progress|section|summary|template|time|video";function ca(a){var b=ba.split("|"),c=a.createDocumentFragment();if(c.createElement)while(b.length)c.createElement(b.pop());return c}!function(){var a=d.createElement("div"),b=d.createDocumentFragment(),c=d.createElement("input");a.innerHTML="
a",l.leadingWhitespace=3===a.firstChild.nodeType,l.tbody=!a.getElementsByTagName("tbody").length,l.htmlSerialize=!!a.getElementsByTagName("link").length,l.html5Clone="<:nav>"!==d.createElement("nav").cloneNode(!0).outerHTML,c.type="checkbox",c.checked=!0,b.appendChild(c),l.appendChecked=c.checked,a.innerHTML="",l.noCloneChecked=!!a.cloneNode(!0).lastChild.defaultValue,b.appendChild(a),c=d.createElement("input"),c.setAttribute("type","radio"),c.setAttribute("checked","checked"),c.setAttribute("name","t"),a.appendChild(c),l.checkClone=a.cloneNode(!0).cloneNode(!0).lastChild.checked,l.noCloneEvent=!!a.addEventListener,a[n.expando]=1,l.attributes=!a.getAttribute(n.expando)}();var da={option:[1,""],legend:[1,"
","
"],area:[1,"",""],param:[1,"",""],thead:[1,"","
"],tr:[2,"","
"],col:[2,"","
"],td:[3,"","
"],_default:l.htmlSerialize?[0,"",""]:[1,"X
","
"]};da.optgroup=da.option,da.tbody=da.tfoot=da.colgroup=da.caption=da.thead,da.th=da.td;function ea(a,b){var c,d,e=0,f="undefined"!=typeof a.getElementsByTagName?a.getElementsByTagName(b||"*"):"undefined"!=typeof a.querySelectorAll?a.querySelectorAll(b||"*"):void 0;if(!f)for(f=[],c=a.childNodes||a;null!=(d=c[e]);e++)!b||n.nodeName(d,b)?f.push(d):n.merge(f,ea(d,b));return void 0===b||b&&n.nodeName(a,b)?n.merge([a],f):f}function fa(a,b){for(var c,d=0;null!=(c=a[d]);d++)n._data(c,"globalEval",!b||n._data(b[d],"globalEval"))}var ga=/<|&#?\w+;/,ha=/r;r++)if(g=a[r],g||0===g)if("object"===n.type(g))n.merge(q,g.nodeType?[g]:g);else if(ga.test(g)){i=i||p.appendChild(b.createElement("div")),j=($.exec(g)||["",""])[1].toLowerCase(),m=da[j]||da._default,i.innerHTML=m[1]+n.htmlPrefilter(g)+m[2],f=m[0];while(f--)i=i.lastChild;if(!l.leadingWhitespace&&aa.test(g)&&q.push(b.createTextNode(aa.exec(g)[0])),!l.tbody){g="table"!==j||ha.test(g)?""!==m[1]||ha.test(g)?0:i:i.firstChild,f=g&&g.childNodes.length;while(f--)n.nodeName(k=g.childNodes[f],"tbody")&&!k.childNodes.length&&g.removeChild(k)}n.merge(q,i.childNodes),i.textContent="";while(i.firstChild)i.removeChild(i.firstChild);i=p.lastChild}else q.push(b.createTextNode(g));i&&p.removeChild(i),l.appendChecked||n.grep(ea(q,"input"),ia),r=0;while(g=q[r++])if(d&&n.inArray(g,d)>-1)e&&e.push(g);else if(h=n.contains(g.ownerDocument,g),i=ea(p.appendChild(g),"script"),h&&fa(i),c){f=0;while(g=i[f++])_.test(g.type||"")&&c.push(g)}return i=null,p}!function(){var b,c,e=d.createElement("div");for(b in{submit:!0,change:!0,focusin:!0})c="on"+b,(l[b]=c in a)||(e.setAttribute(c,"t"),l[b]=e.attributes[c].expando===!1);e=null}();var ka=/^(?:input|select|textarea)$/i,la=/^key/,ma=/^(?:mouse|pointer|contextmenu|drag|drop)|click/,na=/^(?:focusinfocus|focusoutblur)$/,oa=/^([^.]*)(?:\.(.+)|)/;function pa(){return!0}function qa(){return!1}function ra(){try{return d.activeElement}catch(a){}}function sa(a,b,c,d,e,f){var g,h;if("object"==typeof b){"string"!=typeof c&&(d=d||c,c=void 0);for(h in b)sa(a,h,c,d,b[h],f);return a}if(null==d&&null==e?(e=c,d=c=void 0):null==e&&("string"==typeof c?(e=d,d=void 0):(e=d,d=c,c=void 0)),e===!1)e=qa;else if(!e)return a;return 1===f&&(g=e,e=function(a){return n().off(a),g.apply(this,arguments)},e.guid=g.guid||(g.guid=n.guid++)),a.each(function(){n.event.add(this,b,e,d,c)})}n.event={global:{},add:function(a,b,c,d,e){var f,g,h,i,j,k,l,m,o,p,q,r=n._data(a);if(r){c.handler&&(i=c,c=i.handler,e=i.selector),c.guid||(c.guid=n.guid++),(g=r.events)||(g=r.events={}),(k=r.handle)||(k=r.handle=function(a){return"undefined"==typeof n||a&&n.event.triggered===a.type?void 0:n.event.dispatch.apply(k.elem,arguments)},k.elem=a),b=(b||"").match(G)||[""],h=b.length;while(h--)f=oa.exec(b[h])||[],o=q=f[1],p=(f[2]||"").split(".").sort(),o&&(j=n.event.special[o]||{},o=(e?j.delegateType:j.bindType)||o,j=n.event.special[o]||{},l=n.extend({type:o,origType:q,data:d,handler:c,guid:c.guid,selector:e,needsContext:e&&n.expr.match.needsContext.test(e),namespace:p.join(".")},i),(m=g[o])||(m=g[o]=[],m.delegateCount=0,j.setup&&j.setup.call(a,d,p,k)!==!1||(a.addEventListener?a.addEventListener(o,k,!1):a.attachEvent&&a.attachEvent("on"+o,k))),j.add&&(j.add.call(a,l),l.handler.guid||(l.handler.guid=c.guid)),e?m.splice(m.delegateCount++,0,l):m.push(l),n.event.global[o]=!0);a=null}},remove:function(a,b,c,d,e){var f,g,h,i,j,k,l,m,o,p,q,r=n.hasData(a)&&n._data(a);if(r&&(k=r.events)){b=(b||"").match(G)||[""],j=b.length;while(j--)if(h=oa.exec(b[j])||[],o=q=h[1],p=(h[2]||"").split(".").sort(),o){l=n.event.special[o]||{},o=(d?l.delegateType:l.bindType)||o,m=k[o]||[],h=h[2]&&new RegExp("(^|\\.)"+p.join("\\.(?:.*\\.|)")+"(\\.|$)"),i=f=m.length;while(f--)g=m[f],!e&&q!==g.origType||c&&c.guid!==g.guid||h&&!h.test(g.namespace)||d&&d!==g.selector&&("**"!==d||!g.selector)||(m.splice(f,1),g.selector&&m.delegateCount--,l.remove&&l.remove.call(a,g));i&&!m.length&&(l.teardown&&l.teardown.call(a,p,r.handle)!==!1||n.removeEvent(a,o,r.handle),delete k[o])}else for(o in k)n.event.remove(a,o+b[j],c,d,!0);n.isEmptyObject(k)&&(delete r.handle,n._removeData(a,"events"))}},trigger:function(b,c,e,f){var g,h,i,j,l,m,o,p=[e||d],q=k.call(b,"type")?b.type:b,r=k.call(b,"namespace")?b.namespace.split("."):[];if(i=m=e=e||d,3!==e.nodeType&&8!==e.nodeType&&!na.test(q+n.event.triggered)&&(q.indexOf(".")>-1&&(r=q.split("."),q=r.shift(),r.sort()),h=q.indexOf(":")<0&&"on"+q,b=b[n.expando]?b:new n.Event(q,"object"==typeof b&&b),b.isTrigger=f?2:3,b.namespace=r.join("."),b.rnamespace=b.namespace?new RegExp("(^|\\.)"+r.join("\\.(?:.*\\.|)")+"(\\.|$)"):null,b.result=void 0,b.target||(b.target=e),c=null==c?[b]:n.makeArray(c,[b]),l=n.event.special[q]||{},f||!l.trigger||l.trigger.apply(e,c)!==!1)){if(!f&&!l.noBubble&&!n.isWindow(e)){for(j=l.delegateType||q,na.test(j+q)||(i=i.parentNode);i;i=i.parentNode)p.push(i),m=i;m===(e.ownerDocument||d)&&p.push(m.defaultView||m.parentWindow||a)}o=0;while((i=p[o++])&&!b.isPropagationStopped())b.type=o>1?j:l.bindType||q,g=(n._data(i,"events")||{})[b.type]&&n._data(i,"handle"),g&&g.apply(i,c),g=h&&i[h],g&&g.apply&&M(i)&&(b.result=g.apply(i,c),b.result===!1&&b.preventDefault());if(b.type=q,!f&&!b.isDefaultPrevented()&&(!l._default||l._default.apply(p.pop(),c)===!1)&&M(e)&&h&&e[q]&&!n.isWindow(e)){m=e[h],m&&(e[h]=null),n.event.triggered=q;try{e[q]()}catch(s){}n.event.triggered=void 0,m&&(e[h]=m)}return b.result}},dispatch:function(a){a=n.event.fix(a);var b,c,d,f,g,h=[],i=e.call(arguments),j=(n._data(this,"events")||{})[a.type]||[],k=n.event.special[a.type]||{};if(i[0]=a,a.delegateTarget=this,!k.preDispatch||k.preDispatch.call(this,a)!==!1){h=n.event.handlers.call(this,a,j),b=0;while((f=h[b++])&&!a.isPropagationStopped()){a.currentTarget=f.elem,c=0;while((g=f.handlers[c++])&&!a.isImmediatePropagationStopped())a.rnamespace&&!a.rnamespace.test(g.namespace)||(a.handleObj=g,a.data=g.data,d=((n.event.special[g.origType]||{}).handle||g.handler).apply(f.elem,i),void 0!==d&&(a.result=d)===!1&&(a.preventDefault(),a.stopPropagation()))}return k.postDispatch&&k.postDispatch.call(this,a),a.result}},handlers:function(a,b){var c,d,e,f,g=[],h=b.delegateCount,i=a.target;if(h&&i.nodeType&&("click"!==a.type||isNaN(a.button)||a.button<1))for(;i!=this;i=i.parentNode||this)if(1===i.nodeType&&(i.disabled!==!0||"click"!==a.type)){for(d=[],c=0;h>c;c++)f=b[c],e=f.selector+" ",void 0===d[e]&&(d[e]=f.needsContext?n(e,this).index(i)>-1:n.find(e,this,null,[i]).length),d[e]&&d.push(f);d.length&&g.push({elem:i,handlers:d})}return h]","i"),va=/<(?!area|br|col|embed|hr|img|input|link|meta|param)(([\w:-]+)[^>]*)\/>/gi,wa=/\s*$/g,Aa=ca(d),Ba=Aa.appendChild(d.createElement("div"));function Ca(a,b){return n.nodeName(a,"table")&&n.nodeName(11!==b.nodeType?b:b.firstChild,"tr")?a.getElementsByTagName("tbody")[0]||a.appendChild(a.ownerDocument.createElement("tbody")):a}function Da(a){return a.type=(null!==n.find.attr(a,"type"))+"/"+a.type,a}function Ea(a){var b=ya.exec(a.type);return b?a.type=b[1]:a.removeAttribute("type"),a}function Fa(a,b){if(1===b.nodeType&&n.hasData(a)){var c,d,e,f=n._data(a),g=n._data(b,f),h=f.events;if(h){delete g.handle,g.events={};for(c in h)for(d=0,e=h[c].length;e>d;d++)n.event.add(b,c,h[c][d])}g.data&&(g.data=n.extend({},g.data))}}function Ga(a,b){var c,d,e;if(1===b.nodeType){if(c=b.nodeName.toLowerCase(),!l.noCloneEvent&&b[n.expando]){e=n._data(b);for(d in e.events)n.removeEvent(b,d,e.handle);b.removeAttribute(n.expando)}"script"===c&&b.text!==a.text?(Da(b).text=a.text,Ea(b)):"object"===c?(b.parentNode&&(b.outerHTML=a.outerHTML),l.html5Clone&&a.innerHTML&&!n.trim(b.innerHTML)&&(b.innerHTML=a.innerHTML)):"input"===c&&Z.test(a.type)?(b.defaultChecked=b.checked=a.checked,b.value!==a.value&&(b.value=a.value)):"option"===c?b.defaultSelected=b.selected=a.defaultSelected:"input"!==c&&"textarea"!==c||(b.defaultValue=a.defaultValue)}}function Ha(a,b,c,d){b=f.apply([],b);var e,g,h,i,j,k,m=0,o=a.length,p=o-1,q=b[0],r=n.isFunction(q);if(r||o>1&&"string"==typeof q&&!l.checkClone&&xa.test(q))return a.each(function(e){var f=a.eq(e);r&&(b[0]=q.call(this,e,f.html())),Ha(f,b,c,d)});if(o&&(k=ja(b,a[0].ownerDocument,!1,a,d),e=k.firstChild,1===k.childNodes.length&&(k=e),e||d)){for(i=n.map(ea(k,"script"),Da),h=i.length;o>m;m++)g=k,m!==p&&(g=n.clone(g,!0,!0),h&&n.merge(i,ea(g,"script"))),c.call(a[m],g,m);if(h)for(j=i[i.length-1].ownerDocument,n.map(i,Ea),m=0;h>m;m++)g=i[m],_.test(g.type||"")&&!n._data(g,"globalEval")&&n.contains(j,g)&&(g.src?n._evalUrl&&n._evalUrl(g.src):n.globalEval((g.text||g.textContent||g.innerHTML||"").replace(za,"")));k=e=null}return a}function Ia(a,b,c){for(var d,e=b?n.filter(b,a):a,f=0;null!=(d=e[f]);f++)c||1!==d.nodeType||n.cleanData(ea(d)),d.parentNode&&(c&&n.contains(d.ownerDocument,d)&&fa(ea(d,"script")),d.parentNode.removeChild(d));return a}n.extend({htmlPrefilter:function(a){return a.replace(va,"<$1>")},clone:function(a,b,c){var d,e,f,g,h,i=n.contains(a.ownerDocument,a);if(l.html5Clone||n.isXMLDoc(a)||!ua.test("<"+a.nodeName+">")?f=a.cloneNode(!0):(Ba.innerHTML=a.outerHTML,Ba.removeChild(f=Ba.firstChild)),!(l.noCloneEvent&&l.noCloneChecked||1!==a.nodeType&&11!==a.nodeType||n.isXMLDoc(a)))for(d=ea(f),h=ea(a),g=0;null!=(e=h[g]);++g)d[g]&&Ga(e,d[g]);if(b)if(c)for(h=h||ea(a),d=d||ea(f),g=0;null!=(e=h[g]);g++)Fa(e,d[g]);else Fa(a,f);return d=ea(f,"script"),d.length>0&&fa(d,!i&&ea(a,"script")),d=h=e=null,f},cleanData:function(a,b){for(var d,e,f,g,h=0,i=n.expando,j=n.cache,k=l.attributes,m=n.event.special;null!=(d=a[h]);h++)if((b||M(d))&&(f=d[i],g=f&&j[f])){if(g.events)for(e in g.events)m[e]?n.event.remove(d,e):n.removeEvent(d,e,g.handle);j[f]&&(delete j[f],k||"undefined"==typeof d.removeAttribute?d[i]=void 0:d.removeAttribute(i),c.push(f))}}}),n.fn.extend({domManip:Ha,detach:function(a){return Ia(this,a,!0)},remove:function(a){return Ia(this,a)},text:function(a){return Y(this,function(a){return void 0===a?n.text(this):this.empty().append((this[0]&&this[0].ownerDocument||d).createTextNode(a))},null,a,arguments.length)},append:function(){return Ha(this,arguments,function(a){if(1===this.nodeType||11===this.nodeType||9===this.nodeType){var b=Ca(this,a);b.appendChild(a)}})},prepend:function(){return Ha(this,arguments,function(a){if(1===this.nodeType||11===this.nodeType||9===this.nodeType){var b=Ca(this,a);b.insertBefore(a,b.firstChild)}})},before:function(){return Ha(this,arguments,function(a){this.parentNode&&this.parentNode.insertBefore(a,this)})},after:function(){return Ha(this,arguments,function(a){this.parentNode&&this.parentNode.insertBefore(a,this.nextSibling)})},empty:function(){for(var a,b=0;null!=(a=this[b]);b++){1===a.nodeType&&n.cleanData(ea(a,!1));while(a.firstChild)a.removeChild(a.firstChild);a.options&&n.nodeName(a,"select")&&(a.options.length=0)}return this},clone:function(a,b){return a=null==a?!1:a,b=null==b?a:b,this.map(function(){return n.clone(this,a,b)})},html:function(a){return Y(this,function(a){var b=this[0]||{},c=0,d=this.length;if(void 0===a)return 1===b.nodeType?b.innerHTML.replace(ta,""):void 0;if("string"==typeof a&&!wa.test(a)&&(l.htmlSerialize||!ua.test(a))&&(l.leadingWhitespace||!aa.test(a))&&!da[($.exec(a)||["",""])[1].toLowerCase()]){a=n.htmlPrefilter(a);try{for(;d>c;c++)b=this[c]||{},1===b.nodeType&&(n.cleanData(ea(b,!1)),b.innerHTML=a);b=0}catch(e){}}b&&this.empty().append(a)},null,a,arguments.length)},replaceWith:function(){var a=[];return Ha(this,arguments,function(b){var c=this.parentNode;n.inArray(this,a)<0&&(n.cleanData(ea(this)),c&&c.replaceChild(b,this))},a)}}),n.each({appendTo:"append",prependTo:"prepend",insertBefore:"before",insertAfter:"after",replaceAll:"replaceWith"},function(a,b){n.fn[a]=function(a){for(var c,d=0,e=[],f=n(a),h=f.length-1;h>=d;d++)c=d===h?this:this.clone(!0),n(f[d])[b](c),g.apply(e,c.get());return this.pushStack(e)}});var Ja,Ka={HTML:"block",BODY:"block"};function La(a,b){var c=n(b.createElement(a)).appendTo(b.body),d=n.css(c[0],"display");return c.detach(),d}function Ma(a){var b=d,c=Ka[a];return c||(c=La(a,b),"none"!==c&&c||(Ja=(Ja||n("