diff --git a/_freeze/learn/develop/recipes/index/execute-results/html.json b/_freeze/learn/develop/recipes/index/execute-results/html.json index c8ee57cf..db3bed62 100644 --- a/_freeze/learn/develop/recipes/index/execute-results/html.json +++ b/_freeze/learn/develop/recipes/index/execute-results/html.json @@ -1,8 +1,8 @@ { - "hash": "c4103975eda3080a9e2be07fbe6e60c8", + "hash": "078d8e10dad7a4e3671411a34b088cef", "result": { "engine": "knitr", - "markdown": "---\ntitle: \"Create your own recipe step function\"\ncategories:\n - developer tools\ntype: learn-subsection\nweight: 1\ndescription: | \n Write a new recipe step for data preprocessing.\ntoc: true\ntoc-depth: 2\ninclude-after-body: ../../../resources.html\n---\n\n\n\n\n\n\n\n\n## Introduction\n\nTo use code in this article, you will need to install the following packages: modeldata and tidymodels.\n\nThere are many existing recipe steps in packages like recipes, themis, textrecipes, and others. A full list of steps in CRAN packages [can be found here](/find/recipes/). However, you might need to define your own preprocessing operations; this article describes how to do that. If you are looking for good examples of existing steps, we suggest looking at the code for [centering](https://github.com/tidymodels/recipes/blob/master/R/center.R) or [PCA](https://github.com/tidymodels/recipes/blob/master/R/pca.R) to start. \n\nFor check operations (e.g. `check_class()`), the process is very similar. Notes on this are available at the end of this article. \n\nThe general process to follow is to:\n\n1. Define a step constructor function.\n\n2. Create the minimal S3 methods for `prep()`, `bake()`, and `print()`. \n\n3. Optionally add some extra methods to work with other tidymodels packages, such as `tunable()` and `tidy()`. \n\nAs an example, we will create a step for converting data into percentiles. \n\n## A new step definition\n\nLet's create a step that replaces the value of a variable with its percentile from the training set. The example data we'll use is from the modeldata package:\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlibrary(modeldata)\ndata(biomass)\nstr(biomass)\n#> 'data.frame':\t536 obs. of 8 variables:\n#> $ sample : chr \"Akhrot Shell\" \"Alabama Oak Wood Waste\" \"Alder\" \"Alfalfa\" ...\n#> $ dataset : chr \"Training\" \"Training\" \"Training\" \"Training\" ...\n#> $ carbon : num 49.8 49.5 47.8 45.1 46.8 ...\n#> $ hydrogen: num 5.64 5.7 5.8 4.97 5.4 5.75 5.99 5.7 5.5 5.9 ...\n#> $ oxygen : num 42.9 41.3 46.2 35.6 40.7 ...\n#> $ nitrogen: num 0.41 0.2 0.11 3.3 1 2.04 2.68 1.7 0.8 1.2 ...\n#> $ sulfur : num 0 0 0.02 0.16 0.02 0.1 0.2 0.2 0 0.1 ...\n#> $ HHV : num 20 19.2 18.3 18.2 18.4 ...\n\nbiomass_tr <- biomass[biomass$dataset == \"Training\",]\nbiomass_te <- biomass[biomass$dataset == \"Testing\",]\n```\n:::\n\n\n\n\nTo illustrate the transformation with the `carbon` variable, note the training set distribution of this variable with a vertical line below for the first value of the test set. \n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlibrary(ggplot2)\ntheme_set(theme_bw())\nggplot(biomass_tr, aes(x = carbon)) + \n geom_histogram(binwidth = 5, col = \"blue\", fill = \"blue\", alpha = .5) + \n geom_vline(xintercept = biomass_te$carbon[1], lty = 2)\n```\n\n::: {.cell-output-display}\n![](figs/carbon_dist-1.svg){fig-align='center' width=100%}\n:::\n:::\n\n\n\n\nBased on the training set, 42.1% of the data are less than a value of 46.35. There are some applications where it might be advantageous to represent the predictor values as percentiles rather than their original values. \n\nOur new step will do this computation for any numeric variables of interest. We will call this new recipe step `step_percentiles()`. The code below is designed for illustration and not speed or best practices. We've left out a lot of error trapping that we would want in a real implementation. \n\n::: {.callout-note}\nThe step `step_percentiles()` that will be created on this page, has been implemented in recipes as [step_percentile()](https://recipes.tidymodels.org/reference/step_percentile.html).\n:::\n\n## Create the function\n\nTo start, there is a _user-facing_ function. Let's call that `step_percentiles()`. This is just a simple wrapper around a _constructor function_, which defines the rules for any step object that defines a percentile transformation. We'll call this constructor `step_percentiles_new()`. \n\nThe function `step_percentiles()` takes the same arguments as your function and simply adds it to a new recipe. The `...` signifies the variable selectors that can be used.\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nstep_percentiles <- function(\n recipe, \n ..., \n role = NA, \n trained = FALSE, \n ref_dist = NULL,\n options = list(probs = (0:100)/100, names = TRUE),\n skip = FALSE,\n id = rand_id(\"percentiles\")\n ) {\n\n add_step(\n recipe, \n step_percentiles_new(\n terms = enquos(...),\n trained = trained,\n role = role, \n ref_dist = ref_dist,\n options = options,\n skip = skip,\n id = id\n )\n )\n}\n```\n:::\n\n\n\n\nYou should always keep the first four arguments (`recipe` though `trained`) the same as listed above. Some notes:\n\n * the `role` argument is used when you either 1) create new variables and want their role to be pre-set or 2) replace the existing variables with new values. The latter is what we will be doing and using `role = NA` will leave the existing role intact. \n * `trained` is set by the package when the estimation step has been run. You should default your function definition's argument to `FALSE`. \n * `skip` is a logical. Whenever a recipe is prepped, each step is trained and then baked. However, there are some steps that should not be applied when a call to `bake()` is used. For example, if a step is applied to the variables with roles of \"outcomes\", these data would not be available for new samples. \n * `id` is a character string that can be used to identify steps in package code. `rand_id()` will create an ID that has the prefix and a random character sequence. \n\nWe can estimate the percentiles of new data points based on the percentiles from the training set with `approx()`. Our `step_percentiles` contains a `ref_dist` object to store these percentiles (pre-computed from the training set in `prep()`) for later use in `bake()`.\n\nWe will use `stats::quantile()` to compute the grid. However, we might also want to have control over the granularity of this grid, so the `options` argument will be used to define how that calculation is done. We could use the ellipses (aka `...`) so that any options passed to `step_percentiles()` that are not one of its arguments will then be passed to `stats::quantile()`. However, we recommend making a separate list object with the options and use these inside the function because `...` is already used to define the variable selection. \n\nIt is also important to consider if there are any _main arguments_ to the step. For example, for spline-related steps such as `step_ns()`, users typically want to adjust the argument for the degrees of freedom in the spline (e.g. `splines::ns(x, df)`). Rather than letting users add `df` to the `options` argument: \n\n* Allow the important arguments to be main arguments to the step function. \n\n* Follow the tidymodels [conventions for naming arguments](https://tidymodels.github.io/model-implementation-principles/standardized-argument-names.html). Whenever possible, avoid jargon and keep common argument names. \n\nThere are benefits to following these principles (as shown below). \n\n## Initialize a new object\n\nNow, the constructor function can be created.\n\nThe function cascade is: \n\n```\nstep_percentiles() calls recipes::add_step()\n└──> recipes::add_step() calls step_percentiles_new()\n └──> step_percentiles_new() calls recipes::step()\n```\n\n`step()` is a general constructor for recipes that mainly makes sure that the resulting step object is a list with an appropriate S3 class structure. Using `subclass = \"percentiles\"` will set the class of new objects to `\"step_percentiles\"`. \n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nstep_percentiles_new <- \n function(terms, role, trained, ref_dist, options, skip, id) {\n step(\n subclass = \"percentiles\", \n terms = terms,\n role = role,\n trained = trained,\n ref_dist = ref_dist,\n options = options,\n skip = skip,\n id = id\n )\n }\n```\n:::\n\n\n\n\nThis constructor function should have no default argument values. Defaults should be set in the user-facing step object. \n\n## Create the `prep` method\n\nYou will need to create a new `prep()` method for your step's class. To do this, three arguments that the method should have are:\n\n```r\nfunction(x, training, info = NULL)\n```\n\nwhere\n\n * `x` will be the `step_percentiles` object,\n * `training` will be a _tibble_ that has the training set data, and\n * `info` will also be a tibble that has information on the current set of data available. This information is updated as each step is evaluated by its specific `prep()` method so it may not have the variables from the original data. The columns in this tibble are `variable` (the variable name), `type` (currently either \"numeric\" or \"nominal\"), `role` (defining the variable's role), and `source` (either \"original\" or \"derived\" depending on where it originated).\n\nYou can define other arguments as well. \n\nThe first thing that you might want to do in the `prep()` function is to translate the specification listed in the `terms` argument to column names in the current data. There is a function called `recipes_eval_select()` that can be used to obtain this. \n\n::: {.callout-warning}\n The `recipes_eval_select()` function is not one you interact with as a typical recipes user, but it is helpful if you develop your own custom recipe steps. \n:::\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nprep.step_percentiles <- function(x, training, info = NULL, ...) {\n col_names <- recipes_eval_select(x$terms, training, info)\n # TODO finish the rest of the function\n}\n```\n:::\n\n\n\n\nAfter this function call, it is a good idea to check that the selected columns have the appropriate type (e.g. numeric for this example). See `recipes::check_type()` to do this for basic types. \n\nOnce we have this, we can save the approximation grid. For the grid, we will use a helper function that enables us to run `rlang::exec()` to splice in any extra arguments contained in the `options` list to the call to `quantile()`: \n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nget_train_pctl <- function(x, args = NULL) {\n res <- rlang::exec(\"quantile\", x = x, !!!args)\n # Remove duplicate percentile values\n res[!duplicated(res)]\n}\n\n# For example:\nget_train_pctl(biomass_tr$carbon, list(probs = 0:1))\n#> 0% 100% \n#> 14.61 97.18\nget_train_pctl(biomass_tr$carbon)\n#> 0% 25% 50% 75% 100% \n#> 14.610 44.715 47.100 49.725 97.180\n```\n:::\n\n\n\n\nNow, the `prep()` method can be created: \n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nprep.step_percentiles <- function(x, training, info = NULL, ...) {\n col_names <- recipes_eval_select(x$terms, training, info)\n check_type(training[, col_names], types = c(\"double\", \"integer\"))\n\n ## We'll use the names later so make sure they are available\n if (x$options$names == FALSE) {\n rlang::abort(\"`names` should be set to TRUE\")\n }\n \n if (!any(names(x$options) == \"probs\")) {\n x$options$probs <- (0:100)/100\n } else {\n x$options$probs <- sort(unique(x$options$probs))\n }\n \n # Compute percentile grid\n ref_dist <- purrr::map(training[, col_names], get_train_pctl, args = x$options)\n\n ## Use the constructor function to return the updated object. \n ## Note that `trained` is now set to TRUE\n \n step_percentiles_new(\n terms = x$terms, \n trained = TRUE,\n role = x$role, \n ref_dist = ref_dist,\n options = x$options,\n skip = x$skip,\n id = x$id\n )\n}\n```\n:::\n\n\n\n\nWe suggest favoring `rlang::abort()` and `rlang::warn()` over `stop()` and `warning()`. The former can be used for better traceback results.\n\n## Create the `bake` method\n\nRemember that the `prep()` function does not _apply_ the step to the data; it only estimates any required values such as `ref_dist`. We will need to create a new method for our `step_percentiles()` class. The minimum arguments for this are\n\n```r\nfunction(object, new_data, ...)\n```\n\nwhere `object` is the updated step function that has been through the corresponding `prep()` code and `new_data` is a tibble of data to be processed. \n\nHere is the code to convert the new data to percentiles. The input data (`x` below) comes in as a numeric vector and the output is a vector of approximate percentiles: \n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\npctl_by_approx <- function(x, ref) {\n # In case duplicates were removed, get the percentiles from\n # the names of the reference object\n grid <- as.numeric(gsub(\"%$\", \"\", names(ref))) \n approx(x = ref, y = grid, xout = x)$y/100\n}\n```\n:::\n\n\n\n\nWe will loop over the variables one by and and apply the transformation. `check_new_data()` is used to make sure that the variables that are affected in this step are present.\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nbake.step_percentiles <- function(object, new_data, ...) {\n col_names <- names(object$ref_dist)\n check_new_data(col_names, object, new_data)\n\n for (col_name in col_names) {\n new_data[[col_name]] <- pctl_by_approx(\n x = new_data[[col_name]],\n ref = object$ref_dist[[col_name]]\n )\n }\n\n # new_data will be a tibble when passed to this function. It should also\n # be a tibble on the way out.\n new_data\n}\n```\n:::\n\n\n\n\n::: {.callout-note}\nYou need to import `recipes::prep()` and `recipes::bake()` to create your own step function in a package. \n:::\n\n## Run the example\n\nLet's use the example data to make sure that it works: \n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nrec_obj <- \n recipe(HHV ~ ., data = biomass_tr) %>%\n step_percentiles(ends_with(\"gen\")) %>%\n prep(training = biomass_tr)\n\nbiomass_te %>% select(ends_with(\"gen\")) %>% slice(1:2)\n#> hydrogen oxygen nitrogen\n#> 1 5.67 47.20 0.30\n#> 2 5.50 48.06 2.85\nbake(rec_obj, biomass_te %>% slice(1:2), ends_with(\"gen\"))\n#> # A tibble: 2 × 3\n#> hydrogen oxygen nitrogen\n#> \n#> 1 0.45 0.903 0.21 \n#> 2 0.38 0.922 0.928\n\n# Checking to get approximate result: \nmean(biomass_tr$hydrogen <= biomass_te$hydrogen[1])\n#> [1] 0.4517544\nmean(biomass_tr$oxygen <= biomass_te$oxygen[1])\n#> [1] 0.9013158\n```\n:::\n\n\n\n\nThe plot below shows how the original hydrogen percentiles line up with the estimated values:\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nhydrogen_values <- \n bake(rec_obj, biomass_te, hydrogen) %>% \n bind_cols(biomass_te %>% select(original = hydrogen))\n\nggplot(biomass_tr, aes(x = hydrogen)) + \n # Plot the empirical distribution function of the \n # hydrogen training set values as a black line\n stat_ecdf() + \n # Overlay the estimated percentiles for the new data: \n geom_point(data = hydrogen_values, \n aes(x = original, y = hydrogen), \n col = \"red\", alpha = .5, cex = 2) + \n labs(x = \"New Hydrogen Values\", y = \"Percentile Based on Training Set\")\n```\n\n::: {.cell-output-display}\n![](figs/cdf_plot-1.svg){fig-align='center' width=672}\n:::\n:::\n\n\n\n\nThese line up very nicely! \n\n## Custom check operations \n\nThe process here is exactly the same as steps; the internal functions have a similar naming convention: \n\n * `add_check()` instead of `add_step()`\n * `check()` instead of `step()`, and so on. \n \nIt is strongly recommended that:\n \n 1. The operations start with `check_` (i.e. `check_range()` and `check_range_new()`)\n 1. The check uses `rlang::abort(paste0(...))` when the conditions are not met\n 1. The original data are returned (unaltered) by the check when the conditions are satisfied. \n\n## Other step methods\n\nThere are a few other S3 methods that can be created for your step function. They are not required unless you plan on using your step in the broader tidymodels package set. \n\n### A print method\n\nIf you don't add a print method for `step_percentiles`, it will still print but it will be printed as a list of (potentially large) objects and look a bit ugly. The recipes package contains a helper function called `print_step()` that should be useful in most cases. We are using it here for the custom print method for `step_percentiles`. It requires the original terms specification and the column names this specification is evaluated to by `prep()`. For the former, our step object is structured so that the list object `ref_dist` has the names of the selected variables: \n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nprint.step_percentiles <-\n function(x, width = max(20, options()$width - 35), ...) {\n title <- \"Percentile transformation on \"\n\n print_step(\n # Names after prep:\n tr_obj = names(x$ref_dist),\n # Names before prep (could be selectors)\n untr_obj = x$terms,\n # Has it been prepped? \n trained = x$trained,\n # What does this step do?\n title = title,\n # An estimate of how many characters to print on a line: \n width = width\n )\n invisible(x)\n }\n\n# Results before `prep()`:\nrecipe(HHV ~ ., data = biomass_tr) %>%\n step_percentiles(ends_with(\"gen\"))\n#> \n#> ── Recipe ────────────────────────────────────────────────────────────\n#> \n#> ── Inputs\n#> Number of variables by role\n#> outcome: 1\n#> predictor: 7\n#> \n#> ── Operations\n#> • Percentile transformation on: ends_with(\"gen\")\n\n# Results after `prep()`: \nrec_obj\n#> \n#> ── Recipe ────────────────────────────────────────────────────────────\n#> \n#> ── Inputs\n#> Number of variables by role\n#> outcome: 1\n#> predictor: 7\n#> \n#> ── Training information\n#> Training data contained 456 data points and no incomplete rows.\n#> \n#> ── Operations\n#> • Percentile transformation on: hydrogen oxygen, ... | Trained\n```\n:::\n\n\n\n \n### Methods for declaring required packages\n\nSome recipe steps use functions from other packages. When this is the case, the `step_*()` function should check to see if the package is installed. The function `recipes::recipes_pkg_check()` will do this. For example: \n\n```\n> recipes::recipes_pkg_check(\"some_package\")\n1 package is needed for this step and is not installed. (some_package). Start \na clean R session then run: install.packages(\"some_package\")\n```\n\nThere is an S3 method that can be used to declare what packages should be loaded when using the step. For a hypothetical step that relies on the `hypothetical` package, this might look like: \n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nrequired_pkgs.step_hypothetical <- function(x, ...) {\n c(\"hypothetical\", \"myrecipespkg\")\n}\n```\n:::\n\n\n\n\nIn this example, `myrecipespkg` is the package where the step resides (if it is in a package).\n\nThe reason to declare what packages should be loaded is parallel processing. When parallel worker processes are created, there is heterogeneity across technologies regarding which packages are loaded. Multicore methods on macOS and Linux load all of the packages that were loaded in the main R process. However, parallel processing using psock clusters have no additional packages loaded. If the home package for a recipe step is not loaded in the worker processes, the `prep()` methods cannot be found and an error occurs. \n\nIf this S3 method is used for your step, you can rely on this for checking the installation: \n \n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nrecipes::recipes_pkg_check(required_pkgs.step_hypothetical())\n#> 2 packages (hypothetical and myrecipespkg) are needed for this step\n#> but are not installed.\n#> To install run: `install.packages(c(\"hypothetical\", \"myrecipespkg\"))`\n```\n:::\n\n\n\n\nIf you'd like an example of this in a package, please take a look at the [embed](https://github.com/tidymodels/embed/) or [themis](https://github.com/tidymodels/themis/) package.\n\n### A tidy method\n\nThe `broom::tidy()` method is a means to return information about the step in a usable format. For our step, it would be helpful to know the reference values. \n\nWhen the recipe has been prepped, those data are in the list `ref_dist`. A small function can be used to reformat that data into a tibble. It is customary to return the main values as `value`:\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nformat_pctl <- function(x) {\n tibble::tibble(\n value = unname(x),\n percentile = as.numeric(gsub(\"%$\", \"\", names(x))) \n )\n}\n\n# For example: \npctl_step_object <- rec_obj$steps[[1]]\npctl_step_object\n#> • Percentile transformation on: hydrogen and oxygen, ... | Trained\nformat_pctl(pctl_step_object$ref_dist[[\"hydrogen\"]])\n#> # A tibble: 87 × 2\n#> value percentile\n#> \n#> 1 0.03 0\n#> 2 0.934 1\n#> 3 1.60 2\n#> 4 2.07 3\n#> 5 2.45 4\n#> 6 2.74 5\n#> 7 3.15 6\n#> 8 3.49 7\n#> 9 3.71 8\n#> 10 3.99 9\n#> # ℹ 77 more rows\n```\n:::\n\n\n\n\nThe tidy method could return these values for each selected column. Before `prep()`, missing values can be used as placeholders. \n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ntidy.step_percentiles <- function(x, ...) {\n if (is_trained(x)) {\n if (length(x$ref_dist) == 0) {\n # We need to create consistant output when no variables were selected\n res <- tibble(\n terms = character(),\n value = numeric(),\n percentile = numeric()\n )\n } else {\n res <- map_dfr(x$ref_dist, format_pctl, .id = \"term\")\n }\n } else {\n term_names <- sel2char(x$terms)\n res <-\n tibble(\n terms = term_names,\n value = rlang::na_dbl,\n percentile = rlang::na_dbl\n )\n }\n # Always return the step id: \n res$id <- x$id\n res\n}\n\ntidy(rec_obj, number = 1)\n#> # A tibble: 274 × 4\n#> term value percentile id \n#> \n#> 1 hydrogen 0.03 0 percentiles_Bp5vK\n#> 2 hydrogen 0.934 1 percentiles_Bp5vK\n#> 3 hydrogen 1.60 2 percentiles_Bp5vK\n#> 4 hydrogen 2.07 3 percentiles_Bp5vK\n#> 5 hydrogen 2.45 4 percentiles_Bp5vK\n#> 6 hydrogen 2.74 5 percentiles_Bp5vK\n#> 7 hydrogen 3.15 6 percentiles_Bp5vK\n#> 8 hydrogen 3.49 7 percentiles_Bp5vK\n#> 9 hydrogen 3.71 8 percentiles_Bp5vK\n#> 10 hydrogen 3.99 9 percentiles_Bp5vK\n#> # ℹ 264 more rows\n```\n:::\n\n\n\n\n### Methods for tuning parameters\n\nThe tune package can be used to find reasonable values of step arguments by model tuning. There are some S3 methods that are useful to define for your step. The percentile example doesn't really have any tunable parameters, so we will demonstrate using `step_poly()`, which returns a polynomial expansion of selected columns. Its function definition has the arguments: \n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nargs(step_poly)\n#> function (recipe, ..., role = \"predictor\", trained = FALSE, objects = NULL, \n#> degree = 2L, options = list(), keep_original_cols = FALSE, \n#> skip = FALSE, id = rand_id(\"poly\")) \n#> NULL\n```\n:::\n\n\n\n\nThe argument `degree` is tunable.\n\nTo work with tune it is _helpful_ (but not required) to use an S3 method called `tunable()` to define which arguments should be tuned and how values of those arguments should be generated. \n\n`tunable()` takes the step object as its argument and returns a tibble with columns: \n\n* `name`: The name of the argument. \n\n* `call_info`: A list that describes how to call a function that returns a dials parameter object. \n\n* `source`: A character string that indicates where the tuning value comes from (i.e., a model, a recipe etc.). Here, it is just `\"recipe\"`. \n\n* `component`: A character string with more information about the source. For recipes, this is just the name of the step (e.g. `\"step_poly\"`). \n\n* `component_id`: A character string to indicate where a unique identifier is for the object. For recipes, this is just the `id` value of the step object. \n\nThe main piece of information that requires some detail is `call_info`. This is a list column in the tibble. Each element of the list is a list that describes the package and function that can be used to create a dials parameter object. \n\nFor example, for a nearest-neighbors `neighbors` parameter, this value is just: \n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ninfo <- list(pkg = \"dials\", fun = \"neighbors\")\n\n# FYI: how it is used under-the-hood: \nnew_param_call <- rlang::call2(.fn = info$fun, .ns = info$pkg)\nrlang::eval_tidy(new_param_call)\n#> # Nearest Neighbors (quantitative)\n#> Range: [1, 10]\n```\n:::\n\n\n\n\nFor `step_poly()`, a dials object is needed that returns an integer that is the number of new columns to create. It turns out that there are a few different types of tuning parameters related to degree: \n\n```r\n> lsf.str(\"package:dials\", pattern = \"degree\")\ndegree : function (range = c(1, 3), trans = NULL) \ndegree_int : function (range = c(1L, 3L), trans = NULL) \nprod_degree : function (range = c(1L, 2L), trans = NULL) \nspline_degree : function (range = c(3L, 10L), trans = NULL) \n```\n\nLooking at the `range` values, some return doubles and others return integers. For our problem, `degree_int()` would be a good choice. \n\nFor `step_poly()` the `tunable()` S3 method could be: \n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ntunable.step_poly <- function (x, ...) {\n tibble::tibble(\n name = c(\"degree\"),\n call_info = list(list(pkg = \"dials\", fun = \"degree_int\")),\n source = \"recipe\",\n component = \"step_poly\",\n component_id = x$id\n )\n}\n```\n:::\n\n\n\n\n## Session information {#session-info}\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```\n#> ─ Session info ─────────────────────────────────────────────────────\n#> version R version 4.4.2 (2024-10-31)\n#> language (EN)\n#> date 2025-03-24\n#> pandoc 3.6.1\n#> quarto 1.6.42\n#> \n#> ─ Packages ─────────────────────────────────────────────────────────\n#> package version date (UTC) source\n#> broom 1.0.7 2024-09-26 CRAN (R 4.4.1)\n#> dials 1.4.0 2025-02-13 CRAN (R 4.4.2)\n#> dplyr 1.1.4 2023-11-17 CRAN (R 4.4.0)\n#> ggplot2 3.5.1 2024-04-23 CRAN (R 4.4.0)\n#> infer 1.0.7 2024-03-25 CRAN (R 4.4.0)\n#> modeldata 1.4.0 2024-06-19 CRAN (R 4.4.0)\n#> parsnip 1.3.1 2025-03-12 CRAN (R 4.4.1)\n#> purrr 1.0.4 2025-02-05 CRAN (R 4.4.1)\n#> recipes 1.2.0 2025-03-17 CRAN (R 4.4.1)\n#> rlang 1.1.5 2025-01-17 CRAN (R 4.4.2)\n#> rsample 1.2.1 2024-03-25 CRAN (R 4.4.0)\n#> tibble 3.2.1 2023-03-20 CRAN (R 4.4.0)\n#> tidymodels 1.3.0 2025-02-21 CRAN (R 4.4.1)\n#> tune 1.3.0 2025-02-21 CRAN (R 4.4.1)\n#> workflows 1.2.0 2025-02-19 CRAN (R 4.4.1)\n#> yardstick 1.3.2 2025-01-22 CRAN (R 4.4.1)\n#> \n#> ────────────────────────────────────────────────────────────────────\n```\n:::\n", + "markdown": "---\ntitle: \"Create your own recipe step function\"\ncategories:\n - developer tools\ntype: learn-subsection\nweight: 1\ndescription: | \n Write a new recipe step for data preprocessing.\ntoc: true\ntoc-depth: 3\ninclude-after-body: ../../../resources.html\n---\n\n\n\n\n\n\n\n\n## Introduction\n\nTo use code in this article, you will need to install the following packages: modeldata and tidymodels.\n\nThere are many existing recipe steps in packages like recipes, themis, textrecipes, and others. A full list of steps in CRAN packages [can be found here](/find/recipes/). However, you might need to define your own preprocessing operations; this article describes how to do that. If you are looking for good examples of existing steps, we suggest looking at the code for [centering](https://github.com/tidymodels/recipes/blob/master/R/center.R) or [PCA](https://github.com/tidymodels/recipes/blob/master/R/pca.R) to start. \n\nFor check operations (e.g. `check_class()`), the process is very similar. Notes on this are available at the end of this article. \n\nThe general process to follow is to:\n\n1. Define a step constructor function.\n\n2. Create the minimal S3 methods for `prep()`, `bake()`, and `print()`. \n\n3. Optionally add some extra methods to work with other tidymodels packages, such as `tunable()` and `tidy()`. \n\nAs an example, we will create a step for converting data into percentiles. \n\n## Creating a new step\n\nLet's create a step that replaces the value of a variable with its percentile from the training set. The example data we'll use is from the modeldata package:\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlibrary(modeldata)\ndata(biomass)\nstr(biomass)\n#> 'data.frame':\t536 obs. of 8 variables:\n#> $ sample : chr \"Akhrot Shell\" \"Alabama Oak Wood Waste\" \"Alder\" \"Alfalfa\" ...\n#> $ dataset : chr \"Training\" \"Training\" \"Training\" \"Training\" ...\n#> $ carbon : num 49.8 49.5 47.8 45.1 46.8 ...\n#> $ hydrogen: num 5.64 5.7 5.8 4.97 5.4 5.75 5.99 5.7 5.5 5.9 ...\n#> $ oxygen : num 42.9 41.3 46.2 35.6 40.7 ...\n#> $ nitrogen: num 0.41 0.2 0.11 3.3 1 2.04 2.68 1.7 0.8 1.2 ...\n#> $ sulfur : num 0 0 0.02 0.16 0.02 0.1 0.2 0.2 0 0.1 ...\n#> $ HHV : num 20 19.2 18.3 18.2 18.4 ...\n\nbiomass_tr <- biomass[biomass$dataset == \"Training\", ]\nbiomass_te <- biomass[biomass$dataset == \"Testing\", ]\n```\n:::\n\n\n\n\nTo illustrate the transformation with the `carbon` variable, note the training set distribution of this variable with a vertical line below for the first value of the test set. \n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nlibrary(ggplot2)\ntheme_set(theme_bw())\nggplot(biomass_tr, aes(x = carbon)) +\n geom_histogram(binwidth = 5, col = \"blue\", fill = \"blue\", alpha = .5) +\n geom_vline(xintercept = biomass_te$carbon[1], lty = 2)\n```\n\n::: {.cell-output-display}\n![](figs/carbon_dist-1.svg){fig-align='center' width=100%}\n:::\n:::\n\n\n\n\nBased on the training set, 42.1% of the data are less than a value of 46.35. There are some applications where it might be advantageous to represent the predictor values as percentiles rather than their original values. \n\nOur new step will do this computation for any numeric variables of interest. We will call this new recipe step `step_percentiles()`. The code below is designed for illustration and not speed or best practices. We've left out a lot of error trapping that we would want in a real implementation. \n\n::: {.callout-note}\nThe step `step_percentiles()` that will be created on this page, has been implemented in recipes as [step_percentile()](https://recipes.tidymodels.org/reference/step_percentile.html).\n:::\n\n### Create the user function\n\nTo start, there is a _user-facing_ function. Let's call that `step_percentiles()`. This is just a simple wrapper around a _constructor function_, which defines the rules for any step object that defines a percentile transformation. We'll call this constructor `step_percentiles_new()`. \n\nThe function `step_percentiles()` takes the same arguments as your function and simply adds it to a new recipe. The `...` signifies the variable selectors that can be used.\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nstep_percentiles <- function(\n recipe,\n ...,\n role = NA,\n trained = FALSE,\n ref_dist = NULL,\n options = list(probs = (0:100) / 100, names = TRUE),\n skip = FALSE,\n id = rand_id(\"percentiles\")\n) {\n add_step(\n recipe,\n step_percentiles_new(\n terms = enquos(...),\n trained = trained,\n role = role,\n ref_dist = ref_dist,\n options = options,\n skip = skip,\n id = id\n )\n )\n}\n```\n:::\n\n\n\n\nYou should always keep the first four arguments (`recipe` though `trained`) the same as listed above. Additionally, you should have a `skip` and `id` argument as well, conventionally put as the last arguments. Some notes:\n\n * The `role` argument is used when you either 1) create new variables and want their role to be pre-set or 2) replace the existing variables with new values. The latter is what we will be doing and using `role = NA` will leave the existing role intact. \n * `trained` is set by the package when the estimation step has been run. You should default your function definition's argument to `FALSE`. \n * `skip` is a logical. Whenever a recipe is prepped, each step is trained and then baked. However, there are some steps that should not be applied when a call to `bake()` is used. For example, if a step is applied to the variables with roles of \"outcomes\", these data would not be available for new samples.\n * `id` is a character string that can be used to identify steps in package code. `rand_id()` will create an ID that has the prefix and a random character sequence. \n\nWe can estimate the percentiles of new data points based on the percentiles from the training set with `approx()`. Our `step_percentiles()` contains a `ref_dist` object to store these percentiles (pre-computed from the training set in `prep()`) for later use in `bake()`.\n\nWe will use `stats::quantile()` to compute the grid. However, we might also want to have control over the granularity of this grid, so the `options` argument will be used to define how that calculation is done. We could use the ellipses (aka `...`) so that any options passed to `step_percentiles()` that are not one of its arguments will then be passed to `stats::quantile()`. However, we recommend making a separate list object with the options and use these inside the function because `...` is already used to define the variable selection. \n\nIt is also important to consider if there are any _main arguments_ to the step. For example, for spline-related steps such as `step_ns()`, users typically want to adjust the argument for the degrees of freedom in the spline (e.g. `splines::ns(x, df)`). Rather than letting users add `df` to the `options` argument: \n\n* Allow the important arguments to be the main arguments to the step function. \n\n* Follow the tidymodels [conventions for naming arguments](https://tidymodels.github.io/model-implementation-principles/standardized-argument-names.html). Whenever possible, avoid jargon and keep common argument names. \n\nThere are benefits to following these principles (as shown below). \n\n### Initialize a new object\n\nNow, the constructor function can be created.\n\nThe function cascade is: \n\n```\nstep_percentiles() calls recipes::add_step()\n└─> recipes::add_step() calls step_percentiles_new()\n └─> step_percentiles_new() calls recipes::step()\n```\n\n`step()` is a general constructor for recipes that mainly makes sure that the resulting step object is a list with an appropriate S3 class structure. Using `subclass = \"percentiles\"` will set the class of new objects to `\"step_percentiles\"`. \n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nstep_percentiles_new <-\n function(terms, role, trained, ref_dist, options, skip, id) {\n step(\n subclass = \"percentiles\",\n terms = terms,\n role = role,\n trained = trained,\n ref_dist = ref_dist,\n options = options,\n skip = skip,\n id = id\n )\n }\n```\n:::\n\n\n\n\nThis constructor function should have no default argument values. Defaults should be set in the user-facing step object. \n\n### Create the `prep()` method\n\nYou will need to create a new `prep()` method for your step's class. To do this, three arguments that the method should have are:\n\n```r\nfunction(x, training, info = NULL, ...)\n```\n\nwhere\n\n * `x` will be the `step_percentiles` object,\n * `training` will be a _tibble_ that has the training set data, and\n * `info` will also be a tibble that has information on the current set of data available. This information is updated as each step is evaluated by its specific `prep()` method so it may not have the variables from the original data. The columns in this tibble are `variable` (the variable name), `type` (currently either `\"numeric\"` or `\"nominal\"`), `role` (defining the variable's role), and `source` (either `\"original\"` or `\"derived\"` depending on where it originated).\n * `...` not being used.\n\nThe first thing that you might want to do in the `prep()` function is to translate the specification listed in the `terms` argument to column names in the current data. This is done using the `recipes_eval_select()` function. \n\n::: {.callout-warning}\n The `recipes_eval_select()` function is not one you interact with as a typical recipes user, but it is helpful if you develop your own custom recipe steps. \n:::\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nprep.step_percentiles <- function(x, training, info = NULL, ...) {\n col_names <- recipes_eval_select(x$terms, training, info)\n # TODO finish the rest of the function\n}\n```\n:::\n\n\n\n\nAfter this function call, it is a good idea to check that the selected columns have the appropriate type (e.g. numeric for this example). See `recipes::check_type()` to do this for basic types. \n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nprep.step_percentiles <- function(x, training, info = NULL, ...) {\n col_names <- recipes_eval_select(x$terms, training, info)\n check_type(training[, col_names], types = c(\"double\", \"integer\"))\n # TODO finish the rest of the function\n}\n```\n:::\n\n\n\n\nOnce we have this then we are ready for the meat of the function. The purpose of the `prep()` method is to calculate and store the information needed to perform the transformations. `step_center()` stores the means of the selected variables, `step_pca()` stores the loadings, and `step_lincomb()` calculates which variables are linear combinations of each other and marks them for removal. Some steps don't need to calculate anything in the `prep()` method, examples include `step_arrange()` and `step_date()`.\n\nFor this step, we want to save the approximation grid. For the grid, we will use a helper function that enables us to run `rlang::exec()` to splice in any extra arguments contained in the `options` list to the call to `quantile()`: \n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nget_train_pctl <- function(x, args = NULL) {\n res <- rlang::exec(\"quantile\", x = x, !!!args)\n # Remove duplicate percentile values\n res[!duplicated(res)]\n}\n\n# For example:\nget_train_pctl(biomass_tr$carbon, list(probs = 0:1))\n#> 0% 100% \n#> 14.61 97.18\nget_train_pctl(biomass_tr$carbon)\n#> 0% 25% 50% 75% 100% \n#> 14.610 44.715 47.100 49.725 97.180\n```\n:::\n\n\n\n\nNow, the `prep()` method can be created: \n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nprep.step_percentiles <- function(x, training, info = NULL, ...) {\n col_names <- recipes_eval_select(x$terms, training, info)\n check_type(training[, col_names], types = c(\"double\", \"integer\"))\n\n # We'll use the names later so make sure they are available\n if (x$options$names == FALSE) {\n rlang::abort(\"`names` should be set to TRUE\")\n }\n\n if (!any(names(x$options) == \"probs\")) {\n x$options$probs <- (0:100) / 100\n } else {\n x$options$probs <- sort(unique(x$options$probs))\n }\n\n # Compute percentile grid\n ref_dist <- list()\n for (col_name in col_names) {\n ref_dist[[col_name]] <- get_train_pctl(training[[col_name]], args = x$options)\n }\n\n # Use the constructor function to return the updated object.\n # Note that `trained` is now set to TRUE\n\n step_percentiles_new(\n terms = x$terms,\n trained = TRUE,\n role = x$role,\n ref_dist = ref_dist,\n options = x$options,\n skip = x$skip,\n id = x$id\n )\n}\n```\n:::\n\n\n\n\n::: {.callout-tip}\nDue to the way errors are captured and rethrown in recipes, we recommend using a for-loop over `map()` or `lapply()` to go over the selected variables.\n:::\n\nWe suggest favoring `cli::cli_abort()` and `cli::cli_warn()` over `stop()` and `warning()`. The former can be used for better traceback results.\n\n### Create the `bake()` method\n\nRemember that the `prep()` function does not _apply_ the step to the data; it only estimates any required values such as `ref_dist`. We will need to create a new method for our `step_percentiles()` class. The signature is as follows.\n\n```r\nfunction(object, new_data, ...)\n```\n\nWhere `object` is the updated step function that has been through the corresponding `prep()` code and `new_data` is a tibble of data to be processed. \n\nHere is the code to convert the new data to percentiles. The input data (`x` below) comes in as a numeric vector and the output is a vector of approximate percentiles: \n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\npctl_by_approx <- function(x, ref) {\n # In case duplicates were removed, get the percentiles from\n # the names of the reference object\n grid <- as.numeric(gsub(\"%$\", \"\", names(ref)))\n approx(x = ref, y = grid, xout = x)$y / 100\n}\n```\n:::\n\n\n\n\nWe will loop over the variables one by and and apply the transformation. `check_new_data()` is used to make sure that the variables that are affected in this step are present.\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nbake.step_percentiles <- function(object, new_data, ...) {\n col_names <- names(object$ref_dist)\n check_new_data(col_names, object, new_data)\n\n for (col_name in col_names) {\n new_data[[col_name]] <- pctl_by_approx(\n x = new_data[[col_name]],\n ref = object$ref_dist[[col_name]]\n )\n }\n\n # new_data will be a tibble when passed to this function. It should also\n # be a tibble on the way out.\n new_data\n}\n```\n:::\n\n\n\n\n::: {.callout-note}\nYou need to import `recipes::prep()` and `recipes::bake()` to create your own step function in a package. \n:::\n\n### Verify it works\n\nLet's use the example data to make sure that it works: \n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nrec_obj <- recipe(HHV ~ ., data = biomass_tr) %>%\n step_percentiles(ends_with(\"gen\")) %>%\n prep(training = biomass_tr)\n\nbiomass_te %>% select(ends_with(\"gen\")) %>% slice(1:2)\n#> hydrogen oxygen nitrogen\n#> 1 5.67 47.20 0.30\n#> 2 5.50 48.06 2.85\nbake(rec_obj, biomass_te %>% slice(1:2), ends_with(\"gen\"))\n#> # A tibble: 2 × 3\n#> hydrogen oxygen nitrogen\n#> \n#> 1 0.45 0.903 0.21 \n#> 2 0.38 0.922 0.928\n\n# Checking to get approximate results:\nmean(biomass_tr$hydrogen <= biomass_te$hydrogen[1])\n#> [1] 0.4517544\nmean(biomass_tr$oxygen <= biomass_te$oxygen[1])\n#> [1] 0.9013158\n```\n:::\n\n\n\n\nThe plot below shows how the original hydrogen percentiles line up with the estimated values:\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nhydrogen_values <- bake(rec_obj, biomass_te, hydrogen) %>%\n bind_cols(biomass_te %>% select(original = hydrogen))\n\nggplot(biomass_tr, aes(x = hydrogen)) +\n # Plot the empirical distribution function of the\n # hydrogen training set values as a black line\n stat_ecdf() +\n # Overlay the estimated percentiles for the new data:\n geom_point(\n data = hydrogen_values,\n aes(x = original, y = hydrogen),\n col = \"red\",\n alpha = .5,\n cex = 2\n ) +\n labs(x = \"New Hydrogen Values\", y = \"Percentile Based on Training Set\")\n```\n\n::: {.cell-output-display}\n![](figs/cdf_plot-1.svg){fig-align='center' width=672}\n:::\n:::\n\n\n\n\nThese line up very nicely! \n\n## Creating a new check\n\nThe process here is exactly the same as for steps; the internal functions have a similar naming convention: \n\n * `add_check()` instead of `add_step()`.\n * `check()` instead of `step()`, and so on. \n \nIt is strongly recommended that:\n \n 1. The operations start with `check_` (i.e. `check_range()` and `check_range_new()`)\n 1. The check uses `cli::cli_abort(...)` when the conditions are not met\n 1. The original data are returned (unaltered) by the check when the conditions are satisfied. \n\n## Other step methods\n\nThere are a few other S3 methods that can be created for your step function. They are not required unless you plan on using your step in the broader tidymodels package set. \n\n### A `print()` method\n\nIf you don't add a print method for `step_percentiles()`, it will still print but it will be printed as a list of (potentially large) objects and look a bit ugly. The recipes package contains a helper function called `print_step()` that should be useful in most cases. We are using it here for the custom print method for `step_percentiles()`. It requires the original terms specification and the column names this specification is evaluated to by `prep()`. For the former, our step object is structured so that the list object `ref_dist` has the names of the selected variables: \n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nprint.step_percentiles <-\n function(x, width = max(20, options()$width - 35), ...) {\n title <- \"Percentile transformation on \"\n\n print_step(\n # Names after prep:\n tr_obj = names(x$ref_dist),\n # Names before prep (could be selectors)\n untr_obj = x$terms,\n # Has it been prepped?\n trained = x$trained,\n # What does this step do?\n title = title,\n # An estimate of how many characters to print on a line:\n width = width\n )\n invisible(x)\n }\n\n# Results before `prep()`:\nrecipe(HHV ~ ., data = biomass_tr) %>%\n step_percentiles(ends_with(\"gen\"))\n#> \n#> ── Recipe ────────────────────────────────────────────────────────────\n#> \n#> ── Inputs\n#> Number of variables by role\n#> outcome: 1\n#> predictor: 7\n#> \n#> ── Operations\n#> • Percentile transformation on: ends_with(\"gen\")\n\n# Results after `prep()`:\nrec_obj\n#> \n#> ── Recipe ────────────────────────────────────────────────────────────\n#> \n#> ── Inputs\n#> Number of variables by role\n#> outcome: 1\n#> predictor: 7\n#> \n#> ── Training information\n#> Training data contained 456 data points and no incomplete rows.\n#> \n#> ── Operations\n#> • Percentile transformation on: hydrogen oxygen, ... | Trained\n```\n:::\n\n\n\n\n### Methods for declaring required packages\n\nSome recipe steps use functions from other packages. When this is the case, the `step_*()` function should check to see if the package is installed. The function `recipes::recipes_pkg_check()` will do this. For example: \n\n```\n> recipes::recipes_pkg_check(\"some_package\")\n1 package is needed for this step and is not installed. (some_package). Start \na clean R session then run: install.packages(\"some_package\")\n```\n\nAn S3 method can be used to declare what packages should be loaded when using the step. For a hypothetical step that relies on the `hypothetical` package, this might look like: \n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nrequired_pkgs.step_hypothetical <- function(x, ...) {\n c(\"hypothetical\", \"myrecipespkg\")\n}\n```\n:::\n\n\n\n\nIn this example, `myrecipespkg` is the package where the step resides (if it is in a package).\n\nThe reason to declare what packages should be loaded is parallel processing. When parallel worker processes are created, there is heterogeneity across technologies regarding which packages are loaded. Multicore methods on macOS and Linux load all of the packages that were loaded in the main R process. However, parallel processing using psock clusters has no additional packages loaded. If the home package for a recipe step is not loaded in the worker processes, the `prep()` methods cannot be found and an error occurs. \n\nIf this S3 method is used for your step, you can rely on this for checking the installation: \n \n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nrecipes::recipes_pkg_check(required_pkgs.step_hypothetical())\n#> 2 packages (hypothetical and myrecipespkg) are needed for this step\n#> but are not installed.\n#> To install run: `install.packages(c(\"hypothetical\", \"myrecipespkg\"))`\n```\n:::\n\n\n\n\nIf you'd like an example of this in a package, please take a look at the [embed](https://github.com/tidymodels/embed/) or [themis](https://github.com/tidymodels/themis/) package.\n\n### A `tidy()` method\n\nThe `broom::tidy()` method is a means to return information about the step in a usable format. For our step, it would be helpful to know the reference values. \n\nWhen the recipe has been prepped, those data are in the list `ref_dist`. A small function can be used to reformat that data into a tibble. It is customary to return the main values as `value`:\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nformat_pctl <- function(x) {\n tibble::tibble(\n value = unname(x),\n percentile = as.numeric(gsub(\"%$\", \"\", names(x)))\n )\n}\n\n# For example:\npctl_step_object <- rec_obj$steps[[1]]\npctl_step_object\n#> • Percentile transformation on: hydrogen and oxygen, ... | Trained\nformat_pctl(pctl_step_object$ref_dist[[\"hydrogen\"]])\n#> # A tibble: 87 × 2\n#> value percentile\n#> \n#> 1 0.03 0\n#> 2 0.934 1\n#> 3 1.60 2\n#> 4 2.07 3\n#> 5 2.45 4\n#> 6 2.74 5\n#> 7 3.15 6\n#> 8 3.49 7\n#> 9 3.71 8\n#> 10 3.99 9\n#> # ℹ 77 more rows\n```\n:::\n\n\n\n\nThe tidy method could return these values for each selected column. Before `prep()`, missing values can be used as placeholders. \n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ntidy.step_percentiles <- function(x, ...) {\n if (is_trained(x)) {\n if (length(x$ref_dist) == 0) {\n # We need to create consistent output when no variables are selected\n res <- tibble(\n terms = character(),\n value = numeric(),\n percentile = numeric()\n )\n } else {\n res <- map_dfr(x$ref_dist, format_pctl, .id = \"term\")\n }\n } else {\n term_names <- sel2char(x$terms)\n res <-\n tibble(\n terms = term_names,\n value = rlang::na_dbl,\n percentile = rlang::na_dbl\n )\n }\n # Always return the step id:\n res$id <- x$id\n res\n}\n\ntidy(rec_obj, number = 1)\n#> # A tibble: 274 × 4\n#> term value percentile id \n#> \n#> 1 hydrogen 0.03 0 percentiles_Bp5vK\n#> 2 hydrogen 0.934 1 percentiles_Bp5vK\n#> 3 hydrogen 1.60 2 percentiles_Bp5vK\n#> 4 hydrogen 2.07 3 percentiles_Bp5vK\n#> 5 hydrogen 2.45 4 percentiles_Bp5vK\n#> 6 hydrogen 2.74 5 percentiles_Bp5vK\n#> 7 hydrogen 3.15 6 percentiles_Bp5vK\n#> 8 hydrogen 3.49 7 percentiles_Bp5vK\n#> 9 hydrogen 3.71 8 percentiles_Bp5vK\n#> 10 hydrogen 3.99 9 percentiles_Bp5vK\n#> # ℹ 264 more rows\n```\n:::\n\n\n\n\n### Methods for tuning parameters\n\nThe tune package can be used to find reasonable values of step arguments by model tuning. There are some S3 methods that are useful to define for your step. The percentile example doesn't really have any tunable parameters, so we will demonstrate using `step_poly()`, which returns a polynomial expansion of selected columns. Its function definition has the following arguments: \n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\nargs(step_poly)\n#> function (recipe, ..., role = \"predictor\", trained = FALSE, objects = NULL, \n#> degree = 2L, options = list(), keep_original_cols = FALSE, \n#> skip = FALSE, id = rand_id(\"poly\")) \n#> NULL\n```\n:::\n\n\n\n\nThe argument `degree` is tunable.\n\nTo work with tune it is _helpful_ (but not required) to use an S3 method called `tunable()` to define which arguments should be tuned and how the values of those arguments should be generated. \n\n`tunable()` takes the step object as its argument and returns a tibble with columns: \n\n* `name`: The name of the argument. \n\n* `call_info`: A list that describes how to call a function that returns a dials parameter object. \n\n* `source`: A character string that indicates where the tuning value comes from (i.e., a model, a recipe etc.). Here, it is just `\"recipe\"`. \n\n* `component`: A character string with more information about the source. For recipes, this is just the name of the step (e.g. `\"step_poly\"`). \n\n* `component_id`: A character string to indicate where a unique identifier is for the object. For recipes, this is just the `id` value of the step object. \n\nThe main piece of information that requires some detail is `call_info`. This is a list column in the tibble. Each element of the list is a list that describes the package and function that can be used to create a dials parameter object. \n\nFor example, for a nearest-neighbors `neighbors` parameter, this value is just: \n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ninfo <- list(pkg = \"dials\", fun = \"neighbors\")\n\n# FYI: how it is used under-the-hood:\nnew_param_call <- rlang::call2(.fn = info$fun, .ns = info$pkg)\nrlang::eval_tidy(new_param_call)\n#> # Nearest Neighbors (quantitative)\n#> Range: [1, 10]\n```\n:::\n\n\n\n\nFor `step_poly()`, a dials object is needed that returns an integer that is the number of new columns to create. It turns out that there are a few different types of tuning parameters related to degree: \n\n```r\n> lsf.str(\"package:dials\", pattern = \"degree\")\ndegree : function (range = c(1, 3), trans = NULL) \ndegree_int : function (range = c(1L, 3L), trans = NULL) \nprod_degree : function (range = c(1L, 2L), trans = NULL) \nspline_degree : function (range = c(3L, 10L), trans = NULL) \n```\n\nLooking at the `range` values, some return doubles, and others return integers. For our problem, `degree_int()` would be a good choice. \n\nFor `step_poly()` the `tunable()` S3 method could be: \n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```{.r .cell-code}\ntunable.step_poly <- function(x, ...) {\n tibble::tibble(\n name = c(\"degree\"),\n call_info = list(list(pkg = \"dials\", fun = \"degree_int\")),\n source = \"recipe\",\n component = \"step_poly\",\n component_id = x$id\n )\n}\n```\n:::\n\n\n\n\n## Session information {#session-info}\n\n\n\n\n::: {.cell layout-align=\"center\"}\n\n```\n#> ─ Session info ─────────────────────────────────────────────────────\n#> version R version 4.4.2 (2024-10-31)\n#> language (EN)\n#> date 2025-03-27\n#> pandoc 3.6.1\n#> quarto 1.6.42\n#> \n#> ─ Packages ─────────────────────────────────────────────────────────\n#> package version date (UTC) source\n#> broom 1.0.7 2024-09-26 CRAN (R 4.4.1)\n#> dials 1.4.0 2025-02-13 CRAN (R 4.4.2)\n#> dplyr 1.1.4 2023-11-17 CRAN (R 4.4.0)\n#> ggplot2 3.5.1 2024-04-23 CRAN (R 4.4.0)\n#> infer 1.0.7 2024-03-25 CRAN (R 4.4.0)\n#> modeldata 1.4.0 2024-06-19 CRAN (R 4.4.0)\n#> parsnip 1.3.1 2025-03-12 CRAN (R 4.4.1)\n#> purrr 1.0.4 2025-02-05 CRAN (R 4.4.1)\n#> recipes 1.2.1 2025-03-25 CRAN (R 4.4.1)\n#> rlang 1.1.5 2025-01-17 CRAN (R 4.4.2)\n#> rsample 1.2.1 2024-03-25 CRAN (R 4.4.0)\n#> tibble 3.2.1 2023-03-20 CRAN (R 4.4.0)\n#> tidymodels 1.3.0 2025-02-21 CRAN (R 4.4.1)\n#> tune 1.3.0 2025-02-21 CRAN (R 4.4.1)\n#> workflows 1.2.0 2025-02-19 CRAN (R 4.4.1)\n#> yardstick 1.3.2 2025-01-22 CRAN (R 4.4.1)\n#> \n#> ────────────────────────────────────────────────────────────────────\n```\n:::\n", "supporting": [], "filters": [ "rmarkdown/pagebreak.lua" diff --git a/learn/develop/recipes/index.html.md b/learn/develop/recipes/index.html.md index 42b9cf41..de39add6 100644 --- a/learn/develop/recipes/index.html.md +++ b/learn/develop/recipes/index.html.md @@ -5,9 +5,9 @@ categories: type: learn-subsection weight: 1 description: | - Write a new recipe step for data preprocessing. + Write a new recipe step for data preprocessing. toc: true -toc-depth: 2 +toc-depth: 3 include-after-body: ../../../resources.html --- @@ -29,7 +29,7 @@ The general process to follow is to: As an example, we will create a step for converting data into percentiles. -## A new step definition +## Creating a new step Let's create a step that replaces the value of a variable with its percentile from the training set. The example data we'll use is from the modeldata package: @@ -49,8 +49,8 @@ str(biomass) #> $ sulfur : num 0 0 0.02 0.16 0.02 0.1 0.2 0.2 0 0.1 ... #> $ HHV : num 20 19.2 18.3 18.2 18.4 ... -biomass_tr <- biomass[biomass$dataset == "Training",] -biomass_te <- biomass[biomass$dataset == "Testing",] +biomass_tr <- biomass[biomass$dataset == "Training", ] +biomass_te <- biomass[biomass$dataset == "Testing", ] ``` ::: @@ -61,8 +61,8 @@ To illustrate the transformation with the `carbon` variable, note the training s ```{.r .cell-code} library(ggplot2) theme_set(theme_bw()) -ggplot(biomass_tr, aes(x = carbon)) + - geom_histogram(binwidth = 5, col = "blue", fill = "blue", alpha = .5) + +ggplot(biomass_tr, aes(x = carbon)) + + geom_histogram(binwidth = 5, col = "blue", fill = "blue", alpha = .5) + geom_vline(xintercept = biomass_te$carbon[1], lty = 2) ``` @@ -79,7 +79,7 @@ Our new step will do this computation for any numeric variables of interest. We The step `step_percentiles()` that will be created on this page, has been implemented in recipes as [step_percentile()](https://recipes.tidymodels.org/reference/step_percentile.html). ::: -## Create the function +### Create the user function To start, there is a _user-facing_ function. Let's call that `step_percentiles()`. This is just a simple wrapper around a _constructor function_, which defines the rules for any step object that defines a percentile transformation. We'll call this constructor `step_percentiles_new()`. @@ -89,52 +89,51 @@ The function `step_percentiles()` takes the same arguments as your function and ```{.r .cell-code} step_percentiles <- function( - recipe, - ..., - role = NA, - trained = FALSE, - ref_dist = NULL, - options = list(probs = (0:100)/100, names = TRUE), - skip = FALSE, - id = rand_id("percentiles") - ) { - + recipe, + ..., + role = NA, + trained = FALSE, + ref_dist = NULL, + options = list(probs = (0:100) / 100, names = TRUE), + skip = FALSE, + id = rand_id("percentiles") +) { add_step( - recipe, + recipe, step_percentiles_new( - terms = enquos(...), - trained = trained, - role = role, - ref_dist = ref_dist, - options = options, - skip = skip, - id = id - ) - ) + terms = enquos(...), + trained = trained, + role = role, + ref_dist = ref_dist, + options = options, + skip = skip, + id = id + ) + ) } ``` ::: -You should always keep the first four arguments (`recipe` though `trained`) the same as listed above. Some notes: +You should always keep the first four arguments (`recipe` though `trained`) the same as listed above. Additionally, you should have a `skip` and `id` argument as well, conventionally put as the last arguments. Some notes: - * the `role` argument is used when you either 1) create new variables and want their role to be pre-set or 2) replace the existing variables with new values. The latter is what we will be doing and using `role = NA` will leave the existing role intact. + * The `role` argument is used when you either 1) create new variables and want their role to be pre-set or 2) replace the existing variables with new values. The latter is what we will be doing and using `role = NA` will leave the existing role intact. * `trained` is set by the package when the estimation step has been run. You should default your function definition's argument to `FALSE`. - * `skip` is a logical. Whenever a recipe is prepped, each step is trained and then baked. However, there are some steps that should not be applied when a call to `bake()` is used. For example, if a step is applied to the variables with roles of "outcomes", these data would not be available for new samples. + * `skip` is a logical. Whenever a recipe is prepped, each step is trained and then baked. However, there are some steps that should not be applied when a call to `bake()` is used. For example, if a step is applied to the variables with roles of "outcomes", these data would not be available for new samples. * `id` is a character string that can be used to identify steps in package code. `rand_id()` will create an ID that has the prefix and a random character sequence. -We can estimate the percentiles of new data points based on the percentiles from the training set with `approx()`. Our `step_percentiles` contains a `ref_dist` object to store these percentiles (pre-computed from the training set in `prep()`) for later use in `bake()`. +We can estimate the percentiles of new data points based on the percentiles from the training set with `approx()`. Our `step_percentiles()` contains a `ref_dist` object to store these percentiles (pre-computed from the training set in `prep()`) for later use in `bake()`. We will use `stats::quantile()` to compute the grid. However, we might also want to have control over the granularity of this grid, so the `options` argument will be used to define how that calculation is done. We could use the ellipses (aka `...`) so that any options passed to `step_percentiles()` that are not one of its arguments will then be passed to `stats::quantile()`. However, we recommend making a separate list object with the options and use these inside the function because `...` is already used to define the variable selection. It is also important to consider if there are any _main arguments_ to the step. For example, for spline-related steps such as `step_ns()`, users typically want to adjust the argument for the degrees of freedom in the spline (e.g. `splines::ns(x, df)`). Rather than letting users add `df` to the `options` argument: -* Allow the important arguments to be main arguments to the step function. +* Allow the important arguments to be the main arguments to the step function. * Follow the tidymodels [conventions for naming arguments](https://tidymodels.github.io/model-implementation-principles/standardized-argument-names.html). Whenever possible, avoid jargon and keep common argument names. There are benefits to following these principles (as shown below). -## Initialize a new object +### Initialize a new object Now, the constructor function can be created. @@ -142,8 +141,8 @@ The function cascade is: ``` step_percentiles() calls recipes::add_step() -└──> recipes::add_step() calls step_percentiles_new() - └──> step_percentiles_new() calls recipes::step() +└─> recipes::add_step() calls step_percentiles_new() + └─> step_percentiles_new() calls recipes::step() ``` `step()` is a general constructor for recipes that mainly makes sure that the resulting step object is a list with an appropriate S3 class structure. Using `subclass = "percentiles"` will set the class of new objects to `"step_percentiles"`. @@ -151,41 +150,40 @@ step_percentiles() calls recipes::add_step() ::: {.cell layout-align="center"} ```{.r .cell-code} -step_percentiles_new <- +step_percentiles_new <- function(terms, role, trained, ref_dist, options, skip, id) { step( - subclass = "percentiles", - terms = terms, - role = role, - trained = trained, - ref_dist = ref_dist, - options = options, - skip = skip, - id = id - ) - } + subclass = "percentiles", + terms = terms, + role = role, + trained = trained, + ref_dist = ref_dist, + options = options, + skip = skip, + id = id + ) + } ``` ::: This constructor function should have no default argument values. Defaults should be set in the user-facing step object. -## Create the `prep` method +### Create the `prep()` method You will need to create a new `prep()` method for your step's class. To do this, three arguments that the method should have are: ```r -function(x, training, info = NULL) +function(x, training, info = NULL, ...) ``` where * `x` will be the `step_percentiles` object, * `training` will be a _tibble_ that has the training set data, and - * `info` will also be a tibble that has information on the current set of data available. This information is updated as each step is evaluated by its specific `prep()` method so it may not have the variables from the original data. The columns in this tibble are `variable` (the variable name), `type` (currently either "numeric" or "nominal"), `role` (defining the variable's role), and `source` (either "original" or "derived" depending on where it originated). - -You can define other arguments as well. + * `info` will also be a tibble that has information on the current set of data available. This information is updated as each step is evaluated by its specific `prep()` method so it may not have the variables from the original data. The columns in this tibble are `variable` (the variable name), `type` (currently either `"numeric"` or `"nominal"`), `role` (defining the variable's role), and `source` (either `"original"` or `"derived"` depending on where it originated). + * `...` not being used. -The first thing that you might want to do in the `prep()` function is to translate the specification listed in the `terms` argument to column names in the current data. There is a function called `recipes_eval_select()` that can be used to obtain this. +The first thing that you might want to do in the `prep()` function is to translate the specification listed in the `terms` argument to column names in the current data. This is done using the `recipes_eval_select()` function. ::: {.callout-warning} The `recipes_eval_select()` function is not one you interact with as a typical recipes user, but it is helpful if you develop your own custom recipe steps. @@ -195,7 +193,7 @@ The first thing that you might want to do in the `prep()` function is to transla ```{.r .cell-code} prep.step_percentiles <- function(x, training, info = NULL, ...) { - col_names <- recipes_eval_select(x$terms, training, info) + col_names <- recipes_eval_select(x$terms, training, info) # TODO finish the rest of the function } ``` @@ -203,15 +201,28 @@ prep.step_percentiles <- function(x, training, info = NULL, ...) { After this function call, it is a good idea to check that the selected columns have the appropriate type (e.g. numeric for this example). See `recipes::check_type()` to do this for basic types. -Once we have this, we can save the approximation grid. For the grid, we will use a helper function that enables us to run `rlang::exec()` to splice in any extra arguments contained in the `options` list to the call to `quantile()`: +::: {.cell layout-align="center"} + +```{.r .cell-code} +prep.step_percentiles <- function(x, training, info = NULL, ...) { + col_names <- recipes_eval_select(x$terms, training, info) + check_type(training[, col_names], types = c("double", "integer")) + # TODO finish the rest of the function +} +``` +::: + +Once we have this then we are ready for the meat of the function. The purpose of the `prep()` method is to calculate and store the information needed to perform the transformations. `step_center()` stores the means of the selected variables, `step_pca()` stores the loadings, and `step_lincomb()` calculates which variables are linear combinations of each other and marks them for removal. Some steps don't need to calculate anything in the `prep()` method, examples include `step_arrange()` and `step_date()`. + +For this step, we want to save the approximation grid. For the grid, we will use a helper function that enables us to run `rlang::exec()` to splice in any extra arguments contained in the `options` list to the call to `quantile()`: ::: {.cell layout-align="center"} ```{.r .cell-code} get_train_pctl <- function(x, args = NULL) { - res <- rlang::exec("quantile", x = x, !!!args) + res <- rlang::exec("quantile", x = x, !!!args) # Remove duplicate percentile values - res[!duplicated(res)] + res[!duplicated(res)] } # For example: @@ -230,50 +241,57 @@ Now, the `prep()` method can be created: ```{.r .cell-code} prep.step_percentiles <- function(x, training, info = NULL, ...) { - col_names <- recipes_eval_select(x$terms, training, info) + col_names <- recipes_eval_select(x$terms, training, info) check_type(training[, col_names], types = c("double", "integer")) - ## We'll use the names later so make sure they are available - if (x$options$names == FALSE) { - rlang::abort("`names` should be set to TRUE") - } - - if (!any(names(x$options) == "probs")) { - x$options$probs <- (0:100)/100 - } else { - x$options$probs <- sort(unique(x$options$probs)) - } - + # We'll use the names later so make sure they are available + if (x$options$names == FALSE) { + rlang::abort("`names` should be set to TRUE") + } + + if (!any(names(x$options) == "probs")) { + x$options$probs <- (0:100) / 100 + } else { + x$options$probs <- sort(unique(x$options$probs)) + } + # Compute percentile grid - ref_dist <- purrr::map(training[, col_names], get_train_pctl, args = x$options) + ref_dist <- list() + for (col_name in col_names) { + ref_dist[[col_name]] <- get_train_pctl(training[[col_name]], args = x$options) + } + + # Use the constructor function to return the updated object. + # Note that `trained` is now set to TRUE - ## Use the constructor function to return the updated object. - ## Note that `trained` is now set to TRUE - step_percentiles_new( - terms = x$terms, - trained = TRUE, - role = x$role, - ref_dist = ref_dist, - options = x$options, - skip = x$skip, - id = x$id - ) + terms = x$terms, + trained = TRUE, + role = x$role, + ref_dist = ref_dist, + options = x$options, + skip = x$skip, + id = x$id + ) } ``` ::: -We suggest favoring `rlang::abort()` and `rlang::warn()` over `stop()` and `warning()`. The former can be used for better traceback results. +::: {.callout-tip} +Due to the way errors are captured and rethrown in recipes, we recommend using a for-loop over `map()` or `lapply()` to go over the selected variables. +::: + +We suggest favoring `cli::cli_abort()` and `cli::cli_warn()` over `stop()` and `warning()`. The former can be used for better traceback results. -## Create the `bake` method +### Create the `bake()` method -Remember that the `prep()` function does not _apply_ the step to the data; it only estimates any required values such as `ref_dist`. We will need to create a new method for our `step_percentiles()` class. The minimum arguments for this are +Remember that the `prep()` function does not _apply_ the step to the data; it only estimates any required values such as `ref_dist`. We will need to create a new method for our `step_percentiles()` class. The signature is as follows. ```r function(object, new_data, ...) ``` -where `object` is the updated step function that has been through the corresponding `prep()` code and `new_data` is a tibble of data to be processed. +Where `object` is the updated step function that has been through the corresponding `prep()` code and `new_data` is a tibble of data to be processed. Here is the code to convert the new data to percentiles. The input data (`x` below) comes in as a numeric vector and the output is a vector of approximate percentiles: @@ -283,8 +301,8 @@ Here is the code to convert the new data to percentiles. The input data (`x` bel pctl_by_approx <- function(x, ref) { # In case duplicates were removed, get the percentiles from # the names of the reference object - grid <- as.numeric(gsub("%$", "", names(ref))) - approx(x = ref, y = grid, xout = x)$y/100 + grid <- as.numeric(gsub("%$", "", names(ref))) + approx(x = ref, y = grid, xout = x)$y / 100 } ``` ::: @@ -295,19 +313,19 @@ We will loop over the variables one by and and apply the transformation. `check_ ```{.r .cell-code} bake.step_percentiles <- function(object, new_data, ...) { - col_names <- names(object$ref_dist) + col_names <- names(object$ref_dist) check_new_data(col_names, object, new_data) - for (col_name in col_names) { - new_data[[col_name]] <- pctl_by_approx( - x = new_data[[col_name]], - ref = object$ref_dist[[col_name]] - ) - } + for (col_name in col_names) { + new_data[[col_name]] <- pctl_by_approx( + x = new_data[[col_name]], + ref = object$ref_dist[[col_name]] + ) + } # new_data will be a tibble when passed to this function. It should also # be a tibble on the way out. - new_data + new_data } ``` ::: @@ -316,15 +334,14 @@ bake.step_percentiles <- function(object, new_data, ...) { You need to import `recipes::prep()` and `recipes::bake()` to create your own step function in a package. ::: -## Run the example +### Verify it works Let's use the example data to make sure that it works: ::: {.cell layout-align="center"} ```{.r .cell-code} -rec_obj <- - recipe(HHV ~ ., data = biomass_tr) %>% +rec_obj <- recipe(HHV ~ ., data = biomass_tr) %>% step_percentiles(ends_with("gen")) %>% prep(training = biomass_tr) @@ -339,10 +356,10 @@ bake(rec_obj, biomass_te %>% slice(1:2), ends_with("gen")) #> 1 0.45 0.903 0.21 #> 2 0.38 0.922 0.928 -# Checking to get approximate result: +# Checking to get approximate results: mean(biomass_tr$hydrogen <= biomass_te$hydrogen[1]) #> [1] 0.4517544 -mean(biomass_tr$oxygen <= biomass_te$oxygen[1]) +mean(biomass_tr$oxygen <= biomass_te$oxygen[1]) #> [1] 0.9013158 ``` ::: @@ -352,18 +369,21 @@ The plot below shows how the original hydrogen percentiles line up with the esti ::: {.cell layout-align="center"} ```{.r .cell-code} -hydrogen_values <- - bake(rec_obj, biomass_te, hydrogen) %>% +hydrogen_values <- bake(rec_obj, biomass_te, hydrogen) %>% bind_cols(biomass_te %>% select(original = hydrogen)) -ggplot(biomass_tr, aes(x = hydrogen)) + - # Plot the empirical distribution function of the +ggplot(biomass_tr, aes(x = hydrogen)) + + # Plot the empirical distribution function of the # hydrogen training set values as a black line - stat_ecdf() + - # Overlay the estimated percentiles for the new data: - geom_point(data = hydrogen_values, - aes(x = original, y = hydrogen), - col = "red", alpha = .5, cex = 2) + + stat_ecdf() + + # Overlay the estimated percentiles for the new data: + geom_point( + data = hydrogen_values, + aes(x = original, y = hydrogen), + col = "red", + alpha = .5, + cex = 2 + ) + labs(x = "New Hydrogen Values", y = "Percentile Based on Training Set") ``` @@ -374,48 +394,48 @@ ggplot(biomass_tr, aes(x = hydrogen)) + These line up very nicely! -## Custom check operations +## Creating a new check -The process here is exactly the same as steps; the internal functions have a similar naming convention: +The process here is exactly the same as for steps; the internal functions have a similar naming convention: - * `add_check()` instead of `add_step()` + * `add_check()` instead of `add_step()`. * `check()` instead of `step()`, and so on. It is strongly recommended that: 1. The operations start with `check_` (i.e. `check_range()` and `check_range_new()`) - 1. The check uses `rlang::abort(paste0(...))` when the conditions are not met + 1. The check uses `cli::cli_abort(...)` when the conditions are not met 1. The original data are returned (unaltered) by the check when the conditions are satisfied. ## Other step methods There are a few other S3 methods that can be created for your step function. They are not required unless you plan on using your step in the broader tidymodels package set. -### A print method +### A `print()` method -If you don't add a print method for `step_percentiles`, it will still print but it will be printed as a list of (potentially large) objects and look a bit ugly. The recipes package contains a helper function called `print_step()` that should be useful in most cases. We are using it here for the custom print method for `step_percentiles`. It requires the original terms specification and the column names this specification is evaluated to by `prep()`. For the former, our step object is structured so that the list object `ref_dist` has the names of the selected variables: +If you don't add a print method for `step_percentiles()`, it will still print but it will be printed as a list of (potentially large) objects and look a bit ugly. The recipes package contains a helper function called `print_step()` that should be useful in most cases. We are using it here for the custom print method for `step_percentiles()`. It requires the original terms specification and the column names this specification is evaluated to by `prep()`. For the former, our step object is structured so that the list object `ref_dist` has the names of the selected variables: ::: {.cell layout-align="center"} ```{.r .cell-code} print.step_percentiles <- function(x, width = max(20, options()$width - 35), ...) { - title <- "Percentile transformation on " + title <- "Percentile transformation on " print_step( # Names after prep: - tr_obj = names(x$ref_dist), + tr_obj = names(x$ref_dist), # Names before prep (could be selectors) - untr_obj = x$terms, - # Has it been prepped? - trained = x$trained, + untr_obj = x$terms, + # Has it been prepped? + trained = x$trained, # What does this step do? - title = title, - # An estimate of how many characters to print on a line: - width = width - ) + title = title, + # An estimate of how many characters to print on a line: + width = width + ) invisible(x) - } + } # Results before `prep()`: recipe(HHV ~ ., data = biomass_tr) %>% @@ -431,7 +451,7 @@ recipe(HHV ~ ., data = biomass_tr) %>% #> ── Operations #> • Percentile transformation on: ends_with("gen") -# Results after `prep()`: +# Results after `prep()`: rec_obj #> #> ── Recipe ──────────────────────────────────────────────────────────── @@ -449,7 +469,6 @@ rec_obj ``` ::: - ### Methods for declaring required packages Some recipe steps use functions from other packages. When this is the case, the `step_*()` function should check to see if the package is installed. The function `recipes::recipes_pkg_check()` will do this. For example: @@ -460,7 +479,7 @@ Some recipe steps use functions from other packages. When this is the case, the a clean R session then run: install.packages("some_package") ``` -There is an S3 method that can be used to declare what packages should be loaded when using the step. For a hypothetical step that relies on the `hypothetical` package, this might look like: +An S3 method can be used to declare what packages should be loaded when using the step. For a hypothetical step that relies on the `hypothetical` package, this might look like: ::: {.cell layout-align="center"} @@ -473,7 +492,7 @@ required_pkgs.step_hypothetical <- function(x, ...) { In this example, `myrecipespkg` is the package where the step resides (if it is in a package). -The reason to declare what packages should be loaded is parallel processing. When parallel worker processes are created, there is heterogeneity across technologies regarding which packages are loaded. Multicore methods on macOS and Linux load all of the packages that were loaded in the main R process. However, parallel processing using psock clusters have no additional packages loaded. If the home package for a recipe step is not loaded in the worker processes, the `prep()` methods cannot be found and an error occurs. +The reason to declare what packages should be loaded is parallel processing. When parallel worker processes are created, there is heterogeneity across technologies regarding which packages are loaded. Multicore methods on macOS and Linux load all of the packages that were loaded in the main R process. However, parallel processing using psock clusters has no additional packages loaded. If the home package for a recipe step is not loaded in the worker processes, the `prep()` methods cannot be found and an error occurs. If this S3 method is used for your step, you can rely on this for checking the installation: @@ -490,7 +509,7 @@ recipes::recipes_pkg_check(required_pkgs.step_hypothetical()) If you'd like an example of this in a package, please take a look at the [embed](https://github.com/tidymodels/embed/) or [themis](https://github.com/tidymodels/themis/) package. -### A tidy method +### A `tidy()` method The `broom::tidy()` method is a means to return information about the step in a usable format. For our step, it would be helpful to know the reference values. @@ -500,13 +519,13 @@ When the recipe has been prepped, those data are in the list `ref_dist`. A small ```{.r .cell-code} format_pctl <- function(x) { - tibble::tibble( - value = unname(x), - percentile = as.numeric(gsub("%$", "", names(x))) - ) + tibble::tibble( + value = unname(x), + percentile = as.numeric(gsub("%$", "", names(x))) + ) } -# For example: +# For example: pctl_step_object <- rec_obj$steps[[1]] pctl_step_object #> • Percentile transformation on: hydrogen and oxygen, ... | Trained @@ -534,29 +553,29 @@ The tidy method could return these values for each selected column. Before `prep ```{.r .cell-code} tidy.step_percentiles <- function(x, ...) { - if (is_trained(x)) { - if (length(x$ref_dist) == 0) { - # We need to create consistant output when no variables were selected - res <- tibble( - terms = character(), - value = numeric(), - percentile = numeric() - ) - } else { - res <- map_dfr(x$ref_dist, format_pctl, .id = "term") - } - } else { - term_names <- sel2char(x$terms) - res <- + if (is_trained(x)) { + if (length(x$ref_dist) == 0) { + # We need to create consistent output when no variables are selected + res <- tibble( + terms = character(), + value = numeric(), + percentile = numeric() + ) + } else { + res <- map_dfr(x$ref_dist, format_pctl, .id = "term") + } + } else { + term_names <- sel2char(x$terms) + res <- tibble( - terms = term_names, - value = rlang::na_dbl, - percentile = rlang::na_dbl - ) - } - # Always return the step id: - res$id <- x$id - res + terms = term_names, + value = rlang::na_dbl, + percentile = rlang::na_dbl + ) + } + # Always return the step id: + res$id <- x$id + res } tidy(rec_obj, number = 1) @@ -579,7 +598,7 @@ tidy(rec_obj, number = 1) ### Methods for tuning parameters -The tune package can be used to find reasonable values of step arguments by model tuning. There are some S3 methods that are useful to define for your step. The percentile example doesn't really have any tunable parameters, so we will demonstrate using `step_poly()`, which returns a polynomial expansion of selected columns. Its function definition has the arguments: +The tune package can be used to find reasonable values of step arguments by model tuning. There are some S3 methods that are useful to define for your step. The percentile example doesn't really have any tunable parameters, so we will demonstrate using `step_poly()`, which returns a polynomial expansion of selected columns. Its function definition has the following arguments: ::: {.cell layout-align="center"} @@ -594,7 +613,7 @@ args(step_poly) The argument `degree` is tunable. -To work with tune it is _helpful_ (but not required) to use an S3 method called `tunable()` to define which arguments should be tuned and how values of those arguments should be generated. +To work with tune it is _helpful_ (but not required) to use an S3 method called `tunable()` to define which arguments should be tuned and how the values of those arguments should be generated. `tunable()` takes the step object as its argument and returns a tibble with columns: @@ -617,7 +636,7 @@ For example, for a nearest-neighbors `neighbors` parameter, this value is just: ```{.r .cell-code} info <- list(pkg = "dials", fun = "neighbors") -# FYI: how it is used under-the-hood: +# FYI: how it is used under-the-hood: new_param_call <- rlang::call2(.fn = info$fun, .ns = info$pkg) rlang::eval_tidy(new_param_call) #> # Nearest Neighbors (quantitative) @@ -635,21 +654,21 @@ prod_degree : function (range = c(1L, 2L), trans = NULL) spline_degree : function (range = c(3L, 10L), trans = NULL) ``` -Looking at the `range` values, some return doubles and others return integers. For our problem, `degree_int()` would be a good choice. +Looking at the `range` values, some return doubles, and others return integers. For our problem, `degree_int()` would be a good choice. For `step_poly()` the `tunable()` S3 method could be: ::: {.cell layout-align="center"} ```{.r .cell-code} -tunable.step_poly <- function (x, ...) { - tibble::tibble( - name = c("degree"), - call_info = list(list(pkg = "dials", fun = "degree_int")), - source = "recipe", - component = "step_poly", - component_id = x$id - ) +tunable.step_poly <- function(x, ...) { + tibble::tibble( + name = c("degree"), + call_info = list(list(pkg = "dials", fun = "degree_int")), + source = "recipe", + component = "step_poly", + component_id = x$id + ) } ``` ::: @@ -662,7 +681,7 @@ tunable.step_poly <- function (x, ...) { #> ─ Session info ───────────────────────────────────────────────────── #> version R version 4.4.2 (2024-10-31) #> language (EN) -#> date 2025-03-24 +#> date 2025-03-27 #> pandoc 3.6.1 #> quarto 1.6.42 #> @@ -676,7 +695,7 @@ tunable.step_poly <- function (x, ...) { #> modeldata 1.4.0 2024-06-19 CRAN (R 4.4.0) #> parsnip 1.3.1 2025-03-12 CRAN (R 4.4.1) #> purrr 1.0.4 2025-02-05 CRAN (R 4.4.1) -#> recipes 1.2.0 2025-03-17 CRAN (R 4.4.1) +#> recipes 1.2.1 2025-03-25 CRAN (R 4.4.1) #> rlang 1.1.5 2025-01-17 CRAN (R 4.4.2) #> rsample 1.2.1 2024-03-25 CRAN (R 4.4.0) #> tibble 3.2.1 2023-03-20 CRAN (R 4.4.0) diff --git a/learn/develop/recipes/index.qmd b/learn/develop/recipes/index.qmd index e15e875c..583f32f2 100644 --- a/learn/develop/recipes/index.qmd +++ b/learn/develop/recipes/index.qmd @@ -5,9 +5,9 @@ categories: type: learn-subsection weight: 1 description: | - Write a new recipe step for data preprocessing. + Write a new recipe step for data preprocessing. toc: true -toc-depth: 2 +toc-depth: 3 include-after-body: ../../../resources.html --- @@ -47,7 +47,7 @@ The general process to follow is to: As an example, we will create a step for converting data into percentiles. -## A new step definition +## Creating a new step Let's create a step that replaces the value of a variable with its percentile from the training set. The example data we'll use is from the modeldata package: @@ -57,8 +57,8 @@ library(modeldata) data(biomass) str(biomass) -biomass_tr <- biomass[biomass$dataset == "Training",] -biomass_te <- biomass[biomass$dataset == "Testing",] +biomass_tr <- biomass[biomass$dataset == "Training", ] +biomass_te <- biomass[biomass$dataset == "Testing", ] ``` To illustrate the transformation with the `carbon` variable, note the training set distribution of this variable with a vertical line below for the first value of the test set. @@ -70,8 +70,8 @@ To illustrate the transformation with the `carbon` variable, note the training s #| out-width: "100%" library(ggplot2) theme_set(theme_bw()) -ggplot(biomass_tr, aes(x = carbon)) + - geom_histogram(binwidth = 5, col = "blue", fill = "blue", alpha = .5) + +ggplot(biomass_tr, aes(x = carbon)) + + geom_histogram(binwidth = 5, col = "blue", fill = "blue", alpha = .5) + geom_vline(xintercept = biomass_te$carbon[1], lty = 2) ``` @@ -83,7 +83,7 @@ Our new step will do this computation for any numeric variables of interest. We The step `step_percentiles()` that will be created on this page, has been implemented in recipes as [step_percentile()](https://recipes.tidymodels.org/reference/step_percentile.html). ::: -## Create the function +### Create the user function To start, there is a _user-facing_ function. Let's call that `step_percentiles()`. This is just a simple wrapper around a _constructor function_, which defines the rules for any step object that defines a percentile transformation. We'll call this constructor `step_percentiles_new()`. @@ -92,51 +92,50 @@ The function `step_percentiles()` takes the same arguments as your function and ```{r} #| label: "initial_def" step_percentiles <- function( - recipe, - ..., - role = NA, - trained = FALSE, - ref_dist = NULL, - options = list(probs = (0:100)/100, names = TRUE), - skip = FALSE, - id = rand_id("percentiles") - ) { - + recipe, + ..., + role = NA, + trained = FALSE, + ref_dist = NULL, + options = list(probs = (0:100) / 100, names = TRUE), + skip = FALSE, + id = rand_id("percentiles") +) { add_step( - recipe, + recipe, step_percentiles_new( - terms = enquos(...), - trained = trained, - role = role, - ref_dist = ref_dist, - options = options, - skip = skip, - id = id - ) - ) + terms = enquos(...), + trained = trained, + role = role, + ref_dist = ref_dist, + options = options, + skip = skip, + id = id + ) + ) } ``` -You should always keep the first four arguments (`recipe` though `trained`) the same as listed above. Some notes: +You should always keep the first four arguments (`recipe` though `trained`) the same as listed above. Additionally, you should have a `skip` and `id` argument as well, conventionally put as the last arguments. Some notes: - * the `role` argument is used when you either 1) create new variables and want their role to be pre-set or 2) replace the existing variables with new values. The latter is what we will be doing and using `role = NA` will leave the existing role intact. + * The `role` argument is used when you either 1) create new variables and want their role to be pre-set or 2) replace the existing variables with new values. The latter is what we will be doing and using `role = NA` will leave the existing role intact. * `trained` is set by the package when the estimation step has been run. You should default your function definition's argument to `FALSE`. - * `skip` is a logical. Whenever a recipe is prepped, each step is trained and then baked. However, there are some steps that should not be applied when a call to `bake()` is used. For example, if a step is applied to the variables with roles of "outcomes", these data would not be available for new samples. + * `skip` is a logical. Whenever a recipe is prepped, each step is trained and then baked. However, there are some steps that should not be applied when a call to `bake()` is used. For example, if a step is applied to the variables with roles of "outcomes", these data would not be available for new samples. * `id` is a character string that can be used to identify steps in package code. `rand_id()` will create an ID that has the prefix and a random character sequence. -We can estimate the percentiles of new data points based on the percentiles from the training set with `approx()`. Our `step_percentiles` contains a `ref_dist` object to store these percentiles (pre-computed from the training set in `prep()`) for later use in `bake()`. +We can estimate the percentiles of new data points based on the percentiles from the training set with `approx()`. Our `step_percentiles()` contains a `ref_dist` object to store these percentiles (pre-computed from the training set in `prep()`) for later use in `bake()`. We will use `stats::quantile()` to compute the grid. However, we might also want to have control over the granularity of this grid, so the `options` argument will be used to define how that calculation is done. We could use the ellipses (aka `...`) so that any options passed to `step_percentiles()` that are not one of its arguments will then be passed to `stats::quantile()`. However, we recommend making a separate list object with the options and use these inside the function because `...` is already used to define the variable selection. It is also important to consider if there are any _main arguments_ to the step. For example, for spline-related steps such as `step_ns()`, users typically want to adjust the argument for the degrees of freedom in the spline (e.g. `splines::ns(x, df)`). Rather than letting users add `df` to the `options` argument: -* Allow the important arguments to be main arguments to the step function. +* Allow the important arguments to be the main arguments to the step function. * Follow the tidymodels [conventions for naming arguments](https://tidymodels.github.io/model-implementation-principles/standardized-argument-names.html). Whenever possible, avoid jargon and keep common argument names. There are benefits to following these principles (as shown below). -## Initialize a new object +### Initialize a new object Now, the constructor function can be created. @@ -144,48 +143,47 @@ The function cascade is: ``` step_percentiles() calls recipes::add_step() -└──> recipes::add_step() calls step_percentiles_new() - └──> step_percentiles_new() calls recipes::step() +└─> recipes::add_step() calls step_percentiles_new() + └─> step_percentiles_new() calls recipes::step() ``` `step()` is a general constructor for recipes that mainly makes sure that the resulting step object is a list with an appropriate S3 class structure. Using `subclass = "percentiles"` will set the class of new objects to `"step_percentiles"`. ```{r} #| label: "initialize" -step_percentiles_new <- +step_percentiles_new <- function(terms, role, trained, ref_dist, options, skip, id) { step( - subclass = "percentiles", - terms = terms, - role = role, - trained = trained, - ref_dist = ref_dist, - options = options, - skip = skip, - id = id - ) - } + subclass = "percentiles", + terms = terms, + role = role, + trained = trained, + ref_dist = ref_dist, + options = options, + skip = skip, + id = id + ) + } ``` This constructor function should have no default argument values. Defaults should be set in the user-facing step object. -## Create the `prep` method +### Create the `prep()` method You will need to create a new `prep()` method for your step's class. To do this, three arguments that the method should have are: ```r -function(x, training, info = NULL) +function(x, training, info = NULL, ...) ``` where * `x` will be the `step_percentiles` object, * `training` will be a _tibble_ that has the training set data, and - * `info` will also be a tibble that has information on the current set of data available. This information is updated as each step is evaluated by its specific `prep()` method so it may not have the variables from the original data. The columns in this tibble are `variable` (the variable name), `type` (currently either "numeric" or "nominal"), `role` (defining the variable's role), and `source` (either "original" or "derived" depending on where it originated). - -You can define other arguments as well. + * `info` will also be a tibble that has information on the current set of data available. This information is updated as each step is evaluated by its specific `prep()` method so it may not have the variables from the original data. The columns in this tibble are `variable` (the variable name), `type` (currently either `"numeric"` or `"nominal"`), `role` (defining the variable's role), and `source` (either `"original"` or `"derived"` depending on where it originated). + * `...` not being used. -The first thing that you might want to do in the `prep()` function is to translate the specification listed in the `terms` argument to column names in the current data. There is a function called `recipes_eval_select()` that can be used to obtain this. +The first thing that you might want to do in the `prep()` function is to translate the specification listed in the `terms` argument to column names in the current data. This is done using the `recipes_eval_select()` function. ::: {.callout-warning} The `recipes_eval_select()` function is not one you interact with as a typical recipes user, but it is helpful if you develop your own custom recipe steps. @@ -195,21 +193,33 @@ The first thing that you might want to do in the `prep()` function is to transla #| label: "prep_1" #| eval: false prep.step_percentiles <- function(x, training, info = NULL, ...) { - col_names <- recipes_eval_select(x$terms, training, info) + col_names <- recipes_eval_select(x$terms, training, info) # TODO finish the rest of the function } ``` After this function call, it is a good idea to check that the selected columns have the appropriate type (e.g. numeric for this example). See `recipes::check_type()` to do this for basic types. -Once we have this, we can save the approximation grid. For the grid, we will use a helper function that enables us to run `rlang::exec()` to splice in any extra arguments contained in the `options` list to the call to `quantile()`: +```{r} +#| label: "prep_2" +#| eval: false +prep.step_percentiles <- function(x, training, info = NULL, ...) { + col_names <- recipes_eval_select(x$terms, training, info) + check_type(training[, col_names], types = c("double", "integer")) + # TODO finish the rest of the function +} +``` + +Once we have this then we are ready for the meat of the function. The purpose of the `prep()` method is to calculate and store the information needed to perform the transformations. `step_center()` stores the means of the selected variables, `step_pca()` stores the loadings, and `step_lincomb()` calculates which variables are linear combinations of each other and marks them for removal. Some steps don't need to calculate anything in the `prep()` method, examples include `step_arrange()` and `step_date()`. + +For this step, we want to save the approximation grid. For the grid, we will use a helper function that enables us to run `rlang::exec()` to splice in any extra arguments contained in the `options` list to the call to `quantile()`: ```{r} #| label: "splice" get_train_pctl <- function(x, args = NULL) { - res <- rlang::exec("quantile", x = x, !!!args) + res <- rlang::exec("quantile", x = x, !!!args) # Remove duplicate percentile values - res[!duplicated(res)] + res[!duplicated(res)] } # For example: @@ -220,51 +230,58 @@ get_train_pctl(biomass_tr$carbon) Now, the `prep()` method can be created: ```{r} -#| label: "prep-2" +#| label: "prep-3" prep.step_percentiles <- function(x, training, info = NULL, ...) { - col_names <- recipes_eval_select(x$terms, training, info) + col_names <- recipes_eval_select(x$terms, training, info) check_type(training[, col_names], types = c("double", "integer")) - ## We'll use the names later so make sure they are available - if (x$options$names == FALSE) { - rlang::abort("`names` should be set to TRUE") - } - - if (!any(names(x$options) == "probs")) { - x$options$probs <- (0:100)/100 - } else { - x$options$probs <- sort(unique(x$options$probs)) - } - + # We'll use the names later so make sure they are available + if (x$options$names == FALSE) { + rlang::abort("`names` should be set to TRUE") + } + + if (!any(names(x$options) == "probs")) { + x$options$probs <- (0:100) / 100 + } else { + x$options$probs <- sort(unique(x$options$probs)) + } + # Compute percentile grid - ref_dist <- purrr::map(training[, col_names], get_train_pctl, args = x$options) + ref_dist <- list() + for (col_name in col_names) { + ref_dist[[col_name]] <- get_train_pctl(training[[col_name]], args = x$options) + } + + # Use the constructor function to return the updated object. + # Note that `trained` is now set to TRUE - ## Use the constructor function to return the updated object. - ## Note that `trained` is now set to TRUE - step_percentiles_new( - terms = x$terms, - trained = TRUE, - role = x$role, - ref_dist = ref_dist, - options = x$options, - skip = x$skip, - id = x$id - ) + terms = x$terms, + trained = TRUE, + role = x$role, + ref_dist = ref_dist, + options = x$options, + skip = x$skip, + id = x$id + ) } ``` -We suggest favoring `rlang::abort()` and `rlang::warn()` over `stop()` and `warning()`. The former can be used for better traceback results. +::: {.callout-tip} +Due to the way errors are captured and rethrown in recipes, we recommend using a for-loop over `map()` or `lapply()` to go over the selected variables. +::: -## Create the `bake` method +We suggest favoring `cli::cli_abort()` and `cli::cli_warn()` over `stop()` and `warning()`. The former can be used for better traceback results. -Remember that the `prep()` function does not _apply_ the step to the data; it only estimates any required values such as `ref_dist`. We will need to create a new method for our `step_percentiles()` class. The minimum arguments for this are +### Create the `bake()` method + +Remember that the `prep()` function does not _apply_ the step to the data; it only estimates any required values such as `ref_dist`. We will need to create a new method for our `step_percentiles()` class. The signature is as follows. ```r function(object, new_data, ...) ``` -where `object` is the updated step function that has been through the corresponding `prep()` code and `new_data` is a tibble of data to be processed. +Where `object` is the updated step function that has been through the corresponding `prep()` code and `new_data` is a tibble of data to be processed. Here is the code to convert the new data to percentiles. The input data (`x` below) comes in as a numeric vector and the output is a vector of approximate percentiles: @@ -273,8 +290,8 @@ Here is the code to convert the new data to percentiles. The input data (`x` bel pctl_by_approx <- function(x, ref) { # In case duplicates were removed, get the percentiles from # the names of the reference object - grid <- as.numeric(gsub("%$", "", names(ref))) - approx(x = ref, y = grid, xout = x)$y/100 + grid <- as.numeric(gsub("%$", "", names(ref))) + approx(x = ref, y = grid, xout = x)$y / 100 } ``` @@ -283,19 +300,19 @@ We will loop over the variables one by and and apply the transformation. `check_ ```{r} #| label: "bake-method" bake.step_percentiles <- function(object, new_data, ...) { - col_names <- names(object$ref_dist) + col_names <- names(object$ref_dist) check_new_data(col_names, object, new_data) - for (col_name in col_names) { - new_data[[col_name]] <- pctl_by_approx( - x = new_data[[col_name]], - ref = object$ref_dist[[col_name]] - ) - } + for (col_name in col_names) { + new_data[[col_name]] <- pctl_by_approx( + x = new_data[[col_name]], + ref = object$ref_dist[[col_name]] + ) + } # new_data will be a tibble when passed to this function. It should also # be a tibble on the way out. - new_data + new_data } ``` @@ -303,96 +320,98 @@ bake.step_percentiles <- function(object, new_data, ...) { You need to import `recipes::prep()` and `recipes::bake()` to create your own step function in a package. ::: -## Run the example +### Verify it works Let's use the example data to make sure that it works: ```{r} #| label: "example" -rec_obj <- - recipe(HHV ~ ., data = biomass_tr) %>% +rec_obj <- recipe(HHV ~ ., data = biomass_tr) %>% step_percentiles(ends_with("gen")) %>% prep(training = biomass_tr) biomass_te %>% select(ends_with("gen")) %>% slice(1:2) bake(rec_obj, biomass_te %>% slice(1:2), ends_with("gen")) -# Checking to get approximate result: +# Checking to get approximate results: mean(biomass_tr$hydrogen <= biomass_te$hydrogen[1]) -mean(biomass_tr$oxygen <= biomass_te$oxygen[1]) +mean(biomass_tr$oxygen <= biomass_te$oxygen[1]) ``` The plot below shows how the original hydrogen percentiles line up with the estimated values: ```{r} #| label: "cdf_plot" -hydrogen_values <- - bake(rec_obj, biomass_te, hydrogen) %>% +hydrogen_values <- bake(rec_obj, biomass_te, hydrogen) %>% bind_cols(biomass_te %>% select(original = hydrogen)) -ggplot(biomass_tr, aes(x = hydrogen)) + - # Plot the empirical distribution function of the +ggplot(biomass_tr, aes(x = hydrogen)) + + # Plot the empirical distribution function of the # hydrogen training set values as a black line - stat_ecdf() + - # Overlay the estimated percentiles for the new data: - geom_point(data = hydrogen_values, - aes(x = original, y = hydrogen), - col = "red", alpha = .5, cex = 2) + + stat_ecdf() + + # Overlay the estimated percentiles for the new data: + geom_point( + data = hydrogen_values, + aes(x = original, y = hydrogen), + col = "red", + alpha = .5, + cex = 2 + ) + labs(x = "New Hydrogen Values", y = "Percentile Based on Training Set") ``` These line up very nicely! -## Custom check operations +## Creating a new check -The process here is exactly the same as steps; the internal functions have a similar naming convention: +The process here is exactly the same as for steps; the internal functions have a similar naming convention: - * `add_check()` instead of `add_step()` + * `add_check()` instead of `add_step()`. * `check()` instead of `step()`, and so on. It is strongly recommended that: 1. The operations start with `check_` (i.e. `check_range()` and `check_range_new()`) - 1. The check uses `rlang::abort(paste0(...))` when the conditions are not met + 1. The check uses `cli::cli_abort(...)` when the conditions are not met 1. The original data are returned (unaltered) by the check when the conditions are satisfied. ## Other step methods There are a few other S3 methods that can be created for your step function. They are not required unless you plan on using your step in the broader tidymodels package set. -### A print method +### A `print()` method -If you don't add a print method for `step_percentiles`, it will still print but it will be printed as a list of (potentially large) objects and look a bit ugly. The recipes package contains a helper function called `print_step()` that should be useful in most cases. We are using it here for the custom print method for `step_percentiles`. It requires the original terms specification and the column names this specification is evaluated to by `prep()`. For the former, our step object is structured so that the list object `ref_dist` has the names of the selected variables: +If you don't add a print method for `step_percentiles()`, it will still print but it will be printed as a list of (potentially large) objects and look a bit ugly. The recipes package contains a helper function called `print_step()` that should be useful in most cases. We are using it here for the custom print method for `step_percentiles()`. It requires the original terms specification and the column names this specification is evaluated to by `prep()`. For the former, our step object is structured so that the list object `ref_dist` has the names of the selected variables: ```{r} #| label: "print-method" print.step_percentiles <- function(x, width = max(20, options()$width - 35), ...) { - title <- "Percentile transformation on " + title <- "Percentile transformation on " print_step( # Names after prep: - tr_obj = names(x$ref_dist), + tr_obj = names(x$ref_dist), # Names before prep (could be selectors) - untr_obj = x$terms, - # Has it been prepped? - trained = x$trained, + untr_obj = x$terms, + # Has it been prepped? + trained = x$trained, # What does this step do? - title = title, - # An estimate of how many characters to print on a line: - width = width - ) + title = title, + # An estimate of how many characters to print on a line: + width = width + ) invisible(x) - } + } # Results before `prep()`: recipe(HHV ~ ., data = biomass_tr) %>% step_percentiles(ends_with("gen")) -# Results after `prep()`: +# Results after `prep()`: rec_obj ``` - + ### Methods for declaring required packages Some recipe steps use functions from other packages. When this is the case, the `step_*()` function should check to see if the package is installed. The function `recipes::recipes_pkg_check()` will do this. For example: @@ -403,7 +422,7 @@ Some recipe steps use functions from other packages. When this is the case, the a clean R session then run: install.packages("some_package") ``` -There is an S3 method that can be used to declare what packages should be loaded when using the step. For a hypothetical step that relies on the `hypothetical` package, this might look like: +An S3 method can be used to declare what packages should be loaded when using the step. For a hypothetical step that relies on the `hypothetical` package, this might look like: ```{r} required_pkgs.step_hypothetical <- function(x, ...) { @@ -413,7 +432,7 @@ required_pkgs.step_hypothetical <- function(x, ...) { In this example, `myrecipespkg` is the package where the step resides (if it is in a package). -The reason to declare what packages should be loaded is parallel processing. When parallel worker processes are created, there is heterogeneity across technologies regarding which packages are loaded. Multicore methods on macOS and Linux load all of the packages that were loaded in the main R process. However, parallel processing using psock clusters have no additional packages loaded. If the home package for a recipe step is not loaded in the worker processes, the `prep()` methods cannot be found and an error occurs. +The reason to declare what packages should be loaded is parallel processing. When parallel worker processes are created, there is heterogeneity across technologies regarding which packages are loaded. Multicore methods on macOS and Linux load all of the packages that were loaded in the main R process. However, parallel processing using psock clusters has no additional packages loaded. If the home package for a recipe step is not loaded in the worker processes, the `prep()` methods cannot be found and an error occurs. If this S3 method is used for your step, you can rely on this for checking the installation: @@ -423,7 +442,7 @@ recipes::recipes_pkg_check(required_pkgs.step_hypothetical()) If you'd like an example of this in a package, please take a look at the [embed](https://github.com/tidymodels/embed/) or [themis](https://github.com/tidymodels/themis/) package. -### A tidy method +### A `tidy()` method The `broom::tidy()` method is a means to return information about the step in a usable format. For our step, it would be helpful to know the reference values. @@ -432,13 +451,13 @@ When the recipe has been prepped, those data are in the list `ref_dist`. A small ```{r} #| label: "tidy-calcs" format_pctl <- function(x) { - tibble::tibble( - value = unname(x), - percentile = as.numeric(gsub("%$", "", names(x))) - ) + tibble::tibble( + value = unname(x), + percentile = as.numeric(gsub("%$", "", names(x))) + ) } -# For example: +# For example: pctl_step_object <- rec_obj$steps[[1]] pctl_step_object format_pctl(pctl_step_object$ref_dist[["hydrogen"]]) @@ -449,29 +468,29 @@ The tidy method could return these values for each selected column. Before `prep ```{r} #| label: "tidy" tidy.step_percentiles <- function(x, ...) { - if (is_trained(x)) { - if (length(x$ref_dist) == 0) { - # We need to create consistant output when no variables were selected - res <- tibble( - terms = character(), - value = numeric(), - percentile = numeric() - ) - } else { - res <- map_dfr(x$ref_dist, format_pctl, .id = "term") - } - } else { - term_names <- sel2char(x$terms) - res <- + if (is_trained(x)) { + if (length(x$ref_dist) == 0) { + # We need to create consistent output when no variables are selected + res <- tibble( + terms = character(), + value = numeric(), + percentile = numeric() + ) + } else { + res <- map_dfr(x$ref_dist, format_pctl, .id = "term") + } + } else { + term_names <- sel2char(x$terms) + res <- tibble( - terms = term_names, - value = rlang::na_dbl, - percentile = rlang::na_dbl - ) - } - # Always return the step id: - res$id <- x$id - res + terms = term_names, + value = rlang::na_dbl, + percentile = rlang::na_dbl + ) + } + # Always return the step id: + res$id <- x$id + res } tidy(rec_obj, number = 1) @@ -479,7 +498,7 @@ tidy(rec_obj, number = 1) ### Methods for tuning parameters -The tune package can be used to find reasonable values of step arguments by model tuning. There are some S3 methods that are useful to define for your step. The percentile example doesn't really have any tunable parameters, so we will demonstrate using `step_poly()`, which returns a polynomial expansion of selected columns. Its function definition has the arguments: +The tune package can be used to find reasonable values of step arguments by model tuning. There are some S3 methods that are useful to define for your step. The percentile example doesn't really have any tunable parameters, so we will demonstrate using `step_poly()`, which returns a polynomial expansion of selected columns. Its function definition has the following arguments: ```{r} #| label: "poly-args" @@ -488,7 +507,7 @@ args(step_poly) The argument `degree` is tunable. -To work with tune it is _helpful_ (but not required) to use an S3 method called `tunable()` to define which arguments should be tuned and how values of those arguments should be generated. +To work with tune it is _helpful_ (but not required) to use an S3 method called `tunable()` to define which arguments should be tuned and how the values of those arguments should be generated. `tunable()` takes the step object as its argument and returns a tibble with columns: @@ -510,7 +529,7 @@ For example, for a nearest-neighbors `neighbors` parameter, this value is just: #| label: "mtry" info <- list(pkg = "dials", fun = "neighbors") -# FYI: how it is used under-the-hood: +# FYI: how it is used under-the-hood: new_param_call <- rlang::call2(.fn = info$fun, .ns = info$pkg) rlang::eval_tidy(new_param_call) ``` @@ -525,20 +544,20 @@ prod_degree : function (range = c(1L, 2L), trans = NULL) spline_degree : function (range = c(3L, 10L), trans = NULL) ``` -Looking at the `range` values, some return doubles and others return integers. For our problem, `degree_int()` would be a good choice. +Looking at the `range` values, some return doubles, and others return integers. For our problem, `degree_int()` would be a good choice. For `step_poly()` the `tunable()` S3 method could be: ```{r} #| label: "tunable" -tunable.step_poly <- function (x, ...) { - tibble::tibble( - name = c("degree"), - call_info = list(list(pkg = "dials", fun = "degree_int")), - source = "recipe", - component = "step_poly", - component_id = x$id - ) +tunable.step_poly <- function(x, ...) { + tibble::tibble( + name = c("degree"), + call_info = list(list(pkg = "dials", fun = "degree_int")), + source = "recipe", + component = "step_poly", + component_id = x$id + ) } ```