# syberia/mungebits2

Switch branches/tags
Nothing to show
Fetching contributors…
Cannot retrieve contributors at this time
435 lines (410 sloc) 19.8 KB
 ## Usually, we wish to perform more than one cleaning operation on a dataset ## before it is ready to be fed to a machine learning classifier. ## ## The munge function defined in this file allows for a quick way to ## apply a list of mungepieces to a dataframe. ## ## r ## munged_data <- munge(raw_data, list( ## "Drop useless vars" = list(list(drop_vars, vector_of_variables), ## list(drop_vars, c(vector_variables, "dep_var"))), ## "Impute variables" = list(imputer, imputed_vars), ## "Discretize vars" = list(list(discretize, restore_levels), discretized_vars) ## )) ##  ## ## Translated in English, we are saying: ## ## 1. Drop a static list of useless variables from the data set. ## When the model is trained, drop the dependent variable as well ## since we will no longer need it. ## ## 2. Impute the variables in the static list of imputed_vars. ## When the model is trained, the imputer will have some logic ## to restore the means obtained during training of the mungepiece ## (assuming we are using mean imputation). ## ## 3. Discretize the static list of variables in the discretized_vars ## character vector. After model training, when new data points come in, ## the original training set is no longer available. The discretize ## method stored the necessary cuts for each variable in the mungebit's ## input, which the restore_levels function uses to bin the ## numerical features in the list of discretized_vars into factor ## (categorical) variables. ## ## Instead of building the mungepieces and bits by hand by calling the ## mungepiece$new and mungebit$new constructors (which is another ## alternative), we use this convenient format to construct and apply ## the mungepieces on-the-fly. ## ## The end result is the munged_data, ## which is the final cleaned and ready-to-model data, together with ## an [attribute](https://stat.ethz.ch/R-manual/R-devel/library/base/html/attributes.html) ## "mungepieces" which stores the list of trained mungepieces. ## In other words, the munged data *remembers* how it was obtained ## from the raw data. The list of trained mungepieces, informally called a ## **munge procedure**, can be used to replicate the munging in a real-time ## streaming production system without having to remember the full training set: ## ## r ## munged_single_row <- munge(single_row, attr(munged_data, "mungepieces")) ## ## # A syntactic shortcut enabled by the munge helper. It knows to look for ## # the mungepieces attribute. ## munged_single_row <- munge(single_row, munged_data) ##  ## ## We can feed single rows of data (i.e., streaming records coming through ## in a production system) to the trained munge procedure and it will ## correctly replicate the same munging it performed during model training. #' Apply a list of mungepieces to data. #' #' The \code{munge} function allows a convenient format for applying a #' sequence of \code{\link{mungepiece}} objects to a dataset. #' #' The \code{munge} helper accepts a raw, pre-munged (pre-cleaned) #' dataset and a list of lists. Each sublist represents the code #' and hyperparameters necessary to clean the dataset. For example, #' the first row could consist of an imputation function and a list #' of variables to apply the imputation to. It is important to #' understand what a \code{\link{mungebit}} and \code{\link{mungepiece}} #' does before using the \code{munge} helper, as it constructs these #' objects on-the-fly for its operation. #' #' The end result of calling \code{munge} is a fully cleaned data set #' (i.e., one to whom all the mungepieces have been applied and trained) #' adjoined with a \code{"mungepieces"} attribute: the list of trained #' mungepieces. #' #' For each sublist in the list of pre-mungepieces passed to \code{munge}, #' the following format is available. See the examples for a more hands-on #' example. #' #' \enumerate{ #' \item{\code{list(train_fn, ...)}}{ -- If the first element of \code{args} is #' a function followed by other arguments, the constructed mungepiece #' will use the \code{train_fn} as both the \emph{train and predict} #' function for the mungebit, and \code{list(...)} (that is, the remaining #' elements in the list) will be used as both the train and predict #' arguments in the mungepiece. In other words, using this format #' specifies you would like \emph{exactly the same behavior in #' training as in prediction}. This is appropriate for mungebits #' that operate in place and do not need information obtained #' from the training set, such as simple value replacement or #' column removal. #' } #' \item{\code{list(list(train_fn, predict_fn), ...)}}{ #' -- If \code{args} consists of a two-element pair in its first #' element, it must be a pair of either \code{NULL}s or functions, #' with not both elements \code{NULL}. If the \code{train_fn} #' or \code{predict_fn}, resp., is \code{NULL}, this will signify to have #' \emph{no effect} during training or prediction, resp. #' #' The remaining arguments, that is \code{list(...)}, will be used #' as both the training and prediction arguments. #' #' This structure is ideal if the behavior during training and prediction #' has an identical parametrization but very different implementation, #' such as imputation, so you can pass two different functions. #' #' It is also useful if you wish to have no effect during prediction, #' such as removing faulty rows during training, or no effect during #' training, such as making a few transformations that are only #' necessary on raw production data rather than the training data. #' } #' \item{\code{list(train = list(train_fn, ...), predict = list(predict_fn, ...))}}{ #' If \code{args} consists of a list consisting of exactly two named #' elements with names "train" and "predict", then the first format will be #' used for the respective fields. In other words, a mungepiece will #' be constructed consisting of a mungebit with \code{train_fn} as the #' training function, \code{predict_fn} as the predict fuction, and #' the mungepiece train arguments will be the train list of additional #' arguments \code{list(...)}, and similarly the predict arguments will be #' the predict list of additional arguments \code{list(...)}. #' #' Note \code{train_fn} and \code{predict_fn} must \emph{both} be functions #' and not \code{NULL}, since then we could simply use the second format #' described above. #' #' This format is ideal when the parametrization differs during training and #' prediction. In this case, \code{train_fn} usually should be the same #' as \code{predict_fn}, but the additional arguments in each list can #' be used to identify the parametrized discrepancies. For example, to #' sanitize a dataset one may wish to drop unnecessary variables. During #' training, this excludes the dependent variable, but during prediction #' we may wish to drop the dependent as well. #' #' This format can also be used to perform totally different behavior on #' the dataset during training and prediction (different functions and #' parameters), but mungebits should by definition achieve the same #' operation during training and prediction, so this use case is rare #' and should be handled carefully. #' } #' } #' #' @export #' @param data data.frame. Raw, uncleaned data. #' @param mungelist list. A list of lists which will be translated to a #' list of mungepieces. It is also possible to pass a list of mungepieces, #' but often the special syntax is more convenient. See the examples section. #' @param stagerunner logical or list. Either \code{TRUE} or \code{FALSE}, by default #' the latter. If \code{TRUE}, a \code{\link[stagerunner]{stagerunner}} #' object will be returned whose context will contain a key \code{data} #' after being ran, namely the munged data set (with a "mungepieces" #' attribute). #' #' One can also provide a list with a \code{remember} parameter, #' which will be used to construct a stagerunner with the same value #' for its \code{remember} parameter. #' @param list logical. Whether or not to return the list of mungepieces #' instead of executing them on the \code{data}. By default \code{FALSE}. #' @param parse logical. Whether or not to pre-parse the \code{mungelist} #' using \code{\link{parse_mungepiece}}. Note that if this is \code{TRUE}, #' any trained mungepieces will be duplicated and marked as untrained. #' By default, \code{TRUE}. #' @return A cleaned \code{data.frame}, the result of applying each #' \code{\link{mungepiece}} constructed from the \code{mungelist}. #' @seealso \code{\link{mungebit}}, \code{\link{mungepiece}}, #' \code{\link{parse_mungepiece}} #' @examples #' # First, we show off the various formats that the parse_mungepiece #' # helper accepts. For this exercise, we can use dummy train and #' # predict functions and arguments. #' train_fn <- predict_fn <- function(x, ...) { x } #' train_arg1 <- predict_arg1 <- dual_arg1 <- TRUE # Can be any parameter value. #' #' # The typical way to construct mungepieces would be using the constructor. #' piece <- mungepiece$new( #' mungebit$new(train_fn, predict_fn), #' list(train_arg1), list(predict_arg1) #' ) #' #' # This is tedious and can be simplified with the munge syntax, which #' # allows one to specify a nested list that defines all the mungebits #' # and enclosing mungepieces at once. #' #' raw_data <- iris #' munged_data <- munge(raw_data, list( #' # If the train function with train args is the same as the predict function #' # with predict args, we use this syntax. The first element should be #' # the funtion we use for both training and prediction. The remaining #' # arguments will be used as both the train_args and predict_args #' # for the resulting mungepiece. #' "Same train and predict" = list(train_fn, train_arg1, train_arg2 = "blah"), #' #' # If the train and predict arguments to the mungepiece match, but we #' # wish to use a different train versus predict function for the mungebit. #' "Different functions, same args" = #' list(list(train_fn, predict_fn), dual_arg1, dual_arg2 = "blah"), #' #' # If we wish to only run this mungepiece during training. #' "Only run in train" = list(list(train_fn, NULL), train_arg1, train_arg2 = "blah"), #' #' # If we wish to only run this mungepiece during prediction. #' "Only run in predict" = list(list(NULL, predict_fn), predict_arg1, predict_arg2 = "blah"), #' #' # If we wish to run different arguments but the same function during #' # training versus prediction. #' "Totally different train and predict args, but same functions" = #' list(train = list(train_fn, train_arg1), #' predict = list(train_fn, predict_arg1)), #' #' # If we wish to run different arguments with different functions during #' # training versus prediction. #' "Totally different train and predict function and args" = #' list(train = list(train_fn, train_arg1), #' predict = list(predict_fn, predict_arg1)) #' )) # End the call to munge() #' #' # This is an abstract example that was purely meant to illustrate syntax #' # The munged_data variable will have the transformed data set along #' # with a "mungepieces" attribute recording a list of trained mungepieces #' # derived from the above syntax. #' #' # A slightly more real-life example. #' \dontrun{ #' munged_data <- munge(raw_data, list( #' "Drop useless vars" = list(list(drop_vars, vector_of_variables), #' list(drop_vars, c(vector_variables, "dep_var"))), #' "Impute variables" = list(imputer, imputed_vars), #' "Discretize vars" = list(list(discretize, restore_levels), discretized_vars) #' )) #' #' # Here, we have requested to munge the raw_data by dropping useless variables, #' # including the dependent variable dep_var after model training, #' # imputing a static list of imputed_vars, discretizing a static list #' # of discretized_vars being careful to use separate logic when merely #' # using the computed discretization cuts to bin the numeric features into #' # categorical features. The end result is a munged_data set with an #' # attribute "mungepieces" that holds the list of mungepieces used for #' # munging the data, and can be used to perform the exact same set of #' # operations on a single row dataset coming through in a real-time production #' # system. #' munged_single_row_of_data <- munge(single_row_raw_data, munged_data) #' } #' # The munge function uses the attached "mungepieces" attribute, a list of #' # trained mungepieces. munge <- function(data, mungelist, stagerunner = FALSE, list = FALSE, parse = TRUE) { stopifnot(is.data.frame(data) || (is.environment(data) && (!identical(stagerunner, FALSE) || any(ls(data) == "data")))) if (length(mungelist) == 0L) { return(data) } if (!is.list(mungelist)) { ## This error is grabbed from the messages.R file. stop(m("munge_type_error", class = class(mungelist)[1L])) } ## We allow munging in prediction mode using an existing *trained* data.frame. ## For example, if we had ran iris2 <- munge(iris, some_list_of_mungepieces), ## an attributes would be created on iris2 with the name "mungepieces". ## The munge function is capable of re-using this attribute, a list of ## trained mungepieces, and apply it to a new dataset. In English, it is ## asking to "perform the exact same munging that was done on iris2". if (is.data.frame(mungelist)) { if (!is.element("mungepieces", names(attributes(mungelist)))) { stop(m("munge_lack_of_mungepieces_attribute")) } Recall(data, attr(mungelist, "mungepieces"), stagerunner = stagerunner, list = list, parse = FALSE) } else if (methods::is(mungelist, "tundraContainer")) { # An optional interaction effect with the tundra package. Recall(data, mungelist$munge_procedure, stagerunner = stagerunner, list = list, parse = FALSE) } else { munge_(data, mungelist, stagerunner, list, parse) } } # Assume proper arguments. munge_ <- function(data, mungelist, stagerunner, list_output, parse) { mungelist <- Filter(Negate(is.null), mungelist) if (isTRUE(parse)) { ## It is possible to have nested stagerunners of mungeprocedures, ## but this is a more advanced feature. We skip stagerunners ## when parsing the munge procedure using parse_mungepiece. runners <- vapply(mungelist, is, logical(1), "stageRunner") # TODO: (RK) Intercept errors and inject with name for helpfulness! mungelist[!runners] <- lapply(mungelist[!runners], parse_mungepiece) } if (isTRUE(list_output)) { return(mungelist) } ## We will construct a [stagerunner](https://github.com/robertzk/stagerunner) ## and execute it on a context (environment) containing just a data key. ## ## The stages of the stagerunner will be one for each mungepiece, defined ## by the mungepiece_stages helper. stages <- mungepiece_stages(mungelist) if (is.environment(data)) { context <- data } else { context <- list2env(list(data = data), parent = emptyenv()) } remember <- is.list(stagerunner) && isTRUE(stagerunner$remember) runner <- stagerunner::stageRunner$new(context, stages, remember = remember) if (identical(stagerunner, FALSE)) { runner$run() context$data } else { runner } } mungepiece_stages <- function(mungelist, contiguous = FALSE) { ## As before, remember that a munge procedure can consist of mungepieces ## but also stagerunners earlier produced by munge. If we have a ## mungelist that looks like list(mungepiece, runner, mungepiece, mungepiece, runner, ...) ## then each *contiguous* block of mungepieces needs to be transformed ## into a sequence of stages. We will see later why this is necessary: we ## have to append the mungepieces to the data.frame after the last ## mungepiece has been executed. if (!isTRUE(contiguous)) { singles <- which(vapply(mungelist, Negate(is), logical(1), "stageRunner")) groups <- cumsum(diff(c(singles[1L] - 1, singles)) != 1) split(mungelist[singles], groups) <- lapply( split(mungelist[singles], groups), mungepiece_stages, contiguous = TRUE ) mungelist } else { mungepiece_stages_contiguous(mungelist) } } mungepiece_stages_contiguous <- function(mungelist) { shared_context <- list2env(parent = globalenv(), list(size = length(mungelist), mungepieces = mungelist, newpieces = list()) ) mungepiece_names <- names(mungelist) %||% character(length(mungelist)) lapply(Map(list, seq_along(mungelist), mungepiece_names), mungepiece_stage, shared_context) } mungepiece_stage <- function(mungepiece_index_name_pair, context) { stage <- function(env) { } ## For mungebits2 objects (as opposed to [mungebits](https://github.com/robertzk/syberia)) ## we have to use different logic to allow backwards-compatible ## mixing with legacy mungebits. We set the body of the stage ## accordingly by checking whether the mungebit2 is an ## [R6 object](https://github.com/wch/R6). if (methods::is(context$mungepieces[[mungepiece_index_name_pair[[1]]]], "R6")) { body(stage) <- mungepiece_stage_body() } else { body(stage) <- legacy_mungepiece_stage_body() } environment(stage) <- list2env(parent = context, list(mungepiece_index = mungepiece_index_name_pair[[1]], mungepiece_name = mungepiece_index_name_pair[[2]]) ) stage } mungepiece_stage_body <- function() { quote({ ## Each mungepiece will correspond to one stage in a stagerunner. ## We will construct a *new* mungepiece on-the-fly to avoid ## sharing state with other mungepiece objects, run that ## mungepiece, and then modify the newpieces to store ## the trained mungepiece. # Make a fresh copy to avoid shared stage problems. piece <- mungepieces[[mungepiece_index]]$duplicate() piece$run(env) newpieces[[mungepiece_index]] <<- piece if (isTRUE(nzchar(mungepiece_name) & !is.na(mungepiece_name))) { names(newpieces)[mungepiece_index] <<- mungepiece_name } ## When we are out of mungepieces, that is, when the current index equals ## the number of mungepieces in the actively processed contiguous chain ## of mungepieces, we append the mungepieces to the dataframe's ## "mungepieces" attribute. This is what allows us to later replay ## the munging actions on new data by passing the dataframe as the second ## argument to the munge function. if (mungepiece_index == size) { attr(env$data, "mungepieces") <- append(attr(env$data, "mungepieces"), newpieces) } }) } ## To achieve backwards-compatibility with [mungebits](https://github.com/robertzk/syberia)), ## we use different parsing logic for legacy mungepieces. legacy_mungepiece_stage_body <- function() { quote({ ## This code is taken directly from [legacy mungebits](https://github.com/robertzk/mungebits/blob/99e2b30b01bfb6af39dc1bfd8d37334ea9c458b6/R/munge.r#L78-L93). if (!requireNamespace("mungebits", quietly = TRUE)) { stop("To use legacy mungebits with mungebits2, make sure you have ", "the mungebits package installed.") } reference_piece <- mungepieces[[mungepiece_index]] bit <- mungebits:::mungebit$new( reference_piece$bit$train_function, reference_piece$bit$predict_function, enforce_train = reference_piece$bit$enforce_train ) bit$trained <- reference_piece$bit$trained bit$inputs <- reference_piece$bit$inputs piece <- mungebits:::mungepiece$new( bit, reference_piece$train_args, reference_piece$predict_args ) newpieces[[mungepiece_index]] <<- piece piece$run(env) if (mungepiece_index == size) { names(newpieces) <<- names(mungepieces) attr(env$data, "mungepieces") <- append(attr(env\$data, "mungepieces"), newpieces) } }) }