New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace missing values with a constant #473
Comments
|
I think that using an arbitrary constant (like zero or -9999) is a really bad idea that can be implemented easily in |
|
Although i agree with topepo, both in the sense that it is often a bad idea that it is likely simple enough to do through If for some odd reason one wanted to have a specific step that is more "explicitly" saying that it imputes a constant value (making the readability argument), a step could be made such as the one below #' Impute a constant value to correct for missingness
#' @param recipe
#' @param ...
#' @param role
#' @param trained
#' @param constant A numeric value or a possibly named vecotr vector with same length as the result from the selector functions, specifying which constant should be imputed for each selector
#' @param id
#'
#' @description This method can be used for constant imputation.
#'
#' @examples
#' data(airquality)
#' library(recipes)
#' library(dplyr)
#' # Only Solar.R and Ozone has missing values
#' tibble(airquality)
#' # We can specify by roles, predictors etc.
#' # Use a named vector (or list) to specify what value to impute.
#' airquality_rec <- recipe(airquality, Solar.R ~ .) %>%
#' step_constantimpute(all_predictors(), all_outcomes(),
#' constant = c(Solar.R = 0,
#' Ozone = 30,
#' Wind = 0,
#' Temp = 50,
#' Month = 1,
#' Day = 7)) %>%
#' prep() %>%
#' juice()
#' airquality_rec
#'
#' # If every column should be imputed with the same value,
#' # we can simple use a single value for constant (default = 0)
#' recipe(airquality, Solar.R ~ .) %>%
#' step_constantimpute(all_predictors(), all_outcomes(),
#' constant = 0) %>%
#' prep() %>% juice() %>%
#' print()
#'
#' @note Be careful when using constants for imputing values in any dataset.
#' While this seems like a simple fix, it has many drawbacks.
#' In most parametric models (including (generalized) linear regression) and
#' non-parametric models (including SVM, neural networks etc.) this tends to cause
#' a reduction in variance and explanatory power by the imputed variables.
#' Certain models (including tree-based models) are less affected by this, but
#' one should nontheless be aware of potential drawbacks and possibly explore
#' other options before using constant imputation.
#'
#' @export
step_constantimpute <- function(
recipe,
...,
role = NA,
trained = FALSE,
constant = 0,
skip = FALSE,
id = rand_id("constantimpute")){
# Import terms
terms <- ellipse_check(...)
# Check that "constant" is the correct format (first vector, then numeric)
if(is.list(constant)){
if(any(lengths(constant) != 1))
rlang::abort('One or more elements of `constant` has a length different from 1.\n`constant` should be a single numeric vector or a possibly named vector or list with the same length as the selector function.')
constant <- unlist(constant, recursive = TRUE, use.names = TRUE)
}
# After conversion from list, the constant should still be a single numric value
if(!is.numeric(constant))
rlang::abort('`constant` should be a single numeric vector or a possibly named vector or list with the same length as the selector function.')
add_step(
recipe,
step(
subclass = "constantimpute",
terms = terms,
trained = trained,
role = role,
constant = constant,
skip = skip,
id = id
)
)
}
prep.step_constantimpute <- function(x, training, info = NULL, ...){
# Import the names that should be transformed.
col_names <- terms_select(terms = x$terms, info = info)
# Make sure all types are numeric
recipes::check_type(training %>% select(col_names), quant = TRUE)
if(!is.null(nm <- names(x$constant))){
if(any(!nm %in% col_names))
rlang::abort('`constant` has more elements then specified by the selector functions.')
if(any(!col_names %in% nm))
rlang::abort('One or more columns are missing from named `constant` vector.')
}else{
if((n <- length(x$constant)) != 1 && n != length(col_names))
rlang::abort('`constant` should be a single numeric vector or a possibly named vector or list with the same length as the selector function.')
else if(n == 1)
x$constant <- rep(x$constant, length(col_names))
names(x$constant) <- col_names
}
step(
subclass = "constantimpute",
terms = x$terms,
role = x$role,
trained = TRUE,
constant = x$constant,
skip = x$skip,
id = x$id
)
}
bake.step_constantimpute <- function(object, new_data, ...){
# Import the variables that should be baked
vars <- names(object$constant)
# Iterate over each variable and impute the constant.
for(i in vars){
isn <- is.na(new_data[[i]])
new_data[[i]][isn] <- object$constant[[i]]
}
# Return the result as a tibble.
tibble::as_tibble(new_data)
} |
|
I will note that I do not necesssarily think that it should be included in the package, but one could easily create a |
|
This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex https://reprex.tidyverse.org) and link to this issue. |
There is a need for NA values imputation with a constant. For some models constant number imputation usually would work better than median or other type of imputations. For example, in neural network models, NA value replacement with zeros works well as the nodes with input values equal to 0 do not send any further signals. And, in the tree-based models, some extreme value imputations (such as "-99999") is suitable as the tree models will make separate splits for these extreme values.
Please, can a constant number imputation step be implemented in the R recipes?
Thanks.
The text was updated successfully, but these errors were encountered: