The problem
The step_holiday function seems to return errors during baking/juicing in the presence of NA values. For example
Reproducible example
library(recipes)
dd <- data.frame(
date = as.Date(c("2021-12-04", "2021-12-25", NA, "2021-12-27"))
)
recipe(~date, data=dd) %>%
step_holiday(date) %>%
prep() %>%
juice()
# Error in if (rng.nch[1] != rng.nch[2]) stop("'charvec' has non-NA entries of different number of characters") :
# missing value where TRUE/FALSE needed
Tested with recipes_0.1.16
Possible solution
The problem seems to be with the is_holiday function. It's passing a unique vector of years to the timeDate::holiday() function but that vector can include NA values which end up throwing the error the timeDate::holiday() code. It would be better to omit NA years (possibly checking to make sure there is at least one NA-year). It probably also makes sense to propagate the NA values in the returned indicator columns. One possible solution is
is_holiday <- function(hol, dt) {
# ~~ add na.omit() here to drop NA values ~~
hdate <- holiday(year = unique(year(na.omit(dt))), Holiday = hol)
hdate <- as.Date(hdate)
out <- rep(0, length(dt))
out[dt %in% hdate] <- 1
# ~~ carry forward missing values ~~
out[is.na(dt)] <- NA
out
}
This updated version of the function would return
# A tibble: 4 x 4
date date_LaborDay date_NewYearsDay date_ChristmasDay
<date> <dbl> <dbl> <dbl>
1 2021-12-04 0 0 0
2 2021-12-25 0 0 1
3 NA NA NA NA
4 2021-12-27 0 0 0
If that seems like a reasonable solution, I could prepare a pull request.
The problem
The
step_holidayfunction seems to return errors during baking/juicing in the presence of NA values. For exampleReproducible example
Tested with
recipes_0.1.16Possible solution
The problem seems to be with the
is_holidayfunction. It's passing a unique vector of years to thetimeDate::holiday()function but that vector can include NA values which end up throwing the error thetimeDate::holiday()code. It would be better to omit NA years (possibly checking to make sure there is at least one NA-year). It probably also makes sense to propagate the NA values in the returned indicator columns. One possible solution isThis updated version of the function would return
If that seems like a reasonable solution, I could prepare a pull request.