-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hardcoded rset_att limits extendability #43
Comments
The idea is to use a constructor for new If you are implementing a different resampling method, what attributes do you need it add? It contains the attributes that are required for the methods that I've implemented but I don't consider those to be everything that might ever be needed. |
I have a function that uses internal logic to first select one or more columns from the input (by type, in this case), and then computes on those to generate train/test indices for each split. Given what I'd seen the included I don't think I need to add this attribute, as I don't currently plan on having downstream code consume that bit of metadata, but in the abstract I can imagine wanting to do so in the future. Would you mind explaining the logic behind the explicit whitelist for attributes? Is there harm in retaining "unexpected" attributes? |
So you mean the strata? That's a good idea. You could probably overload the class: > tmp <- vfold_cv(mtcars, strata = "cyl")
> class(tmp)
[1] "vfold_cv" "rset" "tbl_df" "tbl" "data.frame"
> class(tmp) <- c("new_class", class(tmp)) and make a specific
It wasn't really done to be restrictive in the way that you have encountered. It's just a good method for 1) forcing you to think out the object structure and plan for what you will eventually need and 2) ensure that any objects of that class are consistent with the structure. |
I hadn't been thinking of them as strata, but I suppose they are, so I'll stick my split cols in the Not to worry, I did make my own function and do overload the class via the my function(terrible name, I know)#' Resample a dataset into time-aware, gap-respecting train/test pairs.
#'
#' A custom implementation similar to [rsample::rolling_origin()], but more
#' respectful of time, and including as many observation in both sets as
#' possible without violating the boundary.
#'
#' For each split, rows _ending strictly before_ the respective element of
#' `dates` will be in the `analysis` set, and rows _beginning on or after_ the
#' date will be in the `assessment` set. Rows in-between will be dropped ("the
#' gap"). Rows for which this relation can't be determined (`NA`'s) will be
#' dropped, and rows with any empty values in date columns will not be in
#' analysis sets, but may be in assessment sets. Comparisons will be made using
#' the per-row minima and maxima across
#' all `Date`-type columns after applying any selectors.
#'
#' @param tbl (`tbl_df`) a dataset to resample, with at least one `Date`-valued
#' column
#' @param split_dates (`Date`) a vector of prediction dates to use for resampling.
#' Will be coerced to `Date` with [lubridate::as_date()] if necessary.
#' @param ... selectors to apply with [dplyr::select()] to filter before
#' identifying date-values columns, e.g. to ignore certain columns. All
#' columns will still appear in results.
#'
#' @return (`vc_resample`, `rset`, `tbl_df`) with `length(dates)` rows and the
#' following columns:
#' - `splits` (`rsplit`): individual splits
#' - `id` (`character`): labels
#' The `dates` are stored in an attribute of the same name.
#'
#' @export
#'
#' @examples
#' library(dplyr)
#'
#' tbl <- tibble(
#' date = rep(seq(as.Date("2018-01-01"), by = 1, length.out = 6), each = 2),
#' ticker = rep(c("AAPL", "IWV"), length.out = length(date)),
#' lead_date = rep(lead(unique(date), 2), each = n_distinct(ticker)),
#' value = 1:12
#' )
#'
#' tbl
#'
#' split_dates <- c("2018-01-04", "2018-01-05", "2018-01-06")
#'
#' vc_resample(tbl, split_dates) # use date and lead_date
#' vc_resample(tbl, split_dates, date) # don't use lead_date
#' vc_resample(tbl, split_dates, -date) # only use lead_date
vc_resample <- function(tbl, split_dates, ...) {
checkmate::assert_tibble(tbl, min.rows = 1L, min.cols = 1L)
if (!lubridate::is.Date(split_dates)) split_dates <- lubridate::as_date(split_dates)
checkmate::assert_date(split_dates, any.missing = FALSE, min.len = 1L, unique = TRUE)
# apply selections
selectors <- rlang::enquos(...)
if (rlang::is_empty(selectors)) {
date_cols <- tbl
} else {
date_cols <- dplyr::select(tbl, !!!selectors)
}
# ensure at least one is a Date
checkmate::assert_true(any(vapply(date_cols, lubridate::is.Date, logical(1))))
# isolate columns to split by
date_cols <- purrr::keep(date_cols, lubridate::is.Date)
# find min and max dates
max_dates <- purrr::lift_dl(pmax)(date_cols, na.rm = TRUE)
min_dates <- purrr::lift_dl(pmin)(date_cols, na.rm = TRUE)
# ensure no splits will be empty
checkmate::assert_true(all(split_dates > min(max_dates, na.rm = TRUE)))
checkmate::assert_true(all(split_dates <= max(min_dates, na.rm = TRUE)))
# create splits
splits <- map(
split_dates,
~rsample:::rsplit(
tbl,
which((max_dates < .x) & complete.cases(date_cols)), # no NA's in analysis
which(min_dates >=.x))
)
# pack splits into rset
rsample:::new_rset(
splits = splits,
ids = paste0("Slice", gsub(" ", "0", format(seq_along(split_dates)))),
attrib = list(strata = names(date_cols)), # just changed
subclass = c("vc_resample", "rset")
) %>%
dplyr::mutate(split_date = split_dates)
} I get your point (1) about enforcing a structure, and ensuring that certain attributes exist is expected (as you do already), but having a whitelist enforced at the parent-class level seems highly unusual and of unclear benefit. If a downstream That being said, if you'd still like to keep the attribute whitelist, I have two related suggestions:
Per Hadley's S3 chapter linked above, the validation could happen in a new A nice side effect of not enforcing the whitelist in the I'd be glad to work up a PR if any of the above sounds good to you. In any case, thanks for engaging with me here, and thanks for these sweet tidymodels packages! |
Funny timing, Davis submitted an issue that appears to do exactly that type of resampling already: #42 |
Haha yup, saw that. Didn't even occur to me to just I suppose I could compute outcomes post-resampling (so horizon doesn't matter yet), like in a |
Maybe contact Davis about that. If there's some aspect of the current approach that is lacking, he would have a better sense of that than I would (since I don't work in time series all that much). |
Thanks so much for your discussion! 🙌 I am cleaning up older issues. This looks like it has been wrapped up and/or superseded by new features. Feel free to add on or open a new issue if not! |
This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex https://reprex.tidyverse.org) and link to this issue. |
Filtering attributes against a hardcoded list in
maybe_rset
means this package can't be extended in ways that add new attributes. This is also unexpected asnew_rset
doesn't warn or error if passed unordained attributes.Here's an example:
Created on 2018-07-10 by the reprex package (v0.2.0).
Session info
Is there a reason to only allow certain attributes? I'd be happy to remove that restriction in a PR if not.
The text was updated successfully, but these errors were encountered: