Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hardcoded rset_att limits extendability #43

Closed
ClaytonJY opened this issue Jul 10, 2018 · 9 comments
Closed

Hardcoded rset_att limits extendability #43

ClaytonJY opened this issue Jul 10, 2018 · 9 comments

Comments

@ClaytonJY
Copy link
Contributor

Filtering attributes against a hardcoded list in maybe_rset means this package can't be extended in ways that add new attributes. This is also unexpected as new_rset doesn't warn or error if passed unordained attributes.

Here's an example:

rset <- rsample::vfold_cv(mtcars)

names(attributes(rset))
#> [1] "row.names" "class"     "names"     "v"         "repeats"   "strata"

names(attributes(dplyr::mutate(rset, x = 1:n())))
#> [1] "names"     "class"     "row.names" "v"         "repeats"   "strata"


custom_rset <- rsample:::new_rset(
  splits = list(
    rsample:::rsplit(mtcars, 1:10, 11:32),
    rsample:::rsplit(mtcars, 1:22, 23:32)
  ),
  ids = c("Slice1", "Slice2"),
  attrib = list(repeats = 1, foo = "bar"),
  subclass = c("custom", "rset")
)

names(attributes(custom_rset))
#> [1] "row.names" "class"     "names"     "repeats"   "foo"

# foo is dropped
names(attributes(dplyr::mutate(custom_rset, x = 1:n())))
#> [1] "names"     "class"     "row.names" "repeats"

Created on 2018-07-10 by the reprex package (v0.2.0).

Session info
devtools::session_info()
#> Session info -------------------------------------------------------------
#>  setting  value                       
#>  version  R version 3.4.4 (2018-03-15)
#>  system   x86_64, linux-gnu           
#>  ui       X11                         
#>  language (EN)                        
#>  collate  en_US.UTF-8                 
#>  tz       America/Detroit             
#>  date     2018-07-10
#> Packages -----------------------------------------------------------------
#>  package    * version    date       source                          
#>  abind        1.4-5      2016-07-21 CRAN (R 3.4.4)                  
#>  assertthat   0.2.0      2017-04-11 CRAN (R 3.4.4)                  
#>  backports    1.1.2      2017-12-13 CRAN (R 3.4.4)                  
#>  base       * 3.4.4      2018-03-16 local                           
#>  bindr        0.1.1      2018-03-13 CRAN (R 3.4.4)                  
#>  bindrcpp   * 0.2.2      2018-03-29 CRAN (R 3.4.4)                  
#>  broom        0.4.5      2018-07-03 cran (@0.4.5)                   
#>  class        7.3-14     2015-08-30 CRAN (R 3.4.0)                  
#>  compiler     3.4.4      2018-03-16 local                           
#>  CVST         0.2-2      2018-05-26 CRAN (R 3.4.4)                  
#>  datasets   * 3.4.4      2018-03-16 local                           
#>  ddalpha      1.3.4      2018-06-23 CRAN (R 3.4.4)                  
#>  DEoptimR     1.0-8      2016-11-19 CRAN (R 3.4.4)                  
#>  devtools     1.13.5     2018-02-18 CRAN (R 3.4.4)                  
#>  digest       0.6.15     2018-01-28 CRAN (R 3.4.4)                  
#>  dimRed       0.1.0      2017-05-04 CRAN (R 3.4.4)                  
#>  dplyr        0.7.6      2018-06-29 cran (@0.7.6)                   
#>  DRR          0.0.3      2018-01-06 CRAN (R 3.4.4)                  
#>  evaluate     0.10.1     2017-06-24 CRAN (R 3.4.4)                  
#>  foreign      0.8-70     2018-04-23 CRAN (R 3.4.4)                  
#>  geometry     0.3-6      2015-09-09 CRAN (R 3.4.4)                  
#>  glue         1.2.0      2017-10-29 CRAN (R 3.4.4)                  
#>  gower        0.1.2      2017-02-23 CRAN (R 3.4.4)                  
#>  graphics   * 3.4.4      2018-03-16 local                           
#>  grDevices  * 3.4.4      2018-03-16 local                           
#>  grid         3.4.4      2018-03-16 local                           
#>  htmltools    0.3.6      2017-04-28 CRAN (R 3.4.4)                  
#>  ipred        0.9-6      2017-03-01 CRAN (R 3.4.4)                  
#>  kernlab      0.9-26     2018-04-30 CRAN (R 3.4.4)                  
#>  knitr        1.20       2018-02-20 CRAN (R 3.4.4)                  
#>  lattice      0.20-35    2017-03-25 CRAN (R 3.3.3)                  
#>  lava         1.6.1      2018-03-28 CRAN (R 3.4.4)                  
#>  lubridate    1.7.4      2018-04-11 CRAN (R 3.4.4)                  
#>  magic        1.5-8      2018-01-26 CRAN (R 3.4.4)                  
#>  magrittr     1.5        2014-11-22 CRAN (R 3.4.4)                  
#>  MASS         7.3-50     2018-04-30 CRAN (R 3.4.4)                  
#>  Matrix       1.2-14     2018-04-09 CRAN (R 3.4.4)                  
#>  memoise      1.1.0      2017-04-21 CRAN (R 3.4.4)                  
#>  methods    * 3.4.4      2018-03-16 local                           
#>  mnormt       1.5-5      2016-10-15 CRAN (R 3.4.4)                  
#>  nlme         3.1-137    2018-04-07 CRAN (R 3.4.4)                  
#>  nnet         7.3-12     2016-02-02 CRAN (R 3.4.0)                  
#>  parallel     3.4.4      2018-03-16 local                           
#>  pillar       1.2.3      2018-05-25 CRAN (R 3.4.4)                  
#>  pkgconfig    2.0.1      2017-03-21 CRAN (R 3.4.4)                  
#>  pls          2.6-0      2016-12-18 CRAN (R 3.4.4)                  
#>  plyr         1.8.4      2016-06-08 CRAN (R 3.4.4)                  
#>  prodlim      2018.04.18 2018-04-18 CRAN (R 3.4.4)                  
#>  psych        1.8.4      2018-05-06 CRAN (R 3.4.4)                  
#>  purrr        0.2.5      2018-05-29 CRAN (R 3.4.4)                  
#>  R6           2.2.2      2017-06-17 CRAN (R 3.4.4)                  
#>  Rcpp         0.12.17    2018-05-18 CRAN (R 3.4.4)                  
#>  RcppRoll     0.3.0      2018-06-05 CRAN (R 3.4.4)                  
#>  recipes      0.1.3      2018-07-09 Github (topepo/recipes@b1b5da9) 
#>  reshape2     1.4.3      2017-12-11 CRAN (R 3.4.4)                  
#>  rlang        0.2.0.9001 2018-07-06 Github (tidyverse/rlang@b4f810f)
#>  rmarkdown    1.10       2018-06-11 CRAN (R 3.4.4)                  
#>  robustbase   0.93-1     2018-06-23 CRAN (R 3.4.4)                  
#>  rpart        4.1-13     2018-02-23 CRAN (R 3.4.3)                  
#>  rprojroot    1.3-2      2018-01-03 CRAN (R 3.4.4)                  
#>  rsample      0.0.2      2017-11-12 CRAN (R 3.4.4)                  
#>  sfsmisc      1.1-2      2018-03-05 CRAN (R 3.4.4)                  
#>  splines      3.4.4      2018-03-16 local                           
#>  stats      * 3.4.4      2018-03-16 local                           
#>  stringi      1.2.3      2018-06-12 cran (@1.2.3)                   
#>  stringr      1.3.1      2018-05-10 CRAN (R 3.4.4)                  
#>  survival     2.42-3     2018-04-16 CRAN (R 3.4.4)                  
#>  tibble       1.4.2      2018-01-22 CRAN (R 3.4.4)                  
#>  tidyr        0.8.1      2018-05-18 CRAN (R 3.4.4)                  
#>  tidyselect   0.2.4      2018-02-26 CRAN (R 3.4.4)                  
#>  timeDate     3043.102   2018-02-21 CRAN (R 3.4.4)                  
#>  tools        3.4.4      2018-03-16 local                           
#>  utils      * 3.4.4      2018-03-16 local                           
#>  withr        2.1.2      2018-03-15 CRAN (R 3.4.4)                  
#>  yaml         2.1.19     2018-05-01 CRAN (R 3.4.4)

Is there a reason to only allow certain attributes? I'd be happy to remove that restriction in a PR if not.

@topepo
Copy link
Member

topepo commented Jul 10, 2018

The idea is to use a constructor for new rset objects to ensure uniformity so it's basically doing what it was designed to (see section 14.3.1 here for the logic).

If you are implementing a different resampling method, what attributes do you need it add? It contains the attributes that are required for the methods that I've implemented but I don't consider those to be everything that might ever be needed.

@ClaytonJY
Copy link
Contributor Author

I have a function that uses internal logic to first select one or more columns from the input (by type, in this case), and then computes on those to generate train/test indices for each split. Given what I'd seen the included rsample function do, it seemed like I should add the split_cols to the attributes, to allow for introspection of how the splits were made. It seems to fit a kind of implicit ethos of the included functions I'm starting to pick up on.

I don't think I need to add this attribute, as I don't currently plan on having downstream code consume that bit of metadata, but in the abstract I can imagine wanting to do so in the future.

Would you mind explaining the logic behind the explicit whitelist for attributes? Is there harm in retaining "unexpected" attributes?

@topepo
Copy link
Member

topepo commented Jul 10, 2018

Given what I'd seen the included rsample function do, it seemed like I should add the split_cols to the attributes, to allow for introspection of how the splits were made.

So you mean the strata? That's a good idea.

You could probably overload the class:

> tmp <- vfold_cv(mtcars, strata = "cyl")
> class(tmp)
[1] "vfold_cv"   "rset"       "tbl_df"     "tbl"        "data.frame"
> class(tmp) <- c("new_class", class(tmp))

and make a specific rset method for your new class (e.g. like the vfold_cv class does above).

Would you mind explaining the logic behind the explicit whitelist for attributes? Is there harm in retaining "unexpected" attributes?

It wasn't really done to be restrictive in the way that you have encountered. It's just a good method for 1) forcing you to think out the object structure and plan for what you will eventually need and 2) ensure that any objects of that class are consistent with the structure.

@ClaytonJY
Copy link
Contributor Author

ClaytonJY commented Jul 11, 2018

I hadn't been thinking of them as strata, but I suppose they are, so I'll stick my split cols in the strata attribute, thanks!

Not to worry, I did make my own function and do overload the class via the subclass arg from rsample:::new_rset (p.s. seems odd to specify "rset" subclass rather than that being automatic). It's a long function with a lot of arg validation, but I'll include it behind the fold below if you're feeling curious; it's for selective backtesting of financial panel data with arbitrary-but-enforced forecast horizons.

my function (terrible name, I know)
#' Resample a dataset into time-aware, gap-respecting train/test pairs.
#'
#' A custom implementation similar to [rsample::rolling_origin()], but more
#' respectful of time, and including as many observation in both sets as
#' possible without violating the boundary.
#'
#' For each split, rows _ending strictly before_ the respective element of
#' `dates` will be in the `analysis` set, and rows _beginning on or after_ the
#' date will be in the `assessment` set. Rows in-between will be dropped ("the
#' gap"). Rows for which this relation can't be determined (`NA`'s) will be
#' dropped, and rows with any empty values in date columns will not be in
#' analysis sets, but may be in assessment sets. Comparisons will be made using
#' the per-row minima and maxima across
#' all `Date`-type columns after applying any selectors.
#'
#' @param tbl (`tbl_df`) a dataset to resample, with at least one `Date`-valued
#'   column
#' @param split_dates (`Date`) a vector of prediction dates to use for resampling.
#'   Will be coerced to `Date` with [lubridate::as_date()] if necessary.
#' @param ... selectors to apply with [dplyr::select()] to filter before
#'   identifying date-values columns, e.g. to ignore certain columns. All
#'   columns will still appear in results.
#'
#' @return (`vc_resample`, `rset`, `tbl_df`) with `length(dates)` rows and the
#'   following columns:
#'   - `splits` (`rsplit`): individual splits
#'   - `id` (`character`): labels
#' The `dates` are stored in an attribute of the same name.
#'
#' @export
#'
#' @examples
#' library(dplyr)
#'
#' tbl <- tibble(
#'   date      = rep(seq(as.Date("2018-01-01"), by = 1, length.out = 6), each = 2),
#'   ticker    = rep(c("AAPL", "IWV"), length.out = length(date)),
#'   lead_date = rep(lead(unique(date), 2), each = n_distinct(ticker)),
#'   value     = 1:12
#' )
#'
#' tbl
#'
#' split_dates <- c("2018-01-04", "2018-01-05", "2018-01-06")
#'
#' vc_resample(tbl, split_dates)         # use date and lead_date
#' vc_resample(tbl, split_dates, date)   # don't use lead_date
#' vc_resample(tbl, split_dates, -date)  # only use lead_date
vc_resample <- function(tbl, split_dates, ...) {

  checkmate::assert_tibble(tbl, min.rows = 1L, min.cols = 1L)

  if (!lubridate::is.Date(split_dates)) split_dates <- lubridate::as_date(split_dates)
  checkmate::assert_date(split_dates, any.missing = FALSE, min.len = 1L, unique = TRUE)

  # apply selections
  selectors <- rlang::enquos(...)
  if (rlang::is_empty(selectors)) {
    date_cols <- tbl
  } else {
    date_cols <- dplyr::select(tbl, !!!selectors)
  }

  # ensure at least one is a Date
  checkmate::assert_true(any(vapply(date_cols, lubridate::is.Date, logical(1))))

  # isolate columns to split by
  date_cols <- purrr::keep(date_cols, lubridate::is.Date)

  # find min and max dates
  max_dates <- purrr::lift_dl(pmax)(date_cols, na.rm = TRUE)
  min_dates <- purrr::lift_dl(pmin)(date_cols, na.rm = TRUE)

  # ensure no splits will be empty
  checkmate::assert_true(all(split_dates >  min(max_dates, na.rm = TRUE)))
  checkmate::assert_true(all(split_dates <= max(min_dates, na.rm = TRUE)))

  # create splits
  splits <- map(
    split_dates,
      ~rsample:::rsplit(
        tbl,
        which((max_dates < .x) & complete.cases(date_cols)),  # no NA's in analysis
        which(min_dates >=.x))
    )

  # pack splits into rset
  rsample:::new_rset(
    splits = splits,
    ids    = paste0("Slice", gsub(" ", "0", format(seq_along(split_dates)))),
    attrib = list(strata = names(date_cols)), # just changed
    subclass = c("vc_resample", "rset")
  ) %>%
    dplyr::mutate(split_date = split_dates)
}

I get your point (1) about enforcing a structure, and ensuring that certain attributes exist is expected (as you do already), but having a whitelist enforced at the parent-class level seems highly unusual and of unclear benefit. If a downstream rset consumer can't handle an extra attribute, I'd think that's an issue with the consumer, not the producer.

That being said, if you'd still like to keep the attribute whitelist, I have two related suggestions:

  • enforce at creation (new_rset) rather than manipulation (mutate.rset)
  • warn or error for unsupported names rather than silently dropping attributes

Per Hadley's S3 chapter linked above, the validation could happen in a new validate_rset(), called on the result of new_rset() inside a new rset() function. This gets at your point (2) above, and would also partially address #40.

A nice side effect of not enforcing the whitelist in the rset methods for dplyr verbs (or dropping the whitelist entirely) is that you'd soon be able to drop most of dplyr-compat.R, since if I'm following the web of issues correctly, all dplyr verbs will soon retain attributes (tidyverse/dplyr#3429).

I'd be glad to work up a PR if any of the above sounds good to you.

In any case, thanks for engaging with me here, and thanks for these sweet tidymodels packages!

@topepo
Copy link
Member

topepo commented Jul 11, 2018

Funny timing, Davis submitted an issue that appears to do exactly that type of resampling already: #42

@ClaytonJY
Copy link
Contributor Author

Haha yup, saw that. Didn't even occur to me to just nest() first, but while that would fix the panel/group aspect, I still need the multi-column date_cols logic to properly respect the "gap" caused by the forecasting horizon (date vs lead_date), and more control over which splits to make (split_dates).

I suppose I could compute outcomes post-resampling (so horizon doesn't matter yet), like in a recipe chain, and have that implicitly drop observations where the future value isn't available? Hmm...

@topepo
Copy link
Member

topepo commented Jul 12, 2018

Haha yup, saw that. Didn't even occur to me to just nest() first, but while that would fix the panel/group aspect, I still need the multi-column date_cols logic to properly respect the "gap" caused by the forecasting horizon (date vs lead_date), and more control over which splits to make (split_dates).

Maybe contact Davis about that. If there's some aspect of the current approach that is lacking, he would have a better sense of that than I would (since I don't work in time series all that much).

@juliasilge
Copy link
Member

Thanks so much for your discussion! 🙌 I am cleaning up older issues.

This looks like it has been wrapped up and/or superseded by new features. Feel free to add on or open a new issue if not!

@github-actions
Copy link

This issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with a reprex https://reprex.tidyverse.org) and link to this issue.

@github-actions github-actions bot locked and limited conversation to collaborators Feb 21, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants