Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected lubridate interval behavior when filtering dataframe via dplyr #3206

Closed
mienkoja opened this issue Nov 14, 2017 · 9 comments
Closed

Comments

@mienkoja
Copy link

Background
I posted this on SO as well, since I'm not entirely certain this is a bug - perhaps just my ignorance.

I am creating a lubridate interval vector using a dataset similar to that which is available in the following chunk:

    dat <- readr::read_rds(url("https://github.com/mienkoja/stack_stash/blob/master/dat.rds?raw=true"))
    
    dat_mod <- dplyr::mutate(dat, interval = lubridate::interval(lubridate::ymd(start_combo)
                                                                 ,lubridate::ymd(stop_combo)))

Problem

If I try to filter the object via dplyr, I get an unexpected interval returned:

    dplyr::filter(dat_mod, id == 7000)
    
    ## A tibble: 1 x 4
    #     id start_combo stop_combo                       interval
    #  <int>       <dbl>      <dbl>                 <S4: Interval>
    #1  7000    20170802   20170816 2016-10-11 UTC--2016-10-25 UTC

My own investigation

Every interval (viewed via dplyr::filter) appears to be the same length as the expected interval, but all anchored (incorrectly) to 2016-10-11 UTC.

I suspect this is a dplyr::filter bug as I can search by bracket notation and get the expected result.

    dat_mod[dat_mod$id == 7000, ]
    
    ## A tibble: 1 x 4
    #     id start_combo stop_combo                       interval
    #  <int>       <dbl>      <dbl>                 <S4: Interval>
    #1  7000    20170802   20170816 2017-08-02 UTC--2017-08-16 UTC

Also, this does not appear to be an artifact of what is getting displayed in the output. If I assign the filtered object to a new object, the incorrect interval is preserved:

    dat_mod2 <- filter(dat_mod, id == 7000)
    dat_mod2
    ## A tibble: 1 x 4
    #     id start_combo stop_combo                       interval
    #  <int>       <dbl>      <dbl>                 <S4: Interval>
    #1  7000    20170802   20170816 2016-10-11 UTC--2016-10-25 UTC
@mienkoja mienkoja changed the title lubri Unexpected lubridate interval behavior when filtering dataframe via dplyr Nov 14, 2017
@yutannihilation
Copy link
Member

Hi, I guess this is the same kind of bug as #2568 as suggested on SO and #1581 is more exact one. If you use str() to interval column, you can notice it failed to subset @start. This is because S4 method of [ for Interval class is not properly called inside filter().

Let's set our hope on vctrs...

dplyr::filter(dat_mod, .data$id == 7000) %>%
  pull(interval) %>%
  str
#> Formal class 'Interval' [package "lubridate"] with 3 slots
#>   ..@ .Data: num 1209600
#>   ..@ start: POSIXct[1:7632], format: "2016-10-11" NA NA NA ...
#>   ..@ tzone: chr "UTC"

dat_mod[dat_mod$id == 7000, ] %>%
  pull(interval) %>%
  str
#> Formal class 'Interval' [package "lubridate"] with 3 slots
#>   ..@ .Data: num 1209600
#>   ..@ start: POSIXct[1:1], format: "2017-08-02"
#>   ..@ tzone: chr "UTC"

@mienkoja
Copy link
Author

Thanks @yutannihilation!

Since this seems to be a much more pervasive issue (and one that appears to omit of a simple solution given how long it's been since some of these issues were posted), perhaps we could just (for now) get a warning based on the class of columns?

Maybe something like this would work for my current problem?

filter <- function (.data, ...) 
{
  num_interval_cols <- length(which(unlist(lapply(.data, class)) == "Interval"))
  if (num_interval_cols > 0) {warning("S4 Interval class not currently supported inside filter(). Results may not be accurate.")}
  UseMethod("filter")
}

@yutannihilation
Copy link
Member

yutannihilation commented Nov 14, 2017

Thanks!

Maybe something like this would work for my current problem?

I think so. One idea is that add a row ID column and subset the original data.frame by the row IDs in the result data.frame. Though I don't think my code is great enough, SO is great place where you can ask for the better version of the code :)

myfilter <- function(d, ...) {
  preds <- rlang::quos(...)
  
  result <- d %>%
    # add row IDs to distinguish rows in the result
    tibble::rowid_to_column(var = "rowid") %>%
    dplyr::filter(!!! preds)
  
  # overwrite S4 cols by data properly subsetted by `[`
  cols_S4 <- colnames(result)[purrr::map_lgl(result, ~ isS4(x = .))]
  result[, cols_S4] <- d[result$rowid, cols_S4]
  
  # remove row ID column
  dplyr::select(result, -rowid)
}

@mienkoja
Copy link
Author

Woule the tidyverse team entertain a pure R solution to this? I believe the filter method is written in C - correct?

@yutannihilation
Copy link
Member

Ah, if you thought the code above is a "solution", no. I intended to post a temporary "work around" until vctrs package does the right thing.

@krlmlr
Copy link
Member

krlmlr commented Dec 12, 2017

Thanks. Do you think #2432 would fix this issue as well?

@yutannihilation
Copy link
Member

I'm not quite sure how #2432 will be implemented, but I think so. The root cause is the absence of nice ways to dispatch the correct method for non-base types, which is the same as other issues listed on there.

@romainfrancois
Copy link
Member

I'll close this now as a duplicate to #2432

At the moment, we have a workaround in place to essentially refuse to deal with interval objects.

> dplyr::filter(dat_mod, id == 7000)
 Error in filter_impl(.data, quo) : 
  Column `interval` classes Period and Interval from lubridate are currently not supported. 

This is of course temporary until we deal with #2432, but at least this is less surprising.

urskalbitzer added a commit to pace-primates/paceR that referenced this issue May 24, 2018
@lock
Copy link

lock bot commented Oct 20, 2018

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

@lock lock bot locked and limited conversation to collaborators Oct 20, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants