Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

between() function is not generic #3744

Closed
leungi opened this issue Aug 8, 2018 · 11 comments
Closed

between() function is not generic #3744

leungi opened this issue Aug 8, 2018 · 11 comments

Comments

@leungi
Copy link

leungi commented Aug 8, 2018

The tsibble package imports dplyr::filter().

library(dplyr)
library(tsibble)
library(lubridate)

# this fails
tsibbledata::elecdemand %>%
  filter(between(index, ymd("2014-03-01"), ymd("2014-03-04")))
#> Warning: package 'bindrcpp' was built under R version 3.4.4
#> Error: A tsibble must not be empty.

# this works, by explicitly specifying date and time
tsibbledata::elecdemand %>%
  filter(between(index, ymd_h("2014-03-01 0"), ymd_hm("2014-03-04 23:30")))

# this works as well, but without specifying date and time; is it because the filter() 
# criteria defaults to 0th hour if time component is not specified?
tsibbledata::elecdemand %>%
    filter(index > ymd("2014-03-01"))
@romainfrancois
Copy link
Member

romainfrancois commented Aug 8, 2018

@leungi where can I find tsibbledata ? Alternatively could you make a reproducible example with dummy data ?

This seems to work fine for me:

> tibble( index = ymd("2014-03-01") ) %>% 
+    filter(between(index, ymd("2014-03-01"), ymd("2014-03-04")))
# A tibble: 1 x 1
  index     
  <date>    
1 2014-03-01

@leungi
Copy link
Author

leungi commented Aug 8, 2018

Thanks for prompt reply @romainfrancois.

# install.packages("devtools")
devtools::install_github("tidyverts/tsibbledata")

# alternatively
tibble::tribble(
            ~index,     ~Demand, ~WorkDay, ~Temperature,
  "3/1/2014 15:00", 4.313623932,       0L,         22.6,
  "3/1/2014 15:30", 4.335849964,       0L,         22.9,
  "3/1/2014 16:00", 4.359745492,       0L,         22.6,
  "3/1/2014 16:30",  4.41801486,       0L,         23.1,
  "3/1/2014 17:00", 4.422075156,       0L,         22.4,
  "3/1/2014 17:30", 4.388184742,       0L,           23,
  "3/1/2014 18:00",  4.30974097,       0L,         22.1,
  "3/1/2014 18:30",  4.27559588,       0L,           21,
  "3/1/2014 19:00", 4.312765668,       0L,         20.3,
  "3/1/2014 19:30", 4.349436696,       0L,         19.7,
  "3/1/2014 20:00", 4.240609104,       0L,         19.4,
  "3/1/2014 20:30", 4.135012122,       0L,         19.1,
  "3/1/2014 21:00", 4.009609112,       0L,           19,
  "3/1/2014 21:30", 3.874741748,       0L,         19.1,
  "3/1/2014 22:00", 3.835917938,       0L,         18.8,
  "3/1/2014 22:30", 3.854626102,       0L,         18.6,
  "3/1/2014 23:00", 4.196060894,       0L,         18.8,
  "3/1/2014 23:30", 4.265129816,       0L,         18.7,
   "3/2/2014 0:00", 3.922372796,       0L,         18.5,
   "3/2/2014 0:30", 3.622464232,       0L,         18.4,
   "3/2/2014 1:00", 3.437786128,       0L,         18.3,
   "3/2/2014 1:30",  3.31374111,       0L,         18.3
  )

# this works
tsibbledata::elecdemand %>%
  filter(between(index, ymd_h("2014-03-01 15"), ymd_hm("2014-03-02 1:30")))

# this doesn't work
tsibbledata::elecdemand %>%
  filter(between(index, ymd_h("2014-03-01 15"), ymd("2014-03-02")))

# again, this works
tsibbledata::elecdemand %>%
  filter(index >= ymd("2014-03-01"))

@hadley
Copy link
Member

hadley commented Aug 10, 2018

Isn't this a tsibble issue?

@earowang
Copy link
Contributor

earowang commented Aug 11, 2018

I think the date-time issue is related to between not filter.

Here's the reprex.

library(lubridate)
#> 
#> Attaching package: 'lubridate'
#> The following object is masked from 'package:base':
#> 
#>     date
x <- make_datetime(2018) + hours(0:48)
dplyr::between(x, ymd("2018-01-01"), ymd("2018-01-03")) 
# expected as x >= ymd("2018-01-01") & x <= ymd("2018-01-03")
#>  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
#> [45] FALSE FALSE FALSE FALSE FALSE
x >= ymd("2018-01-01") & x <= ymd("2018-01-03")
#>  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#> [15] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#> [29] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#> [43] TRUE TRUE TRUE TRUE TRUE TRUE TRUE
dplyr::between(x, ymd("2018-01-01"), ymd_h("2018-01-03 0")) # repaired version
#>  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#> [15] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#> [29] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
#> [43] TRUE TRUE TRUE TRUE TRUE TRUE TRUE

Created on 2018-08-11 by the reprex
package
(v0.2.0).

@leungi
Copy link
Author

leungi commented Aug 11, 2018

@earowang, thanks for investigation!

I was going to follow up this weekend, but you got ahead 👍

@cderv
Copy link
Contributor

cderv commented Aug 12, 2018

I think the cause for all this is because between currently just works with numeric vector and is strict with type with x and both boundaries. See @romainfrancois comments and the cpp implementation of between

LogicalVector between(NumericVector x, double left, double right) {

Now, as date/datetime object can be converted to numeric, it should work. However, Date object and datetime object (POSIXct) are not the same and cannot be compared as numeric. When providing date or datetime to between every arguments must be the same type, to be able to be compared as numeric correctly. It is why the behavior in the first example of @leungi is correct, due to the use of lubridate::ymd that returns Date object by default. (New default in lubridate 1.5.6).

I detailed the behavior and the different option in the reprex below

library(lubridate)
# this is a datetime object (POSIXct)
x <- make_datetime(2018, 01, 02, 10) # 2018-01-02 10:00 UTC
class(x)
#> [1] "POSIXct" "POSIXt"
as.numeric(x)
#> [1] 1514887200
# ymd with tz = NULL (the default) returns a Date object
y <- ymd("2018-01-01", tz = NULL)
class(y)
#> [1] "Date"
as.numeric(y)
#> [1] 17532
# between converts to numeric and is right to return false
dplyr::between(x, ymd("2018-01-01", tz = NULL), ymd("2018-01-03", tz = NULL))
#> [1] FALSE
as.numeric(x) >= as.numeric(ymd("2018-01-01", tz = NULL)) & as.numeric(x) <= as.numeric(ymd("2018-01-03", tz = NULL))
#> [1] FALSE

# lubridate::ymd with tz not NULL returns a datetime object (POSIXct)
class(ymd("2018-01-01", tz = "UTC"))
#> [1] "POSIXct" "POSIXt"
# it works fine if all are the same type
dplyr::between(x, ymd("2018-01-01", tz = "UTC"), ymd("2018-01-03", tz = "UTC")) 
#> [1] TRUE
as.numeric(x) >= as.numeric(ymd("2018-01-01", tz = "UTC")) & as.numeric(x) <= as.numeric(ymd("2018-01-03", tz = "UTC"))
#> [1] TRUE

# you could also use lubridate::as_datetime 
class(as_datetime("2018-01-01"))
#> [1] "POSIXct" "POSIXt"
# it works fine if all are the same type
dplyr::between(x, as_datetime("2018-01-01"), as_datetime("2018-01-03")) 
#> [1] TRUE

Created on 2018-08-12 by the reprex package (v0.2.0).

So, I think there is no bug here - it is just how between currently works. There are some opened issues related to this function to add some support for other types. I do not know what is planned for date/datetime object.
However, there could be some additions to the documentation, and some examples, to explain the limitations and how it works. Also, a check for strict types of arguments in between could prevent some errors. Just ideas that I could help with a PR if needed.

In your case with tsibble, you could:

  • not use between to avoid error
  • always insure that you are working with POSIXct, and be careful with ymd default behaviour.
  • Maybe something could be done in tsibble to help with all this. (converting date to POSIXct ?). Not sure.

Hope it helps.

@hadley
Copy link
Member

hadley commented Aug 12, 2018

Since this is a problem related to types/classes it will eventually be addressed in vctrs, not dplyr. I think there’s already an issue about between in vctrs; it’d be great if someone could find it and cross link with this issue

@romainfrancois
Copy link
Member

Also related to #2995

@leungi
Copy link
Author

leungi commented Aug 13, 2018

@cderv, thanks for the detailed review; points taken.

A few notes on my end.

Without a timestamp in string, lubdridate defaults to Date type

class(ymd("2014-03-01"))
#> [1] "Date"

By specifying any component of timestamp, lubdridate defaults to POSIXct type, regardless if tz is specified or not.

class(ymd_h("2014-03-01 15"))
#> [1] "POSIXct" "POSIXt" 

class(ymd("2014-03-01", tz = 'UTC'))
#> [1] "POSIXct" "POSIXt" 

Converting to POSIXct type seems to be the best workaround for now 👍

@hadley hadley changed the title Unexpected behaviour using dplyr::filter() on DateTime column between() function is not generic Oct 1, 2018
@romainfrancois
Copy link
Member

Closing this in favor of https://github.com/r-lib/vctrs/issues/147

@lock
Copy link

lock bot commented Jun 16, 2019

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

@lock lock bot locked and limited conversation to collaborators Jun 16, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants