Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Operations involving group_by() can corrupt difftime values due to mixing up units. #2059

Closed
phirsch opened this issue Aug 9, 2016 · 4 comments
Labels
bug an unexpected problem or unintended behavior

Comments

@phirsch
Copy link

phirsch commented Aug 9, 2016

In this example, a result of "1 days" gets mis-reported as "1 secs", presumably because of a bug in handling units during re-combination of groupwise results:

> library(dplyr)
> library(tidyr)
>
> ff = data.frame(
    g = c(1,2),
    start = c("2000-01-01 00:00:00", "2000-01-01 00:00:00") %>% as.POSIXct,
    end = c("2000-01-01 00:00:02", "2000-01-02 00:00:00") %>% as.POSIXct())
>
> ff
  g      start                 end
1 1 2000-01-01 2000-01-01 00:00:02
2 2 2000-01-01 2000-01-02 00:00:00
> 
> # Without grouping, the result is correct:
>
> ff %>% mutate(delta = end - start)
  g      start                 end      delta
1 1 2000-01-01 2000-01-01 00:00:02     2 secs
2 2 2000-01-01 2000-01-02 00:00:00 86400 secs
>
> # With grouping, the result is wrong - result line 2 is "1 secs" instead of "1 days":
>
> ff %>% group_by(g) %>% mutate(delta = end - start)
Source: local data frame [2 x 4]
Groups: g [2]

      g      start                 end          delta
  <dbl>     <time>              <time> <S3: difftime>
1     1 2000-01-01 2000-01-01 00:00:02         2 secs
2     2 2000-01-01 2000-01-02 00:00:00         1 secs
>
> # summarise() gives the same (wrong) result:
>
> ff %>% group_by(g) %>% summarise(delta = end - start)
# A tibble: 2 x 2
      g          delta
  <dbl> <S3: difftime>
1     1         2 secs
2     2         1 secs
>
> # summarise with intermediate result values gets the numeric values right, but
> # drops type information:
>
> ff %>% group_by(g) %>% summarise(t0 = min(start), t1 = max(end), delta = t1 - t0)
# A tibble: 2 x 4
      g         t0                  t1 delta
  <dbl>     <time>              <time> <dbl>
1     1 2000-01-01 2000-01-01 00:00:02     2
2     2 2000-01-01 2000-01-02 00:00:00 86400
>
> sessionInfo()
R version 3.3.1 (2016-06-21)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 14.04.5 LTS

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C               LC_TIME=en_AU.UTF-8        LC_COLLATE=en_US.UTF-8     LC_MONETARY=en_AU.UTF-8    LC_MESSAGES=en_US.UTF-8   
 [7] LC_PAPER=en_AU.UTF-8       LC_NAME=C                  LC_ADDRESS=C               LC_TELEPHONE=C             LC_MEASUREMENT=en_AU.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] tidyr_0.5.1 dplyr_0.5.0

loaded via a namespace (and not attached):
[1] lazyeval_0.2.0  magrittr_1.5    R6_2.1.2        assertthat_0.1  rsconnect_0.4.3 DBI_0.4-1       tools_3.3.1     tibble_1.1      Rcpp_0.12.6
@hadley
Copy link
Member

hadley commented Aug 9, 2016

Yes, this why you shouldn't rely on difftime. See lubridate for alternatives.

@hadley hadley closed this as completed Aug 9, 2016
@phirsch
Copy link
Author

phirsch commented Aug 10, 2016

In my original use case, I was actually using time objects created by lubridate::ymd_hms(). Taking their difference silently yields a difftime object and thus triggers the same bug - without any warning. (Sorry for using as.POSIXct() to create my example data!)

> ff = tibble(g = c(1,2), start = c("2000-01-01 00:00:00", "2000-01-01 00:00:00") %>% ymd_hms(), end = c("2000-01-01 00:00:02", "2000-01-02 00:00:00") %>% ymd_hms())
> ff
  g      start                 end
1 1 2000-01-01 2000-01-01 00:00:02
2 2 2000-01-01 2000-01-02 00:00:00
>
> ff %>% group_by(g) %>% summarise(delta = end - start)
# A tibble: 2 x 2
      g          delta
  <dbl> <S3: difftime>
1     1         2 secs
2     2         1 secs

Is it possible to at least have dplyr emit a warning instead of silently returning erroneous results in this case?

And thank you for all your wonderful packages - they make R so much more regular and enjoyable and help to avoid many gotchas! This issue, however, feels like a gotcha to me - particularly since I have not been able to find any warning about this in the dplyr or lubridate docs, and there is this comment of yours on SO:

difftime(timestamp, lag(timestamp)) would be a little simpler – hadley Sep 13 '14 at 14:10

@hadley hadley reopened this Aug 10, 2016
@krlmlr
Copy link
Member

krlmlr commented Nov 7, 2016

Could hms be a useful option? This is basically difftime with seconds only, and nice formatting.

We should do better in vctrs, this looks too specialized for dplyr.

@hadley hadley added bug an unexpected problem or unintended behavior data frame labels Feb 9, 2017
@hadley
Copy link
Member

hadley commented Feb 16, 2017

Now part of #2432

@hadley hadley closed this as completed Feb 16, 2017
@lock lock bot locked as resolved and limited conversation to collaborators Jun 8, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug an unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

3 participants