Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

joining with class interval #3217

Closed
greg-botwin opened this issue Nov 21, 2017 · 4 comments
Closed

joining with class interval #3217

greg-botwin opened this issue Nov 21, 2017 · 4 comments

Comments

@greg-botwin
Copy link

greg-botwin commented Nov 21, 2017

This is my first time posting on GitHub so I apologize in advance if I am missing needed information or if I am doing anything incorrectly. I also want to thank the developers for making an amazing package!

I ran into a possible bug when joining a data frame that contained an interval class column (lubridate). The interval class seems to be joined by position, not via the "by" variable. I think this is related to 2432, but I did not see this specific case described. Using merge produces the correct data frame.

Appreciate your support!

library(lubridate)
library(dplyr)
library(reprex)

df1 <- data.frame(id = seq(from = 1, to = 10, by = 1),
                  interval_start = seq.Date(from = ymd("2017-03-01"), to = ymd("2017-03-10"), by = 1))

df1 <- df1 %>%
  mutate(interval_end = interval_start + days(10)) %>%
  mutate(ten_d_interval = interval(start = interval_start, end = interval_end))

df2 <- data.frame(id = c(1,2,3,4,5,10,9,8,7,6),
                  letters = letters[1:10])

df2_join_df1 <- inner_join(df2, df1, by = "id")

# incorrectly joined df, see bottom five rows
df2_join_df1
#>    id letters interval_start interval_end                 ten_d_interval
#> 1   1       a     2017-03-01   2017-03-11 2017-03-01 UTC--2017-03-11 UTC
#> 2   2       b     2017-03-02   2017-03-12 2017-03-02 UTC--2017-03-12 UTC
#> 3   3       c     2017-03-03   2017-03-13 2017-03-03 UTC--2017-03-13 UTC
#> 4   4       d     2017-03-04   2017-03-14 2017-03-04 UTC--2017-03-14 UTC
#> 5   5       e     2017-03-05   2017-03-15 2017-03-05 UTC--2017-03-15 UTC
#> 6  10       f     2017-03-10   2017-03-20 2017-03-06 UTC--2017-03-16 UTC
#> 7   9       g     2017-03-09   2017-03-19 2017-03-07 UTC--2017-03-17 UTC
#> 8   8       h     2017-03-08   2017-03-18 2017-03-08 UTC--2017-03-18 UTC
#> 9   7       i     2017-03-07   2017-03-17 2017-03-09 UTC--2017-03-19 UTC
#> 10  6       j     2017-03-06   2017-03-16 2017-03-10 UTC--2017-03-20 UTC

df2_merge_df1 <- merge(df2, df1, by = "id")

# correcty merged df
df2_merge_df1
#>    id letters interval_start interval_end                 ten_d_interval
#> 1   1       a     2017-03-01   2017-03-11 2017-03-01 UTC--2017-03-11 UTC
#> 2   2       b     2017-03-02   2017-03-12 2017-03-02 UTC--2017-03-12 UTC
#> 3   3       c     2017-03-03   2017-03-13 2017-03-03 UTC--2017-03-13 UTC
#> 4   4       d     2017-03-04   2017-03-14 2017-03-04 UTC--2017-03-14 UTC
#> 5   5       e     2017-03-05   2017-03-15 2017-03-05 UTC--2017-03-15 UTC
#> 6   6       j     2017-03-06   2017-03-16 2017-03-06 UTC--2017-03-16 UTC
#> 7   7       i     2017-03-07   2017-03-17 2017-03-07 UTC--2017-03-17 UTC
#> 8   8       h     2017-03-08   2017-03-18 2017-03-08 UTC--2017-03-18 UTC
#> 9   9       g     2017-03-09   2017-03-19 2017-03-09 UTC--2017-03-19 UTC
#> 10 10       f     2017-03-10   2017-03-20 2017-03-10 UTC--2017-03-20 UTC
@krlmlr
Copy link
Member

krlmlr commented Dec 12, 2017

Thanks! This doesn't seem to be related to #2432. As far as I can tell, the data frames differ only by the order of rows? Looks like inner_join() maintains the order of the lhs data frame, and merge() returns the data in the order of id.

@greg-botwin
Copy link
Author

greg-botwin commented Dec 12, 2017

Thanks for looking into this!

The data frames do differ by the order of the rows, but I think that is expected output.

What is specifically different, and what I believe to be incorrect, is that in df2_join_d1 the actual value of the interval object changes during the join.

For example in df1 prior to the join, look at row id 10. The interval has a start date of 2017-03-10, end date of 2017-03-20, and a correct interval object of 2017-03-10 UTC--2017-03-20 UTC. But, after the join in df2_join_df1, the value of the interval object is changed and is reported as 2017-03-06 UTC--2017-03-16 UTC. This is incorrect. For some reason the class interval object changes during the join.

Hope this helps clarify! Greatly appreciative of your support.

@krlmlr
Copy link
Member

krlmlr commented Dec 14, 2017

Thanks for walking me through, I somehow missed the obvious. So: The row order is different, this is by design (and shouldn't be relevant). The ten_d_interval column is wrong, this will be fixed with #2432.

@greg-botwin
Copy link
Author

great! thanks for looking into this.

@krlmlr krlmlr closed this as completed Dec 21, 2017
@lock lock bot locked as resolved and limited conversation to collaborators Jun 19, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants