Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with Date format #194

Closed
manuelreif opened this issue Jan 20, 2014 · 14 comments
Closed

Problem with Date format #194

manuelreif opened this issue Jan 20, 2014 · 14 comments
Assignees
Milestone

Comments

@manuelreif
Copy link

@manuelreif manuelreif commented Jan 20, 2014

Hi!
Thank you for developing the dplyr package!
I used the package for the very first time and I had some issues with date vectors. I browsed the issues but haven´t found a solution for the problem.
Here is what happens if I want to use summarise() in order to get the 'earliest' observation from my data.frame.

library(dplyr)
# seed
set.seed(111)
# create variables
ID     <- rep(letters[1:4],each=5)
date   <-  ymd(paste0(sample(c(1960:2014),length(ID),replace=TRUE),sample(sprintf(fmt="%02d",1:12),length(ID),replace=TRUE),sample(sprintf(fmt="%02d",1:25),length(ID),replace=TRUE)))
number <-  rnorm(length(ID))
# create data.frame
d <- data.frame(ID,date,number)
# use dplyr
d_dplyr <- tbl_df(d)
d_dplyr %.% group_by(ID) %.% summarise(mindate=min(date))


 ID    mindate
1  a  327283200
2  b -296352000
3  c -238723200
4  d  -27648000
  • I expected the function to return the date.
  • additional question: Is it possible to get an return with all colums of the original data.frame? I tried to use select() etc., but I never got back the 'big' & summerised data.frame.

Thank you!
Manuel

@hadley
Copy link
Member

@hadley hadley commented Jan 20, 2014

Looks like we need to add a Date method to our internal implementation of min().

If you want all the columns of the original dataset, use mutate() instead of summarise()

@ghost ghost assigned romainfrancois Jan 20, 2014
@manuelreif
Copy link
Author

@manuelreif manuelreif commented Jan 20, 2014

Thank you for the instant reply!

When I use mutate() the return i a data.frame with all rows. What I thought of is more like an 'compressed" version of the data.frame - only the earliest observation but all variables.

d_dplyr %.% group_by(ID) %.% mutate(mindate=min(date))

  ID       date      number    mindate
1   a 1992-06-17 -3.11321730  327283200
2   a 1999-04-09 -0.94135740  327283200
3   a 1980-05-16  1.40025878  327283200
4   a 1988-05-24 -1.62047003  327283200
5   a 1980-12-15 -2.26599596  327283200
6   b 1983-04-10  1.16299359 -296352000
7   b 1960-08-11 -0.11615504 -296352000
8   b 1989-04-22  0.33425601 -296352000
9   b 1983-10-16 -0.62085811 -296352000
10  b 1965-08-20 -1.30984491 -296352000
11  c 1990-01-19 -1.17572604 -238723200
12  c 1992-07-15 -1.12121553 -238723200
13  c 1963-06-01 -1.36190448 -238723200
14  c 1962-06-09  0.48112458 -238723200
15  c 1968-05-25  0.74197163 -238723200
16  d 1984-09-14  0.02782463  -27648000
17  d 1969-02-15  0.33137971  -27648000
18  d 2013-10-12  0.64411413  -27648000
19  d 1977-08-03  2.48566156  -27648000
20  d 1993-10-21  1.95998171  -27648000

@hadley
Copy link
Member

@hadley hadley commented Jan 20, 2014

Then maybe you want filter(date == min(date))?

@manuelreif
Copy link
Author

@manuelreif manuelreif commented Jan 20, 2014

That´s exactly what I wanted!
Thank you!
Manuel

@romainfrancois
Copy link
Member

@romainfrancois romainfrancois commented Jan 20, 2014

internal min and max now handles Date and POSIXct objects. So I get:

> d_dplyr %.% group_by(ID) %.% summarise(mindate=min(date), maxdate=max(date))
Source: local data frame [4 x 3]

  ID             mindate             maxdate
1  d 1969-02-15 01:00:00 2013-10-12 02:00:00
2  c 1962-06-09 01:00:00 1992-07-15 02:00:00
3  b 1960-08-11 01:00:00 1989-04-22 02:00:00
4  a 1980-05-16 02:00:00 1999-04-09 02:00:00

But however not handling it for arbitrary functions:

min_ <- min
max_ <- max
> d_dplyr %.% group_by(ID) %.% summarise(mindate=min_(date), maxdate=max_(date))
Source: local data frame [4 x 3]

  ID    mindate    maxdate
1  d  -27648000 1381536000
2  c -238723200  711158400
3  b -296352000  609206400
4  a  327283200  923616000

@hadley, should we e.g. promote the object to have the class of say the first result ?

@hadley
Copy link
Member

@hadley hadley commented Jan 20, 2014

@romainfrancois hmmm, can you expand a bit more why the second case doesn't work already? Why are we losing the class information returned by min()? (Or are we passing min an vector sans attributes?)

@romainfrancois
Copy link
Member

@romainfrancois romainfrancois commented Jan 20, 2014

It is just an omission. The subsetting happens in the GroupSubset class: https://github.com/hadley/dplyr/blob/master/inst/include/dplyr/Result/GroupedSubset.h

which then uses ShrinkableVector:
https://github.com/hadley/dplyr/blob/99cdff96d576596529f2cecef1308a275629c35c/inst/include/tools/ShrinkableVector.h

which does not keep attributes. That is one side of the problem, so the min_ function only sees an integer vector.

The other side of the problem is the fusion of all results for the chunks. but I think this should be taken care of.

If we go there, should only class be passed to the subset or should other attributes participate ? It would be enough to (shallow) copy all attributes, but I'm not sure this is what we would want.

@hadley
Copy link
Member

@hadley hadley commented Jan 20, 2014

A shallow copy of attributes would work for dates, times and factors, but not necessarily in general. The problem is that S3 classes don't provide quite enough information about what attributes mean. Perhaps we could maintain a registry of classes, and when attributes should be preserved across vector operations.

Or for now, we could just special case POSIXct, Date and factor, and leave worrying about other classes until the future.

romainfrancois added a commit that referenced this issue Jan 20, 2014
@romainfrancois
Copy link
Member

@romainfrancois romainfrancois commented Jan 20, 2014

Ok, so for now I'm propagating all attributes and also setting the object bit if it is set in the original variable.

@romainfrancois
Copy link
Member

@romainfrancois romainfrancois commented Jan 20, 2014

Reopening until i put some unit tests

@hadley
Copy link
Member

@hadley hadley commented Jan 20, 2014

Can you please also add a note to NEWS.md (just added)?

@jmb01
Copy link

@jmb01 jmb01 commented Dec 4, 2014

There still seems to be an issue when using min(na.rm = TRUE). Is there a way around this besides eliminating observations with missing date values before trying to find the ID-specific minima?

> test <- data.frame(
+   id = rep(1:5, each = 3),
+   date = as.Date('2014-01-01') + runif(15, 0, 3650)
+   )
> 
> test %>%
+   group_by(id) %>%
+   mutate(mindate = min(date),
+          mindate2 = min(date, na.rm = TRUE))
Source: local data frame [15 x 4]
Groups: id

   id       date    mindate       mindate2
1   1 2015-03-05 2015-03-05  1672297-12-09
2   1 2016-11-14 2015-03-05  1672297-12-09
3   1 2018-06-19 2015-03-05  1672297-12-09
4   2 2016-01-15 2016-01-15 -4283559-04-20
5   2 2018-08-18 2016-01-15 -4283559-04-20
6   2 2022-04-08 2016-01-15 -4283559-04-20
7   3 2015-02-23 2015-01-17   375902-01-15
8   3 2016-06-18 2015-01-17   375902-01-15
9   3 2015-01-17 2015-01-17   375902-01-15
10  4 2021-09-28 2014-02-08  2979338-06-03
11  4 2014-02-08 2014-02-08  2979338-06-03
12  4 2022-09-04 2014-02-08  2979338-06-03
13  5 2014-01-22 2014-01-22 -1717593-01-18
14  5 2023-01-11 2014-01-22 -1717593-01-18
15  5 2014-06-16 2014-01-22 -1717593-01-18
> 
> test2 <- test
>   test2[c(1,14), 'date'] <- NA
> 
> test2 %>%
+ group_by(id) %>%
+   mutate(mindate = min(date),
+          mindate2 = min(date, na.rm = TRUE))
Source: local data frame [15 x 4]
Groups: id

   id       date    mindate       mindate2
1   1       <NA>       <NA>     1975-05-09
2   1 2016-11-14       <NA>     1975-05-09
3   1 2018-06-19       <NA>     1975-05-09
4   2 2016-01-15 2016-01-15 -4283559-04-20
5   2 2018-08-18 2016-01-15 -4283559-04-20
6   2 2022-04-08 2016-01-15 -4283559-04-20
7   3 2015-02-23 2015-01-17   375902-01-15
8   3 2016-06-18 2015-01-17   375902-01-15
9   3 2015-01-17 2015-01-17   375902-01-15
10  4 2021-09-28 2014-02-08  2979338-06-03
11  4 2014-02-08 2014-02-08  2979338-06-03
12  4 2022-09-04 2014-02-08  2979338-06-03
13  5 2014-01-22       <NA> -1717593-01-18
14  5       <NA>       <NA> -1717593-01-18
15  5 2014-06-16       <NA> -1717593-01-18

@hadley
Copy link
Member

@hadley hadley commented Dec 4, 2014

@jmb1 that works for me with the dev version

@jmb01
Copy link

@jmb01 jmb01 commented Dec 4, 2014

Thanks, problem solved after install_github('hadley/dplyr').

romainfrancois added a commit that referenced this issue Mar 19, 2016
@lock lock bot locked as resolved and limited conversation to collaborators Jun 10, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants