Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

arrange converts class Period to num #266

Closed
bartev opened this issue Feb 15, 2014 · 19 comments
Closed

arrange converts class Period to num #266

bartev opened this issue Feb 15, 2014 · 19 comments
Assignees
Labels
Milestone

Comments

@bartev
Copy link

@bartev bartev commented Feb 15, 2014

When using arrange, the Period type data gets converted to type num.
Here's an example of some code and output comparing dplyr and plyr.

> times <- c("17:16", "20:24", "17:23", "17:23", "21:38")
> names <- c('Larry', 'Curly', 'Moe', 'Shemp', 'Curly')
> df <- data.frame(names, times)
> str(df)
'data.frame':   5 obs. of  2 variables:
 $ names: chr  "Larry" "Curly" "Moe" "Shemp" ...
 $ times: chr  "17:16" "20:24" "17:23" "17:23" ...
> df$times <- hm(df$times)
> str(df)
'data.frame':   5 obs. of  2 variables:
 $ names: chr  "Larry" "Curly" "Moe" "Shemp" ...
 $ times:Formal class 'Period' [package "lubridate"] with 6 slots
  .. ..@ .Data : num  0 0 0 0 0
  .. ..@ year  : num  0 0 0 0 0
  .. ..@ month : num  0 0 0 0 0
  .. ..@ day   : num  0 0 0 0 0
  .. ..@ hour  : num  17 20 17 17 21
  .. ..@ minute: num  16 24 23 23 38
> df
  names      times
1 Larry 17H 16M 0S
2 Curly 20H 24M 0S
3   Moe 17H 23M 0S
4 Shemp 17H 23M 0S
5 Curly 21H 38M 0S
> res.plyr <- plyr::arrange(df, times)
> res.dplyr <- dplyr::arrange(df, times)
> str(res.plyr)
'data.frame':   5 obs. of  2 variables:
 $ names: chr  "Larry" "Moe" "Shemp" "Curly" ...
 $ times:Formal class 'Period' [package "lubridate"] with 6 slots
  .. ..@ .Data : num  0 0 0 0 0
  .. ..@ year  : num  0 0 0 0 0
  .. ..@ month : num  0 0 0 0 0
  .. ..@ day   : num  0 0 0 0 0
  .. ..@ hour  : num  17 17 17 20 21
  .. ..@ minute: num  16 23 23 24 38
> str(res.dplyr)
'data.frame':   5 obs. of  2 variables:
 $ names: chr  "Larry" "Curly" "Moe" "Shemp" ...
 $ times: num  0 0 0 0 0
> res.plyr
  names      times
1 Larry 17H 16M 0S
2   Moe 17H 23M 0S
3 Shemp 17H 23M 0S
4 Curly 20H 24M 0S
5 Curly 21H 38M 0S
> res.dplyr
  names times
1 Larry     0
2 Curly     0
3   Moe     0
4 Shemp     0
5 Curly     0
@romainfrancois
Copy link
Member

@romainfrancois romainfrancois commented Feb 15, 2014

I suppose it only uses the .Data slot and just discards the rest. Not sure what we can do here. We could perhaps error on S4 objects.

@hadley hadley added this to the v0.1.2 milestone Feb 17, 2014
@hadley
Copy link
Member

@hadley hadley commented Feb 17, 2014

@romainfrancois for now I think we should have a whitelist of column types (logical, integer, double, character, factor, ordered factor, Date, POSIXct). That would also resolve #266

@romainfrancois
Copy link
Member

@romainfrancois romainfrancois commented Feb 17, 2014

The problem here is that df$times is a numeric vector :

> typeof(df$times)
[1] "double"
> .Internal(inspect(df$times))
@103a01dc0 14 REALSXP g0c4 [OBJ,NAM(2),S4,gp=0x10,ATT] (len=5, tl=0) 0,0,0,0,0
ATTRIB:
  @103304118 02 LISTSXP g0c0 []
    TAG: @103dc5578 01 SYMSXP g1c0 [MARK] "year"
    @103d51520 14 REALSXP g0c4 [NAM(2)] (len=5, tl=0) 0,0,0,0,0
    TAG: @103e0c800 01 SYMSXP g1c0 [MARK] "month"
    @103d514c8 14 REALSXP g0c4 [NAM(2)] (len=5, tl=0) 0,0,0,0,0
    TAG: @103e29f20 01 SYMSXP g1c0 [MARK] "day"
    @103d51418 14 REALSXP g0c4 [NAM(2)] (len=5, tl=0) 0,0,0,0,0
    TAG: @103e21d40 01 SYMSXP g1c0 [MARK] "hour"
    @103d51310 14 REALSXP g0c4 [NAM(2)] (len=5, tl=0) 17,20,17,17,21
    TAG: @103e14c60 01 SYMSXP g1c0 [MARK] "minute"
    @103d51368 14 REALSXP g0c4 [NAM(2)] (len=5, tl=0) 16,24,23,23,38
    TAG: @100823348 01 SYMSXP g1c0 [MARK,LCK,gp=0x4000] "class" (has value)
    @103cb2718 16 STRSXP g0c1 [NAM(2),ATT] (len=1, tl=0)
      @103d96cf8 09 CHARSXP g1c1 [MARK,gp=0x61,ATT] [ASCII] [cached] "Period"
    ATTRIB:
      @1033042a0 02 LISTSXP g0c0 []
    TAG: @10082cf40 01 SYMSXP g1c0 [MARK,NAM(2)] "package"
    @103cb2748 16 STRSXP g0c1 [NAM(2)] (len=1, tl=0)
      @1047d3d80 09 CHARSXP g1c2 [MARK,gp=0x61] [ASCII] [cached] "lubridate"

I could black list things that have the S4 bit set so that we'd have the list above expect when the object is an S4 object.

This is an unfortunate consequence of the design of the Period class, i.e. the fact that it contains numeric https://github.com/hadley/lubridate/blob/master/R/periods.r#L192

@hadley
Copy link
Member

@hadley hadley commented Feb 17, 2014

This is an unfortunate design of S3/S4 in general - I think you need to do something like (is.vector(x) && (is.logical(x) || is.numeric(x) || is.character(x))) || inherits(x, "Date") || inherits(x, "factor") || inherits(x, "POSIXct")

@romainfrancois
Copy link
Member

@romainfrancois romainfrancois commented Feb 19, 2014

This might be too restrictive. Here is a snippet from do_isvector: https://github.com/wch/r-source/blob/013f4896e0946d638e96fe0a736b53a5d8e6063f/src/main/coerce.c#L1962

    /* We allow a "names" attribute on any vector. */
    if (LOGICAL(ans)[0] && ATTRIB(CAR(args)) != R_NilValue) {
    a = ATTRIB(CAR(args));
    while(a != R_NilValue) {
        if (TAG(a) != R_NamesSymbol) {
        LOGICAL(ans)[0] = 0;
        break;
        }
        a = CDR(a);
    }
    }

In essence it means that only names attributes are allowed. I'll come up with something we can reuse throughout the rest of the code.

romainfrancois added a commit that referenced this issue Feb 19, 2014
@romainfrancois
Copy link
Member

@romainfrancois romainfrancois commented Feb 19, 2014

Here is a start. https://github.com/hadley/dplyr/blob/master/inst/include/dplyr/white_list.h
Similar to what you had but with less power to is.vector.

@romainfrancois
Copy link
Member

@romainfrancois romainfrancois commented Feb 19, 2014

Now failing like this:

> arrange( df, times )
Erreur : column 'times' has unsupported type

I need to check other places where it makes sense to use white_list.

@romainfrancois
Copy link
Member

@romainfrancois romainfrancois commented Feb 19, 2014

Hmm. Are list (VECSXP) white listed ? The problem is that this will let pass POSIXlt which we don't yet want to support. But from #239 it seems we need some support for list.

Perhaps I should black list POSIXlt and check the length of the list otherwise, only allowing lists of the appropriate length ?

@hadley
Copy link
Member

@hadley hadley commented Feb 19, 2014

I don't think you need to blacklist POSIXlt, just check that the list length is ok.

@romainfrancois
Copy link
Member

@romainfrancois romainfrancois commented Feb 19, 2014

Hmm. POSIXlt are always of length 9. From ?strptime, here is what a POSIXlt of 5 times looks like. A list of 9 elements, each of which have 5 elements.

> dates <- c("02/27/92", "02/27/92", "01/14/92", "02/28/92", "02/01/92")
> times <- c("23:03:20", "22:29:56", "01:03:30", "18:21:03", "16:56:26")
> x <- paste(dates, times)
> dates <- strptime(x, "%m/%d/%y %H:%M:%S")
> dates
[1] "1992-02-27 23:03:20" "1992-02-27 22:29:56" "1992-01-14 01:03:30"
[4] "1992-02-28 18:21:03" "1992-02-01 16:56:26"
> .Internal( inspect( dates ) )
@101a19aa8 19 VECSXP g0c6 [OBJ,NAM(2),ATT] (len=9, tl=0)
  @1072cfa70 14 REALSXP g0c4 [NAM(2)] (len=5, tl=0) 20,56,30,3,26
  @1095ea8d0 13 INTSXP g0c3 [] (len=5, tl=0) 3,29,3,21,56
  @1095ea9f0 13 INTSXP g0c3 [] (len=5, tl=0) 23,22,1,18,16
  @1095eab10 13 INTSXP g0c3 [] (len=5, tl=0) 27,27,14,28,1
  @1095e3ce0 13 INTSXP g0c3 [] (len=5, tl=0) 1,1,0,1,1
  ...
ATTRIB:
  @108a76ca0 02 LISTSXP g0c0 []
    TAG: @100823078 01 SYMSXP g1c0 [MARK,NAM(2),LCK,gp=0x4000] "names" (has value)
    @101c8a5f8 16 STRSXP g0c6 [NAM(2)] (len=9, tl=0)
      @108d670c8 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [cached] "sec"
      @10083f5a8 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [cached] "min"
      @108384f28 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [cached] "hour"
      @108384fe8 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [cached] "mday"
      @109017bf8 09 CHARSXP g0c1 [gp=0x60] [ASCII] [cached] "mon"
      ...
    TAG: @100823548 01 SYMSXP g1c0 [MARK,NAM(2),LCK,gp=0x4000] "class" (has value)
    @10974e518 16 STRSXP g0c2 [NAM(2)] (len=2, tl=0)
      @1049fc4c8 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [cached] "POSIXlt"
      @101838528 09 CHARSXP g1c1 [MARK,gp=0x61] [ASCII] [cached] "POSIXt"

So we'd let a POSIXlt pass if there were 9 rows in the data frame. However, what we get internally if we do x[0] is not the first "time", but a vector of all "seconds". So for example arrange'ing a POSIXlt that by "luck" happened to have 9 times would be a catastrophy. filter would be in danger too. Actually without specific handling of POSIXlt, pretty much all verbs are in danger from nonsense results.

POSIXlt are however not too hard to deal with correctly, i.e. to filter them, we just need to filter each of the 9 components, same goes for arrange, hashing would just be a combination of hashing the 9 components, ...

@romainfrancois
Copy link
Member

@romainfrancois romainfrancois commented Feb 19, 2014

Hmm. see Rcpp11/Rcpp11#112 (and RcppCore/Rcpp#120)

nrow only uses the first column, so we'd have to be unlucky that it is a POSIXlt. And weird things can happen:

> df <- data.frame( times = 1:5 )
> df$times <- as.POSIXlt( seq.Date( Sys.Date(), length.out = 5, by = "day" ) )
> cppFunction( "int nrow_df( DataFrame df){ return df.nrows(); }")
> nrow_df( df )
[1] 9
> df$x <- 1:5
> filter( df, x < 3 )
Erreur : incorrect length (5), expecting: 9

@hadley
Copy link
Member

@hadley hadley commented Feb 19, 2014

Ok, you've convinced me - let's blacklist POSIXlt.

@romainfrancois
Copy link
Member

@romainfrancois romainfrancois commented Feb 19, 2014

Done. Closing the issue now as the problem is fixed.
One question though, I need lubridate to reproduce the test reported here initially, so I've not put in a test for now. Should I do a if( require("lubridate") )

@hadley
Copy link
Member

@hadley hadley commented Feb 19, 2014

You can create with e.g. as.POSIXlt(Sys.time() + 1:10)

@romainfrancois
Copy link
Member

@romainfrancois romainfrancois commented Feb 19, 2014

Sure, I meant creating Period objects as in @bartev code df$times <- hm(df$times)

@hadley
Copy link
Member

@hadley hadley commented Feb 19, 2014

Oh sure - just create a minimal class in the test:

Period <- setClass("Period", contains = "numeric")
p <- Period(c(1, 2, 3))

@romainfrancois
Copy link
Member

@romainfrancois romainfrancois commented Feb 19, 2014

Ah. yes. Thanks. Done now.

@bartev
Copy link
Author

@bartev bartev commented Feb 21, 2014

@romainfrancois , If you blacklist POSIXlt, does that mean that you can't use POSIXlt objects in dplyr?

@hadley
Copy link
Member

@hadley hadley commented Feb 21, 2014

@bartev Correct, because you should never put a POSIXlt object in a data frame. Basically they're uniformally inferior to POXIct.

@lock lock bot locked as resolved and limited conversation to collaborators Jun 10, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants