Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

summarize dropping attributes of columns #1237

Closed
renkun-ken opened this issue Jun 28, 2015 · 13 comments
Closed

summarize dropping attributes of columns #1237

renkun-ken opened this issue Jun 28, 2015 · 13 comments
Assignees
Milestone

Comments

@renkun-ken
Copy link

@renkun-ken renkun-ken commented Jun 28, 2015

I'm working on a new package formattable that tries to add some formatting on vectors and data frames for more friendly printing. It uses attributes to store metadata e.g. formatting rules.

I test it with data.table and everything works fine in latest dev version since Rdatatable/data.table#1160 is fixed. In some previous version of dplyr, it seems that it did not support atomic vectors with customized classes (e.g. formattable numeric) as table columns. Now it works with such columns but summarize seems not to preserve attributes.

> library(dplyr)
> library(formattable)
> df <- data.frame(id = 1:10, ret = percent(rnorm(10, 0.1, 0.1)))
> df
   id    ret
1   1 22.82%
2   2  6.74%
3   3 16.15%
4   4 -5.82%
5   5  2.27%
6   6 12.12%
7   7 -5.73%
8   8 16.96%
9   9 -2.90%
10 10  9.13%
> df %>% summarize(ret = mean(ret))
         ret
1 0.07174314

while filter, arrange and group_by do not drop attributes of ret.

> df %>% filter(ret >= mean(ret))
  id    ret
1  1 22.82%
2  3 16.15%
3  6 12.12%
4  8 16.96%
5 10  9.13%
> df %>% arrange(ret)
   id    ret
1   4 -5.82%
2   7 -5.73%
3   9 -2.90%
4   5  2.27%
5   2  6.74%
6  10  9.13%
7   6 12.12%
8   3 16.15%
9   8 16.96%
10  1 22.82%
> df %>% group_by(group = id %% 3)
Source: local data frame [10 x 3]
Groups: group

   id    ret group
1   1 22.82%     1
2   2  6.74%     2
3   3 16.15%     0
4   4 -5.82%     1
5   5  2.27%     2
6   6 12.12%     0
7   7 -5.73%     1
8   8 16.96%     2
9   9 -2.90%     0
10 10  9.13%     1

My session info:

R version 3.2.1 (2015-06-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 8 x64 (build 9200)

locale:
[1] LC_COLLATE=English_United States.1252  LC_CTYPE=English_United States.1252   
[3] LC_MONETARY=English_United States.1252 LC_NUMERIC=C                          
[5] LC_TIME=English_United States.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.9.5     formattable_0.0.16.1 dplyr_0.4.2         

loaded via a namespace (and not attached):
 [1] Rcpp_0.11.6     digest_0.6.8    assertthat_0.1  mime_0.3        chron_2.3-47   
 [6] R6_2.0.1        xtable_1.7-4    DBI_0.3.1       magrittr_1.5    lazyeval_0.1.10
[11] tools_3.2.1     htmlwidgets_0.5 markdown_0.7.7  shiny_0.12.1    httpuv_1.3.2   
[16] parallel_3.2.1  htmltools_0.2.6 knitr_1.10.5   
@hadley
Copy link
Member

@hadley hadley commented Jun 28, 2015

Isn't it more that mean doesn't preserve attributes?

@renkun-ken
Copy link
Author

@renkun-ken renkun-ken commented Jun 28, 2015

I tested that mean keeps attributes, and it works with data.table.

On Sun, Jun 28, 2015 at 4:56 AM -0700, "Hadley Wickham" notifications@github.com wrote:
Isn't it more that mean doesn't preserve attributes?


Reply to this email directly or view it on GitHub:
#1237 (comment)

@renkun-ken
Copy link
Author

@renkun-ken renkun-ken commented Jun 28, 2015

It seems that summarize drops the class and the attributes of the column vector and the class implements mean and other stat methods that keep the attributes.

On Sun, Jun 28, 2015 at 4:56 AM -0700, "Hadley Wickham" notifications@github.com wrote:
Isn't it more that mean doesn't preserve attributes?


Reply to this email directly or view it on GitHub:
#1237 (comment)

@hadley
Copy link
Member

@hadley hadley commented Jun 28, 2015

I just tried mean on a numeric vector with an attribute and it dropped it...

@hadley
Copy link
Member

@hadley hadley commented Jun 28, 2015

> x <- structure(1:10, blah = "x")
> attributes(x)
$blah
[1] "x"

> attributes(mean(x))
NULL

@renkun-ken
Copy link
Author

@renkun-ken renkun-ken commented Jun 28, 2015

Sorry, the real problem is that summarize seems to drop the class and attributes of a formattable numeric column that implements mean.formattable for formattable class that preserves the attributes. Here's the working examples:

> library(formattable)
> x <- percent(rnorm(10, 0.1, 0.1))
> x
 [1] 9.18%  16.74% 19.01% 12.34% 32.76% 20.65% 10.00% 25.72% 4.00%  35.25%
> mean(x)
[1] 18.57%
> library(data.table)
data.table 1.9.5  For help type: ?data.table
*** NB: by=.EACHI is now explicit. See README to restore previous behaviour.
> dt <- data.table(id = 1:10, x = x)
> dt
    id      x
 1:  1  9.18%
 2:  2 16.74%
 3:  3 19.01%
 4:  4 12.34%
 5:  5 32.76%
 6:  6 20.65%
 7:  7 10.00%
 8:  8 25.72%
 9:  9  4.00%
10: 10 35.25%
> dt[, .(n = .N, mx = mean(x)), by = .(g = id %% 2)]
   g n     mx
1: 1 5 14.99%
2: 0 5 22.14%

@romainfrancois romainfrancois self-assigned this Jul 7, 2015
@romainfrancois
Copy link
Member

@romainfrancois romainfrancois commented Jul 7, 2015

Hmm. So we'll have to check for existence of a mean method for foo object and use that instead of our internal method.

Or should we e.g. fall back on R evaluation as soon as we are dealing with an object (for some definition of being an object).

I'll look into R internals, probably in DispatchGroup here: https://github.com/wch/r-source/blob/48d522a12b28b532bf38d236dfdd5672b2da9257/src/main/eval.c#L2814

@romainfrancois
Copy link
Member

@romainfrancois romainfrancois commented Jul 7, 2015

Oh well, the dispatch code is a mess.

@hadley
Copy link
Member

@hadley hadley commented Jul 7, 2015

@romainfrancois I think there's an easy check - if the vector has a class attribute that we don't recognise (or is an S4 object), we should avoid hybrid evaluation. I think maybe there's a macro (OBJECT()?) that does the check efficiently.

@romainfrancois
Copy link
Member

@romainfrancois romainfrancois commented Jul 8, 2015

Should be fine now, i.e. :

> mean.foo <- function(x) 42
>   df <- data_frame( x = structure(1:10, class = "foo" ) )
>
> summarise(df, m = mean(x) )
Source: local data frame [1 x 1]

   m
1 42

@renkun-ken Can you please check with the original example, I have not installed formattable

So far, only for hybrid candidates mean, sd, var, sum, min, max.

Should I handle the others from https://github.com/hadley/dplyr/blob/master/src/dplyr.cpp#L851 in case e.g. some package makes ntile or lead or ... generic for some reason.

@hadley
Copy link
Member

@hadley hadley commented Jul 8, 2015

@romainfrancois no, I don't think so - we should only care if the existing function is generic.

This comes back to Thomas Lumley's question at your useR talk - if someone does mean <- sum, we'll give the incorrect response. (And I think that's fine)

@romainfrancois
Copy link
Member

@romainfrancois romainfrancois commented Jul 8, 2015

Yes, I was thinking about Thomas's question while doing this actually. I think that's fine too.

I'll close this now. please @renkun-ken reopen if somehow this does not work for formattable.

@renkun-ken
Copy link
Author

@renkun-ken renkun-ken commented Jul 8, 2015

Thanks! Now it works perfect with formattable.

@lock lock bot locked as resolved and limited conversation to collaborators Jun 9, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants