Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

summarise problems when using just-created summary columns #259

Closed
andrewblim opened this issue Feb 12, 2014 · 3 comments
Closed

summarise problems when using just-created summary columns #259

andrewblim opened this issue Feb 12, 2014 · 3 comments

Comments

@andrewblim
Copy link

Not sure if this is a problem per se or whether we should just not use just-created summary columns for new columns (but maybe in this case nicer to produce an error?). I get slightly different values each time I run summarise, as shown below. It does work properly when I avoid reusing the just-created columns. So far I've only seen this when I have more than one summary column reusing a previously created summary column, which is why in all the examples below I've got columns diff1 and diff2.

I'm running dplyr 0.1.1 and R 3.0.2 on OS X Mavericks.

> require('dplyr')
> df <- tbl_df(data.frame(id=c(1,1,2,2,3,3), a=1:6))
> df %.% group_by(id) %.% summarise(biggest=max(a), smallest=min(a), diff1=biggest-smallest, diff2=smallest-biggest)
Source: local data frame [3 x 5]

  id biggest smallest diff1 diff2
1  3       6        5     1     0
2  2       4        3     1     0
3  1       2        1     1     0
> # produces some randomly different values when rerun, not always the same ones, and sometimes segfaulted on big tbl_dfs I was working with
> df %.% group_by(id) %.% summarise(biggest=max(a), smallest=min(a), diff1=biggest-smallest, diff2=smallest-biggest) 
Source: local data frame [3 x 5]

  id biggest smallest diff1 diff2
1  3       6        5     1 32643
2  2       4        3     1     0
3  1       2        1     1     0
> df %.% group_by(id) %.% summarise(biggest=max(a), smallest=min(a), diff1=biggest-smallest, diff2=smallest-biggest)
Source: local data frame [3 x 5]

  id biggest smallest diff1  diff2
1  3       6        5     1 -32641
2  2       4        3     1      0
3  1       2        1     1      0
> # but this seems to work consistently
> df %.% group_by(id) %.% summarise(biggest=max(a), smallest=min(a), diff1=max(a)-min(a), diff2=min(a)-max(a))  # seems to work
Source: local data frame [3 x 5]

  id biggest smallest diff1 diff2
1  3       6        5     1    -1
2  2       4        3     1    -1
3  1       2        1     1    -1
@hadley
Copy link
Member

hadley commented Feb 12, 2014

Reproducible for me with the latest dev version.

@romainfrancois
Copy link
Member

Thanks. That was a tricky one. The c++ class I used to handle newly created variables (SummarisedSubsetTemplate) was using a field to keep track of the group the indices were referring to, assuming that groups were being handled in sequence.

When making diff1, it considered it was processing groups 0, 1 and 2, and when making diff2, it kept going and considered it was processing groups 3, 4, 5.

I augmented the SlicingIndex class so that it has both the actual indices but also the index of the group.

@andrewblim
Copy link
Author

Thank you! Impressed by the turnaround.

@lock lock bot locked as resolved and limited conversation to collaborators Jun 10, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants