Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong results in "summarise" when characters are used as variable identifiers #567

Closed
patrickroocks opened this issue Aug 30, 2014 · 7 comments
Assignees
Labels
bug an unexpected problem or unintended behavior
Milestone

Comments

@patrickroocks
Copy link

The following code

summarise(mtcars, n_distinct("mpg"))

doesn't rise any errors or warning but returns (randomly) wrong results (but approximately the same magnitude) between 16 and 24, whereas the correct use of the variable.

summarise(mtcars, n_distinct(mpg))

returns 25.

Indicated by the stackoverflow question "Randomness in dplyr" a similar behavior occurs on larger datasets too.

@hadley hadley added the bug label Sep 1, 2014
@hadley hadley added this to the 0.3 milestone Sep 1, 2014
@romainfrancois
Copy link
Member

The problem is there somewhere:https://github.com/hadley/dplyr/blob/424b304ae235d9ff33b6fbcd7860e8f94ec7e9f2/inst/include/dplyr/Result/Count_Distinct.h

@hadley : Should I make n_distinct("mpg" ):

  • illegal
  • return 1

By the same token, would we allow expressions inside of n_distinct, e.g. n_distinct(mpg*2)

@hadley
Copy link
Member

hadley commented Sep 10, 2014

Ideally n_distinct("mpg") would return 1 and expressions would also work.

@romainfrancois
Copy link
Member

Sure. Just wanted to check first, forbid stuff is easier to do.

@romainfrancois
Copy link
Member

Ah. What about this:

z <- 1:2
df <- data.frame( x = c(1,1,1, 2,2,2) )
summarise( group_by(df,x), n_distinct(z) )

z comes from outside the data frame. Currently I get bogus results like this:

Source: local data frame [2 x 2]

  x n_distinct(z)
1 1             2
2 2             3

@romainfrancois
Copy link
Member

This is because the underlying class Count_Distinct expects it is given something to visit that is as long as there are rows in the data, which is enough of an assumption when we know n_distinct only manages symbols.

I'm not trying to push towards forbidding things, but it would be easier to name whatever expression that would go inside n_distinct in some mutate call before and only accept symbols in n_distinct.

Also n_distinct("mpg") seems like a time bomb, probably not what was expected from the SO question, making it an hard error at least gives a better clue that it's wrong ?

@hadley
Copy link
Member

hadley commented Sep 11, 2014

Ok, that's reasonable - can you just make Count_Distinct throw an error message like "Input to n_distinct() must be a single variable name` ?

@romainfrancois
Copy link
Member

Done. Only thing allowed is a variable name that must be from the data frame.

@lock lock bot locked as resolved and limited conversation to collaborators Jun 10, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug an unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

3 participants