New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong results in "summarise" when characters are used as variable identifiers #567

Closed
patrickroocks opened this Issue Aug 30, 2014 · 7 comments

Comments

Projects
None yet
3 participants
@patrickroocks

patrickroocks commented Aug 30, 2014

The following code

summarise(mtcars, n_distinct("mpg"))

doesn't rise any errors or warning but returns (randomly) wrong results (but approximately the same magnitude) between 16 and 24, whereas the correct use of the variable.

summarise(mtcars, n_distinct(mpg))

returns 25.

Indicated by the stackoverflow question "Randomness in dplyr" a similar behavior occurs on larger datasets too.

@hadley hadley added the bug label Sep 1, 2014

@hadley hadley added this to the 0.3 milestone Sep 1, 2014

@romainfrancois

This comment has been minimized.

Member

romainfrancois commented Sep 10, 2014

The problem is there somewhere:https://github.com/hadley/dplyr/blob/424b304ae235d9ff33b6fbcd7860e8f94ec7e9f2/inst/include/dplyr/Result/Count_Distinct.h

@hadley : Should I make n_distinct("mpg" ):

  • illegal
  • return 1

By the same token, would we allow expressions inside of n_distinct, e.g. n_distinct(mpg*2)

@hadley

This comment has been minimized.

Member

hadley commented Sep 10, 2014

Ideally n_distinct("mpg") would return 1 and expressions would also work.

@romainfrancois

This comment has been minimized.

Member

romainfrancois commented Sep 10, 2014

Sure. Just wanted to check first, forbid stuff is easier to do.

@romainfrancois

This comment has been minimized.

Member

romainfrancois commented Sep 10, 2014

Ah. What about this:

z <- 1:2
df <- data.frame( x = c(1,1,1, 2,2,2) )
summarise( group_by(df,x), n_distinct(z) )

z comes from outside the data frame. Currently I get bogus results like this:

Source: local data frame [2 x 2]

  x n_distinct(z)
1 1             2
2 2             3
@romainfrancois

This comment has been minimized.

Member

romainfrancois commented Sep 10, 2014

This is because the underlying class Count_Distinct expects it is given something to visit that is as long as there are rows in the data, which is enough of an assumption when we know n_distinct only manages symbols.

I'm not trying to push towards forbidding things, but it would be easier to name whatever expression that would go inside n_distinct in some mutate call before and only accept symbols in n_distinct.

Also n_distinct("mpg") seems like a time bomb, probably not what was expected from the SO question, making it an hard error at least gives a better clue that it's wrong ?

@hadley

This comment has been minimized.

Member

hadley commented Sep 11, 2014

Ok, that's reasonable - can you just make Count_Distinct throw an error message like "Input to n_distinct() must be a single variable name` ?

romainfrancois added a commit that referenced this issue Sep 11, 2014

@romainfrancois

This comment has been minimized.

Member

romainfrancois commented Sep 11, 2014

Done. Only thing allowed is a variable name that must be from the data frame.

@lock lock bot locked as resolved and limited conversation to collaborators Jun 10, 2018

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.