Wrong results in "summarise" when characters are used as variable identifiers #567

patrickroocks · 2014-08-30T07:50:12Z

The following code

summarise(mtcars, n_distinct("mpg"))

doesn't rise any errors or warning but returns (randomly) wrong results (but approximately the same magnitude) between 16 and 24, whereas the correct use of the variable.

summarise(mtcars, n_distinct(mpg))

returns 25.

Indicated by the stackoverflow question "Randomness in dplyr" a similar behavior occurs on larger datasets too.

romainfrancois · 2014-09-10T12:11:47Z

The problem is there somewhere:https://github.com/hadley/dplyr/blob/424b304ae235d9ff33b6fbcd7860e8f94ec7e9f2/inst/include/dplyr/Result/Count_Distinct.h

@hadley : Should I make n_distinct("mpg" ):

illegal
return 1

By the same token, would we allow expressions inside of n_distinct, e.g. n_distinct(mpg*2)

hadley · 2014-09-10T12:24:17Z

Ideally n_distinct("mpg") would return 1 and expressions would also work.

romainfrancois · 2014-09-10T12:27:24Z

Sure. Just wanted to check first, forbid stuff is easier to do.

romainfrancois · 2014-09-10T12:39:30Z

Ah. What about this:

z <- 1:2
df <- data.frame( x = c(1,1,1, 2,2,2) )
summarise( group_by(df,x), n_distinct(z) )

z comes from outside the data frame. Currently I get bogus results like this:

Source: local data frame [2 x 2]

  x n_distinct(z)
1 1             2
2 2             3

romainfrancois · 2014-09-10T12:45:19Z

This is because the underlying class Count_Distinct expects it is given something to visit that is as long as there are rows in the data, which is enough of an assumption when we know n_distinct only manages symbols.

I'm not trying to push towards forbidding things, but it would be easier to name whatever expression that would go inside n_distinct in some mutate call before and only accept symbols in n_distinct.

Also n_distinct("mpg") seems like a time bomb, probably not what was expected from the SO question, making it an hard error at least gives a better clue that it's wrong ?

hadley · 2014-09-11T00:24:00Z

Ok, that's reasonable - can you just make Count_Distinct throw an error message like "Input to n_distinct() must be a single variable name` ?

romainfrancois · 2014-09-11T08:32:04Z

Done. Only thing allowed is a variable name that must be from the data frame.

hadley added the bug label Sep 1, 2014

hadley added this to the 0.3 milestone Sep 1, 2014

hadley assigned romainfrancois Sep 1, 2014

romainfrancois closed this as completed in 8184c4b Sep 11, 2014

romainfrancois added a commit that referenced this issue Sep 11, 2014

also check that there is only one arg (length(call)==2); #567

072d35f

lock bot locked as resolved and limited conversation to collaborators Jun 10, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wrong results in "summarise" when characters are used as variable identifiers #567

Wrong results in "summarise" when characters are used as variable identifiers #567

patrickroocks commented Aug 30, 2014

romainfrancois commented Sep 10, 2014

hadley commented Sep 10, 2014

romainfrancois commented Sep 10, 2014

romainfrancois commented Sep 10, 2014

romainfrancois commented Sep 10, 2014

hadley commented Sep 11, 2014

romainfrancois commented Sep 11, 2014

Wrong results in "summarise" when characters are used as variable identifiers #567

Wrong results in "summarise" when characters are used as variable identifiers #567

Comments

patrickroocks commented Aug 30, 2014

romainfrancois commented Sep 10, 2014

hadley commented Sep 10, 2014

romainfrancois commented Sep 10, 2014

romainfrancois commented Sep 10, 2014

romainfrancois commented Sep 10, 2014

hadley commented Sep 11, 2014

romainfrancois commented Sep 11, 2014