Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Default behavior of count() seems to have changed in 1.0.0 #5298

Closed
nhols opened this issue Jun 3, 2020 · 6 comments · Fixed by #5349
Closed

Default behavior of count() seems to have changed in 1.0.0 #5298

nhols opened this issue Jun 3, 2020 · 6 comments · Fixed by #5349
Labels
bug an unexpected problem or unintended behavior verbs 🏃‍♀️
Milestone

Comments

@nhols
Copy link

nhols commented Jun 3, 2020

Noticed that the default behavior of count() has changed in 1.0.0 but wasn't listed as a breaking change in the release notes

0.8.5

library(dplyr, warn.conflicts = FALSE)
packageVersion("dplyr")
#> [1] '0.8.5'
tibble(x = 1:3, n = 3:1) %>% count(x)
#> # A tibble: 3 x 2
#>       x     n
#>   <int> <int>
#> 1     1     1
#> 2     2     1
#> 3     3     1

1.0.0

library(dplyr, warn.conflicts = FALSE)
packageVersion("dplyr")
#> [1] '1.0.0'
tibble(x = 1:3, n = 3:1) %>% count(x)
#> Using `n` as weighting variable
#> ℹ Quiet this message with `wt = n` or count rows with `wt = 1`
#> # A tibble: 3 x 2
#>       x     n
#>   <int> <int>
#> 1     1     3
#> 2     2     2
#> 3     3     1

Originally posted by @nhols in #5265 (comment)

@hadley
Copy link
Member

hadley commented Jun 5, 2020

I don't remember count() ever working like this 😞; the new behaviour is definitely what I intended here.

@brianmsm
Copy link

brianmsm commented Jun 5, 2020

I like the wt argument, in fact I have used it many times when there are weights in the data. But generally this is intentional behavior.
I also have code and I remember having a database with a column called n, and on which I wanted to count (without wt). I saw that specifying wt = NULL fixes the problem. Although, I would expect this to be the default behavior of count.

@gorcha
Copy link
Member

gorcha commented Jun 9, 2020

Seconded - I and others I work with regularly use the x %>% count(a, b) %>% count(a) pattern when exploring datasets, and this has broken a bit of existing code.

That aside, it seems like a really counter-intuitive default to implicitly assume that weighting is intended because there's a variable called n in the data frame. It makes way more sense to me to only weight if it's explicitly specified.

The reference to counting rows in the message looks incorrect at the moment also - setting wt = 1 returns n = 1 for all returned rows, where wt = n() replicates the old behaviour:

library(dplyr, warn.conflicts = FALSE)
df <- tibble(x = c(1, 1, 1, 2, 2, 2), n = 1:6)

df %>% count(x, wt = n)
#> # A tibble: 2 x 2
#>       x     n
#>   <dbl> <int>
#> 1     1     6
#> 2     2    15

df %>% count(x, wt = 1)
#> # A tibble: 2 x 2
#>       x     n
#>   <dbl> <dbl>
#> 1     1     1
#> 2     2     1

df %>% count(x, wt = n())
#> # A tibble: 2 x 2
#>       x     n
#>   <dbl> <int>
#> 1     1     3
#> 2     2     3

Created on 2020-06-09 by the reprex package (v0.3.0)

@hadley hadley added this to the 1.0.1 milestone Jun 15, 2020
@lilly-lilly
Copy link

Hello all,

I have a similar issue not in count but in tally.
data2<- data%>%
dplyr::group_by(SITEID) %>%
tally()

I used to use tally to add up how many rows of data I had for each siteID.

Thanks @gorcha that wt = n () fixes the issue:-

data2<- data%>%
dplyr::group_by(SITEID) %>%
tally(wt = n())

Now I just need to remember which dataframes have n in them.

@hadley
Copy link
Member

hadley commented Jun 22, 2020

The root problem appears to be that an inconsistency crept in between count and tally (https://github.com/tidyverse/dplyr/blob/v0.8.5/R/count-tally.R#L25-L33) — in my mind count() and tally() always behaved the same way (automatically weighting by n variable, if present.

However, it looks like count() and tally() have in fact behaved differently for rather a long time (at least in 0.6.0 and above), although it took a long time for the documentation to be correct (it looks like it was fixed in 0.8.5).

Since this was a fairly major and unplanned change, I think the best option is to roll it back in 1.0.1 so that count() once again doesn't try use an n variable by default.

@hadley hadley added bug an unexpected problem or unintended behavior verbs 🏃‍♀️ and removed documentation labels Jun 22, 2020
@hadley
Copy link
Member

hadley commented Jun 22, 2020

Alternatively, maybe it's time to make tally() and count() consistent once more by making neither automatically weight by n if present.

tklebel added a commit to transpose-publishing/landscapeStudy that referenced this issue Jul 4, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug an unexpected problem or unintended behavior verbs 🏃‍♀️
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants