Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add_count() and add_tally() should error if output column name is already used #4284

Closed
ghost opened this issue Mar 14, 2019 · 3 comments
Closed
Labels
bug an unexpected problem or unintended behavior helpers 🏋️‍♂️ helper functions like count() and tally()

Comments

@ghost
Copy link

ghost commented Mar 14, 2019

I'm wondering if add_count() and add_tally() should throw an error in case that the output column name (default "n") is already used in a pre-existing column. Otherwise it ends up doing some odd behaviour. For instance:

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

df <- tibble(a = rep(LETTERS, 2), n = 1:52)

df
#> # A tibble: 52 x 2
#>    a         n
#>    <chr> <int>
#>  1 A         1
#>  2 B         2
#>  3 C         3
#>  4 D         4
#>  5 E         5
#>  6 F         6
#>  7 G         7
#>  8 H         8
#>  9 I         9
#> 10 J        10
#> # ... with 42 more rows

add_tally(df)
#> Using `n` as weighting variable
#> # A tibble: 52 x 2
#>    a         n
#>    <chr> <int>
#>  1 A      1378
#>  2 B      1378
#>  3 C      1378
#>  4 D      1378
#>  5 E      1378
#>  6 F      1378
#>  7 G      1378
#>  8 H      1378
#>  9 I      1378
#> 10 J      1378
#> # ... with 42 more rows

Do we want n to be used as weighting variable? This behaviour seems unexpected to me.

add_count(df, a)
#> # A tibble: 52 x 2
#>    a         n
#>    <chr> <int>
#>  1 A         2
#>  2 B         2
#>  3 C         2
#>  4 D         2
#>  5 E         2
#>  6 F         2
#>  7 G         2
#>  8 H         2
#>  9 I         2
#> 10 J         2
#> # ... with 42 more rows

In this case, add_count() silently replaces the pre-existing column "n" with its output, which is likely not the user's intent.

EDIT: I see that the behaviour with add_count is already discussed here. Fair enough on that front.

@ghost ghost changed the title Should add_count and add_tally throw an error if the output column name is already used for a pre-existing column? Should add_count and add_tally throw an error if the output column name is already used by a pre-existing column? Mar 14, 2019
@nschmucker
Copy link

nschmucker commented Mar 28, 2019

It seems the behavior of add_tally() changed in version 0.8.0.1. Previously, in 0.7.6...

> library(dplyr)

> mtcars %>% add_tally(vs) %>% add_tally(am)
# A tibble: 32 x 13
     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb     n    nn
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1  21       6  160    110  3.9   2.62  16.5     0     1     4     4    14    13
 2  21       6  160    110  3.9   2.88  17.0     0     1     4     4    14    13
 3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1    14    13
 4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1    14    13
 5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2    14    13
 6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1    14    13
 7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4    14    13
 8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2    14    13
 9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2    14    13
10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4    14    13
# ... with 22 more rows

Now, though, under 0.8.0.1, it overwrites the n column, rather than automatically adding a nn column.

> mtcars %>% add_tally(vs) %>% add_tally(am)
# A tibble: 32 x 12
     mpg   cyl  disp    hp  drat    wt  qsec    vs    am  gear  carb     n
   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
 1  21       6  160    110  3.9   2.62  16.5     0     1     4     4    13
 2  21       6  160    110  3.9   2.88  17.0     0     1     4     4    13
 3  22.8     4  108     93  3.85  2.32  18.6     1     1     4     1    13
 4  21.4     6  258    110  3.08  3.22  19.4     1     0     3     1    13
 5  18.7     8  360    175  3.15  3.44  17.0     0     0     3     2    13
 6  18.1     6  225    105  2.76  3.46  20.2     1     0     3     1    13
 7  14.3     8  360    245  3.21  3.57  15.8     0     0     3     4    13
 8  24.4     4  147.    62  3.69  3.19  20       1     0     4     2    13
 9  22.8     4  141.    95  3.92  3.15  22.9     1     0     4     2    13
10  19.2     6  168.   123  3.92  3.44  18.3     1     0     4     4    13
# ... with 22 more rows

I like the new option to specify a column name (!) but it seems that existing code that relied on the nn column name may run into issues. This seems somewhat related to the documentation change noted in #4124

@nteetor
Copy link

nteetor commented Sep 18, 2019

Ran into the problem @nschmucker brings up. I went looking for a mention in NEWS, but found nothing.

@hadley hadley added bug an unexpected problem or unintended behavior helpers 🏋️‍♂️ helper functions like count() and tally() labels Dec 10, 2019
@hadley hadley changed the title Should add_count and add_tally throw an error if the output column name is already used by a pre-existing column? add_count() and add_tally() should throw an error if output column name is already used Dec 31, 2019
@hadley hadley changed the title add_count() and add_tally() should throw an error if output column name is already used add_count() and add_tally() should error if output column name is already used Dec 31, 2019
@hadley hadley closed this as completed in 00d62a7 Dec 31, 2019
@hadley
Copy link
Member

hadley commented Dec 31, 2019

Now:

library(dplyr, warn.conflicts = FALSE)
df <- tibble(g = c(1, 1, 2), n = c(1, 1, 10))

df %>% count(g)
#> Using `n` as weighting variable
#> Error: Column 'n' is already present in output
#>  * Use `name = "new_name"` to pick a new name
df %>% add_count(g)
#> Using `n` as weighting variable
#> Error: Column 'n' is already present in output
#>  * Use `name = "new_name"` to pick a new name

# Create new variable
df %>% count(g, name = "nn")
#> Using `n` as weighting variable
#> # A tibble: 2 x 2
#>       g    nn
#>   <dbl> <dbl>
#> 1     1     2
#> 2     2    10
df %>% add_count(g, name = "nn")
#> Using `n` as weighting variable
#> # A tibble: 3 x 3
#>       g     n    nn
#>   <dbl> <dbl> <dbl>
#> 1     1     1     2
#> 2     1     1     2
#> 3     2    10    10

# Overide existing variable
df %>% count(g, name = "n")
#> Using `n` as weighting variable
#> # A tibble: 2 x 2
#>       g     n
#>   <dbl> <dbl>
#> 1     1     2
#> 2     2    10
df %>% add_count(g, name = "n")
#> Using `n` as weighting variable
#> # A tibble: 3 x 2
#>       g     n
#>   <dbl> <dbl>
#> 1     1     2
#> 2     1     2
#> 3     2    10

Created on 2019-12-31 by the reprex package (v0.3.0)

(and similarly for tally())

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug an unexpected problem or unintended behavior helpers 🏋️‍♂️ helper functions like count() and tally()
Projects
None yet
Development

No branches or pull requests

3 participants