Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

summarise_each_q naming consistency #442

Closed
tverbeke opened this issue May 26, 2014 · 7 comments
Closed

summarise_each_q naming consistency #442

tverbeke opened this issue May 26, 2014 · 7 comments
Assignees
Labels
Milestone

Comments

@tverbeke
Copy link
Contributor

@tverbeke tverbeke commented May 26, 2014

Currently the naming of output variables does not seem consistent:

summarise_each_q(mtcars, funs = "mean", vars = "disp")
#       disp
#1 230.7219

summarise_each_q(mtcars, funs = c("mean", "sd"), vars = "disp")
#       mean       sd
#1 230.7219 123.9387

summarise_each_q(mtcars, funs = c("mean", "sd"), vars = c("disp", "cyl"))
#   disp_mean cyl_mean  disp_sd   cyl_sd
#1  230.7219   6.1875 123.9387 1.785922

Maybe in each of the cases <var>_<fun> could be used, as in

summarise_each_q(mtcars, funs = "mean", vars = "disp")
#    disp_mean
#1 230.7219

summarise_each_q(mtcars, funs = c("mean", "sd"), vars = "disp")
#  disp_mean  disp_sd
#1 230.7219 123.9387

summarise_each_q(mtcars, funs = c("mean", "sd"), vars = c("disp", "cyl"))
#   disp_mean cyl_mean  disp_sd   cyl_sd
#1  230.7219   6.1875 123.9387 1.785922
@hadley
Copy link
Member

@hadley hadley commented May 26, 2014

It is consistent - it always uses the minimum needed to disambiguate between columns.

@tverbeke
Copy link
Contributor Author

@tverbeke tverbeke commented May 27, 2014

I understand the underlying logic, but if one wants to use the results of summarise_each_q downstream (i.e. programmatically in a larger piece of code), it requires quite some special casing to e.g. pick the right columns depending on the number of funs and vars that were specified.

@hadley
Copy link
Member

@hadley hadley commented May 27, 2014

I prefer to solve this conundrum by some how describing how the outputs should be named. Any ideas for interfaces?

@tverbeke
Copy link
Contributor Author

@tverbeke tverbeke commented May 27, 2014

Thank you, Hadley. Allowing to explicitly name outputs is a good solution. Below a first idea for an interface. If outputs is NULL (default) the current system could be applied in the presence of a single 'var' (vars of length one):

summarise_each_q(mtcars, funs = c("mean", "sd"), vars = c("disp", "cyl"), 
      outputs = list(vars = c("displace", "cylinder"), funs = c("avg", "stdev"), sep = "_"))
#   displace_avg cylinder_avg  displace_stdev   cylinder_stdev
# 1      230.7219          6.1875          123.9387      1.785922

summarise_each_q(mtcars, funs = c("mean", "sd"), vars = c("disp", "cyl"), 
      outputs = list(vars = vars, funs = funs, sep = "_"))
#   disp_mean cyl_mean    disp_sd        cyl_sd
# 1  230.7219      6.1875  123.9387  1.785922

@hadley hadley added this to the 0.3.1 milestone Aug 1, 2014
@hadley hadley self-assigned this Aug 1, 2014
@hadley hadley added this to the 0.4 milestone Nov 20, 2014
@hadley hadley removed this from the 0.3.1 milestone Nov 20, 2014
@lionel-
Copy link
Member

@lionel- lionel- commented Jan 17, 2015

If summarise() gets the ability to return multi-row outputs (see #154), then it could make sense for summarise_each() to report one row by function with a new .funs column.

This way we'd get a tidy data frame.
See also this SO question.

@anhqle
Copy link

@anhqle anhqle commented Mar 7, 2016

The issue here is not just naming inconsistency. More problematic is dplyr's behavior to add variables when there are two summarizing functions and replace existing variables when there is only one summarizing function (see #1259)

I would suggest that dplyr adds variable and renames them when the user gives a named function, e.g. by_species %>% mutate_each(funs(blah = min)). In this case, the columns will be Sepal.Length, Sepal.Width, ... , Sepal.Length_blah, Sepal.Width_blah, ...

This way, there is no need for an extra argument and the behavior is consistent with when there are multiple summarizing functions specified.

@hadley
Copy link
Member

@hadley hadley commented Mar 8, 2016

@LaDilettante I like that idea!

@hadley hadley closed this in a7d92b4 Mar 8, 2016
chrMongeau added a commit to SWS-Methodology/faoswsTrade that referenced this issue Sep 30, 2016
The behaviour of summarise_each changed so that summarised variables get a suffix (see tidyverse/dplyr#442 ).
@lock lock bot locked as resolved and limited conversation to collaborators Jun 9, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
4 participants