Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scoped mutate/summarize with multiple columns AND functions #4125

Merged
merged 7 commits into from Jan 30, 2019

Conversation

romainfrancois
Copy link
Member

library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

storms %>% 
  summarize_at(vars(wind, pressure), list(mean, median)) 
#> # A tibble: 1 x 4
#>   `wind_<fn>_1` `pressure_<fn>_1` `wind_<fn>_2` `pressure_<fn>_2`
#>           <dbl>             <dbl>         <dbl>             <dbl>
#> 1          53.5              992.            45               999

Created on 2019-01-24 by the reprex package (v0.2.1.9000)

@romainfrancois
Copy link
Member Author

closes #4119

@hadley
Copy link
Member

hadley commented Jan 24, 2019

Our policy is to not create non-syntactic names, so <fn> needs to be something else.

@@ -309,6 +309,11 @@ manip_apply_syms <- function(funs, syms, tbl) {
names(out) <- names(funs)
} else {
syms_names <- map_chr(syms, as_string)
names(funs) <- if_else(
names(funs) == "<fn>",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where does <fn> get set as a name?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dplyr/R/funs.R

Line 65 in cf0e282

temp <- exprs_auto_name(temp)

Copy link
Member

@lionel- lionel- Jan 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eventually we should add name repairing arguments to exprs_auto_name() (or its successor).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we fix it higher up? i.e. if the list is unnamed, we manually name it fn1, fn2 etc?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe, and I guess as_fun_list() would be the right spot.

But it's all part of this complicated set of cases that we try to give nice defaults for, e.g. if there's only one function, named and unnamed do something different and are both legit:

library(dplyr, warn.conflicts = FALSE)

summarise_at(iris, vars(starts_with("Sepal")), list(mean))
#>   Sepal.Length Sepal.Width
#> 1     5.843333    3.057333
summarise_at(iris, vars(starts_with("Sepal")), list(mean = mean))
#>   Sepal.Length_mean Sepal.Width_mean
#> 1          5.843333         3.057333

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@romainfrancois Good point. We should only auto-name when there's more than one function.

I see that tibble creates ..n suffix to disambiguate names. Is that to draw user attention on these names? I think I'd prefer an n suffix in this case.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We actually do need to fix this higher up, otherwise this happens (on the second path):

> summarise_at(iris, vars(Sepal.Width), list(mean, median))
  <fn>
1    3

@lionel-
Copy link
Member

lionel- commented Jan 24, 2019

@hadley This is connected to the labelling of literals, i.e. in

transmute(mtcars, !!seq_len(nrow(mtcars)))

Currently the pillar type label is taken in that case.

@romainfrancois
Copy link
Member Author

There's also the problem from #2912 when we use list of lambdas calling the same function:

library(dplyr, warn.conflicts = FALSE)

iris %>% 
  summarise_at(vars(starts_with("Sepal")), list(~quantile(., .25), ~quantile(., .75)))
#>   Sepal.Length_quantile Sepal.Width_quantile
#> 1                   6.4                  3.3

iris %>% 
  summarise_at(vars(starts_with("Sepal")), list(q25 = ~quantile(., .25), q75 = ~quantile(., .75)))
#>   Sepal.Length_q25 Sepal.Width_q25 Sepal.Length_q75 Sepal.Width_q75
#> 1              5.1             2.8              6.4             3.3

Created on 2019-01-25 by the reprex package (v0.2.1.9000)

@krlmlr
Copy link
Member

krlmlr commented Jan 25, 2019

Can we use var..1, var..2 etc. as names? this is consistent with name repair in tibble which eventually should move to vctrs.?

@lionel-
Copy link
Member

lionel- commented Jan 25, 2019

The name repairing should move to rlang I think, because of all the auto-naming in captured arguments. What is the motivation in using ..n as suffix by the way? Is that to draw user attention on variables that need to be named manually?

Edit: If yes, do we need to make a difference between names repair and names generation?

@romainfrancois
Copy link
Member Author

romainfrancois commented Jan 25, 2019

I went the route of fn1, fn2 etc as hinted by @hadley in the thread above. But we still need some name repair/making strategy for esoteric things like:

library(dplyr, warn.conflicts = FALSE)

# both gets called "quantile"
iris %>% 
  summarise_at(vars(starts_with("Sepal")), list(~quantile(., .25), ~quantile(., .75)))
#>   Sepal.Length_quantile Sepal.Width_quantile
#> 1                   6.4                  3.3

# both unnamed, so called fn1, fn2
iris %>% 
  summarise_at(vars(starts_with("Sepal")), list(mean, median))
#>   Sepal.Length_fn1 Sepal.Width_fn1 Sepal.Length_fn2 Sepal.Width_fn2
#> 1         5.843333        3.057333              5.8               3

# what if we really wanted a fn1 ?
iris %>% 
  summarise_at(vars(starts_with("Sepal")), list(mean, median, fn1 = sum))
#>   Sepal.Length_fn1 Sepal.Width_fn1 Sepal.Length_fn2 Sepal.Width_fn2
#> 1            876.5           458.6              5.8               3

This looks tangent to name repairing, or name generation as @lionel- hints. Do we need a package specific to naming things ? @jennybc

@romainfrancois
Copy link
Member Author

romainfrancois commented Jan 25, 2019

Borrowing tibble:::universal_names() gives:

library(dplyr, warn.conflicts = FALSE)

iris %>% 
  summarise_at(vars(starts_with("Sepal")), list(~quantile(., .25), ~quantile(., .75)))
#>   Sepal.Length_quantile..1 Sepal.Width_quantile..1
#> 1                      5.1                     2.8
#>   Sepal.Length_quantile..2 Sepal.Width_quantile..2
#> 1                      6.4                     3.3

iris %>% 
  summarise_at(vars(starts_with("Sepal")), list(mean, median))
#>   Sepal.Length_fn..1 Sepal.Width_fn..1 Sepal.Length_fn..2
#> 1           5.843333          3.057333                5.8
#>   Sepal.Width_fn..2
#> 1                 3

iris %>% 
  summarise_at(vars(starts_with("Sepal")), list(mean, median, fn..1 = mean))
#>   Sepal.Length_fn..1 Sepal.Width_fn..1 Sepal.Length_fn..2
#> 1           5.843333          3.057333                5.8
#>   Sepal.Width_fn..2 Sepal.Length_fn..3 Sepal.Width_fn..3
#> 1                 3           5.843333          3.057333

@krlmlr
Copy link
Member

krlmlr commented Jan 25, 2019

..n is the disambiguating suffix for duplicate names (and also empty names), ugly by intention. I'm fine with moving name repair to rlang.

CC @jennybc.

@lionel-
Copy link
Member

lionel- commented Jan 25, 2019

That makes sense. In the case of repairing, we need some peppery typography to draw the attention to potential problems. In the case of name generation, using other rules might be ok? Producing readable labels might be a more important goal in that case.

@romainfrancois
Copy link
Member Author

I'd still think in that case summarise_at(vars(starts_with("Sepal")), list(mean, median)) we want to draw attention

@romainfrancois
Copy link
Member Author

Another random though is that perhaps when the function is S3 generic, maybe it could just be named after its name, i.e. what's inside UseMethod ?

@krlmlr
Copy link
Member

krlmlr commented Jan 25, 2019

We could support other name-generation rules, but for now it feels like ..1 is the best we have.

@lionel-
Copy link
Member

lionel- commented Jan 25, 2019

Another random though is that perhaps when the function is S3 generic, maybe it could just be named after its name, i.e. what's inside UseMethod ?

Interesting idea, but it might make it more unpredictable for users to figure out when they need to supply names.

@krlmlr
Copy link
Member

krlmlr commented Jan 25, 2019

Do we want teach users to use lst() (or a better equivalent in vctrs) for auto-naming? I'm not sure we should be trying to infer function names, but we could also look up all functions in all namespaces.

@lionel-
Copy link
Member

lionel- commented Jan 25, 2019

I'm not sure we should be trying to infer function names, but we could also look up all functions in all namespaces.

We don't have to scan all namespaces, only the topenv of the function. But I think we shouldn't go down this road because of performance implications and because this will cause different behaviours for exported and sourced functions. Also the user might have imported the function locally with another name.

@romainfrancois
Copy link
Member Author

Yeah this would only buy some more time but then make a bigger monster.

Using some name repair function on generated names, or some name generation gives us ugly enough names for users to want to change them. We could inform() that we generated names, and hint about how to get in control.

@romainfrancois
Copy link
Member Author

sort of related to #3769

# nms[dup] <- paste(nms[dup], which(dup), sep = ".")
# }

# append ..<position> to duplicates
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind copying the relevant parts of name repair from tibble to here, to make sure we're consistent? The current code will misbehave in some corner cases.

I can move the code to a compat-name-repair.R file in tibble so that you just need to copy that file.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Essentially, this needs universal_names(quiet = TRUE)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@krlmlr aren't we switching to use ... because ..n is non-syntactic?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With universal names we add another dot if necessary.

library(tibble)
as_tibble(list(1, a = 2, a = 3), .name_repair = "unique")
#> New names:
#> * `` -> `..1`
#> * a -> a..2
#> * a -> a..3
#> # A tibble: 1 x 3
#>     ..1  a..2  a..3
#>   <dbl> <dbl> <dbl>
#> 1     1     2     3
as_tibble(list(1, a = 2, a = 3), .name_repair = "universal")
#> New names:
#> * `` -> ...1
#> * a -> a..2
#> * a -> a..3
#> # A tibble: 1 x 3
#>    ...1  a..2  a..3
#>   <dbl> <dbl> <dbl>
#> 1     1     2     3

Created on 2019-01-29 by the reprex package (v0.2.1.9000)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@romainfrancois: Can you please pick up from here, let me know if anything's missing: https://github.com/tidyverse/tibble/blob/r-2.0/R/compat-name-repair.R

@romainfrancois romainfrancois added this to the 0.8.0 milestone Jan 29, 2019
@romainfrancois
Copy link
Member Author

I think this is ready now, getting this. Prior to calling universal_names() from the compat file, <fn> becomes fn, otherwise we'd get wind_.fn...1

library(dplyr, warn.conflicts = FALSE)

storms %>%
  summarise_at(vars(wind, pressure), list(mean, median))
#> # A tibble: 1 x 4
#>   wind_fn..1 pressure_fn..1 wind_fn..2 pressure_fn..2
#>        <dbl>          <dbl>      <dbl>          <dbl>
#> 1       53.5           992.         45            999

iris %>% 
  summarise_at(vars(starts_with("Sepal")), list(~quantile(., .25), ~quantile(., .75)))
#>   Sepal.Length_quantile..1 Sepal.Width_quantile..1
#> 1                      5.1                     2.8
#>   Sepal.Length_quantile..2 Sepal.Width_quantile..2
#> 1                      6.4                     3.3

@lock
Copy link

lock bot commented Sep 7, 2019

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

@lock lock bot locked and limited conversation to collaborators Sep 7, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants