Automatically unpack unnamed df-cols #2326

hadley · 2016-12-15T23:18:13Z

Currently mutate() and summarise() only work with vectorised functions: functions that take a vector as input and return a vector (or "scalar") as output. I don't see any reason why summarise() and mutate() couldn't also accept tibbles. The existing restrictions would continue to apply so that in summarise() the tibble would have to have exactly one row, and in mutate() it would have to have either one row or n rows.

In other words, the following two lines of code should be equivalent:

df %>%
  summarise(mean = mean(x), sd = sd(x))

df %>%
   summarise(tibble(mean = mean(x), sd = sd(x))

This would allow you to extract that repeated pattern out into a function:

# and hence
mean_sd <- function(df, var) {
  tibble(mean = mean(df[[var]]), sd = sd(df[[var]]))
}
df %>% 
  summarise(mean_sd(df, "x"))

We'd need to work on documentation to help people develop effective functions of this nature develop tools so that you could easily specify input variables (using whatever the next iteration of lazyeval provides) and name the outputs. But that's largely a second-order concern: we can figure out those details later.

Supporting tibbles in this way would be particular useful for dplyr as it would help to clarify the nature of functions like separate() and unite() which are currently data frame wrappers around simple vector functions.

These ideas are most important for summarise() and mutate() but I think we should apply the same principles to filter() and arrange() as well.

cc @lionel- @jennybc @krlmlr

The text was updated successfully, but these errors were encountered:

romainfrancois · 2017-12-20T08:41:08Z

Now that we have := and sort of going back to the initial #154, perhaps the lhs of := can be richer, i.e. something like this parses:

mtcars %>% 
  group_by(cyl) %>% 
  summarise( tie(mpg0,mp25,mpg50,mpg75,mpg100) := quantile(mpg) )

From this 🐦 thread https://twitter.com/romain_francois/status/943399604065849344

romainfrancois · 2018-02-23T08:27:05Z

I toyed with this syntax on the tie 📦 here: https://github.com/romainfrancois/tie

> iris %>% 
+   dplyr::group_by(Species) %>% 
+   bow( tie(min, max) := range(Sepal.Length) )
# A tibble: 3 x 3
  Species      min   max
  <fct>      <dbl> <dbl>
1 setosa      4.30  5.80
2 versicolor  4.90  7.00
3 virginica   4.90  7.90
> 
> x <- "min"
> iris %>% 
+   dplyr::group_by(Species) %>% 
+   bow( tie(!!x, max) := range(Sepal.Length) )
# A tibble: 3 x 3
  Species      min   max
  <fct>      <dbl> <dbl>
1 setosa      4.30  5.80
2 versicolor  4.90  7.00
3 virginica   4.90  7.90

Now it just does a classic summarise of the rhs of := wrapped in a list call, and then re-extracts into what is specified in the lhs, i.e. it does this:

> iris %>% 
+   group_by(Species) %>% 
+   summarise( ..tmp.. = list(range(Sepal.Length)) ) %>% 
+   mutate( min = map_dbl(..tmp.., 1), max = map_dbl(..tmp.., 2) ) %>% 
+   select( -..tmp..)
# A tibble: 3 x 3
  Species      min   max
  <fct>      <dbl> <dbl>
1 setosa      4.30  5.80
2 versicolor  4.90  7.00
3 virginica   4.90  7.90

hadley · 2019-02-08T21:15:41Z

Update on naming: I think it now seems reasonable that named tibbles would produce a df-col:

# Tibble is spliced into output, producing two new columns
df %>% summarise(tibble(mean = mean(x), sd = sd(x))

# Produces a single df-col containing two variables.
df %>% summarise(summary = tibble(mean = mean(x), sd = sd(x))

Update on sizes: I think it now seems reasonable obvious that columns in summarise must have a size of 1, and columns in mutate must have a size of n.

hadley · 2019-02-08T21:19:27Z

To handle the "quantile" problem, we'll need a quantile() wrapper that returns a tibble with one column for each quantile (tidyverse/funs#24). We'll also need a set of "colwise()" wrappers around standard summary functions:

df %>% summarise(agg_quantile(x))
df %>% summarise(col_mean(starts_with("x")), col_min(ends_with("y"))

We'll have carefully think how these functions compose: what does colwise quantile look like? What if you want to summarise multiple variables with multiple functions?

krlmlr · 2019-10-23T22:21:59Z

Is this implemented already? In the dev_0_9_0 branch I see:

library(tidyverse)
tibble(a = 1) %>% mutate(tibble(b = 2))
#> # A tibble: 1 x 2
#>       a     b
#>   <dbl> <dbl>
#> 1     1     2
tibble(a = 1) %>% mutate(tibble(b = 2), c = b)
#> Error in get(as.character(FUN), mode = "function", envir = envir): object '.f' of mode 'function' was not found

^{Created on 2019-10-24 by the reprex package (v0.3.0)}

Do we unpack (auto-splice) at the end or right after processing an expression? This is relevant for tidyverse/tibble#581.

romainfrancois · 2019-10-25T10:31:30Z

Was mistakingly using compat map() as purrr::map(). fixed now:

library(tidyverse)
tibble(a = 1) %>% mutate(tibble(b = 2))
#> # A tibble: 1 x 2
#>       a     b
#>   <dbl> <dbl>
#> 1     1     2
tibble(a = 1) %>% mutate(tibble(b = 2), c = b)
#> # A tibble: 1 x 3
#>       a     b     c
#>   <dbl> <dbl> <dbl>
#> 1     1     2     2

romainfrancois · 2019-10-25T10:41:54Z

but yeah, it's on:

library(dplyr, warn.conflicts = FALSE)
mtcars %>% 
  group_by(cyl) %>% 
  summarise(as_tibble(as.list(quantile(mpg))))
#> # A tibble: 3 x 6
#>     cyl  `0%` `25%` `50%` `75%` `100%`
#>   <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
#> 1     4  21.4  22.8  26    30.4   33.9
#> 2     6  17.8  18.6  19.7  21     21.4
#> 3     8  10.4  14.4  15.2  16.2   19.2

krlmlr added data frame feature a feature request or enhancement labels Dec 17, 2016

This was referenced Jan 31, 2017

Multivariate mutate #2286

Closed

Optional parameter to control length of summarise #154

Closed