Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatically unpack unnamed df-cols #2326

Closed
hadley opened this issue Dec 15, 2016 · 18 comments
Closed

Automatically unpack unnamed df-cols #2326

hadley opened this issue Dec 15, 2016 · 18 comments
Labels
feature
Milestone

Comments

@hadley
Copy link
Member

hadley commented Dec 15, 2016

Currently mutate() and summarise() only work with vectorised functions: functions that take a vector as input and return a vector (or "scalar") as output. I don't see any reason why summarise() and mutate() couldn't also accept tibbles. The existing restrictions would continue to apply so that in summarise() the tibble would have to have exactly one row, and in mutate() it would have to have either one row or n rows.

In other words, the following two lines of code should be equivalent:

df %>%
  summarise(mean = mean(x), sd = sd(x))

df %>%
   summarise(tibble(mean = mean(x), sd = sd(x))

This would allow you to extract that repeated pattern out into a function:

# and hence
mean_sd <- function(df, var) {
  tibble(mean = mean(df[[var]]), sd = sd(df[[var]]))
}
df %>% 
  summarise(mean_sd(df, "x"))

We'd need to work on documentation to help people develop effective functions of this nature develop tools so that you could easily specify input variables (using whatever the next iteration of lazyeval provides) and name the outputs. But that's largely a second-order concern: we can figure out those details later.

Supporting tibbles in this way would be particular useful for dplyr as it would help to clarify the nature of functions like separate() and unite() which are currently data frame wrappers around simple vector functions.

These ideas are most important for summarise() and mutate() but I think we should apply the same principles to filter() and arrange() as well.

cc @lionel- @jennybc @krlmlr

@krlmlr krlmlr added data frame feature labels Dec 17, 2016
@krlmlr

This comment has been minimized.

@aornugent

This comment has been minimized.

@huftis

This comment has been minimized.

@romainfrancois
Copy link
Member

romainfrancois commented Dec 20, 2017

Now that we have := and sort of going back to the initial #154, perhaps the lhs of := can be richer, i.e. something like this parses:

mtcars %>% 
  group_by(cyl) %>% 
  summarise( tie(mpg0,mp25,mpg50,mpg75,mpg100) := quantile(mpg) )

From this 🐦 thread https://twitter.com/romain_francois/status/943399604065849344

@romainfrancois
Copy link
Member

romainfrancois commented Feb 23, 2018

I toyed with this syntax on the tie 📦 here: https://github.com/romainfrancois/tie

> iris %>% 
+   dplyr::group_by(Species) %>% 
+   bow( tie(min, max) := range(Sepal.Length) )
# A tibble: 3 x 3
  Species      min   max
  <fct>      <dbl> <dbl>
1 setosa      4.30  5.80
2 versicolor  4.90  7.00
3 virginica   4.90  7.90
> 
> x <- "min"
> iris %>% 
+   dplyr::group_by(Species) %>% 
+   bow( tie(!!x, max) := range(Sepal.Length) )
# A tibble: 3 x 3
  Species      min   max
  <fct>      <dbl> <dbl>
1 setosa      4.30  5.80
2 versicolor  4.90  7.00
3 virginica   4.90  7.90

Now it just does a classic summarise of the rhs of := wrapped in a list call, and then re-extracts into what is specified in the lhs, i.e. it does this:

> iris %>% 
+   group_by(Species) %>% 
+   summarise( ..tmp.. = list(range(Sepal.Length)) ) %>% 
+   mutate( min = map_dbl(..tmp.., 1), max = map_dbl(..tmp.., 2) ) %>% 
+   select( -..tmp..)
# A tibble: 3 x 3
  Species      min   max
  <fct>      <dbl> <dbl>
1 setosa      4.30  5.80
2 versicolor  4.90  7.00
3 virginica   4.90  7.90

@aornugent

This comment has been minimized.

@t-kalinowski

This comment has been minimized.

@romainfrancois

This comment has been minimized.

@romainfrancois

This comment has been minimized.

@hadley

This comment has been minimized.

@romainfrancois

This comment has been minimized.

@romainfrancois

This comment has been minimized.

@hadley

This comment has been minimized.

@hadley
Copy link
Member Author

hadley commented Feb 8, 2019

Update on naming: I think it now seems reasonable that named tibbles would produce a df-col:

# Tibble is spliced into output, producing two new columns
df %>% summarise(tibble(mean = mean(x), sd = sd(x))

# Produces a single df-col containing two variables.
df %>% summarise(summary = tibble(mean = mean(x), sd = sd(x))

Update on sizes: I think it now seems reasonable obvious that columns in summarise must have a size of 1, and columns in mutate must have a size of n.

@hadley
Copy link
Member Author

hadley commented Feb 8, 2019

To handle the "quantile" problem, we'll need a quantile() wrapper that returns a tibble with one column for each quantile (tidyverse/funs#24). We'll also need a set of "colwise()" wrappers around standard summary functions:

df %>% summarise(agg_quantile(x))
df %>% summarise(col_mean(starts_with("x")), col_min(ends_with("y"))

We'll have carefully think how these functions compose: what does colwise quantile look like? What if you want to summarise multiple variables with multiple functions?

@hadley hadley changed the title Single table verbs should accept tibbles in conditions Automatically unpack unnamed df-cols May 27, 2019
@krlmlr
Copy link
Member

krlmlr commented Oct 23, 2019

Is this implemented already? In the dev_0_9_0 branch I see:

library(tidyverse)
tibble(a = 1) %>% mutate(tibble(b = 2))
#> # A tibble: 1 x 2
#>       a     b
#>   <dbl> <dbl>
#> 1     1     2
tibble(a = 1) %>% mutate(tibble(b = 2), c = b)
#> Error in get(as.character(FUN), mode = "function", envir = envir): object '.f' of mode 'function' was not found

Created on 2019-10-24 by the reprex package (v0.3.0)

Do we unpack (auto-splice) at the end or right after processing an expression? This is relevant for tidyverse/tibble#581.

@romainfrancois
Copy link
Member

romainfrancois commented Oct 25, 2019

Was mistakingly using compat map() as purrr::map(). fixed now:

library(tidyverse)
tibble(a = 1) %>% mutate(tibble(b = 2))
#> # A tibble: 1 x 2
#>       a     b
#>   <dbl> <dbl>
#> 1     1     2
tibble(a = 1) %>% mutate(tibble(b = 2), c = b)
#> # A tibble: 1 x 3
#>       a     b     c
#>   <dbl> <dbl> <dbl>
#> 1     1     2     2

@romainfrancois
Copy link
Member

romainfrancois commented Oct 25, 2019

but yeah, it's on:

library(dplyr, warn.conflicts = FALSE)
mtcars %>% 
  group_by(cyl) %>% 
  summarise(as_tibble(as.list(quantile(mpg))))
#> # A tibble: 3 x 6
#>     cyl  `0%` `25%` `50%` `75%` `100%`
#>   <dbl> <dbl> <dbl> <dbl> <dbl>  <dbl>
#> 1     4  21.4  22.8  26    30.4   33.9
#> 2     6  17.8  18.6  19.7  21     21.4
#> 3     8  10.4  14.4  15.2  16.2   19.2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature
Projects
None yet
Development

No branches or pull requests

6 participants