New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Single table verbs should accept tibbles in conditions #2326

Open
hadley opened this Issue Dec 15, 2016 · 11 comments

Comments

Projects
None yet
6 participants
@hadley
Member

hadley commented Dec 15, 2016

Currently mutate() and summarise() only work with vectorised functions: functions that take a vector as input and return a vector (or "scalar") as output. I don't see any reason why summarise() and mutate() couldn't also accept tibbles. The existing restrictions would continue to apply so that in summarise() the tibble would have to have exactly one row, and in mutate() it would have to have either one row or n rows.

In other words, the following two lines of code should be equivalent:

df %>%
  summarise(mean = mean(x), sd = sd(x))

df %>%
   summarise(tibble(mean = mean(x), sd = sd(x))

This would allow you to extract that repeated pattern out into a function:

# and hence
mean_sd <- function(df, var) {
  tibble(mean = mean(df[[var]]), sd = sd(df[[var]]))
}
df %>% 
  summarise(mean_sd(df, "x"))

We'd need to work on documentation to help people develop effective functions of this nature develop tools so that you could easily specify input variables (using whatever the next iteration of lazyeval provides) and name the outputs. But that's largely a second-order concern: we can figure out those details later.

Supporting tibbles in this way would be particular useful for dplyr as it would help to clarify the nature of functions like separate() and unite() which are currently data frame wrappers around simple vector functions.

These ideas are most important for summarise() and mutate() but I think we should apply the same principles to filter() and arrange() as well.

cc @lionel- @jennybc @krlmlr

@krlmlr

This comment has been minimized.

Show comment
Hide comment
@krlmlr

krlmlr Feb 21, 2017

Member

What should happen if the structure of the tibbles varies from call to call (grouped)? bind_rows() semantics? Either way, I feel this should happen after #2311.

Member

krlmlr commented Feb 21, 2017

What should happen if the structure of the tibbles varies from call to call (grouped)? bind_rows() semantics? Either way, I feel this should happen after #2311.

@aornugent

This comment has been minimized.

Show comment
Hide comment
@aornugent

aornugent Mar 28, 2017

This is really helpful, especially for filter().

There are cases where functions return the tibble in rows, which is not accepted by summarise().

df %>%
   summarise(tibble(quantile(x, probs = c(0.025, 0.5, 0.975))))

results in:

Error in `[<-.data.frame`(`*tmp*`, , value = list(model = c("exponential_detection_model",  : 
  replacement element 2 is a matrix/data frame of 3 rows, need 1
In addition: Warning message:
In `[<-.data.frame`(`*tmp*`, , value = list(model = c("exponential_detection_model",  :
  replacement element 1 has 3 rows to replace 1 rows

Is there a way to transpose the tibble?

This is really helpful, especially for filter().

There are cases where functions return the tibble in rows, which is not accepted by summarise().

df %>%
   summarise(tibble(quantile(x, probs = c(0.025, 0.5, 0.975))))

results in:

Error in `[<-.data.frame`(`*tmp*`, , value = list(model = c("exponential_detection_model",  : 
  replacement element 2 is a matrix/data frame of 3 rows, need 1
In addition: Warning message:
In `[<-.data.frame`(`*tmp*`, , value = list(model = c("exponential_detection_model",  :
  replacement element 1 has 3 rows to replace 1 rows

Is there a way to transpose the tibble?

@huftis

This comment has been minimized.

Show comment
Hide comment
@huftis

huftis Nov 2, 2017

@aornugent: You can (ab?)use bind_rows() instead of tibble(). Example:

bind_rows(quantile(iris$Sepal.Length))

This returns:

# A tibble: 1 x 5
   `0%` `25%` `50%` `75%` `100%`
  <dbl> <dbl> <dbl> <dbl>  <dbl>
1   4.3   5.1   5.8   6.4    7.9

huftis commented Nov 2, 2017

@aornugent: You can (ab?)use bind_rows() instead of tibble(). Example:

bind_rows(quantile(iris$Sepal.Length))

This returns:

# A tibble: 1 x 5
   `0%` `25%` `50%` `75%` `100%`
  <dbl> <dbl> <dbl> <dbl>  <dbl>
1   4.3   5.1   5.8   6.4    7.9
@romainfrancois

This comment has been minimized.

Show comment
Hide comment
@romainfrancois

romainfrancois Dec 20, 2017

Member

Now that we have := and sort of going back to the initial #154, perhaps the lhs of := can be richer, i.e. something like this parses:

mtcars %>% 
  group_by(cyl) %>% 
  summarise( tie(mpg0,mp25,mpg50,mpg75,mpg100) := quantile(mpg) )

From this 🐦 thread https://twitter.com/romain_francois/status/943399604065849344

Member

romainfrancois commented Dec 20, 2017

Now that we have := and sort of going back to the initial #154, perhaps the lhs of := can be richer, i.e. something like this parses:

mtcars %>% 
  group_by(cyl) %>% 
  summarise( tie(mpg0,mp25,mpg50,mpg75,mpg100) := quantile(mpg) )

From this 🐦 thread https://twitter.com/romain_francois/status/943399604065849344

@romainfrancois

This comment has been minimized.

Show comment
Hide comment
@romainfrancois

romainfrancois Feb 23, 2018

Member

I toyed with this syntax on the tie 📦 here: https://github.com/romainfrancois/tie

> iris %>% 
+   dplyr::group_by(Species) %>% 
+   bow( tie(min, max) := range(Sepal.Length) )
# A tibble: 3 x 3
  Species      min   max
  <fct>      <dbl> <dbl>
1 setosa      4.30  5.80
2 versicolor  4.90  7.00
3 virginica   4.90  7.90
> 
> x <- "min"
> iris %>% 
+   dplyr::group_by(Species) %>% 
+   bow( tie(!!x, max) := range(Sepal.Length) )
# A tibble: 3 x 3
  Species      min   max
  <fct>      <dbl> <dbl>
1 setosa      4.30  5.80
2 versicolor  4.90  7.00
3 virginica   4.90  7.90

Now it just does a classic summarise of the rhs of := wrapped in a list call, and then re-extracts into what is specified in the lhs, i.e. it does this:

> iris %>% 
+   group_by(Species) %>% 
+   summarise( ..tmp.. = list(range(Sepal.Length)) ) %>% 
+   mutate( min = map_dbl(..tmp.., 1), max = map_dbl(..tmp.., 2) ) %>% 
+   select( -..tmp..)
# A tibble: 3 x 3
  Species      min   max
  <fct>      <dbl> <dbl>
1 setosa      4.30  5.80
2 versicolor  4.90  7.00
3 virginica   4.90  7.90
Member

romainfrancois commented Feb 23, 2018

I toyed with this syntax on the tie 📦 here: https://github.com/romainfrancois/tie

> iris %>% 
+   dplyr::group_by(Species) %>% 
+   bow( tie(min, max) := range(Sepal.Length) )
# A tibble: 3 x 3
  Species      min   max
  <fct>      <dbl> <dbl>
1 setosa      4.30  5.80
2 versicolor  4.90  7.00
3 virginica   4.90  7.90
> 
> x <- "min"
> iris %>% 
+   dplyr::group_by(Species) %>% 
+   bow( tie(!!x, max) := range(Sepal.Length) )
# A tibble: 3 x 3
  Species      min   max
  <fct>      <dbl> <dbl>
1 setosa      4.30  5.80
2 versicolor  4.90  7.00
3 virginica   4.90  7.90

Now it just does a classic summarise of the rhs of := wrapped in a list call, and then re-extracts into what is specified in the lhs, i.e. it does this:

> iris %>% 
+   group_by(Species) %>% 
+   summarise( ..tmp.. = list(range(Sepal.Length)) ) %>% 
+   mutate( min = map_dbl(..tmp.., 1), max = map_dbl(..tmp.., 2) ) %>% 
+   select( -..tmp..)
# A tibble: 3 x 3
  Species      min   max
  <fct>      <dbl> <dbl>
1 setosa      4.30  5.80
2 versicolor  4.90  7.00
3 virginica   4.90  7.90
@aornugent

This comment has been minimized.

Show comment
Hide comment
@aornugent

aornugent Mar 5, 2018

This is awesome, thank you.

This is awesome, thank you.

@t-kalinowski

This comment has been minimized.

Show comment
Hide comment
@t-kalinowski

t-kalinowski Apr 11, 2018

Contributor

👍 💯 for the functionality.

The names bow and tie don't seem quite right (too cutesy). simply c() would be consistent with zealot.

Contributor

t-kalinowski commented Apr 11, 2018

👍 💯 for the functionality.

The names bow and tie don't seem quite right (too cutesy). simply c() would be consistent with zealot.

@romainfrancois

This comment has been minimized.

Show comment
Hide comment
@romainfrancois

romainfrancois Apr 11, 2018

Member

Yeah sure. I typically don’t use the same standards when making a poc 📦 as when working on dplyr.

Member

romainfrancois commented Apr 11, 2018

Yeah sure. I typically don’t use the same standards when making a poc 📦 as when working on dplyr.

@romainfrancois

This comment has been minimized.

Show comment
Hide comment
@romainfrancois

romainfrancois Apr 23, 2018

Member

Back in the original suggestion:

df %>%
   summarise(tibble(mean = mean(x), sd = sd(x))

the usual rule for summarise is that the result is of length 1, we could extend this in the data frame case by only allowing data frames with 1 row.

for mutate, we could say we would only accept data frames with n rows, n being the size of the group

what would we do if the expression has a name, e.g.

df %>%
   summarise( y = tibble(mean = mean(x), sd = sd(x))
  • c("y_mean", "y_sd" )
  • c("mean", "sd")
  • not allow it
Member

romainfrancois commented Apr 23, 2018

Back in the original suggestion:

df %>%
   summarise(tibble(mean = mean(x), sd = sd(x))

the usual rule for summarise is that the result is of length 1, we could extend this in the data frame case by only allowing data frames with 1 row.

for mutate, we could say we would only accept data frames with n rows, n being the size of the group

what would we do if the expression has a name, e.g.

df %>%
   summarise( y = tibble(mean = mean(x), sd = sd(x))
  • c("y_mean", "y_sd" )
  • c("mean", "sd")
  • not allow it
@hadley

This comment has been minimized.

Show comment
Hide comment
@hadley

hadley Apr 23, 2018

Member

I think the names case is easiest - we follow whatever strategy rlang::flatten() uses.

Member

hadley commented Apr 23, 2018

I think the names case is easiest - we follow whatever strategy rlang::flatten() uses.

@romainfrancois

This comment has been minimized.

Show comment
Hide comment
@romainfrancois

romainfrancois Apr 30, 2018

Member

Wondering about nesting now, would it make sense instead that:

df %>%
   summarise(data = tibble(mean = mean(x), sd = sd(x))

make the same thing as

df %>%
   summarise(data = list(tibble(mean = mean(x), sd = sd(x)))

(without the always confusing list)

Relevant discussing in #2132

Member

romainfrancois commented Apr 30, 2018

Wondering about nesting now, would it make sense instead that:

df %>%
   summarise(data = tibble(mean = mean(x), sd = sd(x))

make the same thing as

df %>%
   summarise(data = list(tibble(mean = mean(x), sd = sd(x)))

(without the always confusing list)

Relevant discussing in #2132

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment