Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simpler group_map() #4251

Merged
merged 5 commits into from
Mar 15, 2019
Merged

Simpler group_map() #4251

merged 5 commits into from
Mar 15, 2019

Conversation

romainfrancois
Copy link
Member

Following discussions on twitter, ... here is a simpler group_map() that just lives up to its name and return a list:

library(dplyr, warn.conflicts = FALSE)

data <- mtcars %>%
  group_by(cyl)

data %>% 
  group_map(~ head(.x, 2L))
#> [[1]]
#> # A tibble: 2 x 10
#>     mpg  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  22.8  108     93  3.85  2.32  18.6     1     1     4     1
#> 2  24.4  147.    62  3.69  3.19  20       1     0     4     2
#> 
#> [[2]]
#> # A tibble: 2 x 10
#>     mpg  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1    21   160   110   3.9  2.62  16.5     0     1     4     4
#> 2    21   160   110   3.9  2.88  17.0     0     1     4     4
#> 
#> [[3]]
#> # A tibble: 2 x 10
#>     mpg  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  18.7   360   175  3.15  3.44  17.0     0     0     3     2
#> 2  14.3   360   245  3.21  3.57  15.8     0     0     3     4

# to get the same sort of result as before, we can e.g. 
old_group_map <- function(data, ...) {
  groups <- groups(data)

  data %>% 
    group_keys() %>% 
    tibble::add_column(`::data::` = group_map(data, ~head(.x, 2L))) %>% 
    tidyr::unnest(`::data::`) %>% 
    group_by(!!!groups)

}
old_group_map(data, ~head(.x, 2L))
#> # A tibble: 6 x 11
#> # Groups:   cyl [3]
#>     cyl   mpg  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1     4  22.8  108     93  3.85  2.32  18.6     1     1     4     1
#> 2     4  24.4  147.    62  3.69  3.19  20       1     0     4     2
#> 3     6  21    160    110  3.9   2.62  16.5     0     1     4     4
#> 4     6  21    160    110  3.9   2.88  17.0     0     1     4     4
#> 5     8  18.7  360    175  3.15  3.44  17.0     0     0     3     2
#> 6     8  14.3  360    245  3.21  3.57  15.8     0     0     3     4

This simplifies also the implementation, but makes it harder to get the results it previously made when it was assuming the results from .f were always data frames.

This makes group_map() similar to group_split() + map() with the bonus of having .y available, which is sometimes necessary for empty groups.

I'd need to fix the documentation if this is the direction we want to take.

@coolbutuseless
Copy link
Contributor

This looks interesting!

Any thought to having group_map() return a collapsed data.frame.

i.e. group_map() = group_split() + map_dfr()

Or perhaps maybe a separate group_map_dfr()?

@romainfrancois
Copy link
Member Author

A group_map_dfr() may look like this :

group_map_dfr <- function(.tbl, .f, ..., .id = NULL) bind_rows(group_map(.tbl, .f, ...), .id = .id)

But what that does not do is relate back to the grouping variables of .tbl as we were trying to do in the released version of group_map().

Perhaps the user can make sure they include the grouping variables, and then do a group_by() after the group_map_dfr()

@sharlagelfand
Copy link
Contributor

I really like this!

For reference, this is the kind of thing that I was trying to do:

library(dplyr)
library(ggplot2)
library(patchwork)

data <- iris %>%
  group_by(Species)

plot_length_by_width <- function(data) {
  data %>%
    ggplot(aes(x = Sepal.Length,
               y = Sepal.Width)) +
    geom_point() +
    ggtitle(paste(data[["Species"]]))
}

data %>%
  group_map( ~ plot_length_by_width(.x),
             keep = TRUE) %>%
  wrap_plots()

Created on 2019-03-06 by the reprex package (v0.2.1.9000)

and it is super easy to accomplish with this new approach! I especially like the keep argument since I usually spend some time trying to remember how to "carry along" the grouping variable.

@jennybc
Copy link
Member

jennybc commented Mar 8, 2019

I suspect my head is not in the right place but I don't get the motivation for this yet.

The first example, where a function is mapped over the groups induced by cyl, makes me feel immediately uneasy, because it feels like the cyl information has gone missing. Which tempts many to put cyl info as list names, returning to a "grouping data in names" anti-pattern.

Then, if we use keep = TRUE, as in @sharlagelfand's example, it feels like we might as well use group_split().

I feel like I am missing something.

@romainfrancois
Copy link
Member Author

The advantage of this group_map() over group_split() is that you always have .y to characterize the groups, i.e. it's not just equivalent to group_split() %>% purrr::map() but more along the lines of map2( group_split(), <a list of 1 row nibbles that has all the grouping variable and one row>)

For most cases it does not matter all that much, but when for some reason you have an empty group, the group_split() + map() has no information about which group you are currently manipulating

@romainfrancois
Copy link
Member Author

But indeed @sharlagelfand example might as well be

data %>%
  group_split(keep = TRUE) %>%
  purrr::map(plot_length_by_width) %>%
  wrap_plots()

@jennybc
Copy link
Member

jennybc commented Mar 8, 2019

The advantage of this group_map() over group_split() is that you always have .y to characterize the groups

At the time of applying the function, yes, but then it's generally absent from the result, right?

It fees like this thread does not yet have an example that actually uses the .y feature we're talking about. Does anyone have a nice small example for a head to head comparison of various approaches and show off the proposed change to group_map()? @coolbutuseless?

We seem to be converging on evaluating them in terms of ease of access to the grouping keys vs data at the time of applying the mapped function and in the result.

@romainfrancois
Copy link
Member Author

The result is whatever the user makes of it. This no longer makes assumptions or requirements on that, it just applies the function to each group, a group being:

  • .x a slice of the data for all (or all except the grouping) variables, although perhaps it should be keep=TRUE default or even only that
  • .y a single value for each of the grouping variable (structured as a one row tibble).

If it wasn't for empty groups then we would not need this proposed group_map() at all, but the .y is what makes this special.

The other thing is that with this implementation, group_map() is perhaps simple enough to perhaps just appear in the examples of ?group_split:

group_map <- function(.tbl, .f, ..., keep = FALSE) {
  .f <- as_group_map_function(.f)

  # call the function on each group
  chunks <- group_split(.tbl, keep = keep)
  keys  <- group_keys(.tbl)
  group_keys <- map(seq_len(nrow(keys)), function(i) keys[i, , drop = FALSE])
  map2(chunks, group_keys, .f, ...)
}

@jennybc
Copy link
Member

jennybc commented Mar 8, 2019

More thoughts:

Is the impetus for change that current group_map() doesn't have the right functionality or that it doesn't have the right name? The part of this conversation (the twitter conversation, actually) that I best understand is the surprise that group_map() has such a strong data-frame-y flavour. Is it more deserving of the name group_modify() or group_map_dfr()?

@romainfrancois
Copy link
Member Author

It's like we have what we have now, because we wanted something between

  • mutate() for which each expression gives something of vec_size() == n()
  • summarise() that gives a vec_size() == 1

but we implemented this too early, i.e. before they can return tibbles (#2326).

This still makes sense to have a mutate() like function, say alter() with the usual interface, e.g. :

iris %>%
  group_by(Species) %>%
  alter(broom::tidy(lm(Sepal.Length ~ Sepal.Width)))

to splice the resulting columns of tidy(). As is planned for summarise() and mutate() not giving a name inside this hypothetical alter() unpacks the columns in place. As opposed to the current released group_map() that uses .x and .y we could inside alter() benefit from the data mask and therefore just use the column names, here Sepal.Length and Sepal.Width

🙈 dance actually has jive() 💃 for this:

library(dance)

iris %>% 
  group_by(Species) %>% 
  jive( ~ broom::tidy(lm(Sepal.Length ~ Sepal.Width)))
#> # A tibble: 6 x 6
#> # Groups:   Species [3]
#>   Species    term        estimate std.error statistic  p.value
#>   <fct>      <chr>          <dbl>     <dbl>     <dbl>    <dbl>
#> 1 setosa     (Intercept)    2.64     0.310       8.51 3.74e-11
#> 2 setosa     Sepal.Width    0.690    0.0899      7.68 6.71e-10
#> 3 versicolor (Intercept)    3.54     0.563       6.29 9.07e- 8
#> 4 versicolor Sepal.Width    0.865    0.202       4.28 8.77e- 5
#> 5 virginica  (Intercept)    3.91     0.757       5.16 4.66e- 6
#> 6 virginica  Sepal.Width    0.902    0.253       3.56 8.43e- 4

Created on 2019-03-08 by the reprex package (v0.2.1.9000)

@romainfrancois
Copy link
Member Author

To come back to your last question @jennybc I'd say "both".

The thing we currently call group_map() should rather be a sibling to summarise() and mutate() taking quosures and auto splicing.

And the name group_map() is legit I'd say for the implementation I'm proposing here, a way to map() over the groups.

@hadley
Copy link
Member

hadley commented Mar 14, 2019

This new implementation feels like a much better fit to the name group_map() than the previous attempt. I like that it's now a clean wrapper around group_split() + group_data() + map2(). I think we should keep the existing behaviour and give it a new name and I agree with @jennybc that group_modify() would be natural.

@jennybc as an example of how you might use .y, I'd write @sharlagelfand's example like this:

plot_length_by_width <- function(data, group) {
  data %>%
    ggplot(aes(Sepal.Length, Sepal.Width)) +
    geom_point() +
    ggtitle(group$Species)
}

iris %>%
  group_by(cyl) %>%
  group_map(plot_length_by_width)

(The fact that the original code actually works is a bug in ggplot2: ggtitle(letters) should return an error, rather than silently using the first element)

@romainfrancois
Copy link
Member Author

I can have a go at group_modify() in another branch/pr.

Is that a temporary measure until we have something like summarise() without the vec_size() == 1 constraint and support for auto splice tibble results ? Such a thing would have a more natural quosure-like interface.

@hadley
Copy link
Member

hadley commented Mar 14, 2019

I think group_modify() is exactly what we currently call group_map(), so shouldn't need a separate branch.

group_map() becomes the proposed
@romainfrancois
Copy link
Member Author

✅ so this now has:

  • group_modify() : what we had before, plus the keep = FALSE argument
  • group_map() : what is discussed here, essentially a map2(group_split(), <rows of > group_keys())
  • group_walk() : side effecty version. calls group_map() but returns its input

@romainfrancois romainfrancois merged commit 904b85f into master Mar 15, 2019
@romainfrancois romainfrancois deleted the group_map_simplification branch March 15, 2019 08:56
@lock
Copy link

lock bot commented Sep 11, 2019

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

@lock lock bot locked and limited conversation to collaborators Sep 11, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants