Simpler group_map() #4251

romainfrancois · 2019-03-05T14:49:40Z

Following discussions on twitter, ... here is a simpler group_map() that just lives up to its name and return a list:

library(dplyr, warn.conflicts = FALSE)

data <- mtcars %>%
  group_by(cyl)

data %>% 
  group_map(~ head(.x, 2L))
#> [[1]]
#> # A tibble: 2 x 10
#>     mpg  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  22.8  108     93  3.85  2.32  18.6     1     1     4     1
#> 2  24.4  147.    62  3.69  3.19  20       1     0     4     2
#> 
#> [[2]]
#> # A tibble: 2 x 10
#>     mpg  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1    21   160   110   3.9  2.62  16.5     0     1     4     4
#> 2    21   160   110   3.9  2.88  17.0     0     1     4     4
#> 
#> [[3]]
#> # A tibble: 2 x 10
#>     mpg  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  18.7   360   175  3.15  3.44  17.0     0     0     3     2
#> 2  14.3   360   245  3.21  3.57  15.8     0     0     3     4

# to get the same sort of result as before, we can e.g. 
old_group_map <- function(data, ...) {
  groups <- groups(data)

  data %>% 
    group_keys() %>% 
    tibble::add_column(`::data::` = group_map(data, ~head(.x, 2L))) %>% 
    tidyr::unnest(`::data::`) %>% 
    group_by(!!!groups)

}
old_group_map(data, ~head(.x, 2L))
#> # A tibble: 6 x 11
#> # Groups:   cyl [3]
#>     cyl   mpg  disp    hp  drat    wt  qsec    vs    am  gear  carb
#>   <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1     4  22.8  108     93  3.85  2.32  18.6     1     1     4     1
#> 2     4  24.4  147.    62  3.69  3.19  20       1     0     4     2
#> 3     6  21    160    110  3.9   2.62  16.5     0     1     4     4
#> 4     6  21    160    110  3.9   2.88  17.0     0     1     4     4
#> 5     8  18.7  360    175  3.15  3.44  17.0     0     0     3     2
#> 6     8  14.3  360    245  3.21  3.57  15.8     0     0     3     4

This simplifies also the implementation, but makes it harder to get the results it previously made when it was assuming the results from .f were always data frames.

This makes group_map() similar to group_split() + map() with the bonus of having .y available, which is sometimes necessary for empty groups.

I'd need to fix the documentation if this is the direction we want to take.

coolbutuseless · 2019-03-06T07:35:40Z

This looks interesting!

Any thought to having group_map() return a collapsed data.frame.

i.e. group_map() = group_split() + map_dfr()

Or perhaps maybe a separate group_map_dfr()?

romainfrancois · 2019-03-06T08:38:30Z

A group_map_dfr() may look like this :

group_map_dfr <- function(.tbl, .f, ..., .id = NULL) bind_rows(group_map(.tbl, .f, ...), .id = .id)

But what that does not do is relate back to the grouping variables of .tbl as we were trying to do in the released version of group_map().

Perhaps the user can make sure they include the grouping variables, and then do a group_by() after the group_map_dfr()

sharlagelfand · 2019-03-06T18:01:35Z

I really like this!

For reference, this is the kind of thing that I was trying to do:

library(dplyr)
library(ggplot2)
library(patchwork)

data <- iris %>%
  group_by(Species)

plot_length_by_width <- function(data) {
  data %>%
    ggplot(aes(x = Sepal.Length,
               y = Sepal.Width)) +
    geom_point() +
    ggtitle(paste(data[["Species"]]))
}

data %>%
  group_map( ~ plot_length_by_width(.x),
             keep = TRUE) %>%
  wrap_plots()

^{Created on 2019-03-06 by the reprex package (v0.2.1.9000)}

and it is super easy to accomplish with this new approach! I especially like the keep argument since I usually spend some time trying to remember how to "carry along" the grouping variable.

jennybc · 2019-03-08T08:44:17Z

I suspect my head is not in the right place but I don't get the motivation for this yet.

The first example, where a function is mapped over the groups induced by cyl, makes me feel immediately uneasy, because it feels like the cyl information has gone missing. Which tempts many to put cyl info as list names, returning to a "grouping data in names" anti-pattern.

Then, if we use keep = TRUE, as in @sharlagelfand's example, it feels like we might as well use group_split().

I feel like I am missing something.

romainfrancois · 2019-03-08T08:54:31Z

The advantage of this group_map() over group_split() is that you always have .y to characterize the groups, i.e. it's not just equivalent to group_split() %>% purrr::map() but more along the lines of map2( group_split(), <a list of 1 row nibbles that has all the grouping variable and one row>)

For most cases it does not matter all that much, but when for some reason you have an empty group, the group_split() + map() has no information about which group you are currently manipulating

romainfrancois · 2019-03-08T08:56:32Z

But indeed @sharlagelfand example might as well be

data %>%
  group_split(keep = TRUE) %>%
  purrr::map(plot_length_by_width) %>%
  wrap_plots()

jennybc · 2019-03-08T16:29:29Z

The advantage of this group_map() over group_split() is that you always have .y to characterize the groups

At the time of applying the function, yes, but then it's generally absent from the result, right?

It fees like this thread does not yet have an example that actually uses the .y feature we're talking about. Does anyone have a nice small example for a head to head comparison of various approaches and show off the proposed change to group_map()? @coolbutuseless?

We seem to be converging on evaluating them in terms of ease of access to the grouping keys vs data at the time of applying the mapped function and in the result.

romainfrancois · 2019-03-08T16:45:46Z

The result is whatever the user makes of it. This no longer makes assumptions or requirements on that, it just applies the function to each group, a group being:

.x a slice of the data for all (or all except the grouping) variables, although perhaps it should be keep=TRUE default or even only that
.y a single value for each of the grouping variable (structured as a one row tibble).

If it wasn't for empty groups then we would not need this proposed group_map() at all, but the .y is what makes this special.

The other thing is that with this implementation, group_map() is perhaps simple enough to perhaps just appear in the examples of ?group_split:

group_map <- function(.tbl, .f, ..., keep = FALSE) {
  .f <- as_group_map_function(.f)

  # call the function on each group
  chunks <- group_split(.tbl, keep = keep)
  keys  <- group_keys(.tbl)
  group_keys <- map(seq_len(nrow(keys)), function(i) keys[i, , drop = FALSE])
  map2(chunks, group_keys, .f, ...)
}

jennybc · 2019-03-08T16:52:55Z

More thoughts:

Is the impetus for change that current group_map() doesn't have the right functionality or that it doesn't have the right name? The part of this conversation (the twitter conversation, actually) that I best understand is the surprise that group_map() has such a strong data-frame-y flavour. Is it more deserving of the name group_modify() or group_map_dfr()?

romainfrancois · 2019-03-08T20:28:52Z

It's like we have what we have now, because we wanted something between

mutate() for which each expression gives something of vec_size() == n()
summarise() that gives a vec_size() == 1

but we implemented this too early, i.e. before they can return tibbles (#2326).

This still makes sense to have a mutate() like function, say alter() with the usual interface, e.g. :

iris %>%
  group_by(Species) %>%
  alter(broom::tidy(lm(Sepal.Length ~ Sepal.Width)))

to splice the resulting columns of tidy(). As is planned for summarise() and mutate() not giving a name inside this hypothetical alter() unpacks the columns in place. As opposed to the current released group_map() that uses .x and .y we could inside alter() benefit from the data mask and therefore just use the column names, here Sepal.Length and Sepal.Width

🙈 dance actually has jive() 💃 for this:

library(dance)

iris %>% 
  group_by(Species) %>% 
  jive( ~ broom::tidy(lm(Sepal.Length ~ Sepal.Width)))
#> # A tibble: 6 x 6
#> # Groups:   Species [3]
#>   Species    term        estimate std.error statistic  p.value
#>   <fct>      <chr>          <dbl>     <dbl>     <dbl>    <dbl>
#> 1 setosa     (Intercept)    2.64     0.310       8.51 3.74e-11
#> 2 setosa     Sepal.Width    0.690    0.0899      7.68 6.71e-10
#> 3 versicolor (Intercept)    3.54     0.563       6.29 9.07e- 8
#> 4 versicolor Sepal.Width    0.865    0.202       4.28 8.77e- 5
#> 5 virginica  (Intercept)    3.91     0.757       5.16 4.66e- 6
#> 6 virginica  Sepal.Width    0.902    0.253       3.56 8.43e- 4

^{Created on 2019-03-08 by the reprex package (v0.2.1.9000)}

romainfrancois · 2019-03-08T20:32:33Z

To come back to your last question @jennybc I'd say "both".

The thing we currently call group_map() should rather be a sibling to summarise() and mutate() taking quosures and auto splicing.

And the name group_map() is legit I'd say for the implementation I'm proposing here, a way to map() over the groups.

hadley · 2019-03-14T10:07:33Z

This new implementation feels like a much better fit to the name group_map() than the previous attempt. I like that it's now a clean wrapper around group_split() + group_data() + map2(). I think we should keep the existing behaviour and give it a new name and I agree with @jennybc that group_modify() would be natural.

@jennybc as an example of how you might use .y, I'd write @sharlagelfand's example like this:

plot_length_by_width <- function(data, group) {
  data %>%
    ggplot(aes(Sepal.Length, Sepal.Width)) +
    geom_point() +
    ggtitle(group$Species)
}

iris %>%
  group_by(cyl) %>%
  group_map(plot_length_by_width)

(The fact that the original code actually works is a bug in ggplot2: ggtitle(letters) should return an error, rather than silently using the first element)

romainfrancois · 2019-03-14T10:34:07Z

I can have a go at group_modify() in another branch/pr.

Is that a temporary measure until we have something like summarise() without the vec_size() == 1 constraint and support for auto splice tibble results ? Such a thing would have a more natural quosure-like interface.

hadley · 2019-03-14T12:49:49Z

I think group_modify() is exactly what we currently call group_map(), so shouldn't need a separate branch.

group_map() becomes the proposed

romainfrancois · 2019-03-14T14:47:49Z

✅ so this now has:

group_modify() : what we had before, plus the keep = FALSE argument
group_map() : what is discussed here, essentially a map2(group_split(), <rows of > group_keys())
group_walk() : side effecty version. calls group_map() but returns its input

lock · 2019-09-11T08:58:01Z

This old issue has been automatically locked. If you believe you have found a related problem, please file a new issue (with reprex) and link to this issue. https://reprex.tidyverse.org/

romainfrancois requested review from jennybc and hadley March 5, 2019 14:49

romainfrancois mentioned this pull request Mar 5, 2019

group_map() - keep grouping columns #4066

Closed

romainfrancois mentioned this pull request Mar 11, 2019

.y is an unclear name (in group_map, group_walk) #4234

Closed

romainfrancois force-pushed the group_map_simplification branch from 0670bd1 to 7ff980c Compare March 11, 2019 19:34

romainfrancois added 3 commits March 12, 2019 09:17

Simplifying group_map() so that it just returns a list.

2e6706a

document()

897515e

pass the ...

a014ad3

romainfrancois force-pushed the group_map_simplification branch from 7ff980c to a014ad3 Compare March 12, 2019 08:17

group_modify() becomes the old group_map()

96a4d61

group_map() becomes the proposed

simpler group_modify() implementation using group_map()

a0eff14

romainfrancois merged commit 904b85f into master Mar 15, 2019

romainfrancois deleted the group_map_simplification branch March 15, 2019 08:56

jennybc mentioned this pull request Mar 17, 2019

Vignette on grouped data frames #4290

Closed

mattbk mentioned this pull request Jun 7, 2019

Update group_map() to work with dplyr 0.8.1 r-spatial/sf#1071

Closed

lock bot locked and limited conversation to collaborators Sep 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simpler group_map() #4251

Simpler group_map() #4251

romainfrancois commented Mar 5, 2019

coolbutuseless commented Mar 6, 2019

romainfrancois commented Mar 6, 2019

sharlagelfand commented Mar 6, 2019

jennybc commented Mar 8, 2019

romainfrancois commented Mar 8, 2019

romainfrancois commented Mar 8, 2019

jennybc commented Mar 8, 2019

romainfrancois commented Mar 8, 2019

jennybc commented Mar 8, 2019

romainfrancois commented Mar 8, 2019

romainfrancois commented Mar 8, 2019

hadley commented Mar 14, 2019

romainfrancois commented Mar 14, 2019

hadley commented Mar 14, 2019

romainfrancois commented Mar 14, 2019

lock bot commented Sep 11, 2019

Simpler group_map() #4251

Simpler group_map() #4251

Conversation

romainfrancois commented Mar 5, 2019

coolbutuseless commented Mar 6, 2019

romainfrancois commented Mar 6, 2019

sharlagelfand commented Mar 6, 2019

jennybc commented Mar 8, 2019

romainfrancois commented Mar 8, 2019

romainfrancois commented Mar 8, 2019

jennybc commented Mar 8, 2019

romainfrancois commented Mar 8, 2019

jennybc commented Mar 8, 2019

romainfrancois commented Mar 8, 2019

romainfrancois commented Mar 8, 2019

hadley commented Mar 14, 2019

romainfrancois commented Mar 14, 2019

hadley commented Mar 14, 2019

romainfrancois commented Mar 14, 2019

lock bot commented Sep 11, 2019