summarize() with multi-row returns #6382

krlmlr · 2022-08-01T06:20:29Z

As of dplyr 1.0.0, summarize() will create multiple rows per group, according to the length of the return value of the summary function. This new feature leads to unintended behavior if the vector return is accidental, and also can lead to data loss.

library(conflicted)
library(dplyr)

my_custom_summary_function <- function(n) {
  # Should return a scalar, but I accidentally return a vector
  rep(n, n)
}

tibble(n = 2:0) %>% 
  group_by(n) %>% 
  summarize(out = my_custom_summary_function(n), .groups = "drop") %>% 
  ungroup()
#> # A tibble: 3 × 2
#>       n   out
#>   <int> <int>
#> 1     1     1
#> 2     2     2
#> 3     2     2

^{Created on 2022-08-01 by the reprex package (v2.0.1)}

Should we introduce a .multi = c("allow", "require", "fail") argument that supports the pre-1.0.0 strict mode of operation? Should .multi = "fail" even be the default?

library(conflicted)
library(dplyr)

my_custom_summary_function <- function(n) {
  # Should return a scalar, but I accidentally return a vector
  rep(n, n)
}

tibble(n = 2:0) %>% 
  group_by(n) %>% 
  summarize(out = my_custom_summary_function(n), .groups = "drop", .multi = "fail") %>% 
  ungroup()
## Error: `out` has length != 1 in groups 1, 3, use `.multi = "allow"` if this is intended

^{Imagined on 2022-08-01 by the reprex package (v2.0.1)}

The text was updated successfully, but these errors were encountered:

hadley · 2022-08-01T14:12:31Z

I'm not too worried about this; in general any misspecified summary function could corrupt data.

DavisVaughan · 2022-08-26T17:31:07Z

See also https://twitter.com/drob/status/1563198515626770432?s=20&t=iTFWSCPNOGWalIrpXHx2qg

kcarnold · 2022-11-14T17:22:44Z

I have occasionally written multi-row summarize pipelines intentionally, but when I do that I need to very carefully document what that code is doing. When I teach summarize, the working model I use is "one row per group"; otherwise it gets confused with mutate. Yes, any bugs in summary functions could corrupt data, but this behavior should be opt-in because (1) R's recycling rules make this sort of behavior easy to trigger accidentally, (2) it's hard to notice and then diagnose when it does happen, and (3) since one-row-per-group is the expected behavior, code that does something different should look like it's doing something different.

hadley · 2022-11-17T15:41:57Z

Our plan is to deprecate this behaviour in summarise() and instead introduce a new function specifically for this purpose. We just need a name. Ideas so far:

morph()
transmogrify()
multisummarise()
abridge()
remodel()
remould()
renovate()
revamp()
abridge()
shorten()
contract()
lessen()
condense()
synopsize()

I think ideally the word would be closer to summarise() than mutate(), i.e. starting later in the alphabet or ending in ise (although then we'd need UK/US variants, which isn't ideal). I think it's ok if the verb implies an unconditional shrinking, even though some uses might increase the number of rows; we also say that [ subsets a vector.

DavisVaughan · 2022-11-17T16:12:10Z

I like morph() the most out of all of these. It implies some kind of stretching/shrinking of the data without implying any direction. And it is fairly short.

romainfrancois · 2022-11-18T05:15:17Z

I like morph() too. What will happen with morph(<grouped_df>) ?

Are we considering being specific about shinking/growing ? i.e. we could have shrink() and grow() or something.

DavisVaughan · 2022-11-18T14:36:52Z

morph(<grouped_df>) would have to work like summarise(<grouped_df>) currently works, I think. i.e. each group computation can return any number of rows, and we recycle the per group results "rowwise" across the resulting columns. And we'd add .by support for morph()

I'm somewhat confident we don't need to care about the direction, mainly it's:

summarise() has the guarantee of 1 row per group. More predictable for users. Harder to make a mistake. Easier data base translations.
morph() just relaxes that guarantee, but otherwise works similarly. But when you see morph() in code it should be a clear signal that something is happening that isn't a pure summary, which is pretty nice

DavisVaughan · 2022-11-19T23:36:29Z

I'm remembering that tidygraph uses morph(), which might be enough to prevent us from using it.

I also thought of restructure(), which is kind of nice because it is closer to summarise() in the alphabet and the core part of each verb starts with s (structure and summarise). And it seems to nicely convey that you are taking an existing data frame and reworking it into some new form (with little restriction on the number of rows or columns). The only potential problem is possible confusion with reshape(), but I think I'm ok with it.

It seems somewhat reasonable to say that summarise() is a restricted version / special case of restructure().

kcarnold · 2022-11-20T01:48:41Z

Since this operation is sometimes called "split-apply-combine", perhaps recombine, or rebuild, reconstruct, remake, or reform? Since we're making an entirely new data frame by combining the results of operations on each group.

Or, more related to existing verbs, something based on bind_rows? bind_rows_groupwise? tibble_groupwise?

DavisVaughan · 2022-11-20T13:27:19Z

Or just build(), i.e. "build a new data frame from an existing one", if we aren't worried about conflicting with devtools::build() that sounds pretty good

hadley · 2022-11-22T13:19:44Z

Another build synonym would be assemble().

In the building/construction metaphor: renovate()

Crazy idea: this function is a sort of combination of mutate() and summarise() so we could call it summate(), which means to sum up.

DavisVaughan · 2022-11-22T14:33:35Z

I think I'd be fairly happy with assemble()

It doesn't immediately come up as being used by any big packages
Still no direction implied in the name, which i like
I like this idea of using the name to reflect that this "creates a new data frame", which we have always described summarise() as theoretically doing
I like that it doesn't start with re*()

lionel- · 2022-11-22T14:52:29Z

I like the crazy idea (summate()) because it explains what it does (relax the size constraints of mutate and summarise so it can be anything in between) without really introducing a new verb (it's a portmanteau).

Among the other suggestions I prefer morph() for the same reason, because of this idea that unconstrained form of the result.

lionel- · 2022-11-22T15:24:40Z

Since you can also expand the rows, I think summate is not such a good name after all.

Maybe a verb like remodel() would be a good way of expressing the change in shape.

lionel- · 2022-11-22T15:36:36Z

In another direction, verbs that imply recreating a data frame:

retibble()
reframe()
redefine()

Relationship between reframe and tibble frame functions:

enframe: vector → df
deframe: df → vector
reframe: df → df

wurli · 2022-11-24T22:37:28Z

FWIW, as a long time dplyr user I'm not hugely keen on morph() - in my mind it doesn't feel suggestive of summarise()-like behaviour. Of all the suggestions so far I like multisummarise() best, but I feel like there's a better counterpoint out there. Some extra suggestions:

elaborate()
abbreviate()
telescope()
restate()
revise()

DavisVaughan · 2022-11-25T11:47:32Z

it doesn't feel suggestive of summarise()-like behaviour.

I somewhat strongly believe that you should not try to connect this new verb to summarise() too closely in your head:

summarise(): reduce each group down to 1 row
new verb: "do something" to each group

It just happens to be that summarise() is a "special case" of this new verb, but in terms of daily practical usage that is as far as I'd take the comparison.

Real-life usage of this new verb typically looks awkward if summarise() is in the name, because it very often isn't actually performing any kind of summary operation.

kcarnold · 2022-11-26T14:20:45Z

Some real-world examples would help in picking the name. (I thought I'd had some, but in a quick look through my stuff, I only found examples of where the multi-row behavior wasn't what I'd wanted.)

eutwt · 2022-11-26T17:39:01Z

Throwing another name into the hat, because I like short names, I'll suggest draw() as in either (take your pick)

to "draw" out specific data from a tibble, as in "draw water from a well"
to "draw" a new tibble from an existing one, as in "draw a picture"

DavisVaughan · 2022-11-27T14:18:33Z

A few real life examples.

With ivs, which generally takes sets of intervals and returns other sets of arbitrary size (notably can return more or less rows than you started with!)

library(dplyr)
library(ivs)

df <- tibble(
  start = as.Date(c("2019-01-01", "2019-01-04", "2019-01-07")),
  end = as.Date(c("2019-01-05", "2019-01-06", "2019-01-08"))
) %>%
  mutate(iv = iv(start, end), .keep = "none")

df
#> # A tibble: 3 × 1
#>                         iv
#>                 <iv<date>>
#> 1 [2019-01-01, 2019-01-05)
#> 2 [2019-01-04, 2019-01-06)
#> 3 [2019-01-07, 2019-01-08)

# Merge all overlapping ranges
df %>%
  morph(iv = iv_groups(iv))
#> # A tibble: 2 × 1
#>                         iv
#>                 <iv<date>>
#> 1 [2019-01-01, 2019-01-06)
#> 2 [2019-01-07, 2019-01-08)

# Split all overlapping ranges into non-overlapping disjoint sets
df %>%
  morph(iv = iv_splits(iv))
#> # A tibble: 4 × 1
#>                         iv
#>                 <iv<date>>
#> 1 [2019-01-01, 2019-01-04)
#> 2 [2019-01-04, 2019-01-05)
#> 3 [2019-01-05, 2019-01-06)
#> 4 [2019-01-07, 2019-01-08)

Similar idea with intersect():

library(dplyr, warn.conflicts = FALSE)

table <- c("a", "b", "d", "f")

df <- tibble(
  g = c(1, 1, 1, 2, 2, 2, 2),
  x = c("e", "a", "b", "c", "f", "d", "a")
)

# `morph()` allows you to apply functions that return
# an arbitrary number of rows
df %>%
  morph(x = intersect(x, table))
#> # A tibble: 4 × 1
#>   x    
#>   <chr>
#> 1 a    
#> 2 b    
#> 3 f    
#> 4 d

Doing something silly like reproducing slice_head()

library(dplyr)
df <- tibble(
  g = c(1, 1, 2, 2, 2),
  x = c(4, 5, 1, 2, 3)
)
df %>%
  morph(x = sample(x, 4, replace = TRUE), .by = g)
#> # A tibble: 8 × 2
#>       g     x
#>   <dbl> <dbl>
#> 1     1     4
#> 2     1     5
#> 3     1     4
#> 4     1     4
#> 5     2     2
#> 6     2     3
#> 7     2     2
#> 8     2     3

An older pattern combined with read_csv() and multiple files, from the original dplyr 1.0.0 blog post about this feature

tibble(path = dir(pattern = "\\.csv$")) %>% 
  rowwise(path) %>% 
  morph(read_csv(path))

DavisVaughan · 2022-11-27T14:28:07Z

How about create()?

Because you "create a new result from each group" (this would be the help page title)
Can also be seen as "create a new data frame from an existing one"
- Which ties to our theoretical beliefs that this and summarise() create a "new" data frame, as opposed to mutate()
Easy to tie to summarise(), because that "creates a 1 row summary from each group". So it is a special case of this.
Does not imply a direction
Does not imply a number of rows returned
Does not seem to be taken by any packages
The name works very well with all of my real life examples above, even the read_csv() one

With the ivs example, I would say that I "create the groups by merging the overlapping ranges", and the code is create(groups = iv_groups(iv)).

So far this is my favorite option

Subjective reasons I like it:

Has an artistic flair to it. "Creation" has less rules tied to it, i.e. like the rules about the number of rows returned
Fairly short name
It is a name with positive connotations
Feels along the same lines as mutate() and summarise()

lionel- · 2022-11-28T08:00:37Z

I like it. It seems a bit too general to me though, compared to something like reframe() which is a more practical description of what is happening. But I agree that it feels more similar to mutate and summarise.

romainfrancois · 2022-11-28T09:30:18Z

I like create() and believe it would read very well with .by =

wurli · 2022-11-28T09:48:36Z

I also like create() a lot but agree that it possibly feels overly general

lionel- · 2022-11-28T10:04:08Z

When you consider the family it's not immediately obvious why create() is called like that because all the verbs are an act of creation:

mutate() creates new columns or recreates existing columns within an existing data frame.
summarise() creates a new data frame with size-1 summaries from an existing one.
create() creates a new data frame from an existing one.

I think this illustrates why create() is too general.

DavisVaughan · 2022-11-28T13:14:10Z

I actually liked create() because it was fairly general 😆

The fact that you can describe mutate() and summarise() using the word "create" didn't bother me too much, since their names imply they are stricter variants of it. create() is just an act of creation with the fewest restraints possible

eutwt · 2022-11-28T13:26:34Z

I think create() feels a little strange because the object of the verb (as you'd use it in normal speech) is the output instead of the input. Like, you summarise()/mutate() an existing data frame but you create() a new data frame. That being said, I think it does seem to work better than the other suggestions (incl. mine above)

yutannihilation · 2022-11-28T15:56:15Z

I like create(), but I'm afraid it sounds too magical. In my understanding, the function is rather for experts compared to mutate() and summarize() with single-row-returns, so probably it should sound more difficult.

What about explode(), which is used in Hive/Spark SQL? c.f. https://spark.apache.org/docs/latest/api/sql/index.html#explode

mine-cetinkaya-rundel · 2022-11-28T16:14:16Z

I've been thinking about this for a few days but I haven't come up with a new name that I like, and I'm afraid I don't think any of the ones suggested here sit right with me. At the risk of being very not creative, I would suggest something like multi_summarize() or summarize_multi().

Otherwise reframe() makes the most sense but I think it won't be trivial to teach when to use reframe() vs. summarize(), as in, how will someone know they should use summarize() instead of reframe()? (Though this is maybe more of a comment on the function's functionality than its name.)

shahronak47 · 2022-11-29T07:22:26Z

Maybe it's only me but I am not completely convinced that we need a complete new function here. I actually liked @krlmlr initial suggestion having a separate argument .multi in summarise that can define the behaviour. I can't find the discussion why that idea was rejected.

hadley · 2022-11-29T13:44:05Z

@shahronak47

We no longer believe that summarise() is a good name for this sort of operation.
Some backends (e.g. databases) can't support this behaviour at all.
Adding an additional argument to summarise() ends up making the call much longer.

DavisVaughan · 2022-11-29T14:25:00Z

@shahronak47 the ivs examples linked here are good examples of how this typically isn't a summary operation, so summarise() isn't a good name for this. In particular, note that depending on the function you use, this operation can actually return more rows that you started with.

This verb is much more akin to do() than summarise() in my mind, even if the API itself looks a little closer to summarise() (the fact that the API looks more similar to summarise() is actually a big win in my mind over the previous do() syntax)

hadley closed this as not planned Won't fix, can't repro, duplicate, stale Aug 1, 2022

hadley reopened this Aug 3, 2022

krlmlr mentioned this issue Aug 19, 2022

WIP: New .multi argument to summarize() #6420

Closed

DavisVaughan mentioned this issue Nov 14, 2022

summarise() edge case recycling bug #6509

Closed

hadley added this to the 1.1.0 milestone Nov 17, 2022

hadley added feature a feature request or enhancement grouping 👨‍👩‍👧‍👦 labels Nov 17, 2022

DavisVaughan self-assigned this Nov 18, 2022

DavisVaughan mentioned this issue Nov 21, 2022

Implement reframe() #6557

Merged

DavisVaughan mentioned this issue Nov 28, 2022

Feedback for alternate names for reframe() #6565

Closed

DavisVaughan closed this as completed in #6557 Nov 28, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

summarize() with multi-row returns #6382

summarize() with multi-row returns #6382

krlmlr commented Aug 1, 2022

hadley commented Aug 1, 2022

DavisVaughan commented Aug 26, 2022

kcarnold commented Nov 14, 2022

hadley commented Nov 17, 2022

DavisVaughan commented Nov 17, 2022

romainfrancois commented Nov 18, 2022 •

edited

DavisVaughan commented Nov 18, 2022 •

edited

DavisVaughan commented Nov 19, 2022 •

edited

kcarnold commented Nov 20, 2022

DavisVaughan commented Nov 20, 2022

hadley commented Nov 22, 2022

DavisVaughan commented Nov 22, 2022 •

edited

lionel- commented Nov 22, 2022

lionel- commented Nov 22, 2022

lionel- commented Nov 22, 2022 •

edited

wurli commented Nov 24, 2022

DavisVaughan commented Nov 25, 2022 •

edited

kcarnold commented Nov 26, 2022

eutwt commented Nov 26, 2022 •

edited

DavisVaughan commented Nov 27, 2022

DavisVaughan commented Nov 27, 2022 •

edited

lionel- commented Nov 28, 2022

romainfrancois commented Nov 28, 2022

wurli commented Nov 28, 2022

lionel- commented Nov 28, 2022

DavisVaughan commented Nov 28, 2022

eutwt commented Nov 28, 2022 •

edited

yutannihilation commented Nov 28, 2022

mine-cetinkaya-rundel commented Nov 28, 2022

shahronak47 commented Nov 29, 2022

hadley commented Nov 29, 2022

DavisVaughan commented Nov 29, 2022 •

edited

summarize() with multi-row returns #6382

summarize() with multi-row returns #6382

Comments

krlmlr commented Aug 1, 2022

hadley commented Aug 1, 2022

DavisVaughan commented Aug 26, 2022

kcarnold commented Nov 14, 2022

hadley commented Nov 17, 2022

DavisVaughan commented Nov 17, 2022

romainfrancois commented Nov 18, 2022 • edited

DavisVaughan commented Nov 18, 2022 • edited

DavisVaughan commented Nov 19, 2022 • edited

kcarnold commented Nov 20, 2022

DavisVaughan commented Nov 20, 2022

hadley commented Nov 22, 2022

DavisVaughan commented Nov 22, 2022 • edited

lionel- commented Nov 22, 2022

lionel- commented Nov 22, 2022

lionel- commented Nov 22, 2022 • edited

wurli commented Nov 24, 2022

DavisVaughan commented Nov 25, 2022 • edited

kcarnold commented Nov 26, 2022

eutwt commented Nov 26, 2022 • edited

DavisVaughan commented Nov 27, 2022

DavisVaughan commented Nov 27, 2022 • edited

lionel- commented Nov 28, 2022

romainfrancois commented Nov 28, 2022

wurli commented Nov 28, 2022

lionel- commented Nov 28, 2022

DavisVaughan commented Nov 28, 2022

eutwt commented Nov 28, 2022 • edited

yutannihilation commented Nov 28, 2022

mine-cetinkaya-rundel commented Nov 28, 2022

shahronak47 commented Nov 29, 2022

hadley commented Nov 29, 2022

DavisVaughan commented Nov 29, 2022 • edited

romainfrancois commented Nov 18, 2022 •

edited

DavisVaughan commented Nov 18, 2022 •

edited

DavisVaughan commented Nov 19, 2022 •

edited

DavisVaughan commented Nov 22, 2022 •

edited

lionel- commented Nov 22, 2022 •

edited

DavisVaughan commented Nov 25, 2022 •

edited

eutwt commented Nov 26, 2022 •

edited

DavisVaughan commented Nov 27, 2022 •

edited

eutwt commented Nov 28, 2022 •

edited

DavisVaughan commented Nov 29, 2022 •

edited