Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

summarize() with multi-row returns #6382

Closed
krlmlr opened this issue Aug 1, 2022 · 32 comments · Fixed by #6557
Closed

summarize() with multi-row returns #6382

krlmlr opened this issue Aug 1, 2022 · 32 comments · Fixed by #6557
Assignees
Labels
feature a feature request or enhancement grouping 👨‍👩‍👧‍👦
Milestone

Comments

@krlmlr
Copy link
Member

krlmlr commented Aug 1, 2022

As of dplyr 1.0.0, summarize() will create multiple rows per group, according to the length of the return value of the summary function. This new feature leads to unintended behavior if the vector return is accidental, and also can lead to data loss.

library(conflicted)
library(dplyr)

my_custom_summary_function <- function(n) {
  # Should return a scalar, but I accidentally return a vector
  rep(n, n)
}

tibble(n = 2:0) %>% 
  group_by(n) %>% 
  summarize(out = my_custom_summary_function(n), .groups = "drop") %>% 
  ungroup()
#> # A tibble: 3 × 2
#>       n   out
#>   <int> <int>
#> 1     1     1
#> 2     2     2
#> 3     2     2

Created on 2022-08-01 by the reprex package (v2.0.1)

Should we introduce a .multi = c("allow", "require", "fail") argument that supports the pre-1.0.0 strict mode of operation? Should .multi = "fail" even be the default?

library(conflicted)
library(dplyr)

my_custom_summary_function <- function(n) {
  # Should return a scalar, but I accidentally return a vector
  rep(n, n)
}

tibble(n = 2:0) %>% 
  group_by(n) %>% 
  summarize(out = my_custom_summary_function(n), .groups = "drop", .multi = "fail") %>% 
  ungroup()
## Error: `out` has length != 1 in groups 1, 3, use `.multi = "allow"` if this is intended

Imagined on 2022-08-01 by the reprex package (v2.0.1)

@hadley
Copy link
Member

hadley commented Aug 1, 2022

I'm not too worried about this; in general any misspecified summary function could corrupt data.

@hadley hadley closed this as not planned Won't fix, can't repro, duplicate, stale Aug 1, 2022
@hadley hadley reopened this Aug 3, 2022
@DavisVaughan
Copy link
Member

@kcarnold
Copy link

I have occasionally written multi-row summarize pipelines intentionally, but when I do that I need to very carefully document what that code is doing. When I teach summarize, the working model I use is "one row per group"; otherwise it gets confused with mutate. Yes, any bugs in summary functions could corrupt data, but this behavior should be opt-in because (1) R's recycling rules make this sort of behavior easy to trigger accidentally, (2) it's hard to notice and then diagnose when it does happen, and (3) since one-row-per-group is the expected behavior, code that does something different should look like it's doing something different.

@hadley
Copy link
Member

hadley commented Nov 17, 2022

Our plan is to deprecate this behaviour in summarise() and instead introduce a new function specifically for this purpose. We just need a name. Ideas so far:

  • morph()
  • transmogrify()
  • multisummarise()
  • abridge()
  • remodel()
  • remould()
  • renovate()
  • revamp()
  • abridge()
  • shorten()
  • contract()
  • lessen()
  • condense()
  • synopsize()

I think ideally the word would be closer to summarise() than mutate(), i.e. starting later in the alphabet or ending in ise (although then we'd need UK/US variants, which isn't ideal). I think it's ok if the verb implies an unconditional shrinking, even though some uses might increase the number of rows; we also say that [ subsets a vector.

@hadley hadley added this to the 1.1.0 milestone Nov 17, 2022
@hadley hadley added feature a feature request or enhancement grouping 👨‍👩‍👧‍👦 labels Nov 17, 2022
@DavisVaughan
Copy link
Member

I like morph() the most out of all of these. It implies some kind of stretching/shrinking of the data without implying any direction. And it is fairly short.

@romainfrancois
Copy link
Member

romainfrancois commented Nov 18, 2022

I like morph() too. What will happen with morph(<grouped_df>) ?

Are we considering being specific about shinking/growing ? i.e. we could have shrink() and grow() or something.

@DavisVaughan
Copy link
Member

DavisVaughan commented Nov 18, 2022

morph(<grouped_df>) would have to work like summarise(<grouped_df>) currently works, I think. i.e. each group computation can return any number of rows, and we recycle the per group results "rowwise" across the resulting columns. And we'd add .by support for morph()

I'm somewhat confident we don't need to care about the direction, mainly it's:

  • summarise() has the guarantee of 1 row per group. More predictable for users. Harder to make a mistake. Easier data base translations.
  • morph() just relaxes that guarantee, but otherwise works similarly. But when you see morph() in code it should be a clear signal that something is happening that isn't a pure summary, which is pretty nice

@DavisVaughan DavisVaughan self-assigned this Nov 18, 2022
@DavisVaughan
Copy link
Member

DavisVaughan commented Nov 19, 2022

I'm remembering that tidygraph uses morph(), which might be enough to prevent us from using it.

I also thought of restructure(), which is kind of nice because it is closer to summarise() in the alphabet and the core part of each verb starts with s (structure and summarise). And it seems to nicely convey that you are taking an existing data frame and reworking it into some new form (with little restriction on the number of rows or columns). The only potential problem is possible confusion with reshape(), but I think I'm ok with it.

It seems somewhat reasonable to say that summarise() is a restricted version / special case of restructure().

@kcarnold
Copy link

Since this operation is sometimes called "split-apply-combine", perhaps recombine, or rebuild, reconstruct, remake, or reform? Since we're making an entirely new data frame by combining the results of operations on each group.

Or, more related to existing verbs, something based on bind_rows? bind_rows_groupwise? tibble_groupwise?

@DavisVaughan
Copy link
Member

Or just build(), i.e. "build a new data frame from an existing one", if we aren't worried about conflicting with devtools::build() that sounds pretty good

@hadley
Copy link
Member

hadley commented Nov 22, 2022

Another build synonym would be assemble().

In the building/construction metaphor: renovate()

Crazy idea: this function is a sort of combination of mutate() and summarise() so we could call it summate(), which means to sum up.

@DavisVaughan
Copy link
Member

DavisVaughan commented Nov 22, 2022

I think I'd be fairly happy with assemble()

  • It doesn't immediately come up as being used by any big packages
  • Still no direction implied in the name, which i like
  • I like this idea of using the name to reflect that this "creates a new data frame", which we have always described summarise() as theoretically doing
  • I like that it doesn't start with re*()

@lionel-
Copy link
Member

lionel- commented Nov 22, 2022

I like the crazy idea (summate()) because it explains what it does (relax the size constraints of mutate and summarise so it can be anything in between) without really introducing a new verb (it's a portmanteau).

Among the other suggestions I prefer morph() for the same reason, because of this idea that unconstrained form of the result.

@lionel-
Copy link
Member

lionel- commented Nov 22, 2022

Since you can also expand the rows, I think summate is not such a good name after all.

Maybe a verb like remodel() would be a good way of expressing the change in shape.

@lionel-
Copy link
Member

lionel- commented Nov 22, 2022

In another direction, verbs that imply recreating a data frame:

retibble()
reframe()
redefine()

Relationship between reframe and tibble frame functions:

enframe: vector → df
deframe: df → vector
reframe: df → df

@wurli
Copy link

wurli commented Nov 24, 2022

FWIW, as a long time dplyr user I'm not hugely keen on morph() - in my mind it doesn't feel suggestive of summarise()-like behaviour. Of all the suggestions so far I like multisummarise() best, but I feel like there's a better counterpoint out there. Some extra suggestions:

  • elaborate()
  • abbreviate()
  • telescope()
  • restate()
  • revise()

@DavisVaughan
Copy link
Member

DavisVaughan commented Nov 25, 2022

it doesn't feel suggestive of summarise()-like behaviour.

I somewhat strongly believe that you should not try to connect this new verb to summarise() too closely in your head:

  • summarise(): reduce each group down to 1 row
  • new verb: "do something" to each group

It just happens to be that summarise() is a "special case" of this new verb, but in terms of daily practical usage that is as far as I'd take the comparison.

Real-life usage of this new verb typically looks awkward if summarise() is in the name, because it very often isn't actually performing any kind of summary operation.

@kcarnold
Copy link

Some real-world examples would help in picking the name. (I thought I'd had some, but in a quick look through my stuff, I only found examples of where the multi-row behavior wasn't what I'd wanted.)

@eutwt
Copy link
Contributor

eutwt commented Nov 26, 2022

Throwing another name into the hat, because I like short names, I'll suggest draw() as in either (take your pick)

  • to "draw" out specific data from a tibble, as in "draw water from a well"

  • to "draw" a new tibble from an existing one, as in "draw a picture"

@DavisVaughan
Copy link
Member

A few real life examples.

With ivs, which generally takes sets of intervals and returns other sets of arbitrary size (notably can return more or less rows than you started with!)

library(dplyr)
library(ivs)

df <- tibble(
  start = as.Date(c("2019-01-01", "2019-01-04", "2019-01-07")),
  end = as.Date(c("2019-01-05", "2019-01-06", "2019-01-08"))
) %>%
  mutate(iv = iv(start, end), .keep = "none")

df
#> # A tibble: 3 × 1
#>                         iv
#>                 <iv<date>>
#> 1 [2019-01-01, 2019-01-05)
#> 2 [2019-01-04, 2019-01-06)
#> 3 [2019-01-07, 2019-01-08)

# Merge all overlapping ranges
df %>%
  morph(iv = iv_groups(iv))
#> # A tibble: 2 × 1
#>                         iv
#>                 <iv<date>>
#> 1 [2019-01-01, 2019-01-06)
#> 2 [2019-01-07, 2019-01-08)

# Split all overlapping ranges into non-overlapping disjoint sets
df %>%
  morph(iv = iv_splits(iv))
#> # A tibble: 4 × 1
#>                         iv
#>                 <iv<date>>
#> 1 [2019-01-01, 2019-01-04)
#> 2 [2019-01-04, 2019-01-05)
#> 3 [2019-01-05, 2019-01-06)
#> 4 [2019-01-07, 2019-01-08)

Similar idea with intersect():

library(dplyr, warn.conflicts = FALSE)

table <- c("a", "b", "d", "f")

df <- tibble(
  g = c(1, 1, 1, 2, 2, 2, 2),
  x = c("e", "a", "b", "c", "f", "d", "a")
)

# `morph()` allows you to apply functions that return
# an arbitrary number of rows
df %>%
  morph(x = intersect(x, table))
#> # A tibble: 4 × 1
#>   x    
#>   <chr>
#> 1 a    
#> 2 b    
#> 3 f    
#> 4 d

Doing something silly like reproducing slice_head()

library(dplyr)
df <- tibble(
  g = c(1, 1, 2, 2, 2),
  x = c(4, 5, 1, 2, 3)
)
df %>%
  morph(x = sample(x, 4, replace = TRUE), .by = g)
#> # A tibble: 8 × 2
#>       g     x
#>   <dbl> <dbl>
#> 1     1     4
#> 2     1     5
#> 3     1     4
#> 4     1     4
#> 5     2     2
#> 6     2     3
#> 7     2     2
#> 8     2     3

An older pattern combined with read_csv() and multiple files, from the original dplyr 1.0.0 blog post about this feature

tibble(path = dir(pattern = "\\.csv$")) %>% 
  rowwise(path) %>% 
  morph(read_csv(path))

@DavisVaughan
Copy link
Member

DavisVaughan commented Nov 27, 2022

How about create()?

  • Because you "create a new result from each group" (this would be the help page title)
  • Can also be seen as "create a new data frame from an existing one"
    • Which ties to our theoretical beliefs that this and summarise() create a "new" data frame, as opposed to mutate()
  • Easy to tie to summarise(), because that "creates a 1 row summary from each group". So it is a special case of this.
  • Does not imply a direction
  • Does not imply a number of rows returned
  • Does not seem to be taken by any packages
  • The name works very well with all of my real life examples above, even the read_csv() one

With the ivs example, I would say that I "create the groups by merging the overlapping ranges", and the code is create(groups = iv_groups(iv)).

So far this is my favorite option


Subjective reasons I like it:

  • Has an artistic flair to it. "Creation" has less rules tied to it, i.e. like the rules about the number of rows returned
  • Fairly short name
  • It is a name with positive connotations
  • Feels along the same lines as mutate() and summarise()

@lionel-
Copy link
Member

lionel- commented Nov 28, 2022

I like it. It seems a bit too general to me though, compared to something like reframe() which is a more practical description of what is happening. But I agree that it feels more similar to mutate and summarise.

@romainfrancois
Copy link
Member

I like create() and believe it would read very well with .by =

@wurli
Copy link

wurli commented Nov 28, 2022

I also like create() a lot but agree that it possibly feels overly general

@lionel-
Copy link
Member

lionel- commented Nov 28, 2022

When you consider the family it's not immediately obvious why create() is called like that because all the verbs are an act of creation:

  • mutate() creates new columns or recreates existing columns within an existing data frame.
  • summarise() creates a new data frame with size-1 summaries from an existing one.
  • create() creates a new data frame from an existing one.

I think this illustrates why create() is too general.

@DavisVaughan
Copy link
Member

I actually liked create() because it was fairly general 😆

The fact that you can describe mutate() and summarise() using the word "create" didn't bother me too much, since their names imply they are stricter variants of it. create() is just an act of creation with the fewest restraints possible

@eutwt
Copy link
Contributor

eutwt commented Nov 28, 2022

I think create() feels a little strange because the object of the verb (as you'd use it in normal speech) is the output instead of the input. Like, you summarise()/mutate() an existing data frame but you create() a new data frame. That being said, I think it does seem to work better than the other suggestions (incl. mine above)

@yutannihilation
Copy link
Member

I like create(), but I'm afraid it sounds too magical. In my understanding, the function is rather for experts compared to mutate() and summarize() with single-row-returns, so probably it should sound more difficult.

What about explode(), which is used in Hive/Spark SQL? c.f. https://spark.apache.org/docs/latest/api/sql/index.html#explode

@mine-cetinkaya-rundel
Copy link
Member

I've been thinking about this for a few days but I haven't come up with a new name that I like, and I'm afraid I don't think any of the ones suggested here sit right with me. At the risk of being very not creative, I would suggest something like multi_summarize() or summarize_multi().

Otherwise reframe() makes the most sense but I think it won't be trivial to teach when to use reframe() vs. summarize(), as in, how will someone know they should use summarize() instead of reframe()? (Though this is maybe more of a comment on the function's functionality than its name.)

@shahronak47
Copy link

Maybe it's only me but I am not completely convinced that we need a complete new function here. I actually liked @krlmlr initial suggestion having a separate argument .multi in summarise that can define the behaviour. I can't find the discussion why that idea was rejected.

@hadley
Copy link
Member

hadley commented Nov 29, 2022

@shahronak47

  • We no longer believe that summarise() is a good name for this sort of operation.
  • Some backends (e.g. databases) can't support this behaviour at all.
  • Adding an additional argument to summarise() ends up making the call much longer.

@DavisVaughan
Copy link
Member

DavisVaughan commented Nov 29, 2022

@shahronak47 the ivs examples linked here are good examples of how this typically isn't a summary operation, so summarise() isn't a good name for this. In particular, note that depending on the function you use, this operation can actually return more rows that you started with.

This verb is much more akin to do() than summarise() in my mind, even if the API itself looks a little closer to summarise() (the fact that the API looks more similar to summarise() is actually a big win in my mind over the previous do() syntax)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement grouping 👨‍👩‍👧‍👦
Projects
None yet
Development

Successfully merging a pull request may close this issue.