-
Notifications
You must be signed in to change notification settings - Fork 2.1k
summarize() with multi-row returns #6382
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I'm not too worried about this; in general any misspecified summary function could corrupt data. |
I have occasionally written multi-row |
Our plan is to deprecate this behaviour in
I think ideally the word would be closer to |
I like |
I like Are we considering being specific about shinking/growing ? i.e. we could have |
I'm somewhat confident we don't need to care about the direction, mainly it's:
|
I'm remembering that tidygraph uses I also thought of It seems somewhat reasonable to say that |
Since this operation is sometimes called "split-apply-combine", perhaps Or, more related to existing verbs, something based on |
Or just |
Another build synonym would be In the building/construction metaphor: Crazy idea: this function is a sort of combination of |
I think I'd be fairly happy with
|
I like the crazy idea ( Among the other suggestions I prefer |
Since you can also expand the rows, I think summate is not such a good name after all. Maybe a verb like |
In another direction, verbs that imply recreating a data frame: retibble()
reframe()
redefine() Relationship between reframe and tibble frame functions:
|
FWIW, as a long time dplyr user I'm not hugely keen on
|
I somewhat strongly believe that you should not try to connect this new verb to
It just happens to be that Real-life usage of this new verb typically looks awkward if |
Some real-world examples would help in picking the name. (I thought I'd had some, but in a quick look through my stuff, I only found examples of where the multi-row behavior wasn't what I'd wanted.) |
Throwing another name into the hat, because I like short names, I'll suggest
|
A few real life examples. With ivs, which generally takes sets of intervals and returns other sets of arbitrary size (notably can return more or less rows than you started with!) library(dplyr)
library(ivs)
df <- tibble(
start = as.Date(c("2019-01-01", "2019-01-04", "2019-01-07")),
end = as.Date(c("2019-01-05", "2019-01-06", "2019-01-08"))
) %>%
mutate(iv = iv(start, end), .keep = "none")
df
#> # A tibble: 3 × 1
#> iv
#> <iv<date>>
#> 1 [2019-01-01, 2019-01-05)
#> 2 [2019-01-04, 2019-01-06)
#> 3 [2019-01-07, 2019-01-08)
# Merge all overlapping ranges
df %>%
morph(iv = iv_groups(iv))
#> # A tibble: 2 × 1
#> iv
#> <iv<date>>
#> 1 [2019-01-01, 2019-01-06)
#> 2 [2019-01-07, 2019-01-08)
# Split all overlapping ranges into non-overlapping disjoint sets
df %>%
morph(iv = iv_splits(iv))
#> # A tibble: 4 × 1
#> iv
#> <iv<date>>
#> 1 [2019-01-01, 2019-01-04)
#> 2 [2019-01-04, 2019-01-05)
#> 3 [2019-01-05, 2019-01-06)
#> 4 [2019-01-07, 2019-01-08) Similar idea with library(dplyr, warn.conflicts = FALSE)
table <- c("a", "b", "d", "f")
df <- tibble(
g = c(1, 1, 1, 2, 2, 2, 2),
x = c("e", "a", "b", "c", "f", "d", "a")
)
# `morph()` allows you to apply functions that return
# an arbitrary number of rows
df %>%
morph(x = intersect(x, table))
#> # A tibble: 4 × 1
#> x
#> <chr>
#> 1 a
#> 2 b
#> 3 f
#> 4 d Doing something silly like reproducing library(dplyr)
df <- tibble(
g = c(1, 1, 2, 2, 2),
x = c(4, 5, 1, 2, 3)
)
df %>%
morph(x = sample(x, 4, replace = TRUE), .by = g)
#> # A tibble: 8 × 2
#> g x
#> <dbl> <dbl>
#> 1 1 4
#> 2 1 5
#> 3 1 4
#> 4 1 4
#> 5 2 2
#> 6 2 3
#> 7 2 2
#> 8 2 3 An older pattern combined with tibble(path = dir(pattern = "\\.csv$")) %>%
rowwise(path) %>%
morph(read_csv(path)) |
How about
With the ivs example, I would say that I "create the groups by merging the overlapping ranges", and the code is So far this is my favorite option Subjective reasons I like it:
|
I like it. It seems a bit too general to me though, compared to something like |
I like |
I also like |
When you consider the family it's not immediately obvious why
I think this illustrates why |
I actually liked The fact that you can describe |
I think |
I like What about |
I've been thinking about this for a few days but I haven't come up with a new name that I like, and I'm afraid I don't think any of the ones suggested here sit right with me. At the risk of being very not creative, I would suggest something like Otherwise |
Maybe it's only me but I am not completely convinced that we need a complete new function here. I actually liked @krlmlr initial suggestion having a separate argument |
|
@shahronak47 the ivs examples linked here are good examples of how this typically isn't a summary operation, so This verb is much more akin to |
As of dplyr 1.0.0,
summarize()
will create multiple rows per group, according to the length of the return value of the summary function. This new feature leads to unintended behavior if the vector return is accidental, and also can lead to data loss.Created on 2022-08-01 by the reprex package (v2.0.1)
Should we introduce a
.multi = c("allow", "require", "fail")
argument that supports the pre-1.0.0 strict mode of operation? Should.multi = "fail"
even be the default?Imagined on 2022-08-01 by the reprex package (v2.0.1)
The text was updated successfully, but these errors were encountered: