-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement mutate(.when =)
#6313
Conversation
Random lurker here, but just my general thoughts. Having the ability to apply As I understand the alternative with the current API proposed is to have multiple calls to tibble(a = c(1,2,NA,3,5),
b = c("car", "train", "bus", "truck", "motorcycle"),
c = c(100,20,30,40,500)
)|>
mutate(a = if_else(is.na(a), 0, a),
b = if_else(b == "truck", "pick-up", b),
c = if_else(c < 100, c * 10, c)
)
# Example of .when non-global
tibble(a = c(1,2,NA,3,5),
b = c("car", "train", "bus", "truck", "motorcycle")) |>
mutate(a = 0: .when = is.na(a),
b = 'pick-up': .when = b == "truck",
c = c * 10: .when = c <100)
# Current API ?
# Can get quite cumbersome with many conditions
tibble(a = c(1,2,NA,3,5),
b = c("car", "train", "bus", "truck", "motorcycle")) |>
mutate(a = 0,
b = 'pick-up',
c = c * 10,
.when = is.na(a)|b == "truck"|c<100)
# Compare performance to single mutate with multiple if_else statements
tibble(a = c(1,2,NA,3,5),
b = c("car", "train", "bus", "truck", "motorcycle")) |>
mutate(a = 0, .when = is.na(a)) |>
mutate(b = "truck", .when = b == "truck") |>
mutate(c = c * 10, .when = c <100)
|
Your third example with |
b43962f
to
39f246a
Compare
Ah I see, I misunderstood your example at first. Though, it would be interesting the see the performance comparison between these. tibble(a = c(1,2,NA,3,5),
b = c("car", "train", "bus", "truck", "motorcycle"),
c = c(100,20,30,40,500)
)|>
mutate(a = if_else(is.na(a), 0, a),
b = if_else(b == "truck", "pick-up", b),
c = if_else(c < 100, c * 10, c)
)
tibble(a = c(1,2,NA,3,5),
b = c("car", "train", "bus", "truck", "motorcycle"),
c = c(100,20,30,40,500)
)|>
mutate(a = 0, .when = is.na(a)) |>
mutate(b = "truck", .when = b == "truck") |>
mutate(c = c * 10, .when = c <100)
|
If you sample that up to 1 million rows (to actually get useful benchmarks) then it is pretty clear that repeated mutates are much faster than I also think the library(dplyr)
df <- tibble(a = c(1,2,NA,3,5),
b = c("car", "train", "bus", "truck", "motorcycle"),
c = c(100,20,30,40,500)
)
# sample up to 1 mil rows
df <- tibble::new_tibble(lapply(df, sample, size = 1e6, replace = TRUE))
bench::mark(
ifelse = df |>
mutate(
a = if_else(is.na(a), 0, a),
b = if_else(b == "truck", "pick-up", b),
c = if_else(c < 100, c * 10, c)
),
when = df |>
mutate(a = 0, .when = is.na(a)) |>
mutate(b = "pick-up", .when = b == "truck") |>
mutate(c = c * 10, .when = c <100)
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 × 6
#> expression min median `itr/sec` mem_alloc `gc/sec`
#> <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl>
#> 1 ifelse 167.3ms 235ms 4.37 237.9MB 18.9
#> 2 when 48.7ms 57.6ms 15.6 55.1MB 13.7 |
It is apparent that this is needed with the bug fixes in rlang 1.0.3, otherwise an internal dplyr function will be reported as the source of the error
I'm sure there was a lot of discussion around this, but I am going to throw in a vote for this being confusing. The idea that it matches SQL seems less relevant than the fact that The performance loss seems worth it to keep things consistent with other grouping behaviors. I really like the rest of the API btw. Seems very useful. |
@markfairbanks do you think that this should apply the df %>% mutate(x = mean(y), .when = is.na(x), .by = g) For reference, equivalent data.table syntax applies I was optimizing for this potential syntax, assuming that I did just learn that apparently you can't combine |
In my head As for Also worth mentioning that library(data.table)
df <- data.table(group = c("a", "a", "b"), val = as.double(1:3))
copy(df)[, .SD[val <= mean(val), val := mean(val)], by = group][]
#> Error: .SD is locked. Using := in .SD's j is reserved for possible future use; a tortuously flexible way to modify by group. Use := in j directly to modify by group by reference. |
Note (mainly to self): If we apply
The current implementation "drops" groups, but it is clear that that is intended because theoretically it currently filters with I guess this is the library(dplyr)
df <- tibble(
x = c(7, 7, 9, 4, 5, 6),
g = c(1, 1, 1, 2, 2, 2)
) %>%
group_by(g)
df %>%
filter(x < 6) %>%
group_data()
#> # A tibble: 1 × 2
#> g .rows
#> <dbl> <list<int>>
#> 1 2 [2]
df %>%
filter(x < 6, .preserve = TRUE) %>%
group_data()
#> # A tibble: 2 × 2
#> g .rows
#> <dbl> <list<int>>
#> 1 1 [0]
#> 2 2 [2] Created on 2022-06-28 by the reprex package (v2.0.1) |
An example where grouped mutates with dat |> mutate(x = mean(x, na.rm = TRUE), .when = is.na(x), .by = g)
dat |> mutate(x = mean(x, na.rm = TRUE) + 2 * sd(x, na.rm = TRUE), .when = x > mean(x, na.rm = TRUE) + 2 * sd(x, na.rm = TRUE), .by = g) |
@bwiernik it doesn't work that way, |
just out of curiosity (and not understanding the underlying mechanics) why is this implemented as an argument imo, the syntactic form of So could it also not be that this:
could have looked like:
and arguably been a more familiar way of working with dplyr? |
The colon is not the actual api, that was just a random proposal. I think your syntax is good, and something they were considering as well. |
@DavisVaughan I somehow missed this as well. To be honest I thought If that's the intended use of |
I think we are going to close this for now. We aren't entirely convinced that this will benefit a large part of the user base, as we have struggled to come up with a large amount of examples where this is useful - outside of replacing missing values. The addition of We may return to this idea in the future. |
Closes #4050
Closes #6304
This is a fully tested implementation of
mutate(.when =)
.Implementation details
A few notes on how it works:
.when
must evaluate to a logical vector the same size as.data
. It isn't recycled.Groups are ignored when computing
.when
. This might be a little controversial, but I think it makes the most sense:mutate(df, .when =, .by =)
..when
to be evaluated per group (it rarely needs a per-group mean or something like that), but sometimes you want your expressions in...
to still be evaluated per group after applying a global.when
. You save a lot of performance in this case by evaluating.when
on the ungrouped data..when
, you should just useif_else()
instead, since expressions in the...
are evaluated per group.It is hooked into the data mask to be performant. Only columns that are referenced in
...
are sliced to the locations referred to by.when
..when
is mainly useful for updating existing columns. Because of this, you can't modify the type of the columns you are updating. i.e. ifx
is an integer column then you can't domutate(df, x = x + 1.5, .when = y > 2)
.if_else()
for updates, because that takes the common type, i.e.x = if_else(y > 2, x + 1.5, x)
would not be type stable onx
.Outstanding questions
Should
.when
allowif_any()
andif_all()
? It seems like they might be useful. Right now it requires.when
to evaluate to a single logical vector. I don't think it should allowacross()
though.Are we ok with this single condition interface? I am. We had a lot of discussion about alternative interfaces that might allow case-when style updates like
mutate(when(x < 2, x = NA, y = 3), when(x < 5, x = 99))
, but I think:when()
helper that wouldn't be used anywhere elsewhen()
and normal expressions in the samemutate()
call? Do you have to recompute groups before eachwhen()
?UPDATE
statement for dbplyr, I think.Outstanding actions
vec_locate_runs()
from vctrs if we end up using it here in the final version of this PRExamples
Performance
I think performance is generally pretty good. My bar was mainly to be faster than what you could do with an
if_else()
. It is a lot more compact than anif_else()
too.