Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement mutate(.when =) #6313

Closed
wants to merge 4 commits into from

Conversation

DavisVaughan
Copy link
Member

@DavisVaughan DavisVaughan commented Jun 28, 2022

Closes #4050
Closes #6304

This is a fully tested implementation of mutate(.when =).

Implementation details

A few notes on how it works:

  • .when must evaluate to a logical vector the same size as .data. It isn't recycled.

  • Groups are ignored when computing .when. This might be a little controversial, but I think it makes the most sense:

    • It matches how SQL's WHERE statement works when combined with GROUP BY. The WHERE is applied first and then the groups are computed on the remaining rows. This will be clearer if we have mutate(df, .when =, .by =).
    • You generally don't want .when to be evaluated per group (it rarely needs a per-group mean or something like that), but sometimes you want your expressions in ... to still be evaluated per group after applying a global .when. You save a lot of performance in this case by evaluating .when on the ungrouped data.
    • If you really need per-group .when, you should just use if_else() instead, since expressions in the ... are evaluated per group.
  • It is hooked into the data mask to be performant. Only columns that are referenced in ... are sliced to the locations referred to by .when.

  • .when is mainly useful for updating existing columns. Because of this, you can't modify the type of the columns you are updating. i.e. if x is an integer column then you can't do mutate(df, x = x + 1.5, .when = y > 2).

    • This type stability is an extremely nice property. It makes it more valuable than if_else() for updates, because that takes the common type, i.e. x = if_else(y > 2, x + 1.5, x) would not be type stable on x.

Outstanding questions

  • Should .when allow if_any() and if_all()? It seems like they might be useful. Right now it requires .when to evaluate to a single logical vector. I don't think it should allow across() though.

  • Are we ok with this single condition interface? I am. We had a lot of discussion about alternative interfaces that might allow case-when style updates like mutate(when(x < 2, x = NA, y = 3), when(x < 5, x = 99)), but I think:

    • That is overly complicated for the 99% case that this is likely to be used for (1 condition, >=1 columns to update)
    • It introduces a clunky when() helper that wouldn't be used anywhere else
    • It opens up a lot of hard to answer questions like: Can you mix when() and normal expressions in the same mutate() call? Do you have to recompute groups before each when()?
    • Keeping it to 1 condition should allow us to nicely translate this to a SQL UPDATE statement for dbplyr, I think.

Outstanding actions

  • Export vec_locate_runs() from vctrs if we end up using it here in the final version of this PR

Examples

library(dplyr)

# From issue 4050
df <- tibble(a = 1:5)
mutate(df, a = a + 1, .when = a > 3)
#> # A tibble: 5 × 1
#>       a
#>   <int>
#> 1     1
#> 2     2
#> 3     3
#> 4     5
#> 5     6

df <- tibble(x = 1:4, y = 5:8)
mutate(df, x = -x, y = -y, .when = x < 4)
#> # A tibble: 4 × 2
#>       x     y
#>   <int> <int>
#> 1    -1    -5
#> 2    -2    -6
#> 3    -3    -7
#> 4     4     8

# More practical. Replace missing values.
df <- tibble(x = c(1, 2, NA, 4, NA))
mutate(df, x = 0, .when = is.na(x))
#> # A tibble: 5 × 1
#>       x
#>   <dbl>
#> 1     1
#> 2     2
#> 3     0
#> 4     4
#> 5     0

# Standardize missings across variables
df <- tibble(x = c(1, NA, 2, 4), y = c(1, 2, NA, 4), z = c(NA, "b", "c", "d"))
mutate(df, x = NA, y = NA, z = NA, .when = is.na(x) | is.na(y) | is.na(z))
#> # A tibble: 4 × 3
#>       x     y z    
#>   <dbl> <dbl> <chr>
#> 1    NA    NA <NA> 
#> 2    NA    NA <NA> 
#> 3    NA    NA <NA> 
#> 4     4     4 d

Performance

I think performance is generally pretty good. My bar was mainly to be faster than what you could do with an if_else(). It is a lot more compact than an if_else() too.

library(nycflights13)
library(vctrs)
library(dplyr)

big_flights <- vec_rep(flights, 100)
big_grouped_flights <- group_by(big_flights, year, month)

# Small data, one condition, no groups
bench::mark(
  slow = mutate(
    flights,
    dep_delay = if_else(dep_time > 500, -dep_delay, dep_delay, missing = dep_delay)
  ),
  fast = mutate(
    flights,
    dep_delay = -dep_delay,
    .when = dep_time > 500
  ),
  iterations = 5
)
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 slow         25.3ms   25.5ms      39.0    41.7MB        0
#> 2 fast         12.5ms   12.8ms      78.4    14.1MB        0

# Small data, two conditions, no groups
bench::mark(
  slow = mutate(
    flights,
    dep_delay = if_else(dep_time > 500, -dep_delay, dep_delay, missing = dep_delay),
    sched_dep_time = if_else(dep_time > 500, -sched_dep_time, sched_dep_time, missing = sched_dep_time)
  ),
  fast = mutate(
    flights,
    dep_delay = -dep_delay,
    sched_dep_time = -sched_dep_time,
    .when = dep_time > 500
  ),
  iterations = 5
)
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 slow         25.7ms   36.2ms      27.9    72.1MB        0
#> 2 fast         13.9ms     15ms      67.7    17.8MB        0

# Big data, two conditions, no groups
bench::mark(
  slow = mutate(
    big_flights,
    dep_delay = if_else(dep_time > 500, -dep_delay, dep_delay, missing = dep_delay),
    sched_dep_time = if_else(dep_time > 500, -sched_dep_time, sched_dep_time, missing = sched_dep_time)
  ),
  fast = mutate(
    big_flights,
    dep_delay = -dep_delay,
    sched_dep_time = -sched_dep_time,
    .when = dep_time > 500
  ),
  iterations = 5
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 slow          2.54s    2.72s     0.363    7.04GB    1.31 
#> 2 fast       806.87ms 972.97ms     1.01     1.73GB    0.805

# Big data, two conditions, with groups
bench::mark(
  slow = mutate(
    big_grouped_flights,
    dep_delay = if_else(dep_time > 500, -dep_delay, dep_delay, missing = dep_delay),
    sched_dep_time = if_else(dep_time > 500, -sched_dep_time, sched_dep_time, missing = sched_dep_time)
  ),
  fast = mutate(
    big_grouped_flights,
    dep_delay = -dep_delay,
    sched_dep_time = -sched_dep_time,
    .when = dep_time > 500
  ),
  iterations = 5
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 slow             5s    5.45s     0.183    7.92GB    0.623
#> 2 fast          2.33s    2.86s     0.339    3.71GB    0.679

@joeycouse
Copy link

Random lurker here, but just my general thoughts. Having the ability to apply .when non-globally would be the best drop in replacement for if_else() IMO. I understand that probably isn't the overall intent of this PR but would be much more useful in cases where there are many conditions and writing a global condition would be cumbersome or unreadable.

As I understand the alternative with the current API proposed is to have multiple calls to mutate() each with a separate condition. I think that could be a good performance benchmark to do.

tibble(a = c(1,2,NA,3,5),
       b = c("car", "train", "bus", "truck", "motorcycle"),
       c = c(100,20,30,40,500)
       )|> 
  mutate(a = if_else(is.na(a), 0, a),
         b = if_else(b == "truck", "pick-up", b),
         c = if_else(c < 100, c * 10, c)
         )

# Example of .when non-global
tibble(a = c(1,2,NA,3,5),
       b = c("car", "train", "bus", "truck", "motorcycle")) |> 
  mutate(a = 0: .when = is.na(a),
         b = 'pick-up': .when = b == "truck",
         c = c * 10: .when = c <100)

# Current API ?
# Can get quite cumbersome with many conditions
tibble(a = c(1,2,NA,3,5),
       b = c("car", "train", "bus", "truck", "motorcycle")) |> 
  mutate(a = 0,
         b = 'pick-up',
         c = c * 10,
         .when = is.na(a)|b == "truck"|c<100)


# Compare performance to single mutate with multiple if_else statements
tibble(a = c(1,2,NA,3,5),
       b = c("car", "train", "bus", "truck", "motorcycle")) |> 
  mutate(a = 0, .when = is.na(a)) |> 
  mutate(b = "truck", .when = b == "truck") |> 
  mutate(c = c * 10, .when = c <100)

         

@DavisVaughan
Copy link
Member Author

Your third example with .when = is.na(a)|b == "truck"|c<100 does something very different from the other three, right?

@DavisVaughan DavisVaughan force-pushed the feature/mutate-when branch from b43962f to 39f246a Compare June 28, 2022 18:20
@joeycouse
Copy link

Your third example with .when = is.na(a)|b == "truck"|c<100 does something very different from the other three, right?

Ah I see, I misunderstood your example at first. Though, it would be interesting the see the performance comparison between these.

tibble(a = c(1,2,NA,3,5),
       b = c("car", "train", "bus", "truck", "motorcycle"),
       c = c(100,20,30,40,500)
       )|> 
  mutate(a = if_else(is.na(a), 0, a),
         b = if_else(b == "truck", "pick-up", b),
         c = if_else(c < 100, c * 10, c)
         )

tibble(a = c(1,2,NA,3,5),
       b = c("car", "train", "bus", "truck", "motorcycle"),
       c = c(100,20,30,40,500)
       )|> 
  mutate(a = 0, .when = is.na(a)) |> 
  mutate(b = "truck", .when = b == "truck") |> 
  mutate(c = c * 10, .when = c <100)

@DavisVaughan
Copy link
Member Author

DavisVaughan commented Jun 28, 2022

If you sample that up to 1 million rows (to actually get useful benchmarks) then it is pretty clear that repeated mutates are much faster than if_else() (which I expected). The cost to set up the mutate() call isn't that high.

I also think the .when = approach looks much cleaner

library(dplyr)

df <- tibble(a = c(1,2,NA,3,5),
             b = c("car", "train", "bus", "truck", "motorcycle"),
             c = c(100,20,30,40,500)
)

# sample up to 1 mil rows
df <- tibble::new_tibble(lapply(df, sample, size = 1e6, replace = TRUE))

bench::mark(
  ifelse = df |> 
    mutate(
      a = if_else(is.na(a), 0, a),
      b = if_else(b == "truck", "pick-up", b),
      c = if_else(c < 100, c * 10, c)
    ),
  when = df |> 
    mutate(a = 0, .when = is.na(a)) |> 
    mutate(b = "pick-up", .when = b == "truck") |> 
    mutate(c = c * 10, .when = c <100)
)
#> Warning: Some expressions had a GC in every iteration; so filtering is disabled.
#> # A tibble: 2 × 6
#>   expression      min   median `itr/sec` mem_alloc `gc/sec`
#>   <bch:expr> <bch:tm> <bch:tm>     <dbl> <bch:byt>    <dbl>
#> 1 ifelse      167.3ms    235ms      4.37   237.9MB     18.9
#> 2 when         48.7ms   57.6ms     15.6     55.1MB     13.7

It is apparent that this is needed with the bug fixes in rlang 1.0.3, otherwise an internal dplyr function will be reported as the source of the error
@markfairbanks
Copy link
Contributor

Groups are ignored when computing .when

I'm sure there was a lot of discussion around this, but I am going to throw in a vote for this being confusing. The idea that it matches SQL seems less relevant than the fact that filter and mutate both work by group. And having an argument inside mutate that mimics filter/if_else/case_when conditions but doesn't work by group is a bit odd (IMO).

The performance loss seems worth it to keep things consistent with other grouping behaviors.

I really like the rest of the API btw. Seems very useful.

@DavisVaughan
Copy link
Member Author

DavisVaughan commented Jun 28, 2022

@markfairbanks do you think that this should apply the .when expression before or after groups are computed?

df %>% mutate(x = mean(y), .when = is.na(x), .by = g)

For reference, equivalent data.table syntax applies .when (i) first on all of df and then computes the groups.

I was optimizing for this potential syntax, assuming that .when would be computed before .by. But maybe that would be unexpected there too?

I did just learn that apparently you can't combine UPDATE with GROUP BY in SQL, so this probably isn't that common of an operation. I honestly can't think of a super compelling example of a "grouped update", so if anyone else can then I'd love to see it.

@markfairbanks
Copy link
Contributor

In my head mutate(.by =) is shorthand for df %>% group_by() %>% mutate() %>% ungroup() so .when would be calculated after .by. Just like the ... in mutate() would be calculated after .by. This is the main reason why I think .when should operate by group. Especially after a group_by() I would be surprised that .when doesn't operate by group but everything else in my pipe chain does.

As for data.table - it's a bit of an interesting example because i never operates by group but j does. You have to do an extra step to filter by group in data.table. However filter() always operates by group in dplyr. So .when seems different to me than i in data.table.

Also worth mentioning that data.table errors if you try and filter/mutate by group so I guess it behaves like SQL in that way. Definitely lends to your thought that it's an uncommon operation. But is an ungrouped condition with a grouped update that much more common? Maybe it is 🤷‍♂️.

library(data.table)

df <- data.table(group = c("a", "a", "b"), val = as.double(1:3))

copy(df)[, .SD[val <= mean(val), val := mean(val)], by = group][]
#> Error: .SD is locked. Using := in .SD's j is reserved for possible future use; a tortuously flexible way to modify by group. Use := in j directly to modify by group by reference.

@DavisVaughan
Copy link
Member Author

Note (mainly to self): If we apply .when by group then we might end up with groups that don't have any rows left in them, i.e.:

x = [7, 7, 9] [4, 5, 6]
g = [1, 1, 1] [2, 2, 2]

.when = x < 6

# would this "filter" groups, retaining the number of groups?
x = [double()]  [4, 5]
g = [integer()] [2, 2]

# or drop empty groups?
x = [4, 5]
g = [2, 2]

The current implementation "drops" groups, but it is clear that that is intended because theoretically it currently filters with .when before computing the groups with .by, so there weren't really groups to "drop" in the first place because those rows didn't exist when the groups were computed.

I guess this is the filter(.preserve = TRUE/FALSE) problem. I really don't want to add another argument for this case if we don't have to.

library(dplyr)

df <- tibble(
  x = c(7, 7, 9, 4, 5, 6),
  g = c(1, 1, 1, 2, 2, 2)
) %>%
  group_by(g)

df %>%
  filter(x < 6) %>%
  group_data()
#> # A tibble: 1 × 2
#>       g       .rows
#>   <dbl> <list<int>>
#> 1     2         [2]

df %>%
  filter(x < 6, .preserve = TRUE) %>%
  group_data()
#> # A tibble: 2 × 2
#>       g       .rows
#>   <dbl> <list<int>>
#> 1     1         [0]
#> 2     2         [2]

Created on 2022-06-28 by the reprex package (v2.0.1)

@bwiernik
Copy link

An example where grouped mutates with .when come to mind would be group-wise mean imputation of missing values or group-wise outlier trimming.

dat |> mutate(x = mean(x, na.rm = TRUE), .when = is.na(x), .by = g)

dat |> mutate(x = mean(x, na.rm = TRUE) + 2 * sd(x, na.rm = TRUE), .when = x > mean(x, na.rm = TRUE) + 2 * sd(x, na.rm = TRUE), .by = g)

@DavisVaughan
Copy link
Member Author

@bwiernik it doesn't work that way, x = mean(x, na.rm = TRUE) is only evaluated on the slice of dat where .when = is.na(x) is TRUE, so you'd be computing the mean of a vector of missing values

@drmowinckels
Copy link

just out of curiosity (and not understanding the underlying mechanics) why is this implemented as an argument .when rather than a dplyr function when(...) more akin to across()?

imo, the syntactic form of .when here feels quite unfamiliar to how I'd expect mutate to work. Especially in the cases where there are non-global whens. The colon is not used often (at all?) in other dplyr functions. At the same time, you'd want people to be able to have several changes in a single mutate, as we are used to.

So could it also not be that this:

tibble(
       a = c(1,2,NA,3,5),
       b = c("car", "train", "bus", "truck", "motorcycle")
   ) |> 
  mutate(
         a = 0: .when = is.na(a),
         b = 'pick-up': .when = b == "truck",
         c = c * 10: .when = c <100
    )

could have looked like:

tibble(
       a = c(1,2,NA,3,5),
       b = c("car", "train", "bus", "truck", "motorcycle")
   ) |> 
  mutate(
         when(is.na(a), a = 0),
         when(b == "truck", b = 'pick-up'),
         when(c < 100, c = c * 10)
    )

and arguably been a more familiar way of working with dplyr?
Then again, this might have already been a discussion with good reasons to not go this path.

@joeycouse
Copy link

just out of curiosity (and not understanding the underlying mechanics) why is this implemented as an argument .when rather than a dplyr function when(...) more akin to across()?

imo, the syntactic form of .when here feels quite unfamiliar to how I'd expect mutate to work. Especially in the cases where there are non-global whens. The colon is not used often (at all?) in other dplyr functions. At the same time, you'd want people to be able to have several changes in a single mutate, as we are used to.

So could it also not be that this:


tibble(

       a = c(1,2,NA,3,5),

       b = c("car", "train", "bus", "truck", "motorcycle")

   ) |> 

  mutate(

         a = 0: .when = is.na(a),

         b = 'pick-up': .when = b == "truck",

         c = c * 10: .when = c <100

    )

could have looked like:


tibble(

       a = c(1,2,NA,3,5),

       b = c("car", "train", "bus", "truck", "motorcycle")

   ) |> 

  mutate(

         when(is.na(a), a = 0),

         when(b == "truck", b = 'pick-up'),

         when(c < 100, c = c * 10)

    )

and arguably been a more familiar way of working with dplyr?

Then again, this might have already been a discussion with good reasons to not go this path.

The colon is not the actual api, that was just a random proposal. I think your syntax is good, and something they were considering as well.

@markfairbanks
Copy link
Contributor

it doesn't work that way, x = mean(x, na.rm = TRUE) is only evaluated on the slice of dat where .when = is.na(x) is TRUE

@DavisVaughan I somehow missed this as well. To be honest I thought .when was more or less an alternate syntax for if_else(), but I guess this covers a slightly different use case? I had to go check and that's how data.table works as well.

If that's the intended use of .when I might have to retract my earlier vote 😅. It does seem highly unlikely to need this to work by slice and by group.

@DavisVaughan
Copy link
Member Author

I think we are going to close this for now. We aren't entirely convinced that this will benefit a large part of the user base, as we have struggled to come up with a large amount of examples where this is useful - outside of replacing missing values.

The addition of case_match() and the ptype argument to case_when() (for type stability) also makes this feature less impactful.

We may return to this idea in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Conditionally mutate selected rows cur_group() and size zero grouped data frame edge case bug
5 participants