Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Referencing other columns inside mutate/summarise across is broken in 1.0.4 #5734

Closed
juancamilog opened this issue Feb 3, 2021 · 4 comments · Fixed by #5761
Closed

Referencing other columns inside mutate/summarise across is broken in 1.0.4 #5734

juancamilog opened this issue Feb 3, 2021 · 4 comments · Fixed by #5761
Assignees
Milestone

Comments

@juancamilog
Copy link

In dplyr 1.0.3 you can reference other columns in the same data-frame/tibble group by name. This functionality is broken in 1.0.4

To reproduce, the following example works in 1.0.3

storms %>% mutate(across(c('wind', 'pressure'), function(x) {
  return(x/lat)
}))

In 1.0.4, running the above example results in the following error message

>rlang::last_error()
<error/dplyr:::mutate_error>
Problem with mutate() input ..1.
x object 'lat' not found

i.e. we can't reference other columns by name. A possible workaround is to use cur_data()$name_of_column but this is slower as the following benchmark demonstrates:

library(dplyr, warn.conflicts = F)

df <- tibble(cbind.data.frame(
  grp_1 = sort(rep(1:250, 4)),
  grp_2 = rep(1:4, 250), 
  matrix(rnorm(1000 * 100), nrow = 1000)))

bench::mark(iterations = 100,
            filter_gc = FALSE,
            use_cur_data = df %>% summarise(across(is.numeric, function(x) {
              rows = cur_data()
              mask = (rows$grp_1 %% 2) == 0
              return(mean(x[mask] / rows$grp_2[mask]))
            })),
            direct_reference = df %>% summarise(across(is.numeric, function(x) {
              mask = (grp_1 %% 2) == 0
              return(mean(x[mask] / grp_2[mask]))
            }))) %>%
  select(expression, min, median, `itr/sec`, `gc/sec`)

which results in the following output

# A tibble: 2 x 5
  expression            min   median `itr/sec` `gc/sec`
  <bch:expr>       <bch:tm> <bch:tm>     <dbl>    <dbl>
1 use_cur_data       17.2ms  18.29ms      51.1    10.7 
2 direct_reference   8.12ms   8.59ms     108.      9.76

TL;DR In dply 1.0.3, using cur_data()$column_name to reference columns instead of directly using the column names can be considerably slower. In 1.0.4 referencing columns by name, not using cur_data, is currently broken.

@romainfrancois
Copy link
Member

I'm not sure why this worked previously. It seems odd that a function would be instrumented like this.

OTOH, functions created from formulas are:

library(dplyr, warn.conflicts = FALSE)

storms %>% 
  mutate(across(c(wind, pressure), ~ . / lat))
#> # A tibble: 10,010 x 13
#>    name   year month   day  hour   lat  long status category  wind pressure
#>    <chr> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <chr>  <ord>    <dbl>    <dbl>
#>  1 Amy    1975     6    27     0  27.5 -79   tropi… -1       0.909     36.8
#>  2 Amy    1975     6    27     6  28.5 -79   tropi… -1       0.877     35.5
#>  3 Amy    1975     6    27    12  29.5 -79   tropi… -1       0.847     34.3
#>  4 Amy    1975     6    27    18  30.5 -79   tropi… -1       0.820     33.2
#>  5 Amy    1975     6    28     0  31.5 -78.8 tropi… -1       0.794     32.1
#>  6 Amy    1975     6    28     6  32.4 -78.7 tropi… -1       0.772     31.2
#>  7 Amy    1975     6    28    12  33.3 -78   tropi… -1       0.751     30.4
#>  8 Amy    1975     6    28    18  34   -77   tropi… -1       0.882     29.6
#>  9 Amy    1975     6    29     0  34.4 -75.8 tropi… 0        1.02      29.2
#> 10 Amy    1975     6    29     6  34   -74.8 tropi… 0        1.18      29.5
#> # … with 10,000 more rows, and 2 more variables: ts_diameter <dbl>,
#> #   hu_diameter <dbl>

Created on 2021-02-04 by the reprex package (v0.3.0)

@juancamilog
Copy link
Author

juancamilog commented Feb 4, 2021

Sometimes it is useful to change the value of a set of columns (the columns inside the across statement), depending on the value of other columns for the corresponding rows.

@juancamilog
Copy link
Author

juancamilog commented Feb 4, 2021

Actually, this can be dealt with in 1.0.4 by doing the following

storms %>% mutate(
  across(c('wind', 'pressure'),
            function(x, lat_) {
              return(x/lat_) 
            },
            lat
  )
)

Now lat_ is available inside the function scope and corresponds to what I want.

This issue can be closed

@juancamilog
Copy link
Author

Reopening because the example in the comment above does not work in 1.0.4

simonpcouch added a commit to tidymodels/stacks that referenced this issue Feb 8, 2021
addresses a change in dplyr 1.0.4 where newly defined columns in a mutate call cannot be accessed in anonymous functions inside of across(). see tidyverse/dplyr#5734.
simonpcouch added a commit to tidymodels/stacks that referenced this issue Feb 8, 2021
addresses a change in dplyr 1.0.4 where newly defined columns in a mutate call cannot be accessed in anonymous functions inside of across(). see tidyverse/dplyr#5734.
@romainfrancois romainfrancois self-assigned this Feb 11, 2021
@romainfrancois romainfrancois added this to the 1.0.5 milestone Feb 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants