Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lookup ambiguity in data-expressions #76

Open
ghost opened this issue Sep 24, 2018 · 7 comments
Open

Lookup ambiguity in data-expressions #76

ghost opened this issue Sep 24, 2018 · 7 comments
Labels
dsl feature
Milestone

Comments

@ghost
Copy link

@ghost ghost commented Sep 24, 2018

@JohnMount commented on Sep 23, 2018, 2:35 PM UTC:

dplyr::select() returns a wrong column, should probably throw a "no such column/value" style exception.

library("dplyr")
packageVersion("dplyr")

    ## [1] '0.7.6'

packageVersion("rlang")

    ## [1] '0.2.2'

packageVersion("tidyselect")

    ## [1] '0.2.4'

y <- "x"

dxy <- data.frame(x = 1, y = 2)
dx <- data.frame(x = 1) 

# returns column x (correct)
dx %>%
 select(y)

    ##   x
    ## 1 1

# returns column y (incorrect)
dxy %>%
 select(y)

    ##   y
    ## 1 2

This issue was moved by romainfrancois from tidyverse/dplyr#3851.

@ghost
Copy link
Author

@ghost ghost commented Sep 24, 2018

@romainfrancois commented on Sep 24, 2018, 7:35 AM UTC:

This is the documented behavior. Columns have priority in dxy %>% select(y) so the y column is selected. There is no y in dx so y is evaluated (to "x").

This is in essence, tidyselect semantics, I guess what you describe is [ standard evaluation semantics.

@krlmlr
Copy link
Member

@krlmlr krlmlr commented Sep 24, 2018

This was previously discussed in #49, with the resolution to keep the behavior. But that issue didn't include the example here:

library(tidyselect)

# Dangerous: two way to interpret
x <- "y"
vars_select("x", x)
#>   x 
#> "x"
vars_select("x", x:y)
#> Error in is_character(x, encoding = encoding, n = 1L): object 'y' not found
vars_select("x", -x)
#> named character(0)
vars_select("y", x)
#>   y 
#> "y"
vars_select("y", x:y)
#>   y 
#> "y"
vars_select("y", -x)
#> named character(0)


# Safe
vars_select("x", !!x)
#> Error: Unknown column `y` 
#> Backtrace:
#>  ─tryCatch(...)
#>  ─vars_select("x", !!x)
#>  ─vars_select_eval(.vars, quos) /home/kirill/git/R/tidyselect/R/vars-select.R:118:2
#>  ─map_if(ind_list, is_character, match_strings, names = TRUE) /home/kirill/git/R/tidyselect/R/vars-select.R:236:2
#>  ─map(.x[sel], .f, ...) /tmp/RtmpqZVXSG/R.INSTALL62801d434702/purrr/R/map.R:112:2
#>  ─.f(.x[[i]], ...) /tmp/RtmpqZVXSG/R.INSTALL62801d434702/purrr/R/map.R:104:2
#>  ─bad_unknown_vars(vars, unknown) /home/kirill/git/R/tidyselect/R/vars-select.R:272:4
vars_select("x", (!!x):y)
#> Error: Unknown column `y` 
#> Backtrace:
#>  ─tryCatch(...)
#>  ─"y":y
#>  ─match_strings(x) /home/kirill/git/R/tidyselect/R/vars-select.R:243:4
#>  ─bad_unknown_vars(vars, unknown) /home/kirill/git/R/tidyselect/R/vars-select.R:272:4
vars_select("x", -!!x)
#> Error: Unknown column `y` 
#> Backtrace:
#>  ─tryCatch(...)
#>  ─-"y"
#>  ─match_strings(x) /home/kirill/git/R/tidyselect/R/vars-select.R:257:4
#>  ─bad_unknown_vars(vars, unknown) /home/kirill/git/R/tidyselect/R/vars-select.R:272:4
vars_select("y", !!x)
#>   y 
#> "y"
vars_select("y", (!!x):y)
#>   y 
#> "y"
vars_select("y", -!!x)
#> named character(0)

# Correct (but why is the error message misleading?)
x <- rlang::sym("y")
vars_select("x", !!x)
#> Error in .f(.x[[i]], ...): object 'y' not found
vars_select("x", (!!x):y)
#> Error in is_character(x, encoding = encoding, n = 1L): object 'y' not found
vars_select("x", -!!x)
#> Error in is_character(x): object 'y' not found
vars_select("y", !!x)
#>   y 
#> "y"
vars_select("y", (!!x):y)
#>   y 
#> "y"
vars_select("y", -!!x)
#> named character(0)

Created on 2018-09-24 by the reprex package (v0.2.1.9000)

For now the easiest remedy might be to give an appropriate warning in the documentation, and refer to select(!!x), select(one_of()) or select_at(vars()) for programming.

@lionel- lionel- added the dsl label Sep 9, 2019
@lionel- lionel- changed the title dplyr::select() returns a wrong column, should probably throw a "no such column/value" style exception Data expressions should not look up values outside the data mask Sep 9, 2019
@lionel- lionel- added the feature label Sep 9, 2019
lionel- added a commit to lionel-/tidyselect that referenced this issue Oct 18, 2019
@lionel-
Copy link
Member

@lionel- lionel- commented Oct 18, 2019

This type of variable lookup in selection contexts now triggers a message:

vars <- c("cyl", "disp")
mtcars %>% dplyr::select(vars) %>% head()
#> Note: Selecting non-column variables is brittle.
#> ℹ If the data contains `vars` it will be selected instead.
#> ℹ Use `all_of(vars)` to silence this message.
#>                   cyl disp
#> Mazda RX4           6  160
#> Mazda RX4 Wag       6  160
#> Datsun 710          4  108
#> Hornet 4 Drive      6  258
#> Hornet Sportabout   8  360
#> Valiant             6  225

This is the first step towards deprecation of this behaviour.

@lionel- lionel- added this to the future milestone Nov 7, 2019
@lionel- lionel- changed the title Data expressions should not look up values outside the data mask Lookup ambiguity in data-expressions Nov 14, 2019
@rcragun
Copy link

@rcragun rcragun commented Jun 23, 2021

I think I found an odd case that is part of this issue. Piping a selection helper into across() produces different behavior than typing the helper inside across():

# Very simple function: returns input
self = function(x){x}

# Data to manipulate
dtemp = tibble(var = 1:2)

# No message when selection helper is inside across()
dtemp %>% mutate(across(matches("var"), self))

# Generates message when selection helper is piped to across()
dtemp %>% mutate(matches("var") %>% across(self))

The last line recommends all_of(), but the line before it does not.

I do not know if this is an issue with the code that generates the message or with my understanding of the pipe operator. I think most users understand a %>% b to be equivalent to b(a), but perhaps a %>% b evaluates a and then applies b() to the value returned from evaluating a while b(a) evaluates a in an environment somehow determined by b(). Even if this is how the pipe works, the message could be more helpful to the user because this example does not clearly reference an external vector.

dplyr version: 1.0.6
tidyverse version: 1.3.1

@lionel-
Copy link
Member

@lionel- lionel- commented Jun 28, 2021

I think most users understand a %>% b to be equivalent to b(a), but perhaps a %>% b evaluates a and then applies b() to the value returned from evaluating a

This is more or less how %>% works. On the other hand, the base pipe |> of R 4.1 operates at parsing time rather than evaluation time (in other words it's like a macro instead of like a function), which means a |> b() is perfectly equivalent to b(a). With the base pipe you won't get the message.

@rcragun
Copy link

@rcragun rcragun commented Aug 6, 2021

Is there any way to identify when the user has used %>% and add to the warning the suggestion that that may be the reason for the unexpected (to them) behavior?

@lionel-
Copy link
Member

@lionel- lionel- commented Aug 16, 2021

hmm not really. The check would have to be implemented at the level of mutate() and this sort of linting doesn't seem like a great fit for that function.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dsl feature
Projects
None yet
Development

No branches or pull requests

3 participants