Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

select(df, colname) sometimes impersonates select_(df, .dots=colname) #2904

Closed
huftis opened this issue Jun 24, 2017 · 7 comments
Closed

select(df, colname) sometimes impersonates select_(df, .dots=colname) #2904

huftis opened this issue Jun 24, 2017 · 7 comments
Assignees

Comments

@huftis
Copy link

huftis commented Jun 24, 2017

select(df, colname) should issue an error message when colname is not a column in df. However, if there exists a (global) colname variable which is a character vector, the columns in df corresponding to the elements in colname are instead returned. Basically, select(df, colname) works like select_(df, .dots=colname) iff there is no column named colname in df.

# Load package and example data
library(dplyr)
df = as_data_frame(iris)

# Create two identical variables string variables
Species = c("Sepal.Width", "Petal.Length")
myvar = Species

df %>% select(Species) # OK, returns Species column
df %>% select(myvar)   # Not OK, returns Sepal.Width & Petal.Length columns
df %>% select(myvar2)  # OK, returns error message

The last command gives the error message #> Error in overscope_eval_next(overscope, expr): object 'myvar2' not found. The next to last line should have also given a similar error message, since there is no myvar colmn in df. But instead, it returns a tibble with the columns Sepal.Width and Petal.Length. This is unexpected and dangerous behaviour.

I observe this bug with dplyr 0.7.1 and the latest GitHub version (as of 2017-06-24).

@lionel-
Copy link
Member

lionel- commented Jun 26, 2017

This is just consistent semantics of evaluation, if a symbol is not found in the data frame, it is looked up in the context.

However it might make sense to make an exception for select() and pull() as the semantics are already a bit different than other verbs (e.g. function calls are not evaluated within the data frame). Then we can always unquote with !! if we really want a variable defined in the environment. It's easy to document as well: symbols are evaluated in the data frame alone, expressions are evaluated in the environment alone.

@MZLABS
Copy link

MZLABS commented Jun 28, 2017

(John Mount here) In my opinion the "Species" name should not be given two chances to match the frame. I get that if you match the frame you go to the frame, and then if not go to the context. But what is happening is if the name "Species" or the contents of "Species" match the frame the frame wins, and only then the system looks out to the context. Under this interpretation a user can not read a code fragment and know what it does without also knowing what value (if any) "Species" may be carrying.

@lionel-
Copy link
Member

lionel- commented Jun 28, 2017

That's dplyr semantics since the beginning so we're not going to change them. Scoping has always been data first, context second. With tidyeval we offer several ways of being more explicit.

That said, select() is special enough that we might make an exception.

@lionel-
Copy link
Member

lionel- commented Jun 28, 2017

That's dplyr semantics since the beginning

It's worth noting that model formulas from base R and even S work the same way.

@MZLABS
Copy link

MZLABS commented Jun 28, 2017

It's worth noting that model formulas from base R and even S work the same way.

Formulas in base R do not have the double lookup in the frame property. I agree looking two places (frame then environment) is good. But trying the frame twice is not good (new or not).

In a base R formula model.matrix(~x, data.frame(x='y', y=1)) does not evaluate to the same as model.matrix(~y, data.frame(x='y', y=1)) just because the x column happens to contain a column name. Similarly x='y'; model.matrix(~x, data.frame(y=1)) also (rightly) does not evaluate to the same value as model.matrix(~y, data.frame(y=1)).

For dplyr 0.7.1 we have mpg <- 'cyl' ; select(mtcars, mpg) selecting the mpg column and v <- 'cyl' ; select(mtcars, v) selecting the cyl column. Meaning before even going to the environment both the variable name and the value in the variable are checked against the data.frame. This also seems to be a special case for select(). Observe that mpg <- 'cyl' ; transmute(mtcars, mpg) and v <- 'cyl' ; transmute(mtcars, v) don't snoop inside the variables mpg and v to get column names.

That's dplyr semantics since the beginning

For dplyr 0.5.0 we have mpg <- 'cyl' ; select(mtcars, mpg) selecting the mpg column and v <- 'cyl' ; select(mtcars, v) throwing. Though mpg <- 2; select(mtcars, mpg) does select the mpg column while v <- 2; select(mtcars, v) selects the cyl column. So the behavior was there for integers, but not for strings.

@lionel-
Copy link
Member

lionel- commented Jun 29, 2017

For dplyr we have

No, this is only for verbs with selection semantics. For those verbs, the columns in the data environment represent column positions, not column values. Then we select based on those positions. This helps simplifying the implementation of selecting helpers. This also explains the behaviour we're seeing here.

Again: we will consider making an exception for the scoping of symbols in selection verbs because the semantics are special.

@lionel- lionel- self-assigned this Jul 10, 2017
lionel- added a commit to lionel-/dplyr that referenced this issue Jul 11, 2017
lionel- added a commit to lionel-/dplyr that referenced this issue Jul 11, 2017
@hadley
Copy link
Member

hadley commented Nov 2, 2017

The semantics were reconsidered as part of tidyselect and will be incorporated into dplyr once we use tidyselect.

@hadley hadley closed this as completed Nov 2, 2017
@lock lock bot locked as resolved and limited conversation to collaborators Jun 7, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants