New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

.data does not work if object has same name as variable #2916

Open
sollano opened this Issue Jun 26, 2017 · 14 comments

Comments

Projects
None yet
4 participants
@sollano

sollano commented Jun 26, 2017

I believe this is a bug with the .data pronoun:

# Load package and example data
library(dplyr)

# this works

my_var <- "homeworld"

starwars %>%
  group_by(.data[[my_var]]) %>%
  summarise_at(vars(height:mass), mean, na.rm = TRUE)

# A tibble: 49 x 3
           my_var   height  mass
            <chr>    <dbl> <dbl>
 1       Alderaan 176.3333  64.0
 2    Aleen Minor  79.0000  15.0
 3         Bespin 175.0000  79.0
 4     Bestine IV 180.0000 110.0
 5 Cato Neimoidia 191.0000  90.0
 6          Cerea 198.0000  82.0
 7       Champala 196.0000   NaN
 8      Chandrila 150.0000   NaN
 9   Concord Dawn 183.0000  79.0
10       Corellia 175.0000  78.5
# ... with 39 more rows

# this does not

homeworld <- "homeworld"

starwars %>%
  group_by(.data[[homeworld]]) %>%
  summarise_at(vars(height:mass), mean, na.rm = TRUE)

Error in mutate_impl(.data, dots) : 
  Evaluation error: Must subset with a string.

@sollano sollano changed the title from .data does not work if object has same name as variable to .data does not work if object has same name as variable [bug] Jun 26, 2017

@sollano sollano changed the title from .data does not work if object has same name as variable [bug] to .data does not work if object has same name as variable Jun 26, 2017

@lionel- lionel- self-assigned this Jun 26, 2017

@JohnMount

This comment has been minimized.

Show comment
Hide comment
@JohnMount

JohnMount Jun 26, 2017

Also notice everything is grouped by a column named "my_var", not one named "homeworld" (and a column is added to the data frame).

I wonder if it is related to dplyr issue 2904 where similar "coincidence in naming" aliasing appears to be happening (but didn't seem as unfavorable at first look). I'd also say the odds of accidentally guessing a column name when coding are not actually that low.

I am beginning to wonder if things should be stricter and the only allowed pronoun notation should be group_by(.data[[!!homeworld]]).

JohnMount commented Jun 26, 2017

Also notice everything is grouped by a column named "my_var", not one named "homeworld" (and a column is added to the data frame).

I wonder if it is related to dplyr issue 2904 where similar "coincidence in naming" aliasing appears to be happening (but didn't seem as unfavorable at first look). I'd also say the odds of accidentally guessing a column name when coding are not actually that low.

I am beginning to wonder if things should be stricter and the only allowed pronoun notation should be group_by(.data[[!!homeworld]]).

@sollano

This comment has been minimized.

Show comment
Hide comment
@sollano

sollano Jun 28, 2017

@JohnMount You're right, the name of the grouping variable is modified to be equal to the object's name. Really weird... Didn't notice that before.
Also, The guessing problem is very real to me, I usually use variable names that are the initials of some measure (like DBH, Diameter at Breast Height), so it can be easily guessed.

sollano commented Jun 28, 2017

@JohnMount You're right, the name of the grouping variable is modified to be equal to the object's name. Really weird... Didn't notice that before.
Also, The guessing problem is very real to me, I usually use variable names that are the initials of some measure (like DBH, Diameter at Breast Height), so it can be easily guessed.

@lionel-

This comment has been minimized.

Show comment
Hide comment
@lionel-

lionel- Jun 28, 2017

Member

Really weird.

That's the whole point of the dplyr interface though ;) We just need to make an exception for .data subsetting.

The guessing problem is really real to me

As John mentioned, a general solution to avoid hitting variables from the data frame is to evaluate eagerly by unquoting with !!. E.g.:

mass <- 100

transmute(starwars,
  height,
  eagerly = height * (!! mass),  # takes `mass` from environment
  lazily = height * mass         # takes `mass` from `starwars`
)
#> # A tibble: 87 x 3
#>    height eagerly lazily
#>     <int>   <dbl>  <dbl>
#>  1    172   17200  13244
#>  2    167   16700  12525
#>  3     96    9600   3072
#>  4    202   20200  27472
#>  5    150   15000   7350
#>  6    178   17800  21360
#>  7    165   16500  12375
#>  8     97    9700   3104
#>  9    183   18300  15372
#> 10    182   18200  14014
#> # ... with 77 more rows
Member

lionel- commented Jun 28, 2017

Really weird.

That's the whole point of the dplyr interface though ;) We just need to make an exception for .data subsetting.

The guessing problem is really real to me

As John mentioned, a general solution to avoid hitting variables from the data frame is to evaluate eagerly by unquoting with !!. E.g.:

mass <- 100

transmute(starwars,
  height,
  eagerly = height * (!! mass),  # takes `mass` from environment
  lazily = height * mass         # takes `mass` from `starwars`
)
#> # A tibble: 87 x 3
#>    height eagerly lazily
#>     <int>   <dbl>  <dbl>
#>  1    172   17200  13244
#>  2    167   16700  12525
#>  3     96    9600   3072
#>  4    202   20200  27472
#>  5    150   15000   7350
#>  6    178   17800  21360
#>  7    165   16500  12375
#>  8     97    9700   3104
#>  9    183   18300  15372
#> 10    182   18200  14014
#> # ... with 77 more rows
@sollano

This comment has been minimized.

Show comment
Hide comment
@sollano

sollano Jun 28, 2017

As John mentioned, a general solution to avoid hitting variables from the data frame is to evaluate eagerly by unquoting with !!.

Yes, that works when the object is a number. But If that object references a quoted variable name, and they have the same name, it doesn't, unless the object's name is the same as the variable name! (This is getting confusing to me). i.e:

# this works:
height <- "height"
height <- quo(height)

print(height)

<quosure: global>
~height

transmute(starwars,
          height,
          height_sqrt = height * (!!height)         
)

# A tibble: 87 x 2
   height height_sqrt
    <int>       <int>
 1    172       29584
 2    167       27889
 3     96        9216
 4    202       40804
 5    150       22500
 6    178       31684
 7    165       27225
 8     97        9409
 9    183       33489
10    182       33124
# ... with 77 more rows

# this does not

h <- "height"
h <- quo(h)

print(h)

<quosure: global>
~h

transmute(starwars,
          height,
          height_sqrt = height * (!!h)         
)

Error in mutate_impl(.data, dots) : 
  Evaluation error: non-numeric argument to binary operator.

So this method does not work for functions either, because the arguments would have to match the names of the dataframe, otherwise it would fail. And generally, in a function, you want to use generic variable names, that can be replaced by the user.

So one is stuck with the .data pronoun, if one does want to use strings as input to his function. Maybe I'm doing something wrong here, but I could not make !! convert strings to bare variable names.

I guess one way would be to modify the variable names inside the function to match the generic names used inside the function, and skip the whole tidy eval thing. But I guess that kinda misses the point.

sollano commented Jun 28, 2017

As John mentioned, a general solution to avoid hitting variables from the data frame is to evaluate eagerly by unquoting with !!.

Yes, that works when the object is a number. But If that object references a quoted variable name, and they have the same name, it doesn't, unless the object's name is the same as the variable name! (This is getting confusing to me). i.e:

# this works:
height <- "height"
height <- quo(height)

print(height)

<quosure: global>
~height

transmute(starwars,
          height,
          height_sqrt = height * (!!height)         
)

# A tibble: 87 x 2
   height height_sqrt
    <int>       <int>
 1    172       29584
 2    167       27889
 3     96        9216
 4    202       40804
 5    150       22500
 6    178       31684
 7    165       27225
 8     97        9409
 9    183       33489
10    182       33124
# ... with 77 more rows

# this does not

h <- "height"
h <- quo(h)

print(h)

<quosure: global>
~h

transmute(starwars,
          height,
          height_sqrt = height * (!!h)         
)

Error in mutate_impl(.data, dots) : 
  Evaluation error: non-numeric argument to binary operator.

So this method does not work for functions either, because the arguments would have to match the names of the dataframe, otherwise it would fail. And generally, in a function, you want to use generic variable names, that can be replaced by the user.

So one is stuck with the .data pronoun, if one does want to use strings as input to his function. Maybe I'm doing something wrong here, but I could not make !! convert strings to bare variable names.

I guess one way would be to modify the variable names inside the function to match the generic names used inside the function, and skip the whole tidy eval thing. But I guess that kinda misses the point.

@lionel-

This comment has been minimized.

Show comment
Hide comment
@lionel-

lionel- Jun 28, 2017

Member

Let's look at what happens in this snippet:

height <- "height"    # height refers to a column name
height <- quo(height) # height is a quosure referring to `height`, which is itself

height contains a quote to itself, so you won't do anything useful with it. Except if the symbol height is also contained in the data frame, then height column will prevail because scoping in dplyr is always dataframe → environment, with or without quosures.

So when you're doing this:

h <- "height" # `h` contains a column name
h <- quo(h)   # `h` contains a quote of itself

Unquoting h refers to h, which isn't a column name but is found in the environment, and that h contains a reference to itself.

What you want is simply this:

# Import `sym()` to create symbols
# will soon be exported in dplyr as well
sym <- rlang::sym

height_name <- "height"
height_sym <- sym(height_name)

transmute(starwars,
   height,
   height_sqrt = height * (!! height_sym)
)

this is equivalent to:

height <- "height" # storing column name

transmute(starwars,
   height,
   height_sqrt = height * (!! sym(height)) # creating and inlining symbol from stored column name
)

Try to wrap with expr() to debug what expression dplyr sees:

expr(transmute(starwars,
   height,
   height_sqrt = height * (!! sym(height))
))
Member

lionel- commented Jun 28, 2017

Let's look at what happens in this snippet:

height <- "height"    # height refers to a column name
height <- quo(height) # height is a quosure referring to `height`, which is itself

height contains a quote to itself, so you won't do anything useful with it. Except if the symbol height is also contained in the data frame, then height column will prevail because scoping in dplyr is always dataframe → environment, with or without quosures.

So when you're doing this:

h <- "height" # `h` contains a column name
h <- quo(h)   # `h` contains a quote of itself

Unquoting h refers to h, which isn't a column name but is found in the environment, and that h contains a reference to itself.

What you want is simply this:

# Import `sym()` to create symbols
# will soon be exported in dplyr as well
sym <- rlang::sym

height_name <- "height"
height_sym <- sym(height_name)

transmute(starwars,
   height,
   height_sqrt = height * (!! height_sym)
)

this is equivalent to:

height <- "height" # storing column name

transmute(starwars,
   height,
   height_sqrt = height * (!! sym(height)) # creating and inlining symbol from stored column name
)

Try to wrap with expr() to debug what expression dplyr sees:

expr(transmute(starwars,
   height,
   height_sqrt = height * (!! sym(height))
))
@sollano

This comment has been minimized.

Show comment
Hide comment
@sollano

sollano Jun 28, 2017

@lionel- Thanks for that man, I really appreciate it. Just tested this in my function, and it works. When this sym function gets exported to dplyr, I think it would be a good idea to add an example like this to the programming vignette.

But this makes me wonder... Doesn't this sym function kinda replaces the .data pronoun?

sollano commented Jun 28, 2017

@lionel- Thanks for that man, I really appreciate it. Just tested this in my function, and it works. When this sym function gets exported to dplyr, I think it would be a good idea to add an example like this to the programming vignette.

But this makes me wonder... Doesn't this sym function kinda replaces the .data pronoun?

@lionel-

This comment has been minimized.

Show comment
Hide comment
@lionel-

lionel- Jun 28, 2017

Member

Doesn't this sym function kinda replaces the .data pronoun?

No because the .data pronoun ensures the variable comes from the data frame and never from the environment. Since we have the df → env scoping in dplyr, there is always this uncertainty about where the symbols are found.

Most of the time it doesn't matter (otherwise your dplyr pipelines wouldn't work) but when you want to be really explicit you have two solutions: unquoting eagerly makes sure you're taking your variable from the environment, and using the .data pronoun makes sure you're taking your variable from the data frame.

Member

lionel- commented Jun 28, 2017

Doesn't this sym function kinda replaces the .data pronoun?

No because the .data pronoun ensures the variable comes from the data frame and never from the environment. Since we have the df → env scoping in dplyr, there is always this uncertainty about where the symbols are found.

Most of the time it doesn't matter (otherwise your dplyr pipelines wouldn't work) but when you want to be really explicit you have two solutions: unquoting eagerly makes sure you're taking your variable from the environment, and using the .data pronoun makes sure you're taking your variable from the data frame.

@sollano

This comment has been minimized.

Show comment
Hide comment
@sollano

sollano Jun 28, 2017

Oh, now I see. It would be wiser to use the.data pronoun, just to be sure.

sollano commented Jun 28, 2017

Oh, now I see. It would be wiser to use the.data pronoun, just to be sure.

@krlmlr

This comment has been minimized.

Show comment
Hide comment
@krlmlr

krlmlr Jan 19, 2018

Member

Can we close this? Do we need to improve documentation?

Member

krlmlr commented Jan 19, 2018

Can we close this? Do we need to improve documentation?

@sollano

This comment has been minimized.

Show comment
Hide comment
@sollano

sollano Feb 16, 2018

I'm still kinda of confused with this because rlang::sym and rlang::syms are doing the jobs of dplyr::quo and dplyr::quos, just fine for me, so I simply don't use them anymore. I believe this should be detailed in the documentation. Most people don't even know about those rlang functions. That's just my opinion anyway, maybe I got something wrong. But even if I did, that just proves my point.

sollano commented Feb 16, 2018

I'm still kinda of confused with this because rlang::sym and rlang::syms are doing the jobs of dplyr::quo and dplyr::quos, just fine for me, so I simply don't use them anymore. I believe this should be detailed in the documentation. Most people don't even know about those rlang functions. That's just my opinion anyway, maybe I got something wrong. But even if I did, that just proves my point.

@lionel-

This comment has been minimized.

Show comment
Hide comment
@lionel-

lionel- Feb 18, 2018

Member

rlang::sym and rlang::syms are doing the jobs of dplyr::quo and dplyr::quos, just fine for me, so I simply don't use them anymore

Probably because you use them to refer to objects from the data frame. In this case they are completely fine. However in more complex cases (i.e. involving calls to functions) you'll need quosures to avoid bugs.

believe this should be detailed in the documentation.

Better documentation is forthcoming.

Member

lionel- commented Feb 18, 2018

rlang::sym and rlang::syms are doing the jobs of dplyr::quo and dplyr::quos, just fine for me, so I simply don't use them anymore

Probably because you use them to refer to objects from the data frame. In this case they are completely fine. However in more complex cases (i.e. involving calls to functions) you'll need quosures to avoid bugs.

believe this should be detailed in the documentation.

Better documentation is forthcoming.

@sollano

This comment has been minimized.

Show comment
Hide comment
@sollano

sollano Feb 18, 2018

However in more complex cases (i.e. involving calls to functions) you'll need quosures to avoid bugs.
Oh, thanks. I'll take note on this.

So the issue that I found is not really a bug, it simply happens because of the way quosures works, right? I think this is solved, then.

sollano commented Feb 18, 2018

However in more complex cases (i.e. involving calls to functions) you'll need quosures to avoid bugs.
Oh, thanks. I'll take note on this.

So the issue that I found is not really a bug, it simply happens because of the way quosures works, right? I think this is solved, then.

@krlmlr

This comment has been minimized.

Show comment
Hide comment
@krlmlr

krlmlr Feb 28, 2018

Member

Keeping open, feel free to close when docs are updated.

Member

krlmlr commented Feb 28, 2018

Keeping open, feel free to close when docs are updated.

@krlmlr krlmlr added the docs label Feb 28, 2018

@lionel-

This comment has been minimized.

Show comment
Hide comment
@lionel-

lionel- May 28, 2018

Member

Upstream issue: r-lib/rlang#477

Member

lionel- commented May 28, 2018

Upstream issue: r-lib/rlang#477

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment