Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

filter_at() translation issue #296

Closed
jakeybob opened this issue May 8, 2019 · 16 comments
Closed

filter_at() translation issue #296

jakeybob opened this issue May 8, 2019 · 16 comments
Labels
bug verb trans 🤖
Milestone

Comments

@jakeybob
Copy link

@jakeybob jakeybob commented May 8, 2019

dbplyr 1.4.0 appears to translate filter_at() calls incorrectly, translating the variable name themselves, rather than their values. See following example:

library(dplyr)
library(dbplyr)

con <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
copy_to(con, mtcars)

mtcars2 <- tbl(con, "mtcars")
cols <- c("vs", "am", "gear", "carb")
nums <- c(4, 5)
num <- 1

test <- mtcars2 %>%
  filter(gear %in% nums, 
         gear == num) %>%
  filter_at(cols, any_vars(. %in% nums)) %>%
  filter_at(cols, any_vars(. == num))

show_query(test)

In dbplyr 1.3.0 this gives the following SQL query:

SELECT *
FROM (SELECT *
FROM (SELECT *
FROM `mtcars`
WHERE ((`gear` IN (4.0, 5.0)) AND (`gear` = 1.0)))
WHERE (`vs` IN (4.0, 5.0) OR `am` IN (4.0, 5.0) OR `gear` IN (4.0, 5.0) OR `carb` IN (4.0, 5.0)))
WHERE (`vs` = 1.0 OR `am` = 1.0 OR `gear` = 1.0 OR `carb` = 1.0)

but in dbplyr 1.4.0 the variables are translated literally as "num" and "nums" in the SQL translation of the filter_at() function, although they are translated as expected from the filter() function.

SELECT *
FROM (SELECT *
FROM (SELECT *
FROM `mtcars`
WHERE ((`gear` IN (4.0, 5.0)) AND (`gear` = 1.0)))
WHERE (`vs` IN `nums` OR `am` IN `nums` OR `gear` IN `nums` OR `carb` IN `nums`))
WHERE (`vs` = `num` OR `am` = `num` OR `gear` = `num` OR `carb` = `num`)

This then leads to an error when querying the database via collect() as there are no num or nums variables on the database end.

@lianos

This comment has been minimized.

@cderv

This comment has been minimized.

@jakeybob

This comment has been minimized.

@hadley

This comment has been minimized.

@lianos

This comment has been minimized.

@cderv
Copy link
Contributor

@cderv cderv commented May 10, 2019

The issue appeared after efd2c17 when partial_eval was refactored. The refactoring revealed some issues with any_vars / all_vars and dbplyr

Minimal reprex to help debug shows it concerns any_vars and all_vars

library(dplyr, warn.conflicts = FALSE)
library(dbplyr, warn.conflicts = FALSE)

cols <- c("vs", "am")

mtcars2 <- tbl_lazy(mtcars, con = simulate_sqlite())

num <- 1
mtcars2 %>%
  filter_at(cols, any_vars(. == num))
#> <SQL>
#> SELECT *
#> FROM `df`
#> WHERE (`vs` = `num` OR `am` = `num`)

minimum <- 150
mtcars2 %>%
  filter_at(cols, all_vars(. > minimum))
#> <SQL>
#> SELECT *
#> FROM `df`
#> WHERE (`vs` > `minimum` AND `am` > `minimum`)

mtcars2 %>%
  filter_all(all_vars(. > minimum))
#> <SQL>
#> SELECT *
#> FROM `df`
#> WHERE (`mpg` > `minimum` AND `cyl` > `minimum` AND `disp` > `minimum` AND `hp` > `minimum` AND `drat` > `minimum` AND `wt` > `minimum` AND `qsec` > `minimum` AND `vs` > `minimum` AND `am` > `minimum` AND `gear` > `minimum` AND `carb` > `minimum`)

mtcars2 %>%
  filter_all(any_vars(. == num))
#> <SQL>
#> SELECT *
#> FROM `df`
#> WHERE (`mpg` = `num` OR `cyl` = `num` OR `disp` = `num` OR `hp` = `num` OR `drat` = `num` OR `wt` = `num` OR `qsec` = `num` OR `vs` = `num` OR `am` = `num` OR `gear` = `num` OR `carb` = `num`)

Created on 2019-05-10 by the reprex package (v0.2.1.9000)

The issue may be with quo_reduce in dplyr that powers filter_* when using any_vars and all_vars. The quosure is built with an environment (base_env()) that is not parent of caller_env() so the num symbol is not found, hence not replaced in the new quosure created by partial evaluation

dbplyr/R/partial-eval.R

Lines 82 to 91 in 56afe40

partial_eval_sym <- function(sym, vars, env) {
name <- as_string(sym)
if (name %in% vars) {
sym
} else if (env_has(env, name, inherit = TRUE)) {
eval_bare(sym, env)
} else {
sym
}
}

I share here some details about what I found

mtcars2 <- dbplyr::tbl_lazy(mtcars, con = dbplyr::simulate_sqlite())
cols <- c("vs", "am")
num <- 4

if we go into detail, the part of filter_at that deals with any_vars creates a quosure
where then env is base_env()

syms <- dplyr:::tbl_at_syms(mtcars2, cols)
pred <- dplyr:::apply_filter_syms(dplyr::any_vars(. == num), syms, mtcars)
pred
#> <quosure>
#> expr: ^(^vs == num) | (^am == num)
#> env:  package:base

from filter.tbl_lazy, this env is kept after patial_eval_dots is applied

dots <- rlang::quos(!!pred)
dbplyr:::partial_eval_dots(dots, vars = dbplyr:::op_vars(mtcars2))
#> [[1]]
#> <quosure>
#> expr: ^(^vs == num) | (^am == num)
#> env:  package:base

from partial_eval_dots, the env of the quosure (base_env()) is passed in partial_eval

dbplyr:::partial_eval(rlang::get_expr(dots[[1]]), vars = dbplyr:::op_vars(mtcars2), env = rlang::get_env(dots[[1]]))
#> (~vs == num) | ~am == num
dbplyr:::partial_eval(rlang::sym("num"), vars = dbplyr:::op_vars(mtcars2), env = rlang::get_env(dots[[1]]))
#> num
#`  `num` is in `global_env()` and won't be found in parent of `base_env()`
exists("num", envir = rlang::base_env(), inherits = TRUE)
#> [1] FALSE
rlang::env_has(rlang::base_env(), "num", inherit = TRUE)
#>   num 
#> FALSE

There is an environment issue with the quosure built by
dplyr:::apply_filter_syms() and how dbplyr resolves what is from R and
what is in the DB using partial_eval()
I am not sure the issue is in dbplyr. It could also be in dplyr when
building the quosure using a joiner function that sets the env to base_env() through dplyr:::quo_reduce().

I hope this little investigation about the source of the bug helps fix it. I am not currently totally at ease with all the tidyeval interaction to make a PR directly. First try broke some tests 🙄

@keithmcnulty

This comment has been minimized.

@cderv

This comment has been minimized.

@keithmcnulty

This comment has been minimized.

@hadley hadley added bug verb trans 🤖 labels May 30, 2019
@hadley hadley added this to the v1.4.1 milestone May 30, 2019
@hadley
Copy link
Member

@hadley hadley commented May 30, 2019

Thanks for the exploration @cderv! I'll take a look next week.

@hadley
Copy link
Member

@hadley hadley commented May 30, 2019

@lionel- do you remember why dplyr:::apply_filter_syms() creates a quosure with the base environment, rather than the environment of the original quosure?

(@cderv I think your analysis is correct and this is actually a dplyr bug)

@lionel-
Copy link
Member

@lionel- lionel- commented May 31, 2019

The base environment seems correct to me, as it is meant to resolve & or |. On the other hand, the operands should be quosures from the user environment, as captured by all_vars() and any_vars().

@hadley
Copy link
Member

@hadley hadley commented May 31, 2019

Oooh so it's probably partial_eval() not unpeeling the nested quosures correctly.

@hadley
Copy link
Member

@hadley hadley commented May 31, 2019

Hmmm, looks like partial_eval() already handles this correctly:

library(dbplyr)
library(dplyr, warn.conflicts = FALSE)

num <- 4
mtcars2 <- tbl_lazy(mtcars)
mtcars2 %>% 
  filter_at(c("mpg", "cyl"), any_vars(. == num))
#> <SQL>
#> SELECT *
#> FROM `df`
#> WHERE (`mpg` = `num` OR `cyl` = `num`)

# Construct quosure by hand
quo <- dplyr:::apply_filter_syms(any_vars(. == num), syms(c("mpg", "cyl")))
quo
#> <quosure>
#> expr: ^(^mpg == num) | (^cyl == num)
#> env:  package:base
mtcars2 %>% 
  filter(!!quo)
#> <SQL>
#> SELECT *
#> FROM `df`
#> WHERE (`mpg` = `num` OR `cyl` = `num`)

# _but_ explicitly calling partial_eval works 
partial_eval(quo, names(mtcars))
#> <quosure>
#> expr: ^(^mpg == 4) | (^cyl == 4)
#> env:  package:base

mtcars2 %>% 
  filter(!!partial_eval(quo, names(mtcars)))
#> <SQL>
#> SELECT *
#> FROM `df`
#> WHERE (`mpg` = 4.0 OR `cyl` = 4.0)

Created on 2019-05-31 by the reprex package (v0.2.1.9000)

@hadley
Copy link
Member

@hadley hadley commented May 31, 2019

Oh the problem is specifically with partial_eval_dots()

library(dbplyr)
library(dplyr, warn.conflicts = FALSE)

num <- 4
quo <- dplyr:::apply_filter_syms(any_vars(. == num), syms(c("mpg", "cyl")))

partial_eval(quo, names(mtcars))
#> <quosure>
#> expr: ^(^mpg == 4) | (^cyl == 4)
#> env:  package:base

dbplyr:::partial_eval_dots(quos(!!quo), names(mtcars))
#> [[1]]
#> <quosure>
#> expr: ^(^mpg == num) | (^cyl == num)
#> env:  package:base

Created on 2019-05-31 by the reprex package (v0.2.1.9000)

@hadley
Copy link
Member

@hadley hadley commented May 31, 2019

Looks like a 2 line change to partial_eval() fixes this 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug verb trans 🤖
Projects
None yet
Development

No branches or pull requests

6 participants