Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Query text is not deterministic due to generated subquery aliases #336

Closed
returnString opened this issue Jul 14, 2019 · 5 comments
Closed
Labels
feature a feature request or enhancement verb trans 🤖 Translation of dplyr verbs to SQL
Milestone

Comments

@returnString
Copy link

Some database engines like Redshift rely on exact query text matches for result caching. Currently, dbplyr will generate a rolling ID for subquery aliases that means any query comprising a subquery will consistently result in cache misses. See below for a small repro case.

Last year, I created a minimal fork to fix this for an internal use-case but I'd like to upstream this properly :) This basically just set all user-inaccessible subquery aliases to "a", but this probably isn't ideal, and I imagine using some dbplyr-specific string similar to the sequential ones would be more sensible. That said, this has been running in a production setting for over a year with no issues that I'm aware of.

Personally I'm only really familiar with the Postgres/Redshift implementation of the spec and so I'm not sure what effect this could have on other DB engines. I'm happy to create a PR myself with a revised change if people think this is reasonable, but is it worth making it specific to each backend, or perhaps gated behind an option? I'd appreciate any suggestions or thoughts here.

library(dplyr)
library(dbplyr)

pg_table <- lazy_frame(data.frame(x = 1, y = 2), simulate_postgres())

pg_table %>%
  filter(x == 1) %>%
  filter(y == 2)
#> <SQL>
#> SELECT *
#> FROM (SELECT *
#> FROM `df`
#> WHERE (`x` = 1.0)) `dbplyr_001`
#> WHERE (`y` = 2.0)

pg_table %>%
  filter(x == 1) %>%
  filter(y == 2)
#> <SQL>
#> SELECT *
#> FROM (SELECT *
#> FROM `df`
#> WHERE (`x` = 1.0)) `dbplyr_002`
#> WHERE (`y` = 2.0)

Created on 2019-07-14 by the reprex package (v0.3.0)

@alexkyllo
Copy link
Contributor

alexkyllo commented Jul 24, 2019

This also causes the package's tests to fail when re-run a second time, if the package is not reloaded to reset the global alias counter first.

@hadley hadley added feature a feature request or enhancement verb trans 🤖 Translation of dplyr verbs to SQL labels Dec 13, 2019
@krlmlr
Copy link
Member

krlmlr commented May 19, 2020

You can temporarily set the dbplyr_table_num option to work around.

library(tidyverse)
library(dbplyr)
#> 
#> Attaching package: 'dbplyr'
#> The following objects are masked from 'package:dplyr':
#> 
#>     ident, sql

lazy_frame(a = 1) %>%
  select(b = a) %>%
  mutate(c = b)
#> <SQL>
#> SELECT `b`, `b` AS `c`
#> FROM (SELECT `a` AS `b`
#> FROM `df`) `dbplyr_001`
lazy_frame(a = 1) %>%
  select(b = a) %>%
  mutate(c = b)
#> <SQL>
#> SELECT `b`, `b` AS `c`
#> FROM (SELECT `a` AS `b`
#> FROM `df`) `dbplyr_002`

rlang::with_options(
  dbplyr_table_num = 0,
  lazy_frame(a = 1) %>% select(b = a) %>% mutate(c = b) %>% print()
)
#> <SQL>
#> SELECT `b`, `b` AS `c`
#> FROM (SELECT `a` AS `b`
#> FROM `df`) `dbplyr_001`

Created on 2020-05-19 by the reprex package (v0.3.0)

@hadley
Copy link
Member

hadley commented Sep 17, 2020

What if we just reset the counter prior at the start of query generation? I think that would be relatively simple to do, and shouldn't have far reaching effects.

@hadley
Copy link
Member

hadley commented Sep 21, 2020

Also need to combine unique_name() and unique_table_name()

@hadley
Copy link
Member

hadley commented Sep 21, 2020

Ok, the real problem is that unique_name() should really be unique_subquery_name(), it's the subquery names that need to be reset, not the table names (which are primarily used for temporary tables, i.e. testing).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement verb trans 🤖 Translation of dplyr verbs to SQL
Projects
None yet
Development

No branches or pull requests

4 participants