Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimise more nested selects #213

Closed
hadley opened this issue Jan 10, 2019 · 5 comments
Closed

Optimise more nested selects #213

hadley opened this issue Jan 10, 2019 · 5 comments
Labels
feature a feature request or enhancement verb trans 🤖 Translation of dplyr verbs to SQL

Comments

@hadley
Copy link
Member

hadley commented Jan 10, 2019

library(dplyr, warn.conflicts = FALSE)
iris_db <- dbplyr::tbl_lazy(iris)
iris_db %>% 
  select(2:1) %>% 
  select(2:1) %>% 
  select(2:1) %>% 
  show_query()
#> <SQL> SELECT "Sepal.Width", "Sepal.Length"
#> FROM (SELECT "Sepal.Length", "Sepal.Width"
#> FROM (SELECT "Sepal.Width", "Sepal.Length"
#> FROM "df") "dbplyr_tulvyudlkv") "dbplyr_szhyvbpgox"

Created on 2019-01-10 by the reprex package (v0.2.1.9000)

@hadley
Copy link
Member Author

hadley commented Jan 10, 2019

This is going to require a richer data structure in select_query(); currently the components of the SELECT are just strings.

@romatik
Copy link

romatik commented Jan 13, 2019

One use-case where this becomes particularly relevant is DB's without very "smart" optimizer. At work we are switching from Postgres to CockroachDB and multiple nested queries might become a problem. Postgres is smart enough to optimize, but CockroachDB is likely going to be extra-slow because of them.

@hadley hadley added feature a feature request or enhancement verb trans 🤖 Translation of dplyr verbs to SQL labels Feb 4, 2019
@hadley
Copy link
Member Author

hadley commented Feb 5, 2019

Or does this optimisation need to be performed by select()? Otherwise how would we distinguish between a subqueries created by multiple selects, vs a subquery generated by mutate()?

Maybe select() applied to select(), mutate(), transmute(), or summarise() could modify the previous op rather than creating a new one?

@hadley
Copy link
Member Author

hadley commented Feb 5, 2019

I think we should be able to simplify the following SQL:

library(dplyr, warn.conflicts = FALSE)
library(dbplyr, warn.conflicts = FALSE)

memdb_frame(x = 1, y = 2) %>% 
  filter(x > 1) %>% 
  mutate(z = x + 2) %>% 
  select(y) %>% 
  show_query()
#> <SQL>
#> SELECT `y`
#> FROM (SELECT `x`, `y`, `x` + 2.0 AS `z`
#> FROM (SELECT *
#> FROM `dbplyr_yicivbbksg`
#> WHERE (`x` > 1.0)))

But we wouldn't try and simplify code this:

memdb_frame(x = 1, y = 2) %>% 
  mutate(z = x + 2) %>% 
  filter(z > 1) %>% 
  select(y) %>% 
  show_query()
#> <SQL>
#> SELECT `y`
#> FROM (SELECT `x`, `y`, `x` + 2.0 AS `z`
#> FROM `dbplyr_owxqpctbzw`)
#> WHERE (`z` > 1.0)

Since that would require a full dependency analysis of what variables are used by each stage.

@hadley
Copy link
Member Author

hadley commented Feb 5, 2019

I think the best way to do this will be to unify the op underlying mutate(), transmute(), select() and rename(), so that any the combination of any two always produces a single op (exception when there are interrelated variables)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement verb trans 🤖 Translation of dplyr verbs to SQL
Projects
None yet
Development

No branches or pull requests

2 participants