Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimise more nested selects #213

Closed
hadley opened this issue Jan 10, 2019 · 5 comments

Comments

@hadley
Copy link
Member

@hadley hadley commented Jan 10, 2019

library(dplyr, warn.conflicts = FALSE)
iris_db <- dbplyr::tbl_lazy(iris)
iris_db %>% 
  select(2:1) %>% 
  select(2:1) %>% 
  select(2:1) %>% 
  show_query()
#> <SQL> SELECT "Sepal.Width", "Sepal.Length"
#> FROM (SELECT "Sepal.Length", "Sepal.Width"
#> FROM (SELECT "Sepal.Width", "Sepal.Length"
#> FROM "df") "dbplyr_tulvyudlkv") "dbplyr_szhyvbpgox"

Created on 2019-01-10 by the reprex package (v0.2.1.9000)

@hadley

This comment has been minimized.

Copy link
Member Author

@hadley hadley commented Jan 10, 2019

This is going to require a richer data structure in select_query(); currently the components of the SELECT are just strings.

@romatik

This comment has been minimized.

Copy link

@romatik romatik commented Jan 13, 2019

One use-case where this becomes particularly relevant is DB's without very "smart" optimizer. At work we are switching from Postgres to CockroachDB and multiple nested queries might become a problem. Postgres is smart enough to optimize, but CockroachDB is likely going to be extra-slow because of them.

@hadley

This comment has been minimized.

Copy link
Member Author

@hadley hadley commented Feb 5, 2019

Or does this optimisation need to be performed by select()? Otherwise how would we distinguish between a subqueries created by multiple selects, vs a subquery generated by mutate()?

Maybe select() applied to select(), mutate(), transmute(), or summarise() could modify the previous op rather than creating a new one?

@hadley

This comment has been minimized.

Copy link
Member Author

@hadley hadley commented Feb 5, 2019

I think we should be able to simplify the following SQL:

library(dplyr, warn.conflicts = FALSE)
library(dbplyr, warn.conflicts = FALSE)

memdb_frame(x = 1, y = 2) %>% 
  filter(x > 1) %>% 
  mutate(z = x + 2) %>% 
  select(y) %>% 
  show_query()
#> <SQL>
#> SELECT `y`
#> FROM (SELECT `x`, `y`, `x` + 2.0 AS `z`
#> FROM (SELECT *
#> FROM `dbplyr_yicivbbksg`
#> WHERE (`x` > 1.0)))

But we wouldn't try and simplify code this:

memdb_frame(x = 1, y = 2) %>% 
  mutate(z = x + 2) %>% 
  filter(z > 1) %>% 
  select(y) %>% 
  show_query()
#> <SQL>
#> SELECT `y`
#> FROM (SELECT `x`, `y`, `x` + 2.0 AS `z`
#> FROM `dbplyr_owxqpctbzw`)
#> WHERE (`z` > 1.0)

Since that would require a full dependency analysis of what variables are used by each stage.

@hadley

This comment has been minimized.

Copy link
Member Author

@hadley hadley commented Feb 5, 2019

I think the best way to do this will be to unify the op underlying mutate(), transmute(), select() and rename(), so that any the combination of any two always produces a single op (exception when there are interrelated variables)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.