Optimise more nested selects #213

hadley · 2019-01-10T18:59:09Z

library(dplyr, warn.conflicts = FALSE)
iris_db <- dbplyr::tbl_lazy(iris)
iris_db %>% 
  select(2:1) %>% 
  select(2:1) %>% 
  select(2:1) %>% 
  show_query()
#> <SQL> SELECT "Sepal.Width", "Sepal.Length"
#> FROM (SELECT "Sepal.Length", "Sepal.Width"
#> FROM (SELECT "Sepal.Width", "Sepal.Length"
#> FROM "df") "dbplyr_tulvyudlkv") "dbplyr_szhyvbpgox"

^{Created on 2019-01-10 by the reprex package (v0.2.1.9000)}

hadley · 2019-01-10T20:05:12Z

This is going to require a richer data structure in select_query(); currently the components of the SELECT are just strings.

romatik · 2019-01-13T17:54:26Z

One use-case where this becomes particularly relevant is DB's without very "smart" optimizer. At work we are switching from Postgres to CockroachDB and multiple nested queries might become a problem. Postgres is smart enough to optimize, but CockroachDB is likely going to be extra-slow because of them.

hadley · 2019-02-05T17:03:10Z

Or does this optimisation need to be performed by select()? Otherwise how would we distinguish between a subqueries created by multiple selects, vs a subquery generated by mutate()?

Maybe select() applied to select(), mutate(), transmute(), or summarise() could modify the previous op rather than creating a new one?

hadley · 2019-02-05T17:06:26Z

I think we should be able to simplify the following SQL:

library(dplyr, warn.conflicts = FALSE)
library(dbplyr, warn.conflicts = FALSE)

memdb_frame(x = 1, y = 2) %>% 
  filter(x > 1) %>% 
  mutate(z = x + 2) %>% 
  select(y) %>% 
  show_query()
#> <SQL>
#> SELECT `y`
#> FROM (SELECT `x`, `y`, `x` + 2.0 AS `z`
#> FROM (SELECT *
#> FROM `dbplyr_yicivbbksg`
#> WHERE (`x` > 1.0)))

But we wouldn't try and simplify code this:

memdb_frame(x = 1, y = 2) %>% 
  mutate(z = x + 2) %>% 
  filter(z > 1) %>% 
  select(y) %>% 
  show_query()
#> <SQL>
#> SELECT `y`
#> FROM (SELECT `x`, `y`, `x` + 2.0 AS `z`
#> FROM `dbplyr_owxqpctbzw`)
#> WHERE (`z` > 1.0)

Since that would require a full dependency analysis of what variables are used by each stage.

hadley · 2019-02-05T22:03:20Z

I think the best way to do this will be to unify the op underlying mutate(), transmute(), select() and rename(), so that any the combination of any two always produces a single op (exception when there are interrelated variables)

hadley mentioned this issue Jan 10, 2019

Missing columns cause errors with union_all in postgres #183

Closed

hadley added feature a feature request or enhancement verb trans 🤖 Translation of dplyr verbs to SQL labels Feb 4, 2019

hadley mentioned this issue Feb 6, 2019

Explore poor query performance #95

Closed

hadley closed this as completed in 78fc71a Feb 6, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimise more nested selects #213

Optimise more nested selects #213

hadley commented Jan 10, 2019

hadley commented Jan 10, 2019

romatik commented Jan 13, 2019

hadley commented Feb 5, 2019

hadley commented Feb 5, 2019

hadley commented Feb 5, 2019

Optimise more nested selects #213

Optimise more nested selects #213

Comments

hadley commented Jan 10, 2019

hadley commented Jan 10, 2019

romatik commented Jan 13, 2019

hadley commented Feb 5, 2019

hadley commented Feb 5, 2019

hadley commented Feb 5, 2019