Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rethink optimization of ORDER BY #276

Closed
krlmlr opened this issue Apr 9, 2019 · 4 comments · Fixed by #501
Closed

Rethink optimization of ORDER BY #276

krlmlr opened this issue Apr 9, 2019 · 4 comments · Fixed by #501
Labels
feature a feature request or enhancement verb trans 🤖 Translation of dplyr verbs to SQL
Milestone

Comments

@krlmlr
Copy link
Member

krlmlr commented Apr 9, 2019

ORDER BY clauses in subqueries are problematic:

As a stop gap, we could remove ORDER BY clauses from inner queries if we don't combine the inner and the outer query. With a warning. This would also solve #275. I wonder if this generates spurious warnings, e.g., for window functions.

Alternatively, we could allow arrange() only at the very last step in the pipe, and transform it to window_order() if additional verbs are added, with a warning. This will be a problem later when we want to support lazy operations for data frames.

For a true solution, we cannot simply delay the computation due to aliasing (#94 (comment)). A few examples, and a reprex illustrating the problem (using #277), are below. This might well be out of scope.

# Input:
lazy_frame(a = 1:3, b = 4:2) %>%
  mutate(a = -a) %>%
  arrange(a) %>%
  mutate(a = -a)

# Equivalent with `arrange()` delayed:
lazy_frame(a = 1:3, b = 4:2) %>%
  mutate(a = -a) %>%
  mutate(zzz_001 = a) %>%
  mutate(a = -a) %>%
  arrange(zzz_001) %>%
  select(-zzz_001)

# Input:
lazy_frame(a = 1:3, b = 4:2) %>%
  arrange(a) %>%
  group_by(b) %>%
  mutate(c = cumsum(a)) %>%
  ungroup()

# Equivalent with `arrange()` delayed:
lazy_frame(a = 1:3, b = 4:2) %>%
  window_order(a) %>%
  group_by(b) %>%
  mutate(c = cumsum(a)) %>%
  ungroup() %>%
  arrange(a)
library(tidyverse)
devtools::load_all("~/git/R/dbplyr")
#> Loading dbplyr
#> 
#> Attaching package: 'testthat'
#> The following object is masked from 'package:dplyr':
#> 
#>     matches
#> The following object is masked from 'package:purrr':
#> 
#>     is_null
#> Registering testing src: df OK
#> Registering testing src: sqlite OK
#> Registering testing src: mysql OK
#> Registering testing src: MariaDB OK
#> Registering testing src: postgres OK
#> Registering testing src: MSSQL OK


test_frame(a = 1:3, b = 4:2) %>%
  map(
    . %>%
      mutate(a = -a) %>%
      arrange(a) %>%
      mutate(a = -a)
  )
#> Created a temporary table named: ##dbplyr_001
#> $df
#> # A tibble: 3 x 2
#>       a     b
#>   <int> <int>
#> 1     3     2
#> 2     2     3
#> 3     1     4
#> 
#> $sqlite
#> # Source:     lazy query [?? x 2]
#> # Database:   sqlite 3.25.3 [:memory:]
#> # Ordered by: a
#>       a     b
#>   <int> <int>
#> 1     3     2
#> 2     2     3
#> 3     1     4
#> 
#> $mysql
#> # Source:     lazy query [?? x 2]
#> # Database:   mysql 5.5.5-10.1.38-MariaDB-0ubuntu0.18.04.1
#> #   [kirill@localhost:/test]
#> # Ordered by: a
#>       a     b
#>   <dbl> <int>
#> 1     1     4
#> 2     2     3
#> 3     3     2
#> 
#> $MariaDB
#> # Source:     lazy query [?? x 2]
#> # Database:   mysql 5.5.5-10.1.38-MariaDB-0ubuntu0.18.04.1
#> #   [kirill@localhost:/test]
#> # Ordered by: a
#>   a                   b
#>   <S3: integer64> <int>
#> 1 1                   4
#> 2 2                   3
#> 3 3                   2
#> 
#> $postgres
#> # Source:     lazy query [?? x 2]
#> # Database:   postgres 9.6.9 [kirill@/var/run/postgresql:5432/kirill]
#> # Ordered by: a
#>       a     b
#>   <int> <int>
#> 1     3     2
#> 2     2     3
#> 3     1     4
#> 
#> $MSSQL
#> # Source:     lazy query [?? x 2]
#> # Database:   Microsoft SQL Server 12.00.1300[@cynkra-mssql/main]
#> # Ordered by: a
#>       a     b
#>   <int> <int>
#> 1     1     4
#> 2     2     3
#> 3     3     2

Created on 2019-04-09 by the reprex package (v0.2.1.9000)

@hadley
Copy link
Member

hadley commented Sep 21, 2020

My copy of "SQL-99 complete, really" says" There can be no ORDER BY clause inside the subquery" in fully-conformant DBMSs.

I'd prefer to make this into a warning.

@krlmlr
Copy link
Member Author

krlmlr commented Sep 22, 2020

#471 should never emit ORDER BY inside subqueries?

@hadley
Copy link
Member

hadley commented Sep 22, 2020

Yeah, I just re-looked at #471 — it seems a bit too risky to me; I'd prefer to make the minimum change to make this work. But I'm still coming up to speed with dbplyr, so I'll have a better idea in a few days.

@krlmlr
Copy link
Member Author

krlmlr commented Sep 22, 2020

Yeah, it's a bit scary. The commits are self-contained, how about splitting them into individual pull requests?

@hadley hadley added this to the 2.0.0 milestone Sep 24, 2020
hadley added a commit that referenced this issue Sep 25, 2020
hadley added a commit that referenced this issue Sep 25, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature a feature request or enhancement verb trans 🤖 Translation of dplyr verbs to SQL
Projects
None yet
2 participants