Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Named window expressions #631

Merged
merged 20 commits into from
Apr 21, 2022
Merged

Named window expressions #631

merged 20 commits into from
Apr 21, 2022

Conversation

mgirlich
Copy link
Collaborator

@mgirlich mgirlich commented Apr 6, 2021

Fixes #624.
After I also wished for named windows a couple of times I had a first go at it. The idea is relatively simple:

In sql_build.op_select() have (possibly) two runs of translating the variables. In the first run win_over() registers every window it uses. After this first translation create a name for every window that is used more than once. Then translate the variables again but this time use the window name if one is available.

Example SQL

library(dplyr)
library(dbplyr)

lf <- lazy_frame(
  col1 = runif(3),
  col2 = runif(3),
  col3 = runif(3),
  col4 = runif(3),
  part = c("a", "a", "b"),
  ord = 3:1
) %>% 
  group_by(part) %>% 
  window_order(ord)

# window only used once -> no need to name the window
lf %>% 
  mutate(
    across(c(col1), sum, na.rm = TRUE),
    across(c(col3), ~ order_by(desc(ord), cumsum(.x)))
  )
#> <SQL>
#> SELECT SUM(`col1`) OVER (PARTITION BY `part`) AS `col1`, `col2`, SUM(`col3`) OVER (PARTITION BY `part` ORDER BY `ord` DESC ROWS UNBOUNDED PRECEDING) AS `col3`, `col4`, `part`, `ord`
#> FROM `df`

lf %>% 
  mutate(
    across(c(col1, col2), sum, na.rm = TRUE),
    across(c(col3, col4), ~ order_by(desc(ord), cumsum(.x)))
  )
#> <SQL>
#> SELECT SUM(`col1`) OVER (`win1`) AS `col1`, SUM(`col2`) OVER (`win1`) AS `col2`, SUM(`col3`) OVER (`win2`) AS `col3`, SUM(`col4`) OVER (`win2`) AS `col4`, `part`, `ord`
#> FROM `df`
#> WINDOW `win1` AS (PARTITION BY `part`), `win2` AS (PARTITION BY `part` ORDER BY `ord` DESC ROWS UNBOUNDED PRECEDING)

Created on 2021-04-06 by the reprex package (v2.0.0)

Do you like the overall approach?

check window alias support for

With a method supports_window_clause.<backend>() it would be quite simple to only use named windows if they are supported.

* make two translation passes in `sql_build.op_select()`
* register windows in `win_over()`
* add `window` to `select_query()`
* suport `window` in `print.select_query()` and `sql_render.select_query()`
* add `sql_clause_window()`
* add argument `window` to `sql_query_select.*()`
@mgirlich mgirlich requested a review from krlmlr July 30, 2021 11:57
@hadley
Copy link
Member

hadley commented Dec 9, 2021

Could you give me a couple of examples of before and after? I want to think about this a bit.

@mgirlich
Copy link
Collaborator Author

mgirlich commented Dec 10, 2021

Examples

I formatted the examples a bit for better readability and comparison.

Window used only once

lf <- lazy_frame(
  col1 = runif(3),
  col2 = runif(3),
  col3 = runif(3),
  col4 = runif(3),
  part = c("a", "a", "b"),
  ord = 3:1
) %>% 
  group_by(part) %>% 
  window_order(ord)

lf %>% 
  transmute(
    across(c(col1), sum, na.rm = TRUE),
    across(c(col3), ~ order_by(desc(ord), cumsum(.x)))
  )

Before = After

SELECT
  SUM(`col1`) OVER (PARTITION BY `part`) AS `col1`,
  SUM(`col3`) OVER (PARTITION BY `part` ORDER BY `ord` DESC ROWS UNBOUNDED PRECEDING) AS `col3`,
FROM `df`

Two windows

lf %>% 
  transmute(
    across(c(col1, col2), sum, na.rm = TRUE),
    across(c(col3, col4), ~ order_by(desc(ord), cumsum(.x)))
  )

Before

SELECT
  SUM(`col1`) OVER (PARTITION BY `part`) AS `col1`,
  SUM(`col2`) OVER (PARTITION BY `part`) AS `col2`,
  SUM(`col3`) OVER (PARTITION BY `part` ORDER BY `ord` DESC ROWS UNBOUNDED PRECEDING) AS `col3`,
  SUM(`col4`) OVER (PARTITION BY `part` ORDER BY `ord` DESC ROWS UNBOUNDED PRECEDING) AS `col4`
FROM `df`

After

SELECT SUM(`col1`) OVER `win1` AS `col1`,
       SUM(`col2`) OVER `win1` AS `col2`,
       SUM(`col3`) OVER `win2` AS `col3`,
       SUM(`col4`) OVER `win2` AS `col4`
FROM `df`
WINDOW `win1` AS (PARTITION BY `part`),
       `win2` AS (PARTITION BY `part` ORDER BY `ord` DESC ROWS UNBOUNDED PRECEDING)

mode.com example

lazy_frame(
  start_terminal = "Munich",
  duration_seconds = 3913L,
  start_time = "2012-01-08 11:36:09"
) %>% 
  filter(start_time < "2012-01-08") %>% 
  group_by(start_terminal) %>% 
  # Bug: `window_order()` does not work for `ntile()`
  # window_order(duration_seconds) %>% 
  transmute(
    start_terminal,
    quartile = ntile(duration_seconds, n = 4),
    quintile = ntile(duration_seconds, n = 5),
    percentile = ntile(duration_seconds, n = 100)
  )

Before

SELECT `start_terminal`,
       NTILE(4) OVER (PARTITION BY `start_terminal` ORDER BY `duration_seconds`) AS `quartile`,
       NTILE(5) OVER (PARTITION BY `start_terminal` ORDER BY `duration_seconds`) AS `quintile`,
       NTILE(100) OVER (PARTITION BY `start_terminal` ORDER BY `duration_seconds`) AS `percentile`
FROM `df`
WHERE (`start_time` < '2012-01-08')

After

SELECT `start_terminal`,
       NTILE(4) OVER `win1` AS `quartile`,
       NTILE(5) OVER `win1` AS `quintile`,
       NTILE(100) OVER `win1` AS `percentile`
FROM `df`
WHERE (`start_time` < '2012-01-08')
WINDOW `win1` AS (PARTITION BY `start_terminal` ORDER BY `duration_seconds`)

@mgirlich
Copy link
Collaborator Author

It would be nice if the user could name the window but that is probably difficult to understand/use (e.g. in example 2 with sum() and cumsum() using different windows).
Seeing example 2 it might be worth to include some information about the windows in the window names, e.g.

WINDOW `win_part_1` AS (PARTITION BY `part`),
       `win_part_order_1` AS (PARTITION BY `part` ORDER BY `ord` DESC ROWS UNBOUNDED PRECEDING)

or

WINDOW `win_agg_1` AS (PARTITION BY `part`),
       `win_over_1` AS (PARTITION BY `part` ORDER BY `ord` DESC ROWS UNBOUNDED PRECEDING)

so that one can more easily distinguish between sum() and cumsum() in the SQL code.

@hadley
Copy link
Member

hadley commented Dec 10, 2021

Were you thinking including the variable names or the component names? (It's hard to tell from your example). I'd have a mild preference for including variable names, but that does raise the question of what do for long variable names or compound expressions. I think I'd be more tempted just to stick with indices than something that requires a bunch of work (since most people won't read the generated SQL, I don't think we should invest too much effort here).

@mgirlich
Copy link
Collaborator Author

mgirlich commented Jan 3, 2022

Now I realize that my original idea doesn't work anyway, so I would also just stick with simple numbering.

@hadley
Copy link
Member

hadley commented Apr 1, 2022

Do you want to try and get this into the release too?

Conflicts:
	R/backend-access.R
	R/backend-mssql.R
	R/backend-oracle.R
	R/backend-teradata.R
	R/db-sql.R
	R/sql-clause.R
	R/verb-select.R
@mgirlich
Copy link
Collaborator Author

mgirlich commented Apr 6, 2022

Would be nice to also have this in the release but I don't mind if it doesn't make it into the release.

R/backend-hana.R Outdated Show resolved Hide resolved
R/backend-mysql.R Outdated Show resolved Hide resolved
Comment on lines 222 to 223
win_reset()
win_register_deactivate()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these should be in an on.exit()/defer() for safety

named_windows <- win_register_names()
if (nrow(named_windows) > 0 && supports_window_clause(con)) {
# need to translate again and use registered windows names
win_register_deactivate()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why doesn't this come immediately after translate_select_sql()?

tests/testthat/test-translate-sql-window.R Outdated Show resolved Hide resolved
@mgirlich mgirlich merged commit d338262 into main Apr 21, 2022
@mgirlich mgirlich deleted the named-window-expressions branch April 25, 2022 11:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[feature request] named window expressions?
2 participants