Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with group_by() %>% summarise(quantile()) on SQL Server #1110

Closed
Tsemharb opened this issue Jan 25, 2023 · 2 comments · Fixed by #1132
Closed

Issue with group_by() %>% summarise(quantile()) on SQL Server #1110

Tsemharb opened this issue Jan 25, 2023 · 2 comments · Fixed by #1132
Milestone

Comments

@Tsemharb
Copy link

SQL code generated for summarise(quantile()) for SQL Sever is invalid.

library(dplyr, warn.conflicts = F)
library(dbplyr, warn.conflicts = F)

data <- tibble::tibble(ind = rep(1:3, 100),
                       value = runif(300))

# desired output of the further generated SQL queries
data %>% 
  group_by(ind) %>% 
  summarise(q05_value = quantile(value, 0.05))
#> # A tibble: 3 x 2
#>     ind q05_value
#>   <int>     <dbl>
#> 1     1    0.0240
#> 2     2    0.0667
#> 3     3    0.0625

# Translation to Postgres works:
df_postgres <- tbl_lazy(data, con = simulate_postgres())
df_postgres %>% 
  group_by(ind) %>%
  summarise(q05_value = quantile(value, 0.05, na.rm = TRUE)) %>% 
  show_query()
#> <SQL>
#> SELECT
#>   `ind`,
#>   PERCENTILE_CONT(0.05) WITHIN GROUP (ORDER BY `value`) AS `q05_value`
#> FROM `df`
#> GROUP BY `ind`

# Translation to SQL Server doesn't work:
df_mssql <- tbl_lazy(data, con = simulate_mssql())
df_mssql %>% 
  group_by(ind) %>%
  summarise(q05_value = quantile(value, 0.05, na.rm = TRUE)) %>% 
  show_query()
#> <SQL>
#> SELECT
#>   `ind`,
#>   PERCENTILE_CONT(0.05) WITHIN GROUP (ORDER BY `value`) OVER () AS `q05_value`
#> FROM `df`
#> GROUP BY `ind`

Created on 2023-01-25 with reprex v2.0.2

Attempt to run generated SQL on SQL Server fails with the following error:
Column 'tab.value' is invalid in the select list because it is not contained in either an aggregate function or the GROUP BY clause.

SQL Server uses window function type of syntax for PERCENTILE_CONT() as opposed to Postgres which utilizes aggregate approach.

Working SQL for SQL Server for the demonstrated query would be the following:

SELECT DISTINCT
  ind, PERCENTILE_CONT(0.05) WITHIN GROUP (ORDER BY value) OVER (PARTITION BY ind) AS q05_value
  FROM tab;

@mgirlich mgirlich added this to the 2.3.1 milestone Feb 2, 2023
@mgirlich
Copy link
Collaborator

mgirlich commented Feb 7, 2023

Hmm, I'm afraid this isn't possible with our translation approach. But I can add a helpful error message.

@Tsemharb
Copy link
Author

Tsemharb commented Feb 7, 2023

Yeah, I did some dbplyr inner working exploration and had the same thought.
But this problem has two faces in some sense. Trying to design dplyr statement which is correctly translated to SQL Server you might use mutate function. But in this case the statement won't work with postgres

library(dplyr, warn.conflicts = F)
library(dbplyr, warn.conflicts = F)

data <- tibble::tibble(ind = rep(1:3, 100),
                       value = runif(300))

# Translation to SQL Server works:
df_mssql <- tbl_lazy(data, con = simulate_mssql())
df_mssql %>% 
  group_by(ind) %>%
  mutate(q05_value = quantile(value, 0.05, na.rm = TRUE)) %>% 
  select (ind, q05_value) %>% 
  distinct()
#> <SQL>
#> SELECT DISTINCT
#>   `ind`,
#>   PERCENTILE_CONT(0.05) WITHIN GROUP (ORDER BY `value`) OVER (PARTITION BY `ind`) AS `q05_value`
#> FROM `df`

# Translation to Postgres is incorrect:
df_postgres <- tbl_lazy(data, con = simulate_postgres())
df_postgres %>% 
  group_by(ind) %>%
  mutate(q05_value = quantile(value, 0.05, na.rm = TRUE)) %>% 
  select (ind, q05_value) %>% 
  distinct()
#> <SQL>
#> SELECT DISTINCT
#>   `ind`,
#>   PERCENTILE_CONT(0.05) WITHIN GROUP (ORDER BY `value`) OVER (PARTITION BY `ind`) AS `q05_value`
#> FROM `df`

Created on 2023-02-07 with reprex v2.0.2

ERROR: OVER is not supported for ordered-set aggregate percentile_cont

So maybe we should add error message for postgres and other (e.g. redshift) dbms as well?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants