Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for quantile() #169

Closed
halldc opened this issue Sep 26, 2018 · 15 comments

Comments

@halldc
Copy link

@halldc halldc commented Sep 26, 2018

Would it be possible to add support for the quantile() function?

Quite a few databases support PERCENTILE_CONT and PERCENTILE_DISC, which I think could make this possible (e.g. Google BigQuery, PostgreSQL, and Redshift).

I'd be willing to help of course, but would need some pointers on where to start.

@Prometheus77

This comment was marked as resolved.

Copy link

@Prometheus77 Prometheus77 commented Oct 30, 2018

I feel like there might be a way to do this with win_over(). I'm assuming you want to be able to write an expression such as:

SELECT P90 = PERCENTILE_CONT(0.9) WITHIN GROUP (ORDER BY Score) OVER (PARTITION BY [Month])

Here's the documentation for win_over:
https://dbplyr.tidyverse.org/reference/win_over.html

The closest I can get to this is:

dbplyr::win_over(sql("percentile_cont(0.9)"), "Month", order = "score")

Which yields:

<SQL> percentile_cont(0.9) OVER (PARTITION BY "Month" ORDER BY "score")

I'd have to spend more time with it to see if I can get it to do what I want.

@hadley

This comment was marked as resolved.

Copy link
Member

@hadley hadley commented Jan 2, 2019

@halldc If you wanted to help, the place to start would be by making a list of databases (starting with SQLite, MySQL, and Postgres, and then continuing as long as you like) with a link to the documentation for their quantile function.

@hadley hadley added this to the v1.4.0 milestone Jan 9, 2019
@edavidaja

This comment was marked as resolved.

Copy link

@edavidaja edavidaja commented Jan 19, 2019

I'll work on this.

@batpigandme working on for tidy dev day

@edavidaja

This comment was marked as resolved.

Copy link

@edavidaja edavidaja commented Jan 19, 2019

engine syntax
msft sql server PERCENTILE_CONT ( numeric_literal ) WITHIN GROUP ( ORDER BY order_by_expression [ ASC | DESC ] ) OVER ( [ <partition_by_clause> ] )
teradata approx_percentile(x, percentage) ;
hive percentile(BIGINT col, array(p1 [, p2]...))
impala work in progress
postgres percentile_cont(fraction) WITHIN GROUP (ORDER BY sort_expression)
redshift PERCENTILE_CONT ( percentile ) WITHIN GROUP (ORDER BY expr) OVER ( [ PARTITION BY expr_list ] )
salesforce percentile_cont(p) within group (order by expr [asc | desc])
monetdb QUANTILE '(' scalar_expr ',' scalar_expr ')'
mysql not supported
mariadb SELECT name, PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY star_rating) OVER (PARTITION BY name) AS pc FROM book_rating;
sqlite not supported
bigquery PERCENTILE_CONT (value_expression, percentile [{RESPECT | IGNORE} NULLS])
@hadley

This comment was marked as resolved.

Copy link
Member

@hadley hadley commented Jan 19, 2019

@edavidaja can you please reduce the text (by editing your comment) so every thing fits on a single page? It's more important to me to see an overview in one place than to capture all the details (which I can get to by following the links)

@krlmlr

This comment was marked as outdated.

Copy link
Member

@krlmlr krlmlr commented Jan 19, 2019

One way to look at the SQL translation without having access to a real database:

library(tidyverse)
library(dbplyr)
#> 
#> Attaching package: 'dbplyr'
#> The following objects are masked from 'package:dplyr':
#> 
#>     ident, sql

tbl_lazy(iris, src = simulate_dbi("PostgreSQLConnection")) %>%
  transmute(q = quantile(Petal.Length, .25)) %>%
  show_query()
#> <SQL> SELECT QUANTILE(`Petal.Length`, 0.25) AS `q`
#> FROM `df`

Created on 2019-01-19 by the reprex package (v0.2.1)

@hadley

This comment was marked as resolved.

Copy link
Member

@hadley hadley commented Jan 19, 2019

@edavidaja that just hides the details so I have to do even more clicking. I basically want a one screen cheatsheet that lets me see the variation at a glance.

@halldc

This comment was marked as resolved.

Copy link
Author

@halldc halldc commented Jan 22, 2019

Thanks for your work on this @edavidaja ! I'm hoping that in a few weeks' time I should be able to contribute to this task, if that helps.

By the way, I see in your table that you're mixing NTILE with PERCENTILE/QUANTILE functions, but these are used in different contexts.

  • The NTILE function divides the data into a specified number of ranked groupings, and associates each row with one of these groups. See Redshift example. This is similar to dplyr::ntile().
  • The PERCENTILE/QUANTILE functions yield an aggregate statistic (e.g. the median is the 50th percentile). See Redshift example. This is similar to the R function quantile().
@edavidaja

This comment was marked as resolved.

Copy link

@edavidaja edavidaja commented Jan 22, 2019

Thanks @halldc. I've updated the table again--there are a few engines where some percentile_cont like function doesn't appear to be available at all; I've linked to the relevant documentation in case I'm reading it incorrectly (again).

A few engines (postgres, teradata, hive) can accept and return arrays in a way that matches the behavior of quantile in R; the remainder expect a single value.

@hadley

This comment has been minimized.

Copy link
Member

@hadley hadley commented Feb 6, 2019

@edavidaja that is perfect, thank you!

@hadley

This comment has been minimized.

Copy link
Member

@hadley hadley commented Feb 6, 2019

A few more tweaks to make it a bit easier for me to parse:

Aggregation function:

Window function:

  • sql-server: PERCENTILE_CONT(p) WITHIN GROUP (ORDER BY x)
  • postgres: PERCENTILE_CONT(p) WITHIN GROUP (ORDER BY x)
  • redshift: PERCENTILE_CONT(p) WITHIN GROUP (ORDER BY x)
  • salesforce: PERCENTILE_CONT(p) WITHIN GROUP (ORDER BY x)
  • mariadb: PERCENTILE_CONT(p) WITHIN GROUP (ORDER BY x)
@hadley

This comment has been minimized.

Copy link
Member

@hadley hadley commented Feb 6, 2019

For the databases that support the PERCENTILE_CONT() window function, I don't see how we can use this in aggregation (i.e. summarise()) context.

The SQL server docs seem to be the only place that addresses this problem and suggests using DISTINCT:

SELECT DISTINCT DepartmentName  
,PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY BaseRate)  
    OVER (PARTITION BY DepartmentName) AS MedianCont  
,PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY BaseRate)  
    OVER (PARTITION BY DepartmentName) AS MedianDisc  
FROM dbo.DimEmployee;  

So maybe the best we can do is supply translations for those as window functions, and then suggest the user use distinct()?

@hadley

This comment has been minimized.

Copy link
Member

@hadley hadley commented Feb 6, 2019

Oops, those aren't window functions, but are "ordered-set aggregate" functions, and they work just fine with GROUP BY:

library(DBI)

con <- dbConnect(RPostgres::Postgres())
dbGetQuery(con, "
  SELECT cyl, PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY mpg) 
  FROM mtcars 
  GROUP BY cyl  
")
#>   cyl percentile_cont
#> 1   4            26.0
#> 2   6            19.7
#> 3   8            15.2

Created on 2019-02-06 by the reprex package (v0.2.1.9000)

@hadley hadley closed this in 19909fa Feb 6, 2019
@halldc

This comment has been minimized.

Copy link
Author

@halldc halldc commented Feb 6, 2019

Thanks @hadley! 🎉

@hadley

This comment has been minimized.

Copy link
Member

@hadley hadley commented Mar 17, 2019

Looks like the link to teradata was actually to teradata's distribution of presto. It seems like teradata actually is ansi compliant: https://docs.teradata.com/reader/756LNiPSFdY~4JcCCcR5Cw/RgAqeSpr93jpuGAvDTud3w

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
6 participants
You can’t perform that action at this time.