Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

calculate median in bigrquery + dplyr #267

Closed
edgararuiz opened this issue Sep 20, 2018 · 1 comment

Comments

@edgararuiz
Copy link
Contributor

commented Sep 20, 2018

Issue moved from: tidyverse/dbplyr#162

@jialu512 commented on Sep 18, 2018, 5:52 AM UTC:

Hello!

I have a large data stored in google cloud platform. I use bigrquery to make the queries in R. I notice some dplyr functions still work but median() doesn't.

library(bigrquery)
library(dplyr)

tbl(con, "episodes") %>%
filter(Admit_year>=2009) %>%
mutate(ratio=Total_charges/total_cost) %>%
group_by(Admit_year)%>%
summarise_(~median(ratio, na.rm=TRUE)) %>%
arrange(Admit_year)
I tried summarise(median=median(ratio,na.rm=TRUE)) but also failed. The only way it works is if I added "as_tibble()) under the mutate line, then R will "download" the data from the cloud. Then the median function works. My guess is, the median function works "locally", but not as part of a query? The tibble solution works, but it's very time consuming as R needs to download everything before computing. Is there any way to resolve this?

Thank you,
Jialu

@edgararuiz

This comment has been minimized.

Copy link
Contributor Author

commented Sep 20, 2018

The fix is to translate median() to APPROX_QUANTILES(my_var_nane, 2L)[SAFE_ORDINAL(2)] in this section of the code:

sql_translate_env.BigQueryConnection <- function(x) {

I've tested that against a Big Query connection and it worked:

> DBI::dbGetQuery(con, "SELECT 
+ APPROX_QUANTILES(`x`, 2)[SAFE_ORDINAL(2)] 
+ AS `med` ,
+ sum(`x`) as `total` 
+ FROM (SELECT *
+ FROM (SELECT `x`
+ FROM (SELECT `SOURCE_RXCUI`, `SOURCE_TTY`, `SOURCE_NAME`, `TARGET_RXCUI`, `TARGET_TTY`, `TARGET_NAME`, 
+ SAFE_CAST(`SOURCE_RXCUI` AS INT64) AS `x`
+ FROM `rxn_all_pathways_current`) `isaifleutz`) `fkymvfuqso`
+ LIMIT 1000) `gtssbvmztu`")
Complete
Billed: 15.73 MB
Downloading 1 rows in 1 pages.
# A tibble: 1 x 2
     med     total
   <int>     <int>
1 798166 860046199

valentinumbach added a commit to valentinumbach/bigrquery that referenced this issue Jan 20, 2019

add translation for `median()` function
`median()` function translates to `APPROX_QUANTILES(x, 2)[SAFE_ORDINAL(2)]`. This fix was suggested by @edgararuiz (r-dbi#267 (comment)), but apparently not implemented

valentinumbach added a commit to valentinumbach/bigrquery that referenced this issue Jan 23, 2019

add translation for `median()` function
`median()` function translates to `APPROX_QUANTILES(x, 2)[SAFE_ORDINAL(2)]`. This fix was suggested by @edgararuiz (r-dbi#267 (comment)), but apparently not implemented

valentinumbach added a commit to valentinumbach/bigrquery that referenced this issue Jan 23, 2019

add translation for `median()` function
`median()` function translates to `APPROX_QUANTILES(x, 2)[SAFE_ORDINAL(2)]`. This fix was suggested by @edgararuiz (r-dbi#267 (comment)), but apparently not implemented.

@hadley hadley closed this Jan 23, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.