Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added db_save_query.bigquery so compute() now works #52

Closed
wants to merge 2 commits into from

Conversation

@realAkhmed
Copy link
Contributor

commented Jul 1, 2015

Hello everyone! This pull request features an implementation of db_save_query for "bigquery". This is my first attempt to contribute to bigrquery project so I apologize in advance if I am missing some key details (or may be even being overly verbose in describing it)

NOTE: had to resubmit pull request since had to move it from master to a separate branch in my repository. (The original pull request is #51 .)

Motivation

The motivation for this pull request is that the current implementation of bigrquery allows one to use collect function (that is, saving the results locally) but not compute function (that is, saving the results remotely).

Compute function seems to be extremely important when working with big datasets as it allows one to keep big temporary remotely results in Google BigQuery without downloading them locally and use them in subsequent queries.

Implementation Details

  • compute function fails only because of the lack of db_save_query.bigquery.
    Fortunately, query_exec facilities allow saving the query result and thus, db_save_query could potentially be implemented as a hook to query_exec with a particular destination_table.
  • temporary = TRUE is not currently implemented and is currently ignored so all tables are created as persistent.

Tests

The code belows demonstrates a simple test of compute function.

library(dplyr)
library(bigrquery)

BILLING_PROJECT <- "<YOUR_BILLING_PROJECT>"
BILLING_DATASET <- "<DATASET_INSIDE_THAT_PROJECT>"

bq <- src_bigquery(BILLING_PROJECT, BILLING_DATASET)

# Upload the test data frame to your project
df <- data.frame(a = 1:100, b = 101:200)
test_df <- copy_to(bq, df, name = "test_df")

# Perform a simple query and save the result remotely
test_df2 <- test_df %>%
              summarise( n = n() ) %>%
              compute( name = "test_df2", temporary = FALSE)

# Display the result
test_df2 %>% collect

Please note that running the test does require to have a Google Cloud billing project set up already.

Thank you for consideration!

@craigcitro

This comment has been minimized.

Copy link
Collaborator

commented Jul 3, 2015

thanks for the patch! (and sorry i've been slow at getting a chance to look at them.)

as a high-level question, i'm curious why you'd want to default to a non-temporary table -- the default is for query results to live in temporary tables, so it seems like the easiest approach might be to just go with that?

a few related questions coming up inline.

if (is.null(table$project_id)) table$project_id <- con$project
if (is.null(table$dataset_id)) table$dataset_id <- con$dataset

if ( table$table_id %in% list_tables(table$project_id, table$dataset_id) ) {

This comment has been minimized.

Copy link
@craigcitro

craigcitro Jul 3, 2015

Collaborator

i definitely don't think we want the default behavior to be to delete the table -- it seems like we'd want to bail in this case?

This comment has been minimized.

Copy link
@craigcitro

craigcitro Jul 3, 2015

Collaborator

we also probably want to just check if the table exists, instead of listing all tables. (the latter could take multiple API calls, if there are many tables in this dataset.)

This comment has been minimized.

Copy link
@hadley

hadley Apr 18, 2017

Member

This could just call exists_table().

But I'm not sure it's even necessary to check — you could just rely on query_exec() to throw an error.

@@ -130,6 +130,32 @@ query.bigquery <- function(con, sql, .vars) {
BigQuery$new(con, sql(sql), .vars)
}

#' @export
#' @importFrom dplyr db_save_query
db_save_query.bigquery <- function(con, sql, name, temporary=TRUE, ...) {

This comment has been minimized.

Copy link
@craigcitro

craigcitro Jul 3, 2015

Collaborator

given that we don't support temporary = TRUE, it seems like we should raise an error (or at least warn) if that value is provided?

This comment has been minimized.

Copy link
@hadley

hadley Apr 18, 2017

Member

Also needs spaces around =

@realAkhmed

This comment has been minimized.

Copy link
Contributor Author

commented Jul 4, 2015

@craigcitro Thanks for the feedback! Yes, agreed. The code was ignoring temporary = TRUE completely. My issue was that I am not sure if compute with temporary = TRUE is well defined with BQ. (Is it possible to create a temporary table with a given name?) So, I completely ignored it till now.

As of now, I see that one solution here could be to add an error for two cases:

  • any call with temporary = TRUE, or
  • an attempt to overwrite existing table.

I am not sure I understood how to solve the issue with table$table_id %in% list_tables(...) though since this is also how db_has_table also works. Is there a faster approach? (Is it about using get_table?)

@realAkhmed

This comment has been minimized.

Copy link
Contributor Author

commented Jul 5, 2015

Updated the pull request with error checks. Still using table$table_id %in% list_tables(...) approach borrowed from db_has_table.

table <- parse_table(name)

if (is.null(table$project_id)) table$project_id <- con$project
if (is.null(table$dataset_id)) table$dataset_id <- con$dataset

This comment has been minimized.

Copy link
@hadley

hadley Apr 18, 2017

Member

This code needs to get pulled out into a separate function


}

query_exec(query = sql,

This comment has been minimized.

Copy link
@hadley

hadley Apr 18, 2017

Member

Please see http://style.tidyverse.org/syntax.html#long-lines for preferred coding style

query_exec(query = sql,
project = con$project,
default_dataset = con$dataset,
destination_table = paste0(table$project_id,":",

This comment has been minimized.

Copy link
@hadley

hadley Apr 18, 2017

Member

If an existing function doesn't exist, you should create one here.

@hadley hadley closed this in d7bd9b0 Apr 12, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.