Added the extract job: BigQuery -> Cloud Storage #119

realAkhmed · 2016-06-22T01:18:12Z

Hello. Have been using this function for a while in my personal fork and always wanted to share it with the community.

This is the code for doing a BigQuery extract job.
Basically, taking a BigQuery table and extracting it into a CSV file(s) inside Cloud Storage bucket.

The following code contains an example of how to use it

 # specify your project ID here
project <- "<my_project_id>"

 # specify your Cloud Storage bucket name (note the wildcard)
bucket <- "gs://<my_bucket>/shakespeare*.csv"

# Now run the extract_exec - it will return the number of files that were extracted
extract_exec("publicdata:samples.shakespeare", project = project, destinationUris = bucket)

Shakespeare is a small dataset so you won't get charged a lot for that example.

The structure of this file is modeled after insert_upload_job and insert_query_job

craigcitro · 2016-07-16T17:43:37Z

R/extract.r

+#'   either as a string in the format used by BigQuery, or as a list with
+#'   \code{project_id}, \code{dataset_id}, and \code{table_id} entries
+#' @param project project name
+#' @param destinationUris Specify the extract destination URI. Note: for large files, you may need to specify a wild-card since


I think you lost the end of the sentence?

Also needs to be wrapper

hadley · 2017-04-18T13:11:42Z

@realAkhmed are you interested in finishing off this PR?

realAkhmed · 2017-04-18T16:37:21Z

@hadley Absolutely! Just wasn't sure if the package is still actively developed.

This particular PR was very useful for me internally since it opens way for what I called as collect_by_export() -- collecting huge results from BigQuery by automatically saving a query to a temporary table, then exporting it to CSV, then downloading CSV, then parsing CSV locally. This works much faster than a regular collect() on very large query results.

craigcitro · 2017-04-18T16:57:01Z

@realAkhmed agreed that this is indeed a faster route in general -- but CSV as a transport format isn't great for the case that your table includes nested or repeated fields.

I'd be curious what @hadley knows about potentially converting avro to a dataframe, since that would give us full fidelity exports.

hadley · 2017-04-18T18:58:48Z

R/extract.r

+#' @export
+insert_extract_job <- function(source_table, project, destinationUris,
+                               compression = "NONE", destinationFormat = "CSV",
+                               fieldDelimiter = ",", printHeader = TRUE) {


These should use snake case to be consistent with the rest of bigrquery

I'd also recommend putting one parameter on each line.

hadley · 2017-04-18T19:00:01Z

R/extract.r

+  job <- wait_for(job)
+
+  if(job$status$state == "DONE") {
+    (job$statistics$extract$destinationUriFileCounts)


Remove extra parens?

IronistM · 2017-04-19T13:22:17Z

💯 claps for this!

hadley · 2017-04-20T13:13:11Z

I'm just going to merge this so I can work on locally.

realAkhmed · 2017-04-21T05:05:54Z

Thanks @hadley ! Somewhat slow to respond at the moment (the teaching season) -- will come back to address the issues raised by you and @craigcitro once done with it!

Added the extract job handler

27b7534

craigcitro reviewed Jul 16, 2016
View reviewed changes

hadley reviewed Apr 18, 2017

View reviewed changes

hadley merged commit acdd1da into r-dbi:master Apr 20, 2017

Zsedo pushed a commit to Zsedo/bigrquery that referenced this pull request Jun 26, 2017

Added the extract job handler (r-dbi#119)

1922937

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added the extract job: BigQuery -> Cloud Storage #119

Added the extract job: BigQuery -> Cloud Storage #119

realAkhmed commented Jun 22, 2016

craigcitro Jul 16, 2016

hadley Apr 18, 2017

hadley commented Apr 18, 2017

realAkhmed commented Apr 18, 2017

craigcitro commented Apr 18, 2017

hadley Apr 18, 2017

hadley Apr 18, 2017

hadley Apr 18, 2017

IronistM commented Apr 19, 2017

hadley commented Apr 20, 2017

realAkhmed commented Apr 21, 2017

Added the extract job: BigQuery -> Cloud Storage #119

Added the extract job: BigQuery -> Cloud Storage #119

Conversation

realAkhmed commented Jun 22, 2016

craigcitro Jul 16, 2016

Choose a reason for hiding this comment

hadley Apr 18, 2017

Choose a reason for hiding this comment

hadley commented Apr 18, 2017

realAkhmed commented Apr 18, 2017

craigcitro commented Apr 18, 2017

hadley Apr 18, 2017

Choose a reason for hiding this comment

hadley Apr 18, 2017

Choose a reason for hiding this comment

hadley Apr 18, 2017

Choose a reason for hiding this comment

IronistM commented Apr 19, 2017

hadley commented Apr 20, 2017

realAkhmed commented Apr 21, 2017