Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added the extract job: BigQuery -> Cloud Storage #119

Merged
merged 1 commit into from Apr 20, 2017

Conversation

@realAkhmed
Copy link
Contributor

commented Jun 22, 2016

Hello. Have been using this function for a while in my personal fork and always wanted to share it with the community.

This is the code for doing a BigQuery extract job.
Basically, taking a BigQuery table and extracting it into a CSV file(s) inside Cloud Storage bucket.

The following code contains an example of how to use it

 # specify your project ID here
project <- "<my_project_id>"

 # specify your Cloud Storage bucket name (note the wildcard)
bucket <- "gs://<my_bucket>/shakespeare*.csv"

# Now run the extract_exec - it will return the number of files that were extracted
extract_exec("publicdata:samples.shakespeare", project = project, destinationUris = bucket)

Shakespeare is a small dataset so you won't get charged a lot for that example.

The structure of this file is modeled after insert_upload_job and insert_query_job

#' either as a string in the format used by BigQuery, or as a list with
#' \code{project_id}, \code{dataset_id}, and \code{table_id} entries
#' @param project project name
#' @param destinationUris Specify the extract destination URI. Note: for large files, you may need to specify a wild-card since

This comment has been minimized.

Copy link
@craigcitro

craigcitro Jul 16, 2016

Collaborator

I think you lost the end of the sentence?

This comment has been minimized.

Copy link
@hadley

hadley Apr 18, 2017

Member

Also needs to be wrapper

}


#' Run a asynchronous extract job and wait till it is done

This comment has been minimized.

Copy link
@craigcitro

craigcitro Jul 16, 2016

Collaborator

nit: . at end of sentence.

This comment has been minimized.

Copy link
@hadley

hadley Apr 18, 2017

Member

FWIW this is a title, and titles don't usually end in periods

#' # Now run the extract_exec - it will return the number of files that were extracted
#' extract_exec("publicdata:samples.shakespeare", project = project, destinationUris = bucket)
#' }
extract_exec <- function(source_table, project, destinationUris,

This comment has been minimized.

Copy link
@craigcitro

craigcitro Jul 16, 2016

Collaborator

extract_exec is a little weird -- maybe just extract? (I know it mirrors query_exec, but I also think that name is weird. 😉 )

This comment has been minimized.

Copy link
@hadley

hadley Apr 18, 2017

Member

Yeah, I think extract() would be better too

fieldDelimiter, printHeader)
job <- wait_for(job)

if(job$status$state == "DONE") {

This comment has been minimized.

Copy link
@craigcitro

craigcitro Jul 16, 2016

Collaborator

Actually, you still need to check for errors, right?

@hadley

This comment has been minimized.

Copy link
Member

commented Apr 18, 2017

@realAkhmed are you interested in finishing off this PR?

@realAkhmed

This comment has been minimized.

Copy link
Contributor Author

commented Apr 18, 2017

@hadley Absolutely! Just wasn't sure if the package is still actively developed.

This particular PR was very useful for me internally since it opens way for what I called as collect_by_export() -- collecting huge results from BigQuery by automatically saving a query to a temporary table, then exporting it to CSV, then downloading CSV, then parsing CSV locally. This works much faster than a regular collect() on very large query results.

@craigcitro

This comment has been minimized.

Copy link
Collaborator

commented Apr 18, 2017

@realAkhmed agreed that this is indeed a faster route in general -- but CSV as a transport format isn't great for the case that your table includes nested or repeated fields.

I'd be curious what @hadley knows about potentially converting avro to a dataframe, since that would give us full fidelity exports.

#' @export
insert_extract_job <- function(source_table, project, destinationUris,
compression = "NONE", destinationFormat = "CSV",
fieldDelimiter = ",", printHeader = TRUE) {

This comment has been minimized.

Copy link
@hadley

hadley Apr 18, 2017

Member

These should use snake case to be consistent with the rest of bigrquery

This comment has been minimized.

Copy link
@hadley

hadley Apr 18, 2017

Member

I'd also recommend putting one parameter on each line.

job <- wait_for(job)

if(job$status$state == "DONE") {
(job$statistics$extract$destinationUriFileCounts)

This comment has been minimized.

Copy link
@hadley

hadley Apr 18, 2017

Member

Remove extra parens?

@IronistM

This comment has been minimized.

Copy link

commented Apr 19, 2017

💯 claps for this!

@hadley

This comment has been minimized.

Copy link
Member

commented Apr 20, 2017

I'm just going to merge this so I can work on locally.

@hadley hadley merged commit acdd1da into r-dbi:master Apr 20, 2017

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
@realAkhmed

This comment has been minimized.

Copy link
Contributor Author

commented Apr 21, 2017

Thanks @hadley ! Somewhat slow to respond at the moment (the teaching season) -- will come back to address the issues raised by you and @craigcitro once done with it!

Zsedo pushed a commit to Zsedo/bigrquery that referenced this pull request Jun 26, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.