Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added the extract job: BigQuery -> Cloud Storage #119

Merged
merged 1 commit into from
Apr 20, 2017

Conversation

realAkhmed
Copy link
Contributor

Hello. Have been using this function for a while in my personal fork and always wanted to share it with the community.

This is the code for doing a BigQuery extract job.
Basically, taking a BigQuery table and extracting it into a CSV file(s) inside Cloud Storage bucket.

The following code contains an example of how to use it

 # specify your project ID here
project <- "<my_project_id>"

 # specify your Cloud Storage bucket name (note the wildcard)
bucket <- "gs://<my_bucket>/shakespeare*.csv"

# Now run the extract_exec - it will return the number of files that were extracted
extract_exec("publicdata:samples.shakespeare", project = project, destinationUris = bucket)

Shakespeare is a small dataset so you won't get charged a lot for that example.

The structure of this file is modeled after insert_upload_job and insert_query_job

#' either as a string in the format used by BigQuery, or as a list with
#' \code{project_id}, \code{dataset_id}, and \code{table_id} entries
#' @param project project name
#' @param destinationUris Specify the extract destination URI. Note: for large files, you may need to specify a wild-card since
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you lost the end of the sentence?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also needs to be wrapper

@hadley
Copy link
Member

hadley commented Apr 18, 2017

@realAkhmed are you interested in finishing off this PR?

@realAkhmed
Copy link
Contributor Author

@hadley Absolutely! Just wasn't sure if the package is still actively developed.

This particular PR was very useful for me internally since it opens way for what I called as collect_by_export() -- collecting huge results from BigQuery by automatically saving a query to a temporary table, then exporting it to CSV, then downloading CSV, then parsing CSV locally. This works much faster than a regular collect() on very large query results.

@craigcitro
Copy link
Collaborator

@realAkhmed agreed that this is indeed a faster route in general -- but CSV as a transport format isn't great for the case that your table includes nested or repeated fields.

I'd be curious what @hadley knows about potentially converting avro to a dataframe, since that would give us full fidelity exports.

#' @export
insert_extract_job <- function(source_table, project, destinationUris,
compression = "NONE", destinationFormat = "CSV",
fieldDelimiter = ",", printHeader = TRUE) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should use snake case to be consistent with the rest of bigrquery

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd also recommend putting one parameter on each line.

job <- wait_for(job)

if(job$status$state == "DONE") {
(job$statistics$extract$destinationUriFileCounts)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove extra parens?

@IronistM
Copy link

💯 claps for this!

@hadley
Copy link
Member

hadley commented Apr 20, 2017

I'm just going to merge this so I can work on locally.

@hadley hadley merged commit acdd1da into r-dbi:master Apr 20, 2017
@realAkhmed
Copy link
Contributor Author

Thanks @hadley ! Somewhat slow to respond at the moment (the teaching season) -- will come back to address the issues raised by you and @craigcitro once done with it!

Zsedo pushed a commit to Zsedo/bigrquery that referenced this pull request Jun 26, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants