Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow PARQUET format for uploading data. #609

Merged
merged 24 commits into from
Jun 4, 2024
Merged

Conversation

apalacio9502
Copy link
Contributor

@apalacio9502 apalacio9502 commented May 18, 2024

Hi @hadley,

This pull request contains the implementation for allowing the user to decide in which format they want to transmit the data (JSON or PARQUET) to BigQuery (For large amounts of data, loading data in JSON format is very time-consuming due to the size of the data that needs to be transmitted. To address this problem, BigQuery accepts other file formats, including Parquet). The most significant change enabling PARQUET data transmission is that the uploadType is no longer multipart; it is now resumable.

https://cloud.google.com/bigquery/docs/reference/api-uploads

Regards,

Copy link
Member

@hadley hadley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this!

.github/workflows/R-CMD-check.yaml Show resolved Hide resolved
R/bq-perform.R Show resolved Hide resolved
R/bq-perform.R Outdated Show resolved Hide resolved
R/bq-perform.R Outdated Show resolved Hide resolved
R/bq-perform.R Outdated Show resolved Hide resolved
R/bq-request.R Outdated Show resolved Hide resolved
R/bq-request.R Outdated Show resolved Hide resolved
R/bq-request.R Outdated Show resolved Hide resolved
R/bq-request.R Show resolved Hide resolved
R/bq-perform.R Outdated Show resolved Hide resolved
@apalacio9502
Copy link
Contributor Author

Hi @hadley,

Thank you for your review. I have taken into account all of your comments, and I hope I haven't missed any.

Regards,

Copy link
Member

@hadley hadley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two more small points. I really appreciate you working on this and your responsiveness to feedback 😄

NEWS.md Outdated
@@ -1,5 +1,14 @@
# bigrquery (development version)

## Significant improvements
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you leave these headings off? We add them as part of the final release process?

DESCRIPTION Outdated
@@ -29,7 +29,8 @@ Imports:
prettyunits,
rlang (>= 1.1.0),
tibble
Suggests:
Suggests:
arrow,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you consider trying nanoparquet instead? It's very new but has no dependencies, so we could use it from imports.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that nanoparquet is a good alternative because it has no dependencies. The only downside I see is that you would have to write to disk to read the raw data, as it lacks an output stream buffer implementation.

If you think the advantages outweigh the disadvantages, I could start testing and adapting the code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I didn't think about that, but I suspect it's still worthwhile given the lighter dependencies. Do you mind filing a nanoparquet issue to add a stream buffer output?

Copy link
Contributor Author

@apalacio9502 apalacio9502 May 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will begin implementing nanoparquet. If BufferedOutputStream is added in the future, an update will be necessary.

r-lib/nanoparquet#31

@apalacio9502
Copy link
Contributor Author

Hi @hadley,

The implementation of Nanoparquet to replace Arrow has been completed. After several data loading tests, I believe it works very well.

I look forward to your comments.

Regards,

NEWS.md Outdated Show resolved Hide resolved
@hadley hadley merged commit 3642c14 into r-dbi:main Jun 4, 2024
12 checks passed
@hadley
Copy link
Member

hadley commented Jun 4, 2024

Thanks so much for working on this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants