Be notified of new releases
Create your free GitHub account today to subscribe to this repository for new releases and build software alongside 28 million developers.Sign up
The system for downloading data from BigQuery into R has been rewritten from the ground up to give considerable improvements in performance and flexibility.
The two steps, downloading and parsing, now happen in sequence, rather than
interleaved. This means that you'll now see two progress bars: one for
downloading JSON from BigQuery and one for parsing that JSON into a data
Downloads now occur in parallel, using up to 6 simultaneous connections by
The parsing code has been rewritten in C++. As well as considerably improving
performance, this also adds support for nested (record/struct) and repeated
(array) columns (#145). These columns will yield list-columns in the
- Repeated values become list-columns containing vectors.
- Nested values become list-columns containing named lists.
- Repeated nested values become list-columns containing data frames.
Results are now returned as tibbles, not data frames, because the base print
method does not handle list columns well.
I can now download the first million rows of
publicdata.samples.natality in about a minute. This data frame is about 170 MB in BigQuery and 140 MB in R; a minute to download this much data seems reasonable to me. The bottleneck for loading BigQuery data is now parsing BigQuery's json format. I don't see any obvious way to make this faster as I'm already using the fastest C++ json parser, RapidJson. If this is still too slow for you (i.e. you're downloading GBs of data), see
?bq_table_download for an alternative approach.
tbl()now accepts fully (or partially) qualified table names, like
"publicdata.samples.shakespeare" or "samples.shakespeare". This makes it
possible to join tables across datasets (#219).
dbConnect()now defaults to standard SQL, rather than legacy SQL. Use
use_legacy_sql = TRUEif you need the previous behaviour (#147).
datasetto be omitted; this is natural when you
want to use tables from multiple datasets.
dbReadTable()now accept fully (or partially)
qualified table names.
dbi_driver()is deprecated; please use
The low-level API has been completely overhauled to make it easier to use. The primary motivation was to make bigrquery development more enjoyable for me, but it should also be helpful to you when you need to go outside of the features provided by higher-level DBI and dplyr interfaces. The old API has been soft-deprecated - it will continue to work, but no further development will occur (including bug fixes). It will be formally deprecated in the next version, and then removed in the version after that.
Consistent naming scheme:
All API functions now have the form
constructor functions create S3 objects corresponding to important BigQuery
objects (#150). These are paired with
as_coercion functions and used throughout
the new API.
Easier local testing:
bq_test_dataset()make it easier to run
bigrquery tests locally. To run the tests yourself, you need to create a
BigQuery project, and then follow the instructions in
More efficient data transfer:
The new API makes extensive use of the
fieldsquery parameter, ensuring
that functions only download data that they actually use (#153).
Tighter GCS connection:
bq_table_load()loads data from a Google Cloud Storage URI, pairing
bq_table_save()which saves data to a GCS URI (#155).
Bug fixes and minor improvements
The dplyr interface can work with literal SQL once more (#218).
If you have the development version of dbplyr installed,
a BigQuery table will not perform an unneeded query, but will instead
download directly from the table (#226).
Request error messages now contain the "reason", which can contain
useful information for debugging (#209).
bq_project_query()can now supply query parameters
bq_table_create()can now specify