@hadley hadley released this Apr 25, 2018 · 3 commits to master since this release

Assets 2

Improved downloads

The system for downloading data from BigQuery into R has been rewritten from the ground up to give considerable improvements in performance and flexibility.

  • The two steps, downloading and parsing, now happen in sequence, rather than
    interleaved. This means that you'll now see two progress bars: one for
    downloading JSON from BigQuery and one for parsing that JSON into a data
    frame.

  • Downloads now occur in parallel, using up to 6 simultaneous connections by
    default.

  • The parsing code has been rewritten in C++. As well as considerably improving
    performance, this also adds support for nested (record/struct) and repeated
    (array) columns (#145). These columns will yield list-columns in the
    following forms:

    • Repeated values become list-columns containing vectors.
    • Nested values become list-columns containing named lists.
    • Repeated nested values become list-columns containing data frames.
  • Results are now returned as tibbles, not data frames, because the base print
    method does not handle list columns well.

I can now download the first million rows of publicdata.samples.natality in about a minute. This data frame is about 170 MB in BigQuery and 140 MB in R; a minute to download this much data seems reasonable to me. The bottleneck for loading BigQuery data is now parsing BigQuery's json format. I don't see any obvious way to make this faster as I'm already using the fastest C++ json parser, RapidJson. If this is still too slow for you (i.e. you're downloading GBs of data), see ?bq_table_download for an alternative approach.

New features

dplyr

  • dplyr::compute() now works (@realAkhmed, #52).

  • tbl() now accepts fully (or partially) qualified table names, like
    "publicdata.samples.shakespeare" or "samples.shakespeare". This makes it
    possible to join tables across datasets (#219).

DBI

  • dbConnect() now defaults to standard SQL, rather than legacy SQL. Use
    use_legacy_sql = TRUE if you need the previous behaviour (#147).

  • dbConnect() now allows dataset to be omitted; this is natural when you
    want to use tables from multiple datasets.

  • dbWriteTable() and dbReadTable() now accept fully (or partially)
    qualified table names.

  • dbi_driver() is deprecated; please use bigquery() instead.

Low-level API

The low-level API has been completely overhauled to make it easier to use. The primary motivation was to make bigrquery development more enjoyable for me, but it should also be helpful to you when you need to go outside of the features provided by higher-level DBI and dplyr interfaces. The old API has been soft-deprecated - it will continue to work, but no further development will occur (including bug fixes). It will be formally deprecated in the next version, and then removed in the version after that.

  • Consistent naming scheme:
    All API functions now have the form bq_object_verb(), e.g.
    bq_table_create(), or bq_dataset_delete().

  • S3 classes:
    bq_table(), bq_dataset(), bq_job(), bq_field() and bq_fields()
    constructor functions create S3 objects corresponding to important BigQuery
    objects (#150). These are paired with as_ coercion functions and used throughout
    the new API.

  • Easier local testing:
    New bq_test_project() and bq_test_dataset() make it easier to run
    bigrquery tests locally. To run the tests yourself, you need to create a
    BigQuery project, and then follow the instructions in ?bq_test_project.

  • More efficient data transfer:
    The new API makes extensive use of the fields query parameter, ensuring
    that functions only download data that they actually use (#153).

  • Tighter GCS connection:
    New bq_table_load() loads data from a Google Cloud Storage URI, pairing
    with bq_table_save() which saves data to a GCS URI (#155).

Bug fixes and minor improvements

dplyr

  • The dplyr interface can work with literal SQL once more (#218).

  • Improved SQL translation for pmax(), pmin(), sd(), all(), and any()
    (#176, #179, @jarodmeng). And for paste0(), cor() and cov()
    (@edgararuiz).

  • If you have the development version of dbplyr installed, print()ing
    a BigQuery table will not perform an unneeded query, but will instead
    download directly from the table (#226).

Low-level

  • Request error messages now contain the "reason", which can contain
    useful information for debugging (#209).

  • bq_dataset_query() and bq_project_query() can now supply query parameters
    (#191).

  • bq_table_create() can now specify fields (#204).

  • bq_perform_query() no longer fails with empty results (@byapparov, #206).