### Working with BigQuery

This library offers wrappers around BigQuery connectors through the official Google client library and Pandas

In [1]:
import os
import pandas as pd
from prediction_utils.extraction_utils.database import BQDatabase

The primary class that we are going to work with is `BQDatabase`.
The only keyword arguments that this class takes are `gcloud_project` and `google_application_credentials`.
If not provided, they default to `som-nero-phi-nigam-starr` and `os.path.expanduser("~/.config/gcloud/application_default_credentials.json")` respectively.

Let's create a database object called `db` using the defaults:

In [2]:
db = BQDatabase()



With this database object, extracting the results of a query to a pandas dataframe is as simple as calling `db.read_sql_query(...)`. 

`db.read_sql_query` takes a boolean argument `use_bqstorage_api` that defaults to True. This argument toggles whether the query should leverage the BigQuery Storage API. When True, the results will be downloaded significantly faster, but incurs more cost. For small datasets, it may be better to set this argument to False.

This function further passes all additional keyword arguments to `pandas.read_gbq()`

For example:

In [3]:
query = """
    SELECT * 
    FROM som-rit-phi-starr-prod.starr_omop_cdm5_deid_lite_latest.person
    LIMIT 1000
"""
df = db.read_sql_query(query=query, use_bqstorage_api=False)

Downloading: 100%|██████████| 1000/1000 [00:00<00:00, 2504.28rows/s]


In [4]:
df.head()

Unnamed: 0,person_id,gender_concept_id,year_of_birth,month_of_birth,day_of_birth,birth_DATETIME,race_concept_id,ethnicity_concept_id,location_id,provider_id,care_site_id,person_source_value,gender_source_value,gender_source_concept_id,race_source_value,race_source_concept_id,ethnicity_source_value,ethnicity_source_concept_id
0,30360313,0,1974,9,23,1974-09-23,0,38003563,,,,,3 | 3,0,Unknown | Other,0,Unknown | Hispanic,38003563
1,31166460,0,2010,3,21,2010-03-21,8527,38003563,,,,,3 | 3,0,Unknown | White or Caucasian,0,Unknown | Hispanic,38003563
2,30629316,0,2015,6,18,2015-06-18,8515,38003564,,,,,3 | 3,0,Asian | Asian,0,Non-Hispanic/Non-Latino | Declines to State,38003564
3,30762518,8532,1996,5,6,1996-05-06,0,38003563,,,,,1 | 2,8532,Race and Ethnicity Unknown | Unknown,0,Unknown | Hispanic,38003563
4,32507595,8507,1994,1,25,1994-01-25,0,38003563,,,,,2 | 1,8507,Unknown | Unknown,0,Unknown | Hispanic,38003563


To execute arbitrary SQL, use `db.execute_sql`, which internally calls `client.query(query).result()`.

In [5]:
df = db.execute_sql(query=query).to_dataframe()

To write query results to a table in BigQuery without returning the results in Pandas, you can use `db.execute_sql_to_destination_table`. This method requires you to fully specify the destination

In [6]:
destination='som-nero-phi-nigam-starr.temp_dataset.vignette_table'
db.execute_sql_to_destination_table(query=query, destination=destination)

Let's query the destination to confirm that the results were written

In [7]:
df2 = db.read_sql_query('SELECT * FROM {destination}'.format(destination=destination))

Downloading: 100%|██████████| 1000/1000 [00:01<00:00, 611.99rows/s]


Check whether it's the same data (note that the order of the rows is not preserved)

In [8]:
assert (
    df
    .sort_values('person_id')
    .reset_index(drop=True)
    .equals(
        df2
        .sort_values('person_id')
        .reset_index(drop=True)
    )
)

We can also write pandas dataframes to tables in BigQuery using the `to_sql` method.
This method takes a `mode` argument with valid values `"gbq"` and `"client"` that determine whether `pandas.DataFrame.to_gbq` or `client.load_table_from_dataframe` will be used. There are tradeoffs between these two methods, with the primary difference being that the interface to `to_gbq` is more straightforward, but writes all `DATE` columns as `TIMESTAMP`, and serializes data to CSV internally. The `client` approach allows for date columns to be written and uses Parquet to serialize data, but has a more verbose and complex interface. There are also some differences in how the destination table must be formatted.

In [9]:
# gbq method
destination='temp_dataset.vignette_table'
project_id='som-nero-phi-nigam-starr'
db.to_sql(df=df, destination_table=destination, project_id=project_id, mode='gbq')

1it [00:03,  4.00s/it]


In [10]:
# client method
destination='som-nero-phi-nigam-starr.temp_dataset.vignette_table'
db.to_sql(df=df, destination_table=destination, mode='client')

  ", ".join(field.name for field in unknown_type_fields)


Loaded 1000 rows and 18 columns to som-nero-phi-nigam-starr.temp_dataset.vignette_table


This class also provides the capability to stream a query to disk in chunks to the Apache Parquet filetype, using pyarrow with the `db.stream_query` method. For usage, see the docstring in the source code.