![tracker](https://us-central1-vertex-ai-mlops-369716.cloudfunctions.net/pixel-tracking?path=statmike%2Fvertex-ai-mlops%2FTips&file=BigQuery+-+Python+Client.ipynb)
<!--- header table --->
<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/statmike/vertex-ai-mlops/blob/main/Tips/BigQuery%20-%20Python%20Client.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Google Colaboratory logo">
      <br>Run in<br>Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https%3A%2F%2Fraw.githubusercontent.com%2Fstatmike%2Fvertex-ai-mlops%2Fmain%2FTips%2FBigQuery%2520-%2520Python%2520Client.ipynb">
      <img width="32px" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" alt="Google Cloud Colab Enterprise logo">
      <br>Run in<br>Colab Enterprise
    </a>
  </td>      
  <td style="text-align: center">
    <a href="https://github.com/statmike/vertex-ai-mlops/blob/main/Tips/BigQuery%20-%20Python%20Client.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      <br>View on<br>GitHub
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/statmike/vertex-ai-mlops/main/Tips/BigQuery%20-%20Python%20Client.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      <br>Open in<br>Vertex AI Workbench
    </a>
  </td>
</table>

# Python BigQuery Client(s)

This notebook covers many ways of interacting with BigQuery tables from Python, including:
- ways unique to Jupyter notebooks like magics
- built in methods for pandas like pandas-gbq methods
- the BigQuery Python clinet
- BigQuery storage api for fast, multiprocessing, reads directly from BigQuery storage


**Notes:**
- The `LIMIT 5` statement does limit the number of rows returned by BigQuery to 5 but BigQuery still does a full table scan.  If you have a table larger than 1GB and want to limit the rows scanned for a quick review then replacing `LIMIT 5` with `TABLESAMPLE SYSTEM (1 PERCENT)` would be more efficient.  For tables under 1GB it will still return the full table.  More on [Table Sampling](https://cloud.google.com/bigquery/docs/table-sampling)
- Each of the examples below run the same query in BigQuery.  The query is cached on the first run for up to 24 hours.  This means the subsequent, identical queries will not scan the data and instead use the cached results table.  More information on [Using cached query results](https://cloud.google.com/bigquery/docs/cached-results).



**Resources:**
- [BigQuery Python Client](https://cloud.google.com/python/docs/reference/bigquery/latest)
    - Interact with BigQuery compute to run queries
- [BigQuery Storage API Python Client](https://cloud.google.com/python/docs/reference/bigquerystorage/latest)
    - directly read from BigQuery storage through streams which support multiprocessing
- Using BigQuery From Python, Notebooks in This Repository
    - [01 - Data Sources/01 - BigQuery - Table Data Sources](../01%20-%20Data%20Sources/01%20-%20BigQuery%20-%20Table%20Data%20Source.ipynb)
    - [03 - BigQuery ML (BQML)/03 - Introduction to BigQuery ML (BQML)](../03%20-%20BigQuery%20ML%20(BQML)/03%20-%20Introduction%20to%20BigQuery%20ML%20(BQML).ipynb)
    - [Applied Forecasting/BigQuery Time Series Forecasting Data Review and Preparation](../Applied%20Forecasting/BigQuery%20Time%20Series%20Forecasting%20Data%20Review%20and%20Preparation.ipynb)


---
## Colab Setup

To run this notebook in Colab click [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/statmike/vertex-ai-mlops/blob/main/Tips/BigQuery%20-%20Python%20Client.ipynb) and run the cells in this section.  Otherwise, skip this section.

This cell will authenticate to GCP (follow prompts in the popup).

In [1]:
PROJECT_ID = 'statmike-mlops-349915' # replace with project ID

In [2]:
try:
    import google.colab
    from google.colab import auth
    auth.authenticate_user()
    !gcloud config set project {PROJECT_ID}
except Exception:
    pass

---
## Installs

The list `packages` contains tuples of package import names and install names.  If the import name is not found then the install name is used to install quitely for the current user.

In [3]:
# tuples of (import name, install name)
packages = [
    ('google.cloud.bigquery', 'google-cloud-bigquery'),
    ('google.cloud.bigquery_storage', 'google-cloud-bigquery-storage', '2.24.0'),
    ('bigframes', 'bigframes'),
    ('pandas_gbq', 'pandas-gbq'),
    ('tensorflow', 'tensorflow', '2.10'),
    ('tensorflow_io', '--no-deps tensorflow-io'),
]

import importlib
install = False
for package in packages:
    if not importlib.util.find_spec(package[0]):
        print(f'installing package {package[1]}')
        install = True
        !pip install {package[1]} -U -q --user
    elif len(package) == 3:
        if importlib.metadata.version(package[0]) < package[2]:
            print(f'updating package {package[1]}')
            install = True
            !pip install {package[1]} -U -q --user

### Restart Kernel (If Installs Occured)

After a kernel restart the code submission can start with the next cell after this one.

In [4]:
if install:
    import IPython
    app = IPython.Application.instance()
    app.kernel.do_shutdown(True)

---
## Setup

inputs:

In [5]:
project = !gcloud config get-value project
PROJECT_ID = project[0]
PROJECT_ID

'statmike-mlops-349915'

In [6]:
# source data
BQ_PROJECT = 'bigquery-public-data'
BQ_DATASET = 'ml_datasets'
BQ_TABLE = 'ulb_fraud_detection'

packages:

In [7]:
import os
os.environ['TF_CPP_MIN_LOG_LEVEL'] = '3'

import timeit
import concurrent.futures

import pandas as pd
from google.cloud import bigquery
from google.cloud import bigquery_storage
import bigframes.pandas as bpd

from tensorflow.python.framework import dtypes
from tensorflow_io.bigquery import BigQueryClient
import tensorflow as tf

clients:

In [8]:
bq = bigquery.Client(project = PROJECT_ID)
bqstorage = bigquery_storage.BigQueryReadClient()
bpd.options.bigquery.project = PROJECT_ID

parameters:

In [9]:
BQ_SOURCE = f'{BQ_PROJECT}.{BQ_DATASET}.{BQ_TABLE}'

---
## BigQuery From Jupyter

### BigQuery Cell Magic

See the documentation for [IPython Magics for BigQuery](https://cloud.google.com/python/docs/reference/bigquery/latest/magics)

In [10]:
%%bigquery
SELECT *
FROM bigquery-public-data.ml_datasets.ulb_fraud_detection # this cannot be parameterized with magics
LIMIT 5

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,8748.0,-1.070416,0.304517,2.777064,2.154061,0.25445,-0.448529,-0.398691,0.144672,1.0709,...,-0.122032,-0.182351,0.019576,0.626023,-0.018518,-0.263291,-0.1986,0.098435,0.0,0
1,27074.0,1.165628,0.423671,0.887635,2.740163,-0.338578,-0.142846,-0.055628,-0.015325,-0.213621,...,-0.081184,-0.025694,-0.076609,0.414687,0.631032,0.077322,0.010182,0.019912,0.0,0
2,28292.0,1.050879,0.053408,1.36459,2.666158,-0.378636,1.382032,-0.766202,0.486126,0.152611,...,0.083467,0.624424,-0.157228,-0.240411,0.573061,0.24409,0.063834,0.010981,0.0,0
3,28488.0,1.070316,0.079499,1.471856,2.863786,-0.637887,0.858159,-0.687478,0.344146,0.459561,...,0.048067,0.534713,-0.098645,0.129272,0.543737,0.242724,0.06507,0.0235,0.0,0
4,31392.0,-3.680953,-4.183581,2.642743,4.263802,4.643286,-0.225053,-3.733637,1.273037,0.015661,...,0.649051,1.054124,0.795528,-0.901314,-0.425524,0.511675,0.125419,0.243671,0.0,0


---
## BigQuery From Python With BigFrames API

Use the [BigQuery BigFrames API](https://cloud.google.com/python/docs/reference/bigframes/latest) to query BigQuery and directly interact with BigQuery objects.

In [27]:
bq_data_bigframes = bpd.read_gbq(f'SELECT * FROM `{BQ_SOURCE}` LIMIT 5')
bq_data_bigframes.head()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,8748.0,-1.070416,0.304517,2.777064,2.154061,0.25445,-0.448529,-0.398691,0.144672,1.0709,...,-0.122032,-0.182351,0.019576,0.626023,-0.018518,-0.263291,-0.1986,0.098435,0.0,0
1,31392.0,-3.680953,-4.183581,2.642743,4.263802,4.643286,-0.225053,-3.733637,1.273037,0.015661,...,0.649051,1.054124,0.795528,-0.901314,-0.425524,0.511675,0.125419,0.243671,0.0,0
2,28292.0,1.050879,0.053408,1.36459,2.666158,-0.378636,1.382032,-0.766202,0.486126,0.152611,...,0.083467,0.624424,-0.157228,-0.240411,0.573061,0.24409,0.063834,0.010981,0.0,0
3,28488.0,1.070316,0.079499,1.471856,2.863786,-0.637887,0.858159,-0.687478,0.344146,0.459561,...,0.048067,0.534713,-0.098645,0.129272,0.543737,0.242724,0.06507,0.0235,0.0,0
4,27074.0,1.165628,0.423671,0.887635,2.740163,-0.338578,-0.142846,-0.055628,-0.015325,-0.213621,...,-0.081184,-0.025694,-0.076609,0.414687,0.631032,0.077322,0.010182,0.019912,0.0,0


In [28]:
type(bq_data_bigframes)

bigframes.dataframe.DataFrame

In [29]:
bq_data_bigframes.shape

(5, 31)

In [30]:
bq_data_bigframes = bq_data_bigframes.head().to_pandas()
type(bq_data_bigframes)

pandas.core.frame.DataFrame

In [31]:
bq_data_bigframes.shape

(5, 31)

---
## BigQuery From Python

Use the [BigQuery Python Client](https://cloud.google.com/python/docs/reference/bigquery/latest) to query BigQuery and return results as Pandas dataframes.

### BigQuery API From Python


In [32]:
query = f"""
    SELECT * 
    FROM `{BQ_SOURCE}`
    LIMIT 5
"""
preview = bq.query(query = query).to_dataframe()
preview

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,8748.0,-1.070416,0.304517,2.777064,2.154061,0.25445,-0.448529,-0.398691,0.144672,1.0709,...,-0.122032,-0.182351,0.019576,0.626023,-0.018518,-0.263291,-0.1986,0.098435,0.0,0
1,27074.0,1.165628,0.423671,0.887635,2.740163,-0.338578,-0.142846,-0.055628,-0.015325,-0.213621,...,-0.081184,-0.025694,-0.076609,0.414687,0.631032,0.077322,0.010182,0.019912,0.0,0
2,28292.0,1.050879,0.053408,1.36459,2.666158,-0.378636,1.382032,-0.766202,0.486126,0.152611,...,0.083467,0.624424,-0.157228,-0.240411,0.573061,0.24409,0.063834,0.010981,0.0,0
3,28488.0,1.070316,0.079499,1.471856,2.863786,-0.637887,0.858159,-0.687478,0.344146,0.459561,...,0.048067,0.534713,-0.098645,0.129272,0.543737,0.242724,0.06507,0.0235,0.0,0
4,31392.0,-3.680953,-4.183581,2.642743,4.263802,4.643286,-0.225053,-3.733637,1.273037,0.015661,...,0.649051,1.054124,0.795528,-0.901314,-0.425524,0.511675,0.125419,0.243671,0.0,0


### BigQuery Python Client: Helper Function

In [33]:
def bq_runner(query):
    return bq.query(query = query)

In [34]:
bq_runner(
    query = f"""
        SELECT * 
        FROM `{BQ_SOURCE}`
        LIMIT 5
    """
).to_dataframe()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,8748.0,-1.070416,0.304517,2.777064,2.154061,0.25445,-0.448529,-0.398691,0.144672,1.0709,...,-0.122032,-0.182351,0.019576,0.626023,-0.018518,-0.263291,-0.1986,0.098435,0.0,0
1,27074.0,1.165628,0.423671,0.887635,2.740163,-0.338578,-0.142846,-0.055628,-0.015325,-0.213621,...,-0.081184,-0.025694,-0.076609,0.414687,0.631032,0.077322,0.010182,0.019912,0.0,0
2,28292.0,1.050879,0.053408,1.36459,2.666158,-0.378636,1.382032,-0.766202,0.486126,0.152611,...,0.083467,0.624424,-0.157228,-0.240411,0.573061,0.24409,0.063834,0.010981,0.0,0
3,28488.0,1.070316,0.079499,1.471856,2.863786,-0.637887,0.858159,-0.687478,0.344146,0.459561,...,0.048067,0.534713,-0.098645,0.129272,0.543737,0.242724,0.06507,0.0235,0.0,0
4,31392.0,-3.680953,-4.183581,2.642743,4.263802,4.643286,-0.225053,-3.733637,1.273037,0.015661,...,0.649051,1.054124,0.795528,-0.901314,-0.425524,0.511675,0.125419,0.243671,0.0,0


### BigQuery Python Client: Using Query Job Attributes and Methods

Query Jobs have Methods and Attributes that can benefit the Python workflow:
- Query Job [Methods](https://googleapis.dev/python/bigquery/latest/generated/google.cloud.bigquery.job.QueryJob.html#google.cloud.bigquery.job.QueryJob:~:text=for%20accurate%20signature.-,Methods,-__init__(job_id%2C%C2%A0query)
- Query Job [Attributes](https://googleapis.dev/python/bigquery/latest/generated/google.cloud.bigquery.job.QueryJob.html#google.cloud.bigquery.job.QueryJob:~:text=from%20a%20QueryJob-,Attributes,-allow_large_results)

BigQuery Query Job (using helper function):

In [35]:
job = bq_runner(
    query = f"""
        SELECT * 
        FROM `{BQ_SOURCE}`
        LIMIT 5
    """
)

Using Query Job Atrributes to get timing:

In [36]:
job.result()
(job.ended-job.started).total_seconds()

0.216

Using Query Job Methods to retrieve result to Pandas dataframe:

In [37]:
job.to_dataframe()

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,8748.0,-1.070416,0.304517,2.777064,2.154061,0.25445,-0.448529,-0.398691,0.144672,1.0709,...,-0.122032,-0.182351,0.019576,0.626023,-0.018518,-0.263291,-0.1986,0.098435,0.0,0
1,27074.0,1.165628,0.423671,0.887635,2.740163,-0.338578,-0.142846,-0.055628,-0.015325,-0.213621,...,-0.081184,-0.025694,-0.076609,0.414687,0.631032,0.077322,0.010182,0.019912,0.0,0
2,28292.0,1.050879,0.053408,1.36459,2.666158,-0.378636,1.382032,-0.766202,0.486126,0.152611,...,0.083467,0.624424,-0.157228,-0.240411,0.573061,0.24409,0.063834,0.010981,0.0,0
3,28488.0,1.070316,0.079499,1.471856,2.863786,-0.637887,0.858159,-0.687478,0.344146,0.459561,...,0.048067,0.534713,-0.098645,0.129272,0.543737,0.242724,0.06507,0.0235,0.0,0
4,31392.0,-3.680953,-4.183581,2.642743,4.263802,4.643286,-0.225053,-3.733637,1.273037,0.015661,...,0.649051,1.054124,0.795528,-0.901314,-0.425524,0.511675,0.125419,0.243671,0.0,0


### Indirect use with pandas-gbq

When working with [Pandas](https://pandas.pydata.org/docs/user_guide/index.html#user-guide) the methods above show the client returning data to pandas dataframes.  This section will show a pandas mudule, [pandas-gbq](https://pandas-gbq.readthedocs.io/en/latest/) that wraps the BigQuery client so that pandas can retrieve BigQuery data to dataframes.

References:
- [Comparison of BigQuery Client with pandas-gbq](https://cloud.google.com/bigquery/docs/pandas-gbq-migration)

#### Using pandas-gbq

In [38]:
query = f"""
SELECT * 
FROM `{BQ_SOURCE}`
LIMIT 5
"""
df = pd.read_gbq(query, project_id = PROJECT_ID)

In [39]:
df

Unnamed: 0,Time,V1,V2,V3,V4,V5,V6,V7,V8,V9,...,V21,V22,V23,V24,V25,V26,V27,V28,Amount,Class
0,8748.0,-1.070416,0.304517,2.777064,2.154061,0.25445,-0.448529,-0.398691,0.144672,1.0709,...,-0.122032,-0.182351,0.019576,0.626023,-0.018518,-0.263291,-0.1986,0.098435,0.0,0
1,27074.0,1.165628,0.423671,0.887635,2.740163,-0.338578,-0.142846,-0.055628,-0.015325,-0.213621,...,-0.081184,-0.025694,-0.076609,0.414687,0.631032,0.077322,0.010182,0.019912,0.0,0
2,28292.0,1.050879,0.053408,1.36459,2.666158,-0.378636,1.382032,-0.766202,0.486126,0.152611,...,0.083467,0.624424,-0.157228,-0.240411,0.573061,0.24409,0.063834,0.010981,0.0,0
3,28488.0,1.070316,0.079499,1.471856,2.863786,-0.637887,0.858159,-0.687478,0.344146,0.459561,...,0.048067,0.534713,-0.098645,0.129272,0.543737,0.242724,0.06507,0.0235,0.0,0
4,31392.0,-3.680953,-4.183581,2.642743,4.263802,4.643286,-0.225053,-3.733637,1.273037,0.015661,...,0.649051,1.054124,0.795528,-0.901314,-0.425524,0.511675,0.125419,0.243671,0.0,0


## TensorFlow I/O BigQuery Reader

https://www.tensorflow.org/io

In [40]:
query = f"SELECT * FROM {BQ_PROJECT}.{BQ_DATASET}.INFORMATION_SCHEMA.COLUMNS WHERE TABLE_NAME = '{BQ_TABLE}'"
schema = bq.query(query).to_dataframe()

In [41]:
schema.data_type.unique().tolist()

['FLOAT64', 'INT64']

In [42]:
types = {
    'FLOAT64' : dtypes.float64,
    'INT64' : dtypes.int64,
    'STRING' : dtypes.string
}

In [43]:
bq_data_tfio = BigQueryClient()
type(bq_data_tfio)

tensorflow_io.python.ops.bigquery_dataset_ops.BigQueryClient

In [44]:
bq_data_tfio = bq_data_tfio.read_session(
    parent = f"projects/{PROJECT_ID}",
    project_id = BQ_PROJECT,
    table_id = BQ_TABLE,
    dataset_id = BQ_DATASET,
    selected_fields = schema.column_name.tolist(),
    output_types = [types[x] for x in schema.data_type.tolist()],
    #row_restriction = f"splits='{split}'",
    requested_streams = 3
)
type(bq_data_tfio)

tensorflow_io.python.ops.bigquery_dataset_ops.BigQueryReadSession

In [45]:
bq_data_tfio = bq_data_tfio.parallel_read_rows(sloppy = True, num_parallel_calls = tf.data.experimental.AUTOTUNE)
type(bq_data_tfio)

tensorflow.python.data.ops.interleave_op._ParallelInterleaveDataset

In [46]:
bq_data_tfio = bq_data_tfio.batch(5)
type(bq_data_tfio)

tensorflow.python.data.ops.batch_op._BatchDataset

In [48]:
for rows in bq_data_tfio.take(1):
    print(list(rows.keys()))
    for item in rows.items():
        print(item)

['Amount', 'Class', 'Time', 'V1', 'V10', 'V11', 'V12', 'V13', 'V14', 'V15', 'V16', 'V17', 'V18', 'V19', 'V2', 'V20', 'V21', 'V22', 'V23', 'V24', 'V25', 'V26', 'V27', 'V28', 'V3', 'V4', 'V5', 'V6', 'V7', 'V8', 'V9']
('Amount', <tf.Tensor: shape=(5,), dtype=float64, numpy=array([0., 0., 0., 0., 0.])>)
('Class', <tf.Tensor: shape=(5,), dtype=int64, numpy=array([0, 0, 0, 0, 0])>)
('Time', <tf.Tensor: shape=(5,), dtype=float64, numpy=array([ 8748., 27074., 28292., 28488., 31392.])>)
('V1', <tf.Tensor: shape=(5,), dtype=float64, numpy=array([-1.07041642,  1.16562783,  1.05087886,  1.07031571, -3.68095312])>)
('V10', <tf.Tensor: shape=(5,), dtype=float64, numpy=array([-0.31050073,  0.51726104,  0.49667421,  0.34086927,  0.63552321])>)
('V11', <tf.Tensor: shape=(5,), dtype=float64, numpy=array([ 0.09397115, -0.99559462,  0.3419823 , -1.14145224, -0.44290293])>)
('V12', <tf.Tensor: shape=(5,), dtype=float64, numpy=array([-2.3937614 ,  0.23533528,  1.22483121,  0.3995012 ,  0.78233308])>)
('V13'

---

## BigQuery Storage API From Python

BigQuery has an API for directly reading and writing to storage - no need for a query which uses the compute side of BigQuery.  This interacts directly with a single table, not a view, and reads/write rows with some basic filterting capabilities to limit the returned records.  Since BigQuery is columnar storage you can boost efficiency by only reading the columns needed with these methods!


### Create A Read Session

Use the [BigQueryReadClient](https://cloud.google.com/python/docs/reference/bigquerystorage/latest/google.cloud.bigquery_storage_v1.client.BigQueryReadClient) to create a read session with [create_read_session](https://cloud.google.com/python/docs/reference/bigquerystorage/latest/google.cloud.bigquery_storage_v1.client.BigQueryReadClient#google_cloud_bigquery_storage_v1_client_BigQueryReadClient_create_read_session).

>`max_stream_count` set the upper limit for the number of streams returned.  The actual number will be determined based on the tables size and storage layout and could be less than the value provided.  Specifying `0` indicates the system can determine the best value for the `max_stream_counts` and uses a default limit of 1000.

The `read_session` is an [bigquery_storage.types.ReadSession()](https://cloud.google.com/python/docs/reference/bigquerystorage/latest/google.cloud.bigquery_storage_v1.types.ReadSession).

Within the`read_session` the `read_options` is an [bigquery_storage.types.ReadSession.TableReadOptions()](https://cloud.google.com/python/docs/reference/bigquerystorage/latest/google.cloud.bigquery_storage_v1.types.ReadSession.TableReadOptions)

In [40]:
read_session = bqstorage.create_read_session(
    request = dict(
        parent = f'projects/{PROJECT_ID}',
        read_session = dict(
            table = f"projects/{BQ_PROJECT}/datasets/{BQ_DATASET}/tables/{BQ_TABLE}",
            data_format = bigquery_storage.types.DataFormat.ARROW,
            read_options = dict(
                #row_restriction = "Amount > 0",
                selected_fields = [f'V{n+1}' for n in range(28)]
            )
        ),
        max_stream_count = 0
    )
)

Check for the number of streams returned.  A value of would mean all records would be read through a single stream.  A value larger than 1 indicates different streams would be used to read different parts of the table.

In [41]:
len(read_session.streams)

2

### Read the table: One Stream At A Time



In [42]:
def read_stream(stream):
    # setup a reader
    reader = bqstorage.read_rows(
        name = stream.name
    )

    # start timer
    start = timeit.default_timer()

    # read rows from reader into a dataframe.  Note this is actually multiple operations - read and convert
    train = reader.to_dataframe()#.drop(['splits', 'transaction_id'], axis = 1)

    # stop timer and calculate elabsed time
    execution_time = timeit.default_timer() - start

    # report data shape and elapsed time
    print(f'The stream ({stream.name}) read {train.shape[0]} rows and {train.shape[1]} columns in {execution_time} seconds')

    return train

In [43]:
train0 = read_stream(read_session.streams[0])

The stream (projects/statmike-mlops-349915/locations/us/sessions/CAISDFlUNnNySnJkeERUehoCamQaAmpxGgJpchoCb2oaAmpzGgJuYRoCb3caAnB6GgJpdxoCaWMaAmpyGgJqYxoCamYaAnBsGgJvdhoCamkaAmlhGgJweRoCb3MaAnB4/streams/GgJqZBoCanEaAmlyGgJvahoCanMaAm5hGgJvdxoCcHoaAml3GgJpYxoCanIaAmpjGgJqZhoCcGwaAm92GgJqaRoCaWEaAnB5GgJvcxoCcHgoAg) read 142736 rows and 28 columns in 0.33875093500046205 seconds


In [44]:
train1 = read_stream(read_session.streams[1])

The stream (projects/statmike-mlops-349915/locations/us/sessions/CAISDFlUNnNySnJkeERUehoCamQaAmpxGgJpchoCb2oaAmpzGgJuYRoCb3caAnB6GgJpdxoCaWMaAmpyGgJqYxoCamYaAnBsGgJvdhoCamkaAmlhGgJweRoCb3MaAnB4/streams/CAEaAmpkGgJqcRoCaXIaAm9qGgJqcxoCbmEaAm93GgJwehoCaXcaAmljGgJqchoCamMaAmpmGgJwbBoCb3YaAmppGgJpYRoCcHkaAm9zGgJweCgC) read 142071 rows and 28 columns in 0.1601793429999816 seconds


In [45]:
train0.shape[0] + train1.shape[0]

284807

In [46]:
train = pd.concat([train0, train1])

In [47]:
train.shape

(284807, 28)

### Read the table: Async Streams

In [48]:
start = timeit.default_timer()

train = []
with concurrent.futures.ThreadPoolExecutor(max_workers=2) as executor:
    futures = {
        executor.submit(read_stream, stream): stream for stream in read_session.streams
    }
    for future in concurrent.futures.as_completed(futures):
        stream = futures[future]
        train.append(future.result())
        
print(f'The total elapsed time was: {timeit.default_timer() - start}')

The stream (projects/statmike-mlops-349915/locations/us/sessions/CAISDFlUNnNySnJkeERUehoCamQaAmpxGgJpchoCb2oaAmpzGgJuYRoCb3caAnB6GgJpdxoCaWMaAmpyGgJqYxoCamYaAnBsGgJvdhoCamkaAmlhGgJweRoCb3MaAnB4/streams/GgJqZBoCanEaAmlyGgJvahoCanMaAm5hGgJvdxoCcHoaAml3GgJpYxoCanIaAmpjGgJqZhoCcGwaAm92GgJqaRoCaWEaAnB5GgJvcxoCcHgoAg) read 142736 rows and 28 columns in 0.34685096599969256 seconds
The stream (projects/statmike-mlops-349915/locations/us/sessions/CAISDFlUNnNySnJkeERUehoCamQaAmpxGgJpchoCb2oaAmpzGgJuYRoCb3caAnB6GgJpdxoCaWMaAmpyGgJqYxoCamYaAnBsGgJvdhoCamkaAmlhGgJweRoCb3MaAnB4/streams/CAEaAmpkGgJqcRoCaXIaAm9qGgJqcxoCbmEaAm93GgJwehoCaXcaAmljGgJqchoCamMaAmpmGgJwbBoCb3YaAmppGgJpYRoCcHkaAm9zGgJweCgC) read 142071 rows and 28 columns in 0.3689056610000989 seconds
The total elapsed time was: 0.5045286029999261


In [49]:
len(train)

2

In [50]:
train[0].shape, train[1].shape

((142736, 28), (142071, 28))

In [51]:
train = pd.concat(train)
train.shape

(284807, 28)

### Full Example With Bigger Table

In [52]:
read_session = bqstorage.create_read_session(
    request = dict(
        parent = f'projects/{PROJECT_ID}',
        read_session = dict(
            table = f"projects/{BQ_PROJECT}/datasets/google_trends/tables/top_terms",
            data_format = bigquery_storage.types.DataFormat.ARROW,
            read_options = dict(
                #row_restriction = "rank > 10",
                selected_fields = ['dma_id', 'term', 'week', 'score']
            )
        ),
        max_stream_count = 2
    )
)

In [53]:
len(read_session.streams)

2

In [54]:
start = timeit.default_timer()

train = []
with concurrent.futures.ThreadPoolExecutor(max_workers=2) as executor:
    futures = {
        executor.submit(read_stream, stream): stream for stream in read_session.streams
    }
    for future in concurrent.futures.as_completed(futures):
        stream = futures[future]
        train.append(future.result())
        
print(f'The total elapsed time was: {timeit.default_timer() - start}')

The stream (projects/statmike-mlops-349915/locations/us/sessions/CAISDGxReG1zUTdnRkpmTxoCamQaAmpxGgJpchoCb2oaAmpzGgJuYRoCb3caAnB6GgJpdxoCaWMaAmpyGgJqYxoCamYaAnBsGgJvdhoCamkaAmlhGgJweRoCb3MaAnB4/streams/GgJqZBoCanEaAmlyGgJvahoCanMaAm5hGgJvdxoCcHoaAml3GgJpYxoCanIaAmpjGgJqZhoCcGwaAm92GgJqaRoCaWEaAnB5GgJvcxoCcHgoAg) read 21186824 rows and 4 columns in 5.984298192000097 seconds
The stream (projects/statmike-mlops-349915/locations/us/sessions/CAISDGxReG1zUTdnRkpmTxoCamQaAmpxGgJpchoCb2oaAmpzGgJuYRoCb3caAnB6GgJpdxoCaWMaAmpyGgJqYxoCamYaAnBsGgJvdhoCamkaAmlhGgJweRoCb3MaAnB4/streams/CAEaAmpkGgJqcRoCaXIaAm9qGgJqcxoCbmEaAm93GgJwehoCaXcaAmljGgJqchoCamMaAmpmGgJwbBoCb3YaAmppGgJpYRoCcHkaAm9zGgJweCgC) read 22535176 rows and 4 columns in 7.12685243299984 seconds
The total elapsed time was: 7.403345630000331


In [55]:
len(train)

2

In [56]:
train[0].shape, train[1].shape

((21186824, 4), (22535176, 4))

In [57]:
train = pd.concat(train)
train.shape

(43722000, 4)