# Load into bigquery

Prototypes the `google.cloud.bigquery` python API to load table from parquets in GCS onto a bigquery dataset

Note on `prefect-gcp` - there is an option to use prefect's client as a wrapper for GCP resources, but from my experience the documentation is a little lacking and I had to end up using the offical GCP docs anyway, so for sake of documentation completeness let's stick with official clients like I did for GCS

In [1]:
from google.cloud import bigquery

## Access the parquets

Use the `load_table_from_uri` func to load directly from cloud storage

In [2]:
import os
from dotenv import load_dotenv

In [15]:
load_dotenv()

DATA_LAKE = os.getenv("DATA_LAKE")
GCP_PROJECT_ID = os.getenv("GCP_PROJECT_ID")

# creates client with app default cred
bq = bigquery.Client(location="us-west1")
for d in bq.list_datasets(project=GCP_PROJECT_ID):
    print(d.dataset_id)

dbt_ny_taxi
prod_ny_taxi
service_call_test
trips_data_all


In [22]:
dataset_id = "service_call_test"
dataset = bigquery.Dataset(dataset_ref=f"{GCP_PROJECT_ID}.{dataset_id}")
try:
    dataset = bq.create_dataset(dataset=dataset)
except Exception as e:
    print(e)

In [40]:
year = 2021
table_id = f"facts_{year}_test"
src_uris = f"gs://{DATA_LAKE}/raw/pq/SR{year}.parquet"
dest_table = bigquery.Table(f"{GCP_PROJECT_ID}.{dataset_id}.{table_id}")
job_config = bigquery.LoadJobConfig(
    # schema=[
    #     bigquery.SchemaField(
    #         "Status", "STRING"
    #     )
    # ],
    # source_format="PARQUET",
    source_format=bigquery.SourceFormat.PARQUET,
)
# returns a LoadJob class
job = bq.load_table_from_uri(
    src_uris,
    dest_table,
    # job_id="test_notebook_client", # if not spec'd, autogen
    job_config=job_config,
    location="us-west1",
    project=GCP_PROJECT_ID,
)
# call .result() to start
job.result()

LoadJob<project=de-zoom-83, location=us-west1, id=5b116bf7-b1ab-47ad-9025-9af74b10676e>

In [41]:
table = bq.get_table(dest_table)
print(f'{table.num_rows} rows loaded')
table.

323880 rows loaded


Gotchas:

1. job_config is required?
1. schema is required if table does not already exist, or if we're loading from Datastore
1. If only certain fields are specified in schema, then only those columns are uploaded
1. config needs to specify the source format as parquet; otherwise defaults to csv
1. schema can be autodetected only if source is json or csv
1. [these docs on job config](https://cloud.google.com/bigquery/docs/reference/rest/v2/Job#jobconfigurationload) mention `sourceFormat`; `source_format` is a lucky guess.
1. Never mind on points 2, 5. Needed to specify file format correctly [per sample code here](https://cloud.google.com/bigquery/docs/samples/bigquery-load-table-gcs-parquet#bigquery_load_table_gcs_parquet-python)

In [30]:
bq.cancel_job(job_id="test_notebook_client")

LoadJob<project=de-zoom-83, location=us-west1, id=test_notebook_client>

In [39]:
bq.delete_table(table=f"{GCP_PROJECT_ID}.{dataset_id}.{table_id}")

In [19]:
bq.delete_dataset(dataset)

## JobConfig

Next step: set partitioning and clustering in `job_config`

- [time partition](https://cloud.google.com/bigquery/docs/samples/bigquery-create-table-partitioned)
- [clustering](https://cloud.google.com/bigquery/docs/samples/bigquery-create-table-clustered)
