# Load into bigquery

Prototypes the `google.cloud.bigquery` python API to load table from parquets in GCS onto a bigquery dataset

Note on `prefect-gcp` - there is an option to use prefect's client as a wrapper for GCP resources, but from my experience the documentation is a little lacking and I had to end up using the offical GCP docs anyway, so for sake of documentation completeness let's stick with official clients like I did for GCS

In [22]:
from google.cloud import bigquery

## Access the parquets

Use the `load_table_from_uri` func to load directly from cloud storage

In [23]:
import os
from dotenv import load_dotenv

In [24]:
load_dotenv()

DATA_LAKE = os.getenv("DATA_LAKE")
GCP_PROJECT_ID = os.getenv("GCP_PROJECT_ID")

# creates client with app default cred
bq = bigquery.Client(location="us-west1")
for d in bq.list_datasets(project=GCP_PROJECT_ID):
    print(d.dataset_id)

dbt_ny_taxi
prod_ny_taxi
service_call_test
service_calls_dev
test_service_calls_models
trips_data_all


In [26]:
dataset_id = "service_call_test"
dataset = bigquery.Dataset(dataset_ref=f"{GCP_PROJECT_ID}.{dataset_id}")
try:
    dataset = bq.create_dataset(dataset=dataset)
except Exception as e:
    print(e)

409 POST https://bigquery.googleapis.com/bigquery/v2/projects/de-zoom-83/datasets?prettyPrint=false: Already Exists: Dataset de-zoom-83:service_call_test


In [40]:
year = 2021
table_id = f"facts_{year}_test"
src_uris = f"gs://{DATA_LAKE}/raw/pq/SR{year}.parquet"
dest_table = bigquery.Table(f"{GCP_PROJECT_ID}.{dataset_id}.{table_id}")
job_config = bigquery.LoadJobConfig(
    # schema=[
    #     bigquery.SchemaField(
    #         "Status", "STRING"
    #     )
    # ],
    # source_format="PARQUET",
    source_format=bigquery.SourceFormat.PARQUET,
)
# returns a LoadJob class
job = bq.load_table_from_uri(
    src_uris,
    dest_table,
    # job_id="test_notebook_client", # if not spec'd, autogen
    job_config=job_config,
    location="us-west1",
    project=GCP_PROJECT_ID,
)
# call .result() to start
job.result()

LoadJob<project=de-zoom-83, location=us-west1, id=5b116bf7-b1ab-47ad-9025-9af74b10676e>

In [41]:
table = bq.get_table(dest_table)
print(f'{table.num_rows} rows loaded')
table.

323880 rows loaded


Gotchas:

1. job_config is required?
1. schema is required if table does not already exist, or if we're loading from Datastore
1. If only certain fields are specified in schema, then only those columns are uploaded
1. config needs to specify the source format as parquet; otherwise defaults to csv
1. schema can be autodetected only if source is json or csv, *or parquet*
1. [these docs on job config](https://cloud.google.com/bigquery/docs/reference/rest/v2/Job#jobconfigurationload) mention `sourceFormat`; `source_format` is a lucky guess.
1. Never mind on points 2, 5. Needed to specify file format correctly [per sample code here](https://cloud.google.com/bigquery/docs/samples/bigquery-load-table-gcs-parquet#bigquery_load_table_gcs_parquet-python)

Seems like a slight disconnect between the REST API types vs python sdk types, in how they're referenced in code, e.g. `sourceFormat` in REST, and `bigquery.SourceFormat` in python

In [30]:
bq.cancel_job(job_id="test_notebook_client")

LoadJob<project=de-zoom-83, location=us-west1, id=test_notebook_client>

In [39]:
bq.delete_table(table=f"{GCP_PROJECT_ID}.{dataset_id}.{table_id}")

In [19]:
bq.delete_dataset(dataset)

## JobConfig

Next step: set partitioning and clustering in `job_config` during load: [code sample](https://cloud.google.com/bigquery/docs/samples/bigquery-load-table-clustered)

After table is created:

- [time partition](https://cloud.google.com/bigquery/docs/samples/bigquery-create-table-partitioned)
- [clustering](https://cloud.google.com/bigquery/docs/samples/bigquery-create-table-clustered)


In [31]:
import time

In [34]:
year = 2020
table_id = f"test_facts_{year}_partitioned"
src_uris = f"gs://{DATA_LAKE}/raw/pq/SR{year}.parquet"
dest_table = bigquery.Table(f"{GCP_PROJECT_ID}.{dataset_id}.{table_id}")
job_config = bigquery.LoadJobConfig(
    source_format=bigquery.SourceFormat.PARQUET,
    time_partitioning=bigquery.TimePartitioning(
        type_=bigquery.TimePartitioningType.DAY, field="creation_datetime"
    ),
    clustering_fields=["Service_Request_Type", "ward_id"],
)
job = bq.load_table_from_uri(src_uris, dest_table, job_config=job_config)
start_time = time.time()
print(f"Job {job.job_id} started")
job.add_done_callback(lambda x: print(f"Job duration: {time.time()-start_time}"))
job.result()

Job 27cd73dc-0367-460a-a109-7b070cbc4c94 started


Error while executing Future callback.
Traceback (most recent call last):
  File "/home/klang/.cache/pypoetry/virtualenvs/to-service-qAcdV8BS-py3.10/lib/python3.10/site-packages/google/api_core/future/_helpers.py", line 37, in safe_invoke_callback
    return callback(*args, **kwargs)
  File "/tmp/ipykernel_1795/2009562937.py", line 16, in <lambda>
    lambda x: print(f'Job duration: {time.time()-start_time}')
TypeError: unsupported operand type(s) for -: 'float' and 'time.struct_time'


LoadJob<project=de-zoom-83, location=us-west1, id=27cd73dc-0367-460a-a109-7b070cbc4c94>

In [43]:
job_info = f"""
{job.created}
{job.destination}
{job.ended}
{job.job_type}
{job.input_file_bytes}
{job.output_rows}
{job.output_bytes}
{job.location}
{job.state}
duration: {job.ended - job.started}
"""
print(job_info)


2023-04-05 07:31:09.109000+00:00
de-zoom-83.service_call_test.test_facts_2020_partitioned
2023-04-05 07:31:11.140000+00:00
load
11871
100
13409
us-west1
DONE
duration: 0:00:01.922000



In [10]:
table = bq.get_table(dest_table)  # Make an API request.
print(
    "Loaded {} rows and {} columns to {}".format(
        table.num_rows, len(table.schema), table_id
    )
)
print("schema:\n", table.schema)

Loaded 323880 rows and 10 columns to test_facts_2021_partitioned
schema:
 [SchemaField('Status', 'STRING', 'NULLABLE', None, None, (), None), SchemaField('First_3_Chars_of_Postal_Code', 'STRING', 'NULLABLE', None, None, (), None), SchemaField('Intersection_Street_1', 'STRING', 'NULLABLE', None, None, (), None), SchemaField('Intersection_Street_2', 'STRING', 'NULLABLE', None, None, (), None), SchemaField('Service_Request_Type', 'STRING', 'NULLABLE', None, None, (), None), SchemaField('Division', 'STRING', 'NULLABLE', None, None, (), None), SchemaField('Section', 'STRING', 'NULLABLE', None, None, (), None), SchemaField('ward_name', 'STRING', 'NULLABLE', None, None, (), None), SchemaField('ward_id', 'INTEGER', 'NULLABLE', None, None, (), None), SchemaField('creation_datetime', 'TIMESTAMP', 'NULLABLE', None, None, (), None)]


In [5]:
dataset_id = "service_call_test"
dataset_ref = bigquery.DatasetReference(GCP_PROJECT_ID, dataset_id)
table_ref = dataset_ref.table("facts_2021_test")
table = bigquery.Table(table_ref)
table.time_partitioning = bigquery.TimePartitioning(
    type_=bigquery.TimePartitioningType.DAY,
    field="creation_datetime",
)

### Integrating into pipeline

- authentication?
- timeout?
- retries?
- multipart?
- check for existing?
- ensure that the partitioning and clustering fields are present and valid?

### Authenticate with service account key file

Setup:

- create service account with required permissions
- download key file to executor
- use `google.oauth2.service_account`:

```py
from google.oauth2 import service_account
key_path = "path/to/service_account.json"
cred = service_account.Credentials.from_service_account_file(
    key_path,
    scopes=["https://www.googleapis.com/auth/cloud-platform"],
)
client = bigquery.Client(
    credentials=creds,
    project=cred.project_id,
)
```

permissions needed:

- bigquery.table.create, updateData, update
- bigquery.jobs.create
    - combine dataEditor and jobUser?
- storage.buckets.get
- storage.objects.get, list
    - create from custom Storage Admin

The following predefined IAM roles include the above permissions:

- roles/bigquery.dataEditor, dataOwner (no job permission?), admin
- bigqry.user, jobUser (doesn't include tables permission?)

Integrate into terraform?

In [1]:
from google.cloud import bigquery

In [8]:
client = bigquery.Client(location="canteen")

In [9]:
print(client.location)

canteen


In [16]:
for p in client.list_projects():
    print(p.__dir__())

['project_id', 'numeric_id', 'friendly_name', '__module__', '__doc__', '__init__', 'from_api_repr', '__dict__', '__weakref__', '__new__', '__repr__', '__hash__', '__str__', '__getattribute__', '__setattr__', '__delattr__', '__lt__', '__le__', '__eq__', '__ne__', '__gt__', '__ge__', '__reduce_ex__', '__reduce__', '__subclasshook__', '__init_subclass__', '__format__', '__sizeof__', '__dir__', '__class__']


In [21]:
for t in client.list_tables("trips_data_all"):
    print(t.full_table_id)

de-zoom-83:trips_data_all.fhv_taxi_trips
de-zoom-83:trips_data_all.fhv_taxi_trips_ext
de-zoom-83:trips_data_all.fhv_taxi_trips_partition
de-zoom-83:trips_data_all.green-tip-model-eval
de-zoom-83:trips_data_all.green-tip-model-feature-info
de-zoom-83:trips_data_all.green-tip-model-predict
de-zoom-83:trips_data_all.green-tip-model-predict-explain
de-zoom-83:trips_data_all.green_taxi_trips
de-zoom-83:trips_data_all.green_taxi_trips_ml
de-zoom-83:trips_data_all.green_taxi_trips_partition
de-zoom-83:trips_data_all.report_2020
de-zoom-83:trips_data_all.yellow_taxi_trips
de-zoom-83:trips_data_all.yellow_taxi_trips_ext
