![tracker](https://us-central1-vertex-ai-mlops-369716.cloudfunctions.net/pixel-tracking?path=statmike%2Fvertex-ai-mlops%2FDev%2Fnew&file=BigQuery+-+Tables.ipynb)
<!--- header table --->
<table align="left">
  <td style="text-align: center">
    <a href="https://colab.research.google.com/github/statmike/vertex-ai-mlops/blob/main/Dev/new/BigQuery%20-%20Tables.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/colab-logo-32px.png" alt="Google Colaboratory logo">
      <br>Run in<br>Colab
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/colab/import/https%3A%2F%2Fraw.githubusercontent.com%2Fstatmike%2Fvertex-ai-mlops%2Fmain%2FDev%2Fnew%2FBigQuery%2520-%2520Tables.ipynb">
      <img width="32px" src="https://lh3.googleusercontent.com/JmcxdQi-qOpctIvWKgPtrzZdJJK-J3sWE1RsfjZNwshCFgE_9fULcNpuXYTilIR2hjwN" alt="Google Cloud Colab Enterprise logo">
      <br>Run in<br>Colab Enterprise
    </a>
  </td>      
  <td style="text-align: center">
    <a href="https://github.com/statmike/vertex-ai-mlops/blob/main/Dev/new/BigQuery%20-%20Tables.ipynb">
      <img src="https://cloud.google.com/ml-engine/images/github-logo-32px.png" alt="GitHub logo">
      <br>View on<br>GitHub
    </a>
  </td>
  <td style="text-align: center">
    <a href="https://console.cloud.google.com/vertex-ai/workbench/deploy-notebook?download_url=https://raw.githubusercontent.com/statmike/vertex-ai-mlops/main/Dev/new/BigQuery%20-%20Tables.ipynb">
      <img src="https://lh3.googleusercontent.com/UiNooY4LUgW_oTvpsNhPpQzsstV5W8F7rYgxgGBD85cWJoLmrOzhVs_ksK_vgx40SHs7jCqkTkCk=e14-rj-sc0xffffff-h130-w32" alt="Vertex AI logo">
      <br>Open in<br>Vertex AI Workbench
    </a>
  </td>
</table>

# BigQuery | Tables

Explaining BigQuery Tables by example!


# BQ Explained | Tables
## A deep dive into BigQuery Tables

Create a BigQuery dataset with multiple versions of the same table that showcase
- Internal/External tables
  - External file formats: JSON, CSV, Avro, Parquet
- Sharding (wildcard table)
- Partitioning
- Clustering
  - By the Partition column
  - By multiple columns

The same query will be run across all versions of the table and metrics (job duration, bytes scanned, slots utalized) will be reviewed.

Resources:
- [query job attributes](https://googleapis.dev/python/bigquery/latest/generated/google.cloud.bigquery.job.QueryJob.html#google-cloud-bigquery-job-queryjob)
- [wildcard table (sharding)](https://cloud.google.com/bigquery/docs/querying-wildcard-tables)
- [Partioning](https://cloud.google.com/bigquery/docs/partitioned-tables)
- [Clustering](https://cloud.google.com/bigquery/docs/clustered-tables)
- [Exporting](https://cloud.google.com/bigquery/docs/exporting-data#python)
- [Query GCS](https://cloud.google.com/bigquery/external-data-cloud-storage)

## Authenticate and Setup

In [None]:
from google.colab import auth
auth.authenticate_user()

In [None]:
project_id = 'statmike-project-1'
dataset_id = 'bq_explained_tables'
table_id = 'wikiviews'
bq_location = 'US'

In [None]:
!gcloud config set project {project_id}

Updated property [core/project].


## BigQuery Python Client

In [None]:
from google.cloud import bigquery
bq = bigquery.Client(project=project_id)

## Create A Dataset

In [None]:
dataset = bigquery.Dataset(bq.dataset(dataset_id))
dataset.location = bq_location
dataset = bq.create_dataset(dataset, exists_ok=True)
print(f"The dataset {dataset.dataset_id} is in project {bq.project}")

The dataset bq_explained_tables is in project statmike-project-1


## Create Tables
Use `bigquery-public-data.wikipedia.pageviews_2021` to create various table configurations for demonstration.

Naming Convention: 1234567
1. Internal (I) / External (E)
2. Native (N) / JSON (J) / CSV (C) / Avro (A) / Parquet (P)
3. Shard (S) / No Sharding (x)
4. Partition (P) / No Partition (x)
5. Cluster by P / No Cluster by P (x)
6. Cluster (C) / No Cluster (x)
7. How many Cluster columns (0-4)



### The Base Table - Full
- no shard, partition, or cluster
- pick two months from wikipedia page views
  - bigquery-public-data:wikipedia.pageviews_2021
  - pull january-february

In [None]:
query = f"""
CREATE OR REPLACE TABLE {project_id}.{dataset_id}.INxxxx0_{table_id} 
OPTIONS(description = 'Internal / Native / no shard / no partition / no cluster by partion / no cluster')
AS
SELECT * FROM bigquery-public-data.wikipedia.pageviews_2021
WHERE DATE(datehour) < DATE(2021, 3, 1) AND STARTS_WITH(wiki, 'en')
"""
job = bq.query(query)
job.result()
job.ended - job.started

datetime.timedelta(seconds=387, microseconds=695000)

In [None]:
job.total_bytes_processed/1e9 # GB

137.889979806

### The Base Table - Sharded by Month
- table for January
- table for February

#### January From Unpartitioned Base

In [None]:
query = f"""
CREATE OR REPLACE TABLE {project_id}.{dataset_id}.INSxxx0_{table_id}_jan
OPTIONS(description = 'Internal / Native / Shard by Month / no partition / no cluster by partion / no cluster')
AS
SELECT * FROM {project_id}.{dataset_id}.INxxxx0_{table_id}
WHERE datehour >= '2021-01-01' AND datehour < '2021-02-01' AND STARTS_WITH(wiki, 'en')
"""
job = bq.query(query)
job.result()
job.ended - job.started

datetime.timedelta(seconds=211, microseconds=152000)

In [None]:
job.total_bytes_processed/1e9 # GB

135.568999275

#### February From Partitioned Base

In [None]:
query = f"""
CREATE OR REPLACE TABLE {project_id}.{dataset_id}.INSxxx0_{table_id}_feb
OPTIONS(description = 'Internal / Native / Shard by Month / no partition / no cluster by partion / no cluster')
AS
SELECT * FROM bigquery-public-data.wikipedia.pageviews_2021
WHERE datehour >= '2021-02-01' AND datehour < '2021-03-01' AND STARTS_WITH(wiki, 'en')
"""
job = bq.query(query)
job.result()
job.ended - job.started

datetime.timedelta(seconds=127, microseconds=259000)

In [None]:
job.total_bytes_processed/1e9 # GB

66.582601217

### Partitioned By Date

In [None]:
query = f"""
CREATE OR REPLACE TABLE {project_id}.{dataset_id}.INxPxx0_{table_id}
PARTITION BY date(datehour)
OPTIONS(description = 'Internal / Native / no shard / Partition by date / no cluster by partition / no cluster')
AS
SELECT * FROM {project_id}.{dataset_id}.INxxxx0_{table_id};
 
CREATE OR REPLACE TABLE {project_id}.{dataset_id}.INSPxx0_{table_id}_jan
PARTITION BY date(datehour)
OPTIONS(description = 'Internal / Native / Shard by Month / Partition by date / no cluster by partition / no cluster')
AS
SELECT * FROM {project_id}.{dataset_id}.INSxxx0_{table_id}_jan;

CREATE OR REPLACE TABLE {project_id}.{dataset_id}.INSPxx0_{table_id}_feb
PARTITION BY date(datehour)
OPTIONS(description = 'Internal / Native / Shard by Month / Partition by date / no cluster by partition / no cluster')
AS
SELECT * FROM {project_id}.{dataset_id}.INSxxx0_{table_id}_feb;
"""
job = bq.query(query)
job.result()
job.ended - job.started

datetime.timedelta(seconds=561, microseconds=464000)

### Clustered By Date

In [None]:
query = f"""
CREATE OR REPLACE TABLE {project_id}.{dataset_id}.INxxCx1_{table_id}
CLUSTER BY datehour
OPTIONS(description = 'Internal / Native / no shard / no partition / Cluster by partition / no cluster')
AS
SELECT * FROM {project_id}.{dataset_id}.INxxxx0_{table_id};
 
CREATE OR REPLACE TABLE {project_id}.{dataset_id}.INSxCx1_{table_id}_jan
CLUSTER BY datehour
OPTIONS(description = 'Internal / Native / Shard by Month / no partition / Cluster by partition / no cluster')
AS
SELECT * FROM {project_id}.{dataset_id}.INSxxx0_{table_id}_jan;

CREATE OR REPLACE TABLE {project_id}.{dataset_id}.INSxCx1_{table_id}_feb
CLUSTER BY datehour
OPTIONS(description = 'Internal / Native / Shard by Month / no partition / Cluster by partition / no cluster')
AS
SELECT * FROM {project_id}.{dataset_id}.INSxxx0_{table_id}_feb;
"""
job = bq.query(query)
job.result()
job.ended - job.started

datetime.timedelta(seconds=461, microseconds=72000)

### Partition and Cluster By Date

In [None]:
query = f"""
CREATE OR REPLACE TABLE {project_id}.{dataset_id}.INxPCx1_{table_id}
PARTITION BY date(datehour)
CLUSTER BY datehour
OPTIONS(description = 'Internal / Native / no shard / Partion by date / Cluster by partition / no cluster')
AS
SELECT * FROM {project_id}.{dataset_id}.INxxxx0_{table_id};
 
CREATE OR REPLACE TABLE {project_id}.{dataset_id}.INSPCx1_{table_id}_jan
PARTITION BY date(datehour)
CLUSTER BY datehour
OPTIONS(description = 'Internal / Native / Shard by Month / Partion by date / Cluster by partition / no cluster')
AS
SELECT * FROM {project_id}.{dataset_id}.INSxxx0_{table_id}_jan;

CREATE OR REPLACE TABLE {project_id}.{dataset_id}.INSPCx1_{table_id}_feb
PARTITION BY date(datehour)
CLUSTER BY datehour
OPTIONS(description = 'Internal / Native / Shard by Month / Partion by date / Cluster by partition / no cluster')
AS
SELECT * FROM {project_id}.{dataset_id}.INSxxx0_{table_id}_feb;
"""
job = bq.query(query)
job.result()
job.ended - job.started

datetime.timedelta(seconds=392, microseconds=638000)

### Partion and Cluster By Date + Cluster (wiki and title)

In [None]:
query = f"""
CREATE OR REPLACE TABLE {project_id}.{dataset_id}.INxPCC3_{table_id}
PARTITION BY date(datehour)
CLUSTER BY datehour, wiki, title
OPTIONS(description = 'Internal / Native / no shard / Partion by date / Cluster by partition / Cluster + 2')
AS
SELECT * FROM {project_id}.{dataset_id}.INxxxx0_{table_id};

CREATE OR REPLACE TABLE {project_id}.{dataset_id}.INSPCC3_{table_id}_jan
PARTITION BY date(datehour)
CLUSTER BY datehour, wiki, title
OPTIONS(description = 'Internal / Native / Shard by Monthg / Partion by date / Cluster by partition / Cluster + 2')
AS
SELECT * FROM {project_id}.{dataset_id}.INSxxx0_{table_id}_jan;

CREATE OR REPLACE TABLE {project_id}.{dataset_id}.INSPCC3_{table_id}_feb
PARTITION BY date(datehour)
CLUSTER BY datehour, wiki, title
OPTIONS(description = 'Internal / Native / Shard by Monthg / Partion by date / Cluster by partition / Cluster + 2')
AS
SELECT * FROM {project_id}.{dataset_id}.INSxxx0_{table_id}_feb;
"""
job = bq.query(query)
job.result()
job.ended - job.started

datetime.timedelta(seconds=360, microseconds=676000)

### Partion and Cluster (wiki and title)

In [None]:
query = f"""
CREATE OR REPLACE TABLE {project_id}.{dataset_id}.INxPxC2_{table_id}
PARTITION BY date(datehour)
CLUSTER BY wiki, title
OPTIONS(description = 'Internal / Native / no shard / Partion by date / no cluster by partition / Cluster + 2')
AS
SELECT * FROM {project_id}.{dataset_id}.INxxxx0_{table_id};

CREATE OR REPLACE TABLE {project_id}.{dataset_id}.INSPxC2_{table_id}_jan
PARTITION BY date(datehour)
CLUSTER BY wiki, title
OPTIONS(description = 'Internal / Native / Shard by Month / Partion by date / no cluster by partition / Cluster + 2')
AS
SELECT * FROM {project_id}.{dataset_id}.INSxxx0_{table_id}_jan;

CREATE OR REPLACE TABLE {project_id}.{dataset_id}.INSPxC2_{table_id}_feb
PARTITION BY date(datehour)
CLUSTER BY wiki, title
OPTIONS(description = 'Internal / Native / Shard by Month / Partion by date / no cluster by partition / Cluster + 2')
AS
SELECT * FROM {project_id}.{dataset_id}.INSxxx0_{table_id}_feb;
"""
job = bq.query(query)
job.result()
job.ended - job.started

datetime.timedelta(seconds=574, microseconds=860000)

### Clusterd By Date, wiki and title

In [None]:
query = f"""
CREATE OR REPLACE TABLE {project_id}.{dataset_id}.INxxCC3_{table_id}
CLUSTER BY datehour, wiki, title
OPTIONS(description = 'Internal / Native / no shard / no partition / Cluster by partition / Cluster + 2')
AS
SELECT * FROM {project_id}.{dataset_id}.INxxxx0_{table_id};

CREATE OR REPLACE TABLE {project_id}.{dataset_id}.INSxCC3_{table_id}_jan
CLUSTER BY datehour, wiki, title
OPTIONS(description = 'Internal / Native / Shard by Month / no partition / Cluster by partition / Cluster + 2')
AS
SELECT * FROM {project_id}.{dataset_id}.INSxxx0_{table_id}_jan;

CREATE OR REPLACE TABLE {project_id}.{dataset_id}.INSxCC3_{table_id}_feb
CLUSTER BY datehour, wiki, title
OPTIONS(description = 'Internal / Native / Shard by Month / no partition / Cluster by partition / Cluster + 2')
AS
SELECT * FROM {project_id}.{dataset_id}.INSxxx0_{table_id}_feb;
"""
job = bq.query(query)
job.result()
job.ended - job.started

datetime.timedelta(seconds=513, microseconds=102000)

### External Versions
- use the Partitioned and Clustered Full Table (no sharding)

In [None]:
source = 'INxPCC3_wikiviews'
stable = dataset.table(source)

#### Avro

Extract: BQ > GCS

In [None]:
ejob_config = bigquery.job.ExtractJobConfig()
ejob_config.destination_format = bigquery.DestinationFormat.AVRO
ejob = bq.extract_table(
    source = dataset.table(stable.table_id),
    destination_uris = f"gs://{project_id}/{dataset_id}/{stable.table_id}/Avro/output_*",
    location = dataset.location,
    job_config = ejob_config
)

BQ External Table: GCS Avro

In [None]:
table = bigquery.Table(dataset.table('EAxxxx0_wikiviews'))
ejob_config = bigquery.ExternalConfig('AVRO')
ejob_config.source_uris = [f"gs://{project_id}/{dataset_id}/{source}/Avro/output_*"]
table.external_data_configuration = ejob_config
table.description = 'External / Avro / no shard / no partition / no cluster by partition / no cluster'
table = bq.create_table(table, exists_ok = True)

#### CSV

Extract: BQ > GCS

In [None]:
ejob_config = bigquery.job.ExtractJobConfig()
ejob_config.destination_format = bigquery.DestinationFormat.CSV
ejob = bq.extract_table(
    source = dataset.table(stable.table_id),
    destination_uris = f"gs://{project_id}/{dataset_id}/{stable.table_id}/CSV/output_*",
    location = dataset.location,
    job_config = ejob_config
)

BQ External Table: GCS CSV

In [None]:
table = bigquery.Table(dataset.table('ECxxxx0_wikiviews'))
ejob_config = bigquery.ExternalConfig('CSV')
ejob_config.autodetect = True
ejob_config.source_uris = [f"gs://{project_id}/{dataset_id}/{source}/CSV/output_*"]
table.external_data_configuration = ejob_config
table.description = 'External / CSV / no shard / no partition / no cluster by partition / no cluster'
table = bq.create_table(table, exists_ok = True)

#### JSON

Extract: BQ > GCS

In [None]:
ejob_config = bigquery.job.ExtractJobConfig()
ejob_config.destination_format = bigquery.DestinationFormat.NEWLINE_DELIMITED_JSON
ejob = bq.extract_table(
    source = dataset.table(stable.table_id),
    destination_uris = f"gs://{project_id}/{dataset_id}/{stable.table_id}/NEWLINE_DELIMITED_JSON/output_*",
    location = dataset.location,
    job_config = ejob_config
)

BQ External Table: GCS JSON

In [None]:
table = bigquery.Table(dataset.table('EJxxxx0_wikiviews'))
ejob_config = bigquery.ExternalConfig('NEWLINE_DELIMITED_JSON')
ejob_config.autodetect = True
ejob_config.source_uris = [f"gs://{project_id}/{dataset_id}/{source}/NEWLINE_DELIMITED_JSON/output_*"]
table.external_data_configuration = ejob_config
table.description = 'External / JSON / no shard / no partition / no cluster by partition / no cluster'
table = bq.create_table(table, exists_ok = True)

#### Parquet
This does not appear to be in the API yet...
https://googleapis.dev/python/bigquery/latest/generated/google.cloud.bigquery.job.DestinationFormat.html#google.cloud.bigquery.job.DestinationFormat

Extract: BQ > GCS

In [None]:
ejob_config = bigquery.job.ExtractJobConfig()
ejob_config.destination_format = bigquery.DestinationFormat.PARQUET
ejob = bq.extract_table(
    source = dataset.table(stable.table_id),
    destination_uris = f"gs://{project_id}/{dataset_id}/{stable.table_id}/Parquet/output_*",
    location = dataset.location,
    job_config = ejob_config
)

BQ External Table: GCS Parquet

In [None]:
table = bigquery.Table(dataset.table('EPxxxx0_wikiviews'))
ejob_config = bigquery.ExternalConfig('PARQUET')
ejob_config.source_uris = [f"gs://{project_id}/{dataset_id}/{source}/Parquet/output_*"]
table.external_data_configuration = ejob_config
table.description('External / Parquet / no shard / no partition / no cluster by partition / no cluster')
table = bq.create_table(table, exists_ok = True)

## List Tables

In [None]:
for table in bq.list_tables(dataset_id):
  print(f"{table.project}.{table.dataset_id}.{table.table_id}")

statmike-project-1.bq_explained_tables.EAxxxx0_wikiviews
statmike-project-1.bq_explained_tables.ECxxxx0_wikiviews
statmike-project-1.bq_explained_tables.EJxxxx0_wikiviews
statmike-project-1.bq_explained_tables.INSPCC3_wikiviews_feb
statmike-project-1.bq_explained_tables.INSPCC3_wikiviews_jan
statmike-project-1.bq_explained_tables.INSPCx1_wikiviews_feb
statmike-project-1.bq_explained_tables.INSPCx1_wikiviews_jan
statmike-project-1.bq_explained_tables.INSPxC2_wikiviews_feb
statmike-project-1.bq_explained_tables.INSPxC2_wikiviews_jan
statmike-project-1.bq_explained_tables.INSPxx0_wikiviews_feb
statmike-project-1.bq_explained_tables.INSPxx0_wikiviews_jan
statmike-project-1.bq_explained_tables.INSxCC3_wikiviews_feb
statmike-project-1.bq_explained_tables.INSxCC3_wikiviews_jan
statmike-project-1.bq_explained_tables.INSxCx1_wikiviews_feb
statmike-project-1.bq_explained_tables.INSxCx1_wikiviews_jan
statmike-project-1.bq_explained_tables.INSxxx0_wikiviews_feb
statmike-project-1.bq_explained_tabl

## Query Metrics By Table Type
- A simple query with a filter
- Running the same query with the same filter for each experiment:
  - spans months
  - column `wiki`
>```sql
SELECT SUM(views)
FROM project_id.dataset_id.table_id
WHERE datehour BETWEEN '2021-01-30' AND '2021-02-02' AND STARTS_WITH(wiki, 'en.m')
```

Naming Convention: 1234567
1. Internal (I) / External (E)
2. Native (N) / JSON (J) / CSV (C) / Avro (A) / Parquet (P)
3. Shard (S) / No Sharding (x)
4. Partition (P) / No Partition (x)
5. Cluster by P / No Cluster by P (x)
6. Cluster (C) / No Cluster (x)
7. How many Cluster columns (0-4)


In [None]:
definitions = {
    1: {
        'I': 'Internal',
        'E': 'External'
    },
    2: {
        'N': 'Native',
        'J': 'JSON',
        'C': 'CSV',
        'A': 'Avro',
        'P': 'Parquet'
    },
    3: {
        'S': 'Month',
        'x': 'None'
    },
    4: {
        'P': 'Day (datehour)',
        'x': 'None'
    },
    5: {
        'C': 'datehour',
        'x': 'None'
    },
    6: {
        'C': 'wiki, title',
        'x': 'None'
    }
}

In [None]:
import pandas as pd

def makequery(exp, table_id):
  s = "SELECT SUM(views) as total_views\n"
  w = "WHERE datehour BETWEEN '2021-01-30' AND '2021-02-02' AND STARTS_WITH(wiki, 'en.m')"
  if exp == 'Baseline':
    table = f"INxxxx0_{table_id}"
    f = f"FROM `{project_id}.{dataset_id}.{table}`\n"
    w = "WHERE datehour is not null and wiki is not null"
  elif exp[2]=='x':
    table = f"{exp}_{table_id}"
    f = f"FROM `{project_id}.{dataset_id}.{table}`\n"
  elif exp[2]=='S':
    table = f"{exp}_{table_id}_*"
    f = f"FROM `{project_id}.{dataset_id}.{table}`\n"
  return s + f + w

def runner(query):
  job = bq.query(query, job_config=bigquery.QueryJobConfig(use_query_cache=False))
  job.result()
  etime = job.ended - job.started
  duration = etime.seconds + etime.microseconds/1e6
  bytes = job.total_bytes_processed/1e9 # GB
  views = job.to_dataframe().total_views[0]
  slot_millis = job.slot_millis
  return views, duration, bytes, slot_millis, job.job_id

def experiment(table_id):
  results = []
  exp = ['Baseline', 'INxxxx0', 'INSxxx0', 'INxPxx0', 'INSPxx0', 'INxxCx1', 'INSxCx1', 'INxPCx1', 'INSPCx1', 'INxPxC2', 'INSPxC2', 'INxPCC3', 'INSPCC3', 'INxxCC3', 'INSxCC3', 'EAxxxx0', 'ECxxxx0', 'EJxxxx0']#, 'EPxxxx0']
  for e in exp:
    query = makequery(e, table_id)
    views, duration, bytes, slot_millis, job_id = runner(query)
    e2 = e
    if e == 'Baseline': e2 = 'INxxxx0'
    results.append([e, definitions[1][e2[0]], definitions[2][e2[1]], definitions[3][e2[2]], definitions[4][e2[3]], definitions[5][e2[4]], definitions[6][e2[5]], views, duration, bytes, slot_millis, slot_millis/(duration*1000), job_id])
  resultsDF = pd.DataFrame(results, columns=['Experiment', 'Table Type', 'Storage Format', 'Sharded By', 'Partition Column', 'Cluster by Partition', 'Cluster Columns', 'Calculation', 'Duration(s)', 'Bytes Scanned(GB)', 'Slot Milliseconds', 'Average Slots', 'Job ID'])
  return resultsDF

In [None]:
results = experiment(table_id)

In [None]:
results

Unnamed: 0,Experiment,Table Type,Storage Format,Sharded By,Partition Column,Cluster by Partition,Cluster Columns,Calculation,Duration(s),Bytes Scanned(GB),Slot Milliseconds,Average Slots,Job ID
0,Baseline,Internal,Native,,,,,15686788960,1.886,67.081487,507662,269.173913,859c4fe1-bf52-4a27-8fb6-484fef5d9d73
1,INxxxx0,Internal,Native,,,,,534624689,1.321,67.081487,217510,164.655564,23677722-4f32-4b0c-a215-da339c524399
2,INSxxx0,Internal,Native,Month,,,,534624689,1.478,67.081487,230063,155.658322,cc630a35-718c-45de-b150-52a4cabadde4
3,INxPxx0,Internal,Native,,Day (datehour),,,534624689,0.867,4.568752,13424,15.483276,da17de17-da27-4f82-995a-4e933335b08d
4,INSPxx0,Internal,Native,Month,Day (datehour),,,534624689,0.977,4.568752,14142,14.474923,3836d9c1-8837-46b5-a6a9-6ce86f22ab27
5,INxxCx1,Internal,Native,,,datehour,,534624689,1.309,3.471956,34606,26.436975,f27f3efd-81b7-40cc-86d0-915a4c254cac
6,INSxCx1,Internal,Native,Month,,datehour,,534624689,1.734,67.081487,73159,42.190888,ce633a3b-30eb-4341-b191-5c20325b3360
7,INxPCx1,Internal,Native,,Day (datehour),datehour,,534624689,0.598,3.471956,8593,14.369565,18a49706-ecd0-47b0-af25-7e1958c7c07b
8,INSPCx1,Internal,Native,Month,Day (datehour),datehour,,534624689,0.712,3.471956,10392,14.595506,9a67edb1-51b1-4406-a8cb-4ede88425e20
9,INxPxC2,Internal,Native,,Day (datehour),,"wiki, title",534624689,0.67,2.468439,8387,12.51791,beafad11-b70b-4a1a-8224-55555f2ca1ef


---
## Comparing Tables

In [None]:
import plotly.graph_objects as go

def add_dummies(idf, col, newcol):
  df = pd.DataFrame({col: idf[col].unique()})
  df[newcol] = df.index
  idf = pd.merge(idf, df, on = col, how='left')
  return idf, df

results, D1 = add_dummies(results, 'Table Type', 'D1')
results, D2 = add_dummies(results, 'Storage Format', 'D2')
results, D3 = add_dummies(results, 'Sharded By', 'D3')
results, D4 = add_dummies(results, 'Partition Column', 'D4')
results, D5 = add_dummies(results, 'Cluster by Partition', 'D5')
results, D6 = add_dummies(results, 'Cluster Columns', 'D6')

dimensions = list([
                   dict(
                       range = [results['D1'].min(), results['D1'].max()],
                       label = 'Table Type',
                       tickvals = D1['D1'],
                       ticktext = D1['Table Type'],
                       values = results['D1']
                   ),
                   dict(
                       range = [results['D2'].min(), results['D2'].max()],
                       label = 'Storage Format',
                       tickvals = D2['D2'],
                       ticktext = D2['Storage Format'],
                       values = results['D2']
                   ),
                   dict(
                       range = [results['D3'].min(), results['D3'].max()],
                       label = 'Sharded By',
                       tickvals = D3['D3'],
                       ticktext = D3['Sharded By'],
                       values = results['D3']
                   ),
                   dict(
                       range = [results['D4'].min(), results['D4'].max()],
                       label = 'Partition Column',
                       tickvals = D4['D4'],
                       ticktext = D4['Partition Column'],
                       values = results['D4']
                   ),
                   dict(
                       range = [results['D5'].min(), results['D5'].max()],
                       label = 'Cluster by Partition',
                       tickvals = D5['D5'],
                       ticktext = D5['Cluster by Partition'],
                       values = results['D5']
                   ),
                   dict(
                       range = [results['D6'].min(), results['D6'].max()],
                       label = 'Cluster Columns',
                       tickvals = D6['D6'],
                       ticktext = D6['Cluster Columns'],
                       values = results['D6']
                   ),
                   dict(
                       range = [results['Bytes Scanned(GB)'].min(), results['Bytes Scanned(GB)'].max()],
                       label = 'Bytes Scanned(GB)',
                       values = results['Bytes Scanned(GB)']
                   ),
                   dict(
                       range = [results['Duration(s)'].min(), results['Duration(s)'].max()],
                       label = 'Duration(s)',
                       values = results['Duration(s)']
                   ),
                   dict(
                       range = [results['Average Slots'].min(), results['Average Slots'].max()],
                       label = 'Average Slots',
                       values = results['Average Slots']
                   )
])

In [None]:
fig = go.Figure(
        data = go.Parcoords(
            line = dict(
                color = results['Average Slots'],
                colorscale = 'Rainbow',
                showscale = True
            ),
            dimensions = dimensions
        )
      )
fig.show()
fig.write_html('bq_tables.html')

---
## Examining the BigQuery Jobs in more detail using INFORMATION_SCHEMA.JOBS_BY_PROJECT

In [None]:
query = f"""
SELECT job_id, job_type, query, start_time,
  TIMESTAMP_DIFF(end_time, start_time, MILLISECOND)/1000 as duration,
  total_slot_ms,
  total_slot_ms / TIMESTAMP_DIFF(end_time, start_time, MILLISECOND) as avg_slots,
  total_bytes_processed/1e9 as gb_processed
FROM {project_id}.region-{dataset.location}.INFORMATION_SCHEMA.JOBS_BY_PROJECT
WHERE creation_time BETWEEN TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 3 HOUR) AND CURRENT_TIMESTAMP()
  and job_id in ({', '.join(['"{}"'.format(value) for value in results['Job ID'].tolist()])})
ORDER BY start_time
"""
jobs = bq.query(query).to_dataframe()
jobs

Unnamed: 0,job_id,job_type,query,start_time,duration,total_slot_ms,avg_slots,gb_processed
0,859c4fe1-bf52-4a27-8fb6-484fef5d9d73,QUERY,SELECT SUM(views) as total_views\nFROM `statmi...,2021-11-09 19:43:52.460000+00:00,1.886,507662,269.173913,67.081487
1,23677722-4f32-4b0c-a215-da339c524399,QUERY,SELECT SUM(views) as total_views\nFROM `statmi...,2021-11-09 19:43:55.121000+00:00,1.321,217510,164.655564,67.081487
2,cc630a35-718c-45de-b150-52a4cabadde4,QUERY,SELECT SUM(views) as total_views\nFROM `statmi...,2021-11-09 19:43:57.178000+00:00,1.478,230063,155.658322,67.081487
3,da17de17-da27-4f82-995a-4e933335b08d,QUERY,SELECT SUM(views) as total_views\nFROM `statmi...,2021-11-09 19:43:59.428000+00:00,0.867,13424,15.483276,4.568752
4,3836d9c1-8837-46b5-a6a9-6ce86f22ab27,QUERY,SELECT SUM(views) as total_views\nFROM `statmi...,2021-11-09 19:44:00.865000+00:00,0.977,14142,14.474923,4.568752
5,f27f3efd-81b7-40cc-86d0-915a4c254cac,QUERY,SELECT SUM(views) as total_views\nFROM `statmi...,2021-11-09 19:44:02.572000+00:00,1.309,34606,26.436975,3.471956
6,ce633a3b-30eb-4341-b191-5c20325b3360,QUERY,SELECT SUM(views) as total_views\nFROM `statmi...,2021-11-09 19:44:04.534000+00:00,1.734,73159,42.190888,67.081487
7,18a49706-ecd0-47b0-af25-7e1958c7c07b,QUERY,SELECT SUM(views) as total_views\nFROM `statmi...,2021-11-09 19:44:07.090000+00:00,0.598,8593,14.369565,3.471956
8,9a67edb1-51b1-4406-a8cb-4ede88425e20,QUERY,SELECT SUM(views) as total_views\nFROM `statmi...,2021-11-09 19:44:08.294000+00:00,0.712,10392,14.595506,3.471956
9,beafad11-b70b-4a1a-8224-55555f2ca1ef,QUERY,SELECT SUM(views) as total_views\nFROM `statmi...,2021-11-09 19:44:09.713000+00:00,0.67,8387,12.51791,2.468439


---
## Examining the BigQuery Jobs in more detail using INFORMATION_SCHEMA.JOBS_TIMELINE_BY_PROJECT

In [None]:
query = f"""
SELECT job_id, job_start_time,
  TIMESTAMP_DIFF(job_end_time, job_start_time, MILLISECOND)/1000 as job_duration,
  period_start,
  total_bytes_processed/1e9 as job_gb_processed,
  period_slot_ms/1000 as period_avg_slots
FROM {project_id}.region-{dataset.location}.INFORMATION_SCHEMA.JOBS_TIMELINE_BY_PROJECT
WHERE job_creation_time BETWEEN TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 3 HOUR) AND CURRENT_TIMESTAMP()
  and job_id in ({', '.join(['"{}"'.format(value) for value in results['Job ID'].tolist()])})
ORDER BY job_start_time, job_id, period_start
"""
jobs = bq.query(query).to_dataframe()

In [None]:
for value in results['Job ID'].tolist():
  display(jobs[jobs['job_id'] == value])

Unnamed: 0,job_id,job_start_time,job_duration,period_start,job_gb_processed,period_avg_slots
0,11192793-50d9-41dc-b6d4-33d547d41df0,2021-11-08 21:31:38.986000+00:00,1.894,2021-11-08 21:31:38+00:00,67.081487,0.215
1,11192793-50d9-41dc-b6d4-33d547d41df0,2021-11-08 21:31:38.986000+00:00,1.894,2021-11-08 21:31:39+00:00,67.081487,189.43
2,11192793-50d9-41dc-b6d4-33d547d41df0,2021-11-08 21:31:38.986000+00:00,1.894,2021-11-08 21:31:40+00:00,67.081487,409.078


Unnamed: 0,job_id,job_start_time,job_duration,period_start,job_gb_processed,period_avg_slots
3,4ebb69d4-8868-4795-989e-4ac826d6b659,2021-11-08 21:31:41.607000+00:00,1.429,2021-11-08 21:31:41+00:00,67.081487,6.058
4,4ebb69d4-8868-4795-989e-4ac826d6b659,2021-11-08 21:31:41.607000+00:00,1.429,2021-11-08 21:31:42+00:00,67.081487,248.491
5,4ebb69d4-8868-4795-989e-4ac826d6b659,2021-11-08 21:31:41.607000+00:00,1.429,2021-11-08 21:31:43+00:00,67.081487,0.0


Unnamed: 0,job_id,job_start_time,job_duration,period_start,job_gb_processed,period_avg_slots
6,22206a30-b01e-458c-acdb-17528c0865a1,2021-11-08 21:31:43.726000+00:00,2.041,2021-11-08 21:31:43+00:00,67.081487,2.452
7,22206a30-b01e-458c-acdb-17528c0865a1,2021-11-08 21:31:43.726000+00:00,2.041,2021-11-08 21:31:44+00:00,67.081487,104.9
8,22206a30-b01e-458c-acdb-17528c0865a1,2021-11-08 21:31:43.726000+00:00,2.041,2021-11-08 21:31:45+00:00,67.081487,140.719


Unnamed: 0,job_id,job_start_time,job_duration,period_start,job_gb_processed,period_avg_slots
9,238f6616-9d0e-4ad9-a472-e63baf3613de,2021-11-08 21:31:46.604000+00:00,1.24,2021-11-08 21:31:46+00:00,4.568752,2.991
10,238f6616-9d0e-4ad9-a472-e63baf3613de,2021-11-08 21:31:46.604000+00:00,1.24,2021-11-08 21:31:47+00:00,4.568752,11.446


Unnamed: 0,job_id,job_start_time,job_duration,period_start,job_gb_processed,period_avg_slots
11,6e83ed00-240b-4e1f-8953-cf17457bc82c,2021-11-08 21:31:48.891000+00:00,1.035,2021-11-08 21:31:48+00:00,4.568752,1.14
12,6e83ed00-240b-4e1f-8953-cf17457bc82c,2021-11-08 21:31:48.891000+00:00,1.035,2021-11-08 21:31:49+00:00,4.568752,13.835


Unnamed: 0,job_id,job_start_time,job_duration,period_start,job_gb_processed,period_avg_slots
13,52d619cb-834e-4e36-9deb-aa35e2ad613d,2021-11-08 21:31:50.708000+00:00,1.102,2021-11-08 21:31:50+00:00,3.471956,1.613
14,52d619cb-834e-4e36-9deb-aa35e2ad613d,2021-11-08 21:31:50.708000+00:00,1.102,2021-11-08 21:31:51+00:00,3.471956,7.964


Unnamed: 0,job_id,job_start_time,job_duration,period_start,job_gb_processed,period_avg_slots
15,ab39949e-5639-4cb3-bdf8-6baf94c76d7d,2021-11-08 21:31:52.395000+00:00,1.197,2021-11-08 21:31:52+00:00,67.081487,5.406
16,ab39949e-5639-4cb3-bdf8-6baf94c76d7d,2021-11-08 21:31:52.395000+00:00,1.197,2021-11-08 21:31:53+00:00,67.081487,6.467


Unnamed: 0,job_id,job_start_time,job_duration,period_start,job_gb_processed,period_avg_slots
17,e10e96db-29f9-4d28-b106-de0eec89e65b,2021-11-08 21:31:54.389000+00:00,0.59,2021-11-08 21:31:54+00:00,3.471956,8.838


Unnamed: 0,job_id,job_start_time,job_duration,period_start,job_gb_processed,period_avg_slots
18,11db0738-05a8-473d-92bd-59781c197b2e,2021-11-08 21:31:55.520000+00:00,0.645,2021-11-08 21:31:55+00:00,3.471956,7.97
19,11db0738-05a8-473d-92bd-59781c197b2e,2021-11-08 21:31:55.520000+00:00,0.645,2021-11-08 21:31:56+00:00,3.471956,2.225


Unnamed: 0,job_id,job_start_time,job_duration,period_start,job_gb_processed,period_avg_slots
20,63ab84e0-a3bb-45e3-9a88-80daceff165a,2021-11-08 21:31:58.068000+00:00,0.681,2021-11-08 21:31:58+00:00,2.481411,8.633


Unnamed: 0,job_id,job_start_time,job_duration,period_start,job_gb_processed,period_avg_slots
21,da29075d-5ba8-4daa-962a-857ead1a8608,2021-11-08 21:31:59.344000+00:00,0.735,2021-11-08 21:31:59+00:00,2.455107,8.991
22,da29075d-5ba8-4daa-962a-857ead1a8608,2021-11-08 21:31:59.344000+00:00,0.735,2021-11-08 21:32:00+00:00,2.455107,0.562


Unnamed: 0,job_id,job_start_time,job_duration,period_start,job_gb_processed,period_avg_slots
23,a03b8412-b237-4dc6-99bb-a8c2ac1723c1,2021-11-08 21:32:01.882000+00:00,0.818,2021-11-08 21:32:01+00:00,3.28917,1.849
24,a03b8412-b237-4dc6-99bb-a8c2ac1723c1,2021-11-08 21:32:01.882000+00:00,0.818,2021-11-08 21:32:02+00:00,3.28917,8.29


Unnamed: 0,job_id,job_start_time,job_duration,period_start,job_gb_processed,period_avg_slots
25,09a75073-8d4a-43c9-b3f7-569746434a28,2021-11-08 21:32:03.365000+00:00,1.156,2021-11-08 21:32:03+00:00,2.871268,2.384
26,09a75073-8d4a-43c9-b3f7-569746434a28,2021-11-08 21:32:03.365000+00:00,1.156,2021-11-08 21:32:04+00:00,2.871268,6.823


Unnamed: 0,job_id,job_start_time,job_duration,period_start,job_gb_processed,period_avg_slots
27,6f92579f-d82f-4376-affa-5ace6c391c66,2021-11-08 21:32:05.525000+00:00,1.822,2021-11-08 21:32:05+00:00,3.298166,1.275
28,6f92579f-d82f-4376-affa-5ace6c391c66,2021-11-08 21:32:05.525000+00:00,1.822,2021-11-08 21:32:06+00:00,3.298166,6.255
29,6f92579f-d82f-4376-affa-5ace6c391c66,2021-11-08 21:32:05.525000+00:00,1.822,2021-11-08 21:32:07+00:00,3.298166,2.286


Unnamed: 0,job_id,job_start_time,job_duration,period_start,job_gb_processed,period_avg_slots
30,cf565166-fe33-46b4-aee0-093999e2032e,2021-11-08 21:32:07.970000+00:00,1.466,2021-11-08 21:32:07+00:00,67.081487,0.104
31,cf565166-fe33-46b4-aee0-093999e2032e,2021-11-08 21:32:07.970000+00:00,1.466,2021-11-08 21:32:08+00:00,67.081487,5.048
32,cf565166-fe33-46b4-aee0-093999e2032e,2021-11-08 21:32:07.970000+00:00,1.466,2021-11-08 21:32:09+00:00,67.081487,3.817


Unnamed: 0,job_id,job_start_time,job_duration,period_start,job_gb_processed,period_avg_slots
33,51057dcd-43e7-4e5e-8811-d66cbf560e5a,2021-11-08 21:32:10.507000+00:00,48.129,2021-11-08 21:32:10+00:00,119.844204,0.864
34,51057dcd-43e7-4e5e-8811-d66cbf560e5a,2021-11-08 21:32:10.507000+00:00,48.129,2021-11-08 21:32:11+00:00,119.844204,8.185
35,51057dcd-43e7-4e5e-8811-d66cbf560e5a,2021-11-08 21:32:10.507000+00:00,48.129,2021-11-08 21:32:12+00:00,119.844204,22.442
36,51057dcd-43e7-4e5e-8811-d66cbf560e5a,2021-11-08 21:32:10.507000+00:00,48.129,2021-11-08 21:32:13+00:00,119.844204,34.038
37,51057dcd-43e7-4e5e-8811-d66cbf560e5a,2021-11-08 21:32:10.507000+00:00,48.129,2021-11-08 21:32:14+00:00,119.844204,37.27
38,51057dcd-43e7-4e5e-8811-d66cbf560e5a,2021-11-08 21:32:10.507000+00:00,48.129,2021-11-08 21:32:15+00:00,119.844204,43.105
39,51057dcd-43e7-4e5e-8811-d66cbf560e5a,2021-11-08 21:32:10.507000+00:00,48.129,2021-11-08 21:32:16+00:00,119.844204,66.977
40,51057dcd-43e7-4e5e-8811-d66cbf560e5a,2021-11-08 21:32:10.507000+00:00,48.129,2021-11-08 21:32:17+00:00,119.844204,128.711
41,51057dcd-43e7-4e5e-8811-d66cbf560e5a,2021-11-08 21:32:10.507000+00:00,48.129,2021-11-08 21:32:18+00:00,119.844204,216.586
42,51057dcd-43e7-4e5e-8811-d66cbf560e5a,2021-11-08 21:32:10.507000+00:00,48.129,2021-11-08 21:32:19+00:00,119.844204,290.998


Unnamed: 0,job_id,job_start_time,job_duration,period_start,job_gb_processed,period_avg_slots
82,6ca8541d-9338-4308-844b-fc1c0a4cd131,2021-11-08 21:33:05.233000+00:00,30.331,2021-11-08 21:33:05+00:00,161.553434,1.005
83,6ca8541d-9338-4308-844b-fc1c0a4cd131,2021-11-08 21:33:05.233000+00:00,30.331,2021-11-08 21:33:06+00:00,161.553434,8.284
84,6ca8541d-9338-4308-844b-fc1c0a4cd131,2021-11-08 21:33:05.233000+00:00,30.331,2021-11-08 21:33:07+00:00,161.553434,19.118
85,6ca8541d-9338-4308-844b-fc1c0a4cd131,2021-11-08 21:33:05.233000+00:00,30.331,2021-11-08 21:33:08+00:00,161.553434,26.318
86,6ca8541d-9338-4308-844b-fc1c0a4cd131,2021-11-08 21:33:05.233000+00:00,30.331,2021-11-08 21:33:09+00:00,161.553434,37.914
87,6ca8541d-9338-4308-844b-fc1c0a4cd131,2021-11-08 21:33:05.233000+00:00,30.331,2021-11-08 21:33:10+00:00,161.553434,116.119
88,6ca8541d-9338-4308-844b-fc1c0a4cd131,2021-11-08 21:33:05.233000+00:00,30.331,2021-11-08 21:33:11+00:00,161.553434,272.717
89,6ca8541d-9338-4308-844b-fc1c0a4cd131,2021-11-08 21:33:05.233000+00:00,30.331,2021-11-08 21:33:12+00:00,161.553434,402.805
90,6ca8541d-9338-4308-844b-fc1c0a4cd131,2021-11-08 21:33:05.233000+00:00,30.331,2021-11-08 21:33:13+00:00,161.553434,587.647
91,6ca8541d-9338-4308-844b-fc1c0a4cd131,2021-11-08 21:33:05.233000+00:00,30.331,2021-11-08 21:33:14+00:00,161.553434,882.667


Unnamed: 0,job_id,job_start_time,job_duration,period_start,job_gb_processed,period_avg_slots
113,5bfbfff3-4fb6-4c29-be2b-525f73903efd,2021-11-08 21:33:36.436000+00:00,35.674,2021-11-08 21:33:36+00:00,301.47952,3.183
114,5bfbfff3-4fb6-4c29-be2b-525f73903efd,2021-11-08 21:33:36.436000+00:00,35.674,2021-11-08 21:33:37+00:00,301.47952,29.774
115,5bfbfff3-4fb6-4c29-be2b-525f73903efd,2021-11-08 21:33:36.436000+00:00,35.674,2021-11-08 21:33:38+00:00,301.47952,78.623
116,5bfbfff3-4fb6-4c29-be2b-525f73903efd,2021-11-08 21:33:36.436000+00:00,35.674,2021-11-08 21:33:39+00:00,301.47952,105.302
117,5bfbfff3-4fb6-4c29-be2b-525f73903efd,2021-11-08 21:33:36.436000+00:00,35.674,2021-11-08 21:33:40+00:00,301.47952,213.785
118,5bfbfff3-4fb6-4c29-be2b-525f73903efd,2021-11-08 21:33:36.436000+00:00,35.674,2021-11-08 21:33:41+00:00,301.47952,464.389
119,5bfbfff3-4fb6-4c29-be2b-525f73903efd,2021-11-08 21:33:36.436000+00:00,35.674,2021-11-08 21:33:42+00:00,301.47952,696.051
120,5bfbfff3-4fb6-4c29-be2b-525f73903efd,2021-11-08 21:33:36.436000+00:00,35.674,2021-11-08 21:33:43+00:00,301.47952,1000.256
121,5bfbfff3-4fb6-4c29-be2b-525f73903efd,2021-11-08 21:33:36.436000+00:00,35.674,2021-11-08 21:33:44+00:00,301.47952,1330.464
122,5bfbfff3-4fb6-4c29-be2b-525f73903efd,2021-11-08 21:33:36.436000+00:00,35.674,2021-11-08 21:33:45+00:00,301.47952,1445.392
