# Creating Date-Partitioned Tables in BigQuery

Query partitioned datasets and create a new dataset partitions to improve query performance and reduce cost.

### 1. Connecting BigQuery Jupyter Notebook

Set environment variables for notebook to connect Bigquery

In [8]:
import os 
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = 'D:/Agra/Data-Engineer/GCP-DataEngineerLearningPath/Quest-DataWarehouses/Quest-2-Creating-Date-Partitioned-Table/qwiklabs-gcp-02-cad4f40440fa-5aa89f4d270b.json'

Load the BigQuery client library by executing the command below

In [6]:
%load_ext google.cloud.bigquery

### 2. Create a New Dataset

Used to store table for the insights. Create new dataset titled `ecommerce` can be done through SQL query.

In [9]:
%%bigquery
CREATE SCHEMA ecommerce

Query is running:   0%|          |

Ecommerce dataset has been created

### 3. Creating Tables with Date Partitions

Before create a new partition table, we need to explore the data in non-partitioned table first.

Try to execute this query.

In [10]:
%%bigquery
SELECT DISTINCT
  fullVisitorId,
  date,
  city,
  pageTitle
FROM `data-to-insights.ecommerce.all_sessions_raw`
WHERE date = '20170708'
LIMIT 5

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,fullVisitorId,date,city,pageTitle
0,694528843985213255,20170708,Athens,
1,7166902511626681041,20170708,Mountain View,Nest-USA
2,9515822525130944272,20170708,not available in demo dataset,Nest-USA
3,6563097800614871419,20170708,Mountain View,Nest-USA
4,523286475171311815,20170708,Boston,Nest-USA


The engine query process `1.74 GB` when running.

Revise the query and then execute.

In [11]:
%%bigquery
SELECT DISTINCT
  fullVisitorId,
  date,
  city,
  pageTitle
FROM `data-to-insights.ecommerce.all_sessions_raw`
WHERE date = '20180708'
LIMIT 5

Query is running:   0%|          |

Downloading: |          |

Unnamed: 0,fullVisitorId,date,city,pageTitle


The engine query still process `1.74 GB` even though it return 0 results. Why? its happen because the query engine does not know whether the date data exists to satisfy the WHERE clause condition and it needs to scan through all records in a non-partitioned table.

Instead of scanning the entire dataset, use a date partitioned table.

In [12]:
%%bigquery
CREATE OR REPLACE TABLE ecommerce.partition_by_day
PARTITION BY date_formatted
OPTIONS(
    description="a table partitioned by date"
) AS
SELECT DISTINCT
    PARSE_DATE("%Y%m%d", date) AS date_formatted,
    fullvisitorId
FROM `data-to-insights.ecommerce.all_sessions_raw`

Query is running:   0%|          |

This allows us to completely ignore scanning records in certain partitions if they are irrelevant to our query.

To check if the table is partitioned, run this query.

In [20]:
%%bigquery
SELECT *
FROM `ecommerce.INFORMATION_SCHEMA.COLUMNS`
WHERE table_name = 'partition_by_day'

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,table_catalog,table_schema,table_name,column_name,ordinal_position,is_nullable,data_type,is_generated,generation_expression,is_stored,is_hidden,is_updatable,is_system_defined,is_partitioning_column,clustering_ordinal_position,collation_name,column_default,rounding_mode
0,qwiklabs-gcp-02-cad4f40440fa,ecommerce,partition_by_day,date_formatted,1,YES,DATE,NEVER,,,NO,,NO,YES,,,,
1,qwiklabs-gcp-02-cad4f40440fa,ecommerce,partition_by_day,fullvisitorId,2,YES,STRING,NEVER,,,NO,,NO,NO,,,,


As we can see, the date_formatted column on partition_by_day table is partitioned by date

Now we have to query the data in partitioned table.

In [22]:
%%bigquery
SELECT *
FROM `ecommerce.partition_by_day`
WHERE date_formatted = '2018-07-08'

Query is running:   0%|          |

Downloading: |          |

Unnamed: 0,date_formatted,fullvisitorId


With the same query to find visitor on 8 July 2018 but now in partitioned table, the engine query process `0 B` when run. Why? the query engine knows which date partitions exist before the query is ran (and there is no 2018 partitions).

### 4. Creating an Auto-Expiring Partitioned Table

Auto-expiring partitioned table is used to avoid unnecessary storage, so the partition will disappears after a certain time.

First, explore the NOAA weather data from 2022 onward and filter to only include days that have had some precipitation (rain, snow, etc.)

In [25]:
%%bigquery
SELECT
  DATE(CAST(year AS INT64), CAST(mo AS INT64), CAST(da AS INT64)) AS date,
  (SELECT ANY_VALUE(name) FROM `bigquery-public-data.noaa_gsod.stations` AS stations WHERE stations.usaf = stn) AS station_name,
  prcp AS precipitation
FROM `bigquery-public-data.noaa_gsod.gsod*` AS weather
WHERE prcp < 99.9  
  AND prcp > 0      
  AND _TABLE_SUFFIX >= '2022'
ORDER BY date DESC
LIMIT 10

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,date,station_name,precipitation
0,2023-08-14,HECTOR INTERNATIONAL AIRPORT,0.63
1,2023-08-14,DRAKE FLD,0.68
2,2023-08-14,NORTH WEST ALABAMA REGIONAL A,0.09
3,2023-08-14,LOWE AHP,0.35
4,2023-08-14,SEDALIA MEMORIAL,1.46
5,2023-08-14,FORT LAUDERDALE EXEC,0.13
6,2023-08-14,HOBART MUNICIPAL AIRPORT,0.09
7,2023-08-14,HARTFORD BRAINARD,0.22
8,2023-08-14,BLYTHEVILLE MUNI,0.05
9,2023-08-14,HUTCHINSON CO,0.23


Here are 10 of the most recent precipitation data captured by weather stations.

Create a auto-expiring partitioned table with the data that we had explore before.

In [26]:
%%bigquery
CREATE OR REPLACE TABLE ecommerce.days_with_rain
PARTITION BY date
OPTIONS (
  partition_expiration_days=90,
  description="Weather stations with precipitation, partitioned by day"
) AS
  SELECT
    DATE(CAST(year AS INT64), CAST(mo AS INT64), CAST(da AS INT64)) AS date,
    (SELECT ANY_VALUE(name) FROM `bigquery-public-data.noaa_gsod.stations` AS stations WHERE stations.usaf = stn) AS station_name,
    prcp AS precipitation
  FROM `bigquery-public-data.noaa_gsod.gsod*` AS weather
  WHERE prcp < 99.9  
    AND prcp > 0      
    AND _TABLE_SUFFIX >= '2022'

Query is running:   0%|          |

To confirm the table is only storing data from 90 days in the past up until today, run the query below to get the age of your partition. Set the station name to only Soekarno Hatta Airport.

In [30]:
%%bigquery
SELECT
  AVG(precipitation) AS average,
  station_name,
  date,
  CURRENT_DATE() AS today,
  DATE_DIFF(CURRENT_DATE(), date, DAY) AS partition_age,
  EXTRACT(MONTH FROM date) AS month
FROM ecommerce.days_with_rain
WHERE station_name = 'SOEKARNO HATTA INTL'
GROUP BY station_name, date, today, month, partition_age
ORDER BY date DESC

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,average,station_name,date,today,partition_age,month
0,0.79,SOEKARNO HATTA INTL,2023-07-24,2023-08-16,23,7
1,0.2,SOEKARNO HATTA INTL,2023-07-07,2023-08-16,40,7
2,0.02,SOEKARNO HATTA INTL,2023-07-06,2023-08-16,41,7
3,0.43,SOEKARNO HATTA INTL,2023-06-25,2023-08-16,52,6
4,0.37,SOEKARNO HATTA INTL,2023-06-19,2023-08-16,58,6
5,0.39,SOEKARNO HATTA INTL,2023-06-16,2023-08-16,61,6
6,0.02,SOEKARNO HATTA INTL,2023-06-14,2023-08-16,63,6
7,0.63,SOEKARNO HATTA INTL,2023-06-06,2023-08-16,71,6
8,0.01,SOEKARNO HATTA INTL,2023-05-30,2023-08-16,78,5
9,0.63,SOEKARNO HATTA INTL,2023-05-22,2023-08-16,86,5


When the partition_age is over than 90 days, the partition will expired.

In [31]:
%%bigquery
SELECT
  AVG(precipitation) AS average,
  station_name,
  date,
  CURRENT_DATE() AS today,
  DATE_DIFF(CURRENT_DATE(), date, DAY) AS partition_age,
  EXTRACT(MONTH FROM date) AS month
FROM ecommerce.days_with_rain
WHERE station_name = 'SOEKARNO HATTA INTL' 
GROUP BY station_name, date, today, month, partition_age
ORDER BY partition_age DESC

Query is running:   0%|          |

Downloading:   0%|          |

Unnamed: 0,average,station_name,date,today,partition_age,month
0,0.63,SOEKARNO HATTA INTL,2023-05-22,2023-08-16,86,5
1,0.01,SOEKARNO HATTA INTL,2023-05-30,2023-08-16,78,5
2,0.63,SOEKARNO HATTA INTL,2023-06-06,2023-08-16,71,6
3,0.02,SOEKARNO HATTA INTL,2023-06-14,2023-08-16,63,6
4,0.39,SOEKARNO HATTA INTL,2023-06-16,2023-08-16,61,6
5,0.37,SOEKARNO HATTA INTL,2023-06-19,2023-08-16,58,6
6,0.43,SOEKARNO HATTA INTL,2023-06-25,2023-08-16,52,6
7,0.02,SOEKARNO HATTA INTL,2023-07-06,2023-08-16,41,7
8,0.2,SOEKARNO HATTA INTL,2023-07-07,2023-08-16,40,7
9,0.79,SOEKARNO HATTA INTL,2023-07-24,2023-08-16,23,7
