From d8163fe49664c46631c1e317a82b6e2e6fa17d92 Mon Sep 17 00:00:00 2001 From: Phillip LeBlanc Date: Tue, 18 Feb 2025 14:29:24 +0900 Subject: [PATCH 1/2] Add docs on `time_partition_column` --- .../features/data-acceleration/data-refresh.md | 14 ++++++++++++++ website/docs/reference/spicepod/datasets.md | 11 +++++++++++ 2 files changed, 25 insertions(+) diff --git a/website/docs/features/data-acceleration/data-refresh.md b/website/docs/features/data-acceleration/data-refresh.md index db44c2066..1ad92aefc 100644 --- a/website/docs/features/data-acceleration/data-refresh.md +++ b/website/docs/features/data-acceleration/data-refresh.md @@ -58,6 +58,20 @@ datasets: If late arriving data or clock-skew needs to be accounted for, an optional overlap can also be specified. See [`acceleration.refresh_append_overlap`](/docs/reference/spicepod/datasets#accelerationrefresh_append_overlap). +Datasets that are partitioned by a less-granular time-column (e.g. day, month, year) can also use the `time_partition_column` parameter in addition to the `time_column` parameter to specify the time-column to use for efficient partition pruning. + +Example: + +```yaml +datasets: + - from: databricks:my_dataset + name: accelerated_dataset + time_column: created_at + time_format: iso8601 + time_partition_column: created_at_day + time_partition_format: date +``` + ### Changes (CDC) Datasets configured with acceleration `refresh_mode: changes` requires a [Change Data Capture (CDC)](/docs/features/cdc/index.md) supported data connector. Initial CDC support in Spice is supported by the [Debezium data connector](/docs/components/data-connectors/debezium.md). diff --git a/website/docs/reference/spicepod/datasets.md b/website/docs/reference/spicepod/datasets.md index f129f4624..900dd2d86 100644 --- a/website/docs/reference/spicepod/datasets.md +++ b/website/docs/reference/spicepod/datasets.md @@ -150,6 +150,7 @@ Optional. The format of the `time_column`. The following values are supported: - `unix_seconds` - Unix timestamp in seconds. E.g. `1718756687`. - `unix_millis` - Unix timestamp in milliseconds. E.g. `1718756687000`. - `ISO8601` - [ISO 8601](https://en.wikipedia.org/wiki/ISO_8601) format. +- `date` - Date in YYYY-MM-DD format. E.g. `2024-01-01`. Spice emits a warning if the `time_column` from the data source is incompatible with the `time_format` config. @@ -159,6 +160,16 @@ Spice emits a warning if the `time_column` from the data source is incompatible ::: +## `time_partition_column` + +Optional. The name of the column that represents the time-based partitioning of the dataset. Requires `time_column` to be set. + +This parameter is used when the dataset is partitioned by a less-granular time-column (e.g. day, month, year), but the data source has a more granular time-column available (e.g. timestamp). This can ensure that queries for a specific time range are optimized by the data source to use the appropriate partitions. + +## `time_partition_format` + +Optional. The format of the `time_partition_column`. The same format options as `time_format` are supported. + ## `unsupported_type_action` Optional. Specifies the action to take when a data type that is not supported by the data connector is encountered. From b349cdd86ec8aa4adad652ef542d6ff6e75d2530 Mon Sep 17 00:00:00 2001 From: Luke Kim <80174+lukekim@users.noreply.github.com> Date: Tue, 18 Feb 2025 14:38:44 +0900 Subject: [PATCH 2/2] Tweak docs --- website/docs/reference/spicepod/datasets.md | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/website/docs/reference/spicepod/datasets.md b/website/docs/reference/spicepod/datasets.md index 900dd2d86..10d3cece5 100644 --- a/website/docs/reference/spicepod/datasets.md +++ b/website/docs/reference/spicepod/datasets.md @@ -162,13 +162,11 @@ Spice emits a warning if the `time_column` from the data source is incompatible ## `time_partition_column` -Optional. The name of the column that represents the time-based partitioning of the dataset. Requires `time_column` to be set. - -This parameter is used when the dataset is partitioned by a less-granular time-column (e.g. day, month, year), but the data source has a more granular time-column available (e.g. timestamp). This can ensure that queries for a specific time range are optimized by the data source to use the appropriate partitions. +(Optional) Specify the column that represents the physical partitioning of the dataset when using append-based acceleration. When the defined `time_column` is a fine-grained timestamp and the dataset is physically partitioned by a coarser granularity (for example, by date), setting `time_partition_column` to the partition column (e.g. date_col) improves partition pruning, excludes irrelevant partitions during refreshes, and optimizes scan efficiency. ## `time_partition_format` -Optional. The format of the `time_partition_column`. The same format options as `time_format` are supported. +(Optional) Define the format of the `time_partition_column`. For instance, if the physical partitions follow a date format (YYYY-MM-DD), set this value to `date`. The same format options as `time_format` are supported for `time_partition_column`. ## `unsupported_type_action`