diff --git a/docs/api-reference/elasticsearch/index.md b/docs/api-reference/elasticsearch/index.md index 893074cf2..53a0904be 100644 --- a/docs/api-reference/elasticsearch/index.md +++ b/docs/api-reference/elasticsearch/index.md @@ -31,7 +31,7 @@ The `geo_latitude` and `geo_longitude` fields are combined into a single `g ### Self-describing events -Each [self-describing event](/docs/fundamentals/events/index.md#self-describing-events) gets its own field (same [naming rules](/docs/destinations/warehouses-lakes/schemas-in-warehouse/index.md?warehouse=snowflake#location) as for Snowflake). For example: +Each [self-describing event](/docs/fundamentals/events/index.md#self-describing-events) gets its own field (same [naming rules](/docs/api-reference/loaders-storage-targets/schemas-in-warehouse/index.md?warehouse=snowflake#location) as for Snowflake). For example: ```json { @@ -46,7 +46,7 @@ Each [self-describing event](/docs/fundamentals/events/index.md#self-describing- ### Entities -Each [entity](/docs/fundamentals/entities/index.md) type attached to the event gets its own field (same [naming rules](/docs/destinations/warehouses-lakes/schemas-in-warehouse/index.md?warehouse=snowflake#location) as for Snowflake). The field contains an array with the data for all entities of the given type. For example: +Each [entity](/docs/fundamentals/entities/index.md) type attached to the event gets its own field (same [naming rules](/docs/api-reference/loaders-storage-targets/schemas-in-warehouse/index.md?warehouse=snowflake#location) as for Snowflake). The field contains an array with the data for all entities of the given type. For example: ```json { diff --git a/docs/api-reference/loaders-storage-targets/bigquery-loader/index.md b/docs/api-reference/loaders-storage-targets/bigquery-loader/index.md index 3d1c4b2ee..458368986 100644 --- a/docs/api-reference/loaders-storage-targets/bigquery-loader/index.md +++ b/docs/api-reference/loaders-storage-targets/bigquery-loader/index.md @@ -31,7 +31,7 @@ The BigQuery Streaming Loader is an application that loads Snowplow events to Bi :::tip Schemas in BigQuery -For more information on how events are stored in BigQuery, check the [mapping between Snowplow schemas and the corresponding BigQuery column types](/docs/destinations/warehouses-lakes/schemas-in-warehouse/index.md?warehouse=bigquery). +For more information on how events are stored in BigQuery, check the [mapping between Snowplow schemas and the corresponding BigQuery column types](/docs/api-reference/loaders-storage-targets/schemas-in-warehouse/index.md?warehouse=bigquery). ::: diff --git a/docs/api-reference/loaders-storage-targets/bigquery-loader/previous-versions/bigquery-loader-1.x/index.md b/docs/api-reference/loaders-storage-targets/bigquery-loader/previous-versions/bigquery-loader-1.x/index.md index 65da6715b..fec545afd 100644 --- a/docs/api-reference/loaders-storage-targets/bigquery-loader/previous-versions/bigquery-loader-1.x/index.md +++ b/docs/api-reference/loaders-storage-targets/bigquery-loader/previous-versions/bigquery-loader-1.x/index.md @@ -15,7 +15,7 @@ Under the umbrella of Snowplow BigQuery Loader, we have a family of applications :::tip Schemas in BigQuery -For more information on how events are stored in BigQuery, check the [mapping between Snowplow schemas and the corresponding BigQuery column types](/docs/destinations/warehouses-lakes/schemas-in-warehouse/index.md?warehouse=bigquery). +For more information on how events are stored in BigQuery, check the [mapping between Snowplow schemas and the corresponding BigQuery column types](/docs/api-reference/loaders-storage-targets/schemas-in-warehouse/index.md?warehouse=bigquery). ::: diff --git a/docs/api-reference/loaders-storage-targets/bigquery-loader/upgrade-guides/2-0-0-upgrade-guide/index.md b/docs/api-reference/loaders-storage-targets/bigquery-loader/upgrade-guides/2-0-0-upgrade-guide/index.md index f51eae3fa..5e02b5dd5 100644 --- a/docs/api-reference/loaders-storage-targets/bigquery-loader/upgrade-guides/2-0-0-upgrade-guide/index.md +++ b/docs/api-reference/loaders-storage-targets/bigquery-loader/upgrade-guides/2-0-0-upgrade-guide/index.md @@ -110,4 +110,4 @@ If events with incorrectly evolved schemas never arrive, then the recovery colum ::: -You can read more about schema evolution and how recovery columns work [here](/docs/destinations/warehouses-lakes/schemas-in-warehouse/index.md?warehouse=bigquery#versioning). +You can read more about schema evolution and how recovery columns work [here](/docs/api-reference/loaders-storage-targets/schemas-in-warehouse/index.md?warehouse=bigquery#versioning). diff --git a/docs/api-reference/loaders-storage-targets/databricks-streaming-loader/index.md b/docs/api-reference/loaders-storage-targets/databricks-streaming-loader/index.md index 0619d0328..8f9ee70e2 100644 --- a/docs/api-reference/loaders-storage-targets/databricks-streaming-loader/index.md +++ b/docs/api-reference/loaders-storage-targets/databricks-streaming-loader/index.md @@ -39,7 +39,7 @@ The Databricks Streaming Loader is an application that integrates with a Databri :::tip Schemas in Databricks -For more information on how events are stored in Databricks, check the [mapping between Snowplow schemas and the corresponding Databricks column types](/docs/destinations/warehouses-lakes/schemas-in-warehouse/index.md?warehouse=databricks). +For more information on how events are stored in Databricks, check the [mapping between Snowplow schemas and the corresponding Databricks column types](/docs/api-reference/loaders-storage-targets/schemas-in-warehouse/index.md?warehouse=databricks). ::: diff --git a/docs/destinations/warehouses-lakes/schemas-in-warehouse/_parquet-recovery-columns.md b/docs/api-reference/loaders-storage-targets/schemas-in-warehouse/_parquet-recovery-columns.md similarity index 100% rename from docs/destinations/warehouses-lakes/schemas-in-warehouse/_parquet-recovery-columns.md rename to docs/api-reference/loaders-storage-targets/schemas-in-warehouse/_parquet-recovery-columns.md diff --git a/docs/destinations/warehouses-lakes/schemas-in-warehouse/index.md b/docs/api-reference/loaders-storage-targets/schemas-in-warehouse/index.md similarity index 83% rename from docs/destinations/warehouses-lakes/schemas-in-warehouse/index.md rename to docs/api-reference/loaders-storage-targets/schemas-in-warehouse/index.md index 9e51b2926..fc2926325 100644 --- a/docs/destinations/warehouses-lakes/schemas-in-warehouse/index.md +++ b/docs/api-reference/loaders-storage-targets/schemas-in-warehouse/index.md @@ -1,14 +1,14 @@ --- title: "How schema definitions translate to the warehouse" sidebar_label: "Schemas in the warehouse" -sidebar_position: 4 -description: "A detailed explanation of how Snowplow data is represented in Redshift, Postgres, BigQuery, Snowflake, Databricks and Synapse Analytics" +sidebar_position: 100 +description: "A detailed explanation of how Snowplow data is represented in Redshift, BigQuery, Snowflake, Databricks, Iceberg and Delta Lake" --- ```mdx-code-block import Tabs from '@theme/Tabs'; import TabItem from '@theme/TabItem'; -import ParquetRecoveryColumns from '@site/docs/destinations/warehouses-lakes/schemas-in-warehouse/_parquet-recovery-columns.md'; +import ParquetRecoveryColumns from '@site/docs/api-reference/loaders-storage-targets/schemas-in-warehouse/_parquet-recovery-columns.md'; ``` [Self-describing events](/docs/fundamentals/events/index.md#self-describing-events) and [entities](/docs/fundamentals/entities/index.md) use [schemas](/docs/fundamentals/schemas/index.md) to define which fields should be present, and of what type (e.g. string, number). This page explains what happens to this information in the warehouse. @@ -18,7 +18,7 @@ import ParquetRecoveryColumns from '@site/docs/destinations/warehouses-lakes/sch Where can you find the data carried by a self-describing event or an entity? - + Each type of self-describing event and each type of entity get their own dedicated tables. The name of such a table is composed of the schema vendor, schema name and its major version (more on versioning [later](#versioning)). @@ -167,7 +167,7 @@ For example, suppose you have the following field in the schema: It will be translated into an object with a `lastName` key that points to a value of type `VARIANT`. - + Each type of self-describing event and each type of entity get their own dedicated columns in the `events` table. The name of such a column is composed of the schema vendor, schema name and major schema version (more on versioning [later](#versioning)). @@ -213,53 +213,6 @@ For example, suppose you have the following field in the schema: It will be translated into a field called `last_name` (notice the underscore), of type `STRING`. - - - -Each type of self-describing event and each type of entity get their own dedicated columns in the underlying data lake table. The name of such a column is composed of the schema vendor, schema name and major schema version (more on versioning [later](#versioning)). - -The column name is prefixed by `unstruct_event_` for self-describing events, and by `contexts_` for entities. _(In case you were wondering, those are the legacy terms for self-describing events and entities, respectively.)_ - -:::note - -All characters are converted to lowercase and all symbols (like `.`) are replaced with an underscore. - -::: - -Examples: - -| Kind | Schema | Resulting column | -| --------------------- | ------------------------------------------- | -------------------------------------------------- | -| Self-describing event | `com.example/button_press/jsonschema/1-0-0` | `events.unstruct_event_com_example_button_press_1` | -| Entity | `com.example/user/jsonschema/1-0-0` | `events.contexts_com_example_user_1` | - -The column will be formatted as JSON — an object for self-describing events and an array of objects for entities (because an event can have more than one entity attached). - -Inside the JSON object, there will be fields corresponding to the fields in the schema. - -:::note - -The name of each JSON field is the name of the schema field converted to snake case. - -::: - -:::caution - -If an event or entity includes fields not defined in the schema, those fields will not be stored in the data lake, and will not be availble in Synapse. - -::: - -For example, suppose you have the following field in the schema: - -```json -"lastName": { - "type": "string", - "maxLength": 100 -} -``` - -It will be translated into a field called `last_name` (notice the underscore) inside the JSON object. - @@ -300,31 +253,6 @@ Note that this behavior was introduced in RDB Loader 6.0.0. In older versions, b Once the loader creates a column for a given schema version as `NULLABLE` or `NOT NULL`, it will never alter the nullability constraint for that column. For example, if a field is nullable in schema version `1-0-0` and not nullable in version `1-0-1`, the column will remain nullable. (In this example, the Enrich application will still validate data according to the schema, accepting `null` values for `1-0-0` and rejecting them for `1-0-1`.) -::: - - - - -Because the table name for the self-describing event or entity includes the major schema version, each major version of a schema gets a new table: - -| Schema | Resulting table | -| ------------------------------------------- | ---------------------------- | -| `com.example/button_press/jsonschema/1-0-0` | `com_example_button_press_1` | -| `com.example/button_press/jsonschema/1-2-0` | `com_example_button_press_1` | -| `com.example/button_press/jsonschema/2-0-0` | `com_example_button_press_2` | - -When you evolve your schema within the same major version, (non-destructive) changes are applied to the existing table automatically. For example, if you change the `maxLength` of a `string` field, the limit of the `VARCHAR` column would be updated accordingly. - -:::danger Breaking changes - -If you make a breaking schema change (e.g. change a type of a field from a `string` to a `number`) without creating a new major schema version, the loader will not be able to adapt the table to receive new data. Your loading process will halt. - -::: - -:::info Nullability - -Once the loader creates a column for a given schema version as `NULLABLE` or `NOT NULL`, it will never alter the nullability constraint for that column. For example, if a field is nullable in schema version `1-0-0` and not nullable in version `1-0-1`, the column will remain nullable. (In this example, the Enrich application will still validate data according to the schema, accepting `null` values for `1-0-0` and rejecting them for `1-0-1`.) - ::: @@ -389,7 +317,7 @@ Also, creating a new major version of the schema (and hence a new column) is the ::: - + Because the column name for the self-describing event or entity includes the major schema version, each major version of a schema gets a new column: @@ -405,27 +333,6 @@ When you evolve your schema within the same major version, (non-destructive) cha -Note that this behavior was introduced in RDB Loader 5.3.0. - -::: - - - - -Because the column name for the self-describing event or entity includes the major schema version, each major version of a schema gets a new column: - -| Schema | Resulting column | -| ------------------------------------------- | ------------------------------------------- | -| `com.example/button_press/jsonschema/1-0-0` | `unstruct_event_com_example_button_press_1` | -| `com.example/button_press/jsonschema/1-2-0` | `unstruct_event_com_example_button_press_1` | -| `com.example/button_press/jsonschema/2-0-0` | `unstruct_event_com_example_button_press_2` | - -When you evolve your schema within the same major version, (non-destructive) changes are applied to the existing column automatically in the underlying data lake. That said, for the purposes of querying the data from Synapse Analytics, all fields are in JSON format, so these internal modifications are invisible — the new fields just appear in the JSON data. - -:::info Breaking changes - - - ::: @@ -438,7 +345,7 @@ How do schema types translate to the database types? ### Nullability - + All non-required schema fields translate to nullable columns. @@ -517,22 +424,17 @@ In this case, the `RECORD` field will be nullable. It does not matter if `"null" All fields are nullable (because they are stored inside the `VARIANT` type). - + All schema fields, including the required ones, translate to nullable fields inside the `STRUCT`. - - - -All fields are nullable (because they are stored inside the JSON-formatted column). - ### Types themselves - + :::note @@ -543,7 +445,7 @@ The row order in this table is important. Type lookup stops after the first matc - + @@ -1253,7 +1155,7 @@ _Values will be quoted as in JSON._ All types are `VARIANT`. - + :::note @@ -1795,12 +1697,5 @@ _Values will be quoted as in JSON._
Json SchemaRedshift/Postgres TypeRedshift Type
-
- - -All types are `NVARCHAR(4000)` when extracted with [`JSON_VALUE`](https://learn.microsoft.com/en-us/sql/t-sql/functions/json-value-transact-sql?view=azure-sqldw-latest#return-value). - -With [`OPENJSON`](https://learn.microsoft.com/en-us/sql/t-sql/functions/openjson-transact-sql?view=azure-sqldw-latest), you can explicitly specify more precise types. -
diff --git a/docs/api-reference/loaders-storage-targets/snowflake-streaming-loader/migrating.md b/docs/api-reference/loaders-storage-targets/snowflake-streaming-loader/migrating.md index 0e632df01..0d242ecad 100644 --- a/docs/api-reference/loaders-storage-targets/snowflake-streaming-loader/migrating.md +++ b/docs/api-reference/loaders-storage-targets/snowflake-streaming-loader/migrating.md @@ -24,7 +24,7 @@ The Streaming Loader is fully compatible with the table created and managed by t :::tip -[This page](/docs/destinations/warehouses-lakes/schemas-in-warehouse/index.md) explains how Snowplow data maps to the warehouse in more detail. +[This page](/docs/api-reference/loaders-storage-targets/schemas-in-warehouse/index.md) explains how Snowplow data maps to the warehouse in more detail. ::: diff --git a/docs/api-reference/loaders-storage-targets/snowplow-postgres-loader/index.md b/docs/api-reference/loaders-storage-targets/snowplow-postgres-loader/index.md index 8bca0aa13..853bb3beb 100644 --- a/docs/api-reference/loaders-storage-targets/snowplow-postgres-loader/index.md +++ b/docs/api-reference/loaders-storage-targets/snowplow-postgres-loader/index.md @@ -18,7 +18,7 @@ The Postgres loader is not recommended for production use, especially with large :::tip Schemas in Postgres -For more information on how events are stored in Postgres, check the [mapping between Snowplow schemas and the corresponding Postgres column types](/docs/destinations/warehouses-lakes/schemas-in-warehouse/index.md?warehouse=postgres). +For more information on how events are stored in Postgres, check the [mapping between Snowplow schemas and the corresponding Postgres column types](/docs/api-reference/loaders-storage-targets/schemas-in-warehouse/index.md?warehouse=postgres). ::: diff --git a/docs/api-reference/loaders-storage-targets/snowplow-rdb-loader/index.md b/docs/api-reference/loaders-storage-targets/snowplow-rdb-loader/index.md index 842aa48b0..ac8accd41 100644 --- a/docs/api-reference/loaders-storage-targets/snowplow-rdb-loader/index.md +++ b/docs/api-reference/loaders-storage-targets/snowplow-rdb-loader/index.md @@ -27,7 +27,7 @@ We use the name RDB Loader (from "relational database") for a set of application :::tip Schemas in Redshift, Snowflake and Databricks -For more information on how events are stored in the warehouse, check the [mapping between Snowplow schemas and the corresponding warehouse column types](/docs/destinations/warehouses-lakes/schemas-in-warehouse/index.md). +For more information on how events are stored in the warehouse, check the [mapping between Snowplow schemas and the corresponding warehouse column types](/docs/api-reference/loaders-storage-targets/schemas-in-warehouse/index.md). ::: diff --git a/docs/api-reference/loaders-storage-targets/snowplow-rdb-loader/upgrade-guides/6-0-0-upgrade-guide/index.md b/docs/api-reference/loaders-storage-targets/snowplow-rdb-loader/upgrade-guides/6-0-0-upgrade-guide/index.md index 6eacb2084..761ea5a14 100644 --- a/docs/api-reference/loaders-storage-targets/snowplow-rdb-loader/upgrade-guides/6-0-0-upgrade-guide/index.md +++ b/docs/api-reference/loaders-storage-targets/snowplow-rdb-loader/upgrade-guides/6-0-0-upgrade-guide/index.md @@ -144,7 +144,7 @@ In order to solve this problem, we should patch `1-0-0` with `{ "type": "integer After identifying all the offending schemas, you should patch them to reflect the changes in the warehouse. -Schema casting rules could be found [here](/docs/destinations/warehouses-lakes/schemas-in-warehouse/index.md?warehouse=redshift#types). +Schema casting rules could be found [here](/docs/api-reference/loaders-storage-targets/schemas-in-warehouse/index.md?warehouse=redshift#types). #### `$.featureFlags.disableRecovery` configuration diff --git a/docs/data-product-studio/data-quality/failed-events/exploring-failed-events/index.md b/docs/data-product-studio/data-quality/failed-events/exploring-failed-events/index.md index fe7c27bf3..c32c28027 100644 --- a/docs/data-product-studio/data-quality/failed-events/exploring-failed-events/index.md +++ b/docs/data-product-studio/data-quality/failed-events/exploring-failed-events/index.md @@ -30,7 +30,7 @@ There are two differences compared to regular events. * Likewise, any column containing the JSON for a self-describing event (`unstruct_...`) will be set to `null` if that JSON fails validation. * Finally, for entity columns (`contexts_`), if one entity is invalid, it will be removed from the array of entities. If all entities are invalid, the whole column will be set to `null`. -For more information about the different columns in Snowplow data, see [how Snowplow data is stored in the warehouse](/docs/destinations/warehouses-lakes/schemas-in-warehouse/index.md). +For more information about the different columns in Snowplow data, see [how Snowplow data is stored in the warehouse](/docs/api-reference/loaders-storage-targets/schemas-in-warehouse/index.md). **There is an extra column with failure details.** The column is named `contexts_com_snowplowanalytics_snowplow_failure_1`. In most cases, it will also contain the invalid data in some form. See the [next section](#example-failed-event) for an example. diff --git a/docs/data-product-studio/data-structures/version-amend/_breaking.md b/docs/data-product-studio/data-structures/version-amend/_breaking.md index da5c824e1..8e3fc4d68 100644 --- a/docs/data-product-studio/data-structures/version-amend/_breaking.md +++ b/docs/data-product-studio/data-structures/version-amend/_breaking.md @@ -8,6 +8,6 @@ Different data warehouses handle schema evolution slightly differently. Use the :::caution -In Redshift and Databricks, changing _size_ may also mean _type_ change; e.g. changing the `maximum` integer from `30000` to `100000`. See our documentation on [how schemas translate to database types](/docs/destinations/warehouses-lakes/schemas-in-warehouse/index.md). +In Redshift and Databricks, changing _size_ may also mean _type_ change; e.g. changing the `maximum` integer from `30000` to `100000`. See our documentation on [how schemas translate to database types](/docs/api-reference/loaders-storage-targets/schemas-in-warehouse/index.md). ::: diff --git a/docs/data-product-studio/data-structures/version-amend/amending/index.md b/docs/data-product-studio/data-structures/version-amend/amending/index.md index 44cc4b88b..bac21c974 100644 --- a/docs/data-product-studio/data-structures/version-amend/amending/index.md +++ b/docs/data-product-studio/data-structures/version-amend/amending/index.md @@ -15,7 +15,7 @@ Sometimes, small mistakes creep into your schemas. For example, you might mark a It might be tempting to somehow “overwrite” the schema without updating the version. But this can bring several problems: * Events that were previously valid could become invalid against the new changes. -* Your warehouse loader, which updates the table [according to the schema](/docs/destinations/warehouses-lakes/schemas-in-warehouse/index.md#versioning), could get stuck if it’s not possible to cast the data in the existing table column to the new definition (e.g. if you change a field type from a string to a number). +* Your warehouse loader, which updates the table [according to the schema](/docs/api-reference/loaders-storage-targets/schemas-in-warehouse/index.md#versioning), could get stuck if it’s not possible to cast the data in the existing table column to the new definition (e.g. if you change a field type from a string to a number). * Similarly, data models or other applications consuming the data downstream might not be able to deal with the changes. The best approach is to just create a new schema version and update your tracking code to use it. However, there are two alternatives for when it’s not ideal. diff --git a/docs/destinations/warehouses-lakes/_how-loading-works.mdx b/docs/destinations/warehouses-lakes/_how-loading-works.mdx new file mode 100644 index 000000000..1ca75b074 --- /dev/null +++ b/docs/destinations/warehouses-lakes/_how-loading-works.mdx @@ -0,0 +1 @@ +The Snowplow data loading process is engineered for large volumes of data. In addition, our loader applications ensure the best representation of Snowplow events. That includes automatically adjusting the tables to account for your custom data, whether it's new event types or new fields. diff --git a/docs/destinations/warehouses-lakes/_setup-instructions.mdx b/docs/destinations/warehouses-lakes/_setup-instructions.mdx new file mode 100644 index 000000000..ecd3113d8 --- /dev/null +++ b/docs/destinations/warehouses-lakes/_setup-instructions.mdx @@ -0,0 +1,21 @@ +### Step 1: Create a connection + +
    +
  1. In Console, navigate to Destinations > Connections
  2. +
  3. Select Set up connection
  4. +
  5. Choose Loader connection, then {props.connectionType}
  6. +
  7. Follow the steps to provide all the necessary values
  8. +
  9. Click Complete setup to create the connection
  10. +
+ +### Step 2: Create a loader + +
    +
  1. In Console, navigate to Destinations > Destination list. Switch to the Available tab and select {props.destinationName}
  2. +
  3. Select a pipeline: choose the pipeline where you want to deploy the loader.
  4. +
  5. Select your connection: choose the connection you configured in step 1.
  6. +{!props.noFailedEvents &&
  7. Select the type of events: enriched events or failed events
  8. } +
  9. Click Continue to deploy the loader
  10. +
+ +You can review active destinations and loaders by navigating to **Destinations** > **Destination list**. diff --git a/docs/destinations/warehouses-lakes/_single-table-format.mdx b/docs/destinations/warehouses-lakes/_single-table-format.mdx new file mode 100644 index 000000000..e7ca6077f --- /dev/null +++ b/docs/destinations/warehouses-lakes/_single-table-format.mdx @@ -0,0 +1,36 @@ +All events are loaded into a single table (`events`). + +There are dedicated columns for [atomic fields](/docs/fundamentals/canonical-event/index.md), such as `app_id`, `user_id` and so on: + +| app_id | collector_tstamp | ... | event_id | ... | user_id | ... | +| ------ | ---------------- | --- | -------- | --- | ------- | --- | +| website | 2025-05-06 12:30:05.123 | ... | c6ef3124-b53a-4b13-a233-0088f79dcbcb | ... | c94f860b-1266-4dad-ae57-3a36a414a521 | ... | + +Snowplow data also includes customizable [self-describing events](/docs/fundamentals/events/index.md#self-describing-events) and [entities](/docs/fundamentals/entities/index.md). These use [schemas](/docs/fundamentals/schemas/index.md) to define which fields should be present, and of what type (e.g. string, number). + +For self-describing events and entities, there are additional columns, like so: + + + + + + + + + + + + + + + + + + +
app_id...unstruct_event_com_acme_button_press_1contexts_com_acme_product_1
website...data for your custom button_press event (as {props.eventType})data for your custom product entities (as {props.entitiesType})
+ +Note: +* "unstruct\[ured\] event" and "context" are the legacy terms for self-describing events and entities, respectively +* the `_1` suffix represents the major version of the schema (e.g. `1-x-y`) + +You can learn more [in the API reference section](/docs/api-reference/loaders-storage-targets/schemas-in-warehouse/index.md). diff --git a/docs/destinations/warehouses-lakes/bigquery/index.md b/docs/destinations/warehouses-lakes/bigquery/index.md new file mode 100644 index 000000000..d10a3313f --- /dev/null +++ b/docs/destinations/warehouses-lakes/bigquery/index.md @@ -0,0 +1,62 @@ +--- +title: "BigQuery" +sidebar_position: 30 +description: "Send Snowplow data to BigQuery for analytics and data warehousing" +--- + +```mdx-code-block +import SetupInstructions from '../_setup-instructions.mdx'; +import HowLoadingWorks from '../_how-loading-works.mdx'; +import SingleTableFormat from '../_single-table-format.mdx'; +``` + +:::info Cloud availability + +The BigQuery integration is available for Snowplow pipelines running on **AWS**, **Azure** and **GCP**. + +::: + +The Snowplow BigQuery integration allows you to load enriched event data (as well as [failed events](/docs/fundamentals/failed-events/index.md)) directly into your BigQuery datasets for analytics, data modeling, and more. + +## What you will need + +Connecting to a destination always involves configuring cloud resources and granting permissions. It's a good idea to make sure you have sufficient priviliges before you begin the setup process. + +:::tip + +The list below is just a heads up. The Snowplow Console will guide you through the exact steps to set up the integration. + +::: + +Keep in mind that you will need to be able to: + +* Provide your Google Cloud Project ID and region +* Allow-list Snowplow IP addresses +* Specify the desired dataset name +* Create a service account with the `roles/bigquery.dataEditor` permission (more permissions will be required for loading failed events and setting up [Data Quality Dashboard](/docs/data-product-studio/data-quality/failed-events/monitoring-failed-events/index.md#data-quality-dashboard)) + +## Getting started + +You can add a BigQuery destination through the Snowplow Console. (For self-hosted customers, please refer to the [Loader API reference](/docs/api-reference/loaders-storage-targets/bigquery-loader/index.md) instead.) + + + +## How loading works + + + +:::tip + +For more details on the loading flow, see the [BigQuery Loader](/docs/api-reference/loaders-storage-targets/bigquery-loader/index.md) reference page, where you will find additional information and diagrams. + +::: + +## Snowplow data format in BigQuery + +RECORD} entitiesType={REPEATED RECORD}/> + +:::tip + +Check this [guide on querying](/docs/destinations/warehouses-lakes/querying-data/index.md?warehouse=bigquery) Snowplow data. + +::: diff --git a/docs/destinations/warehouses-lakes/databricks/index.md b/docs/destinations/warehouses-lakes/databricks/index.md new file mode 100644 index 000000000..f080234bd --- /dev/null +++ b/docs/destinations/warehouses-lakes/databricks/index.md @@ -0,0 +1,137 @@ +--- +title: "Databricks" +sidebar_position: 20 +description: "Send Snowplow data to Databricks for analytics and data processing" +--- + +```mdx-code-block +import SetupInstructions from '../_setup-instructions.mdx'; +import HowLoadingWorks from '../_how-loading-works.mdx'; +import SingleTableFormat from '../_single-table-format.mdx'; +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; +``` + +:::info Cloud availability + +The Databricks integration is available for Snowplow pipelines running on **AWS**, **Azure** and **GCP**. + +::: + +The Snowplow Databricks integration allows you to load enriched event data (as well as [failed events](/docs/fundamentals/failed-events/index.md)) into your Databricks environment for analytics, data modeling, and more. + +Depending on the cloud provider for your Snowplow pipeline, there are different options for this integration: + +| Integration | AWS | Azure | GCP | Failed events support | +| ----------- |:---:|:-----:|:---:|:---------------------:| +| Direct, batch-based ([RDB Loader](/docs/api-reference/loaders-storage-targets/snowplow-rdb-loader/index.md)) | :white_check_mark: | :x: | :x: | :x: | +| Via Delta Lake ([Lake Loader](/docs/api-reference/loaders-storage-targets/lake-loader/index.md)) | :x:* | :white_check_mark: | :white_check_mark: | :white_check_mark: | +| _Early release:_ Streaming / Lakeflow ([Streaming Loader](/docs/api-reference/loaders-storage-targets/databricks-streaming-loader/index.md)) | :white_check_mark: | :white_check_mark: | :white_check_mark: | :white_check_mark: | + +_*Delta+Databricks combination is currently not supported for AWS pipelines. The loader uses DynamoDB tables for mutually exclusive writes to S3, a feature of Delta. Databricks, however, does not support this (as of September 2025). This means that it’s not possible to alter the data via Databricks (e.g. to run `OPTIMIZE` or to delete PII)._ + +## What you will need + +Connecting to a destination always involves configuring cloud resources and granting permissions. It's a good idea to make sure you have sufficient priviliges before you begin the setup process. + +:::tip + +The list below is just a heads up. The Snowplow Console will guide you through the exact steps to set up the integration. + +::: + +Keep in mind that you will need to be able to do a few things. + + + + +* Provide a Databricks cluster along with its URL +* Specify the Unity catalog name and schema name +* Create an access token with the following permissions: + * `USE CATALOG` on the catalog + * `USE SCHEMA` and `CREATE TABLE` on the schema + * `CAN USE` on the SQL warehouse + + + + +See [Delta Lake](../delta/index.md). + + + + +* Create an S3 or GCS bucket or ADLS storage container, located in the same cloud and region as your Databricks instance +* Create a storage credential to allow Databricks to access the bucket or container +* Create an external location and a volume within Databricks pointing to the above +* Provide a Databricks SQL warehouse URL, Unity catalog name and schema name +* Create a service principal and grant the following permissions: + * `USE CATALOG` on the catalog + * `USE SCHEMA` and `CREATE TABLE` on the schema + * `READ VOLUME` and `WRITE VOLUME` on the volume + * `CAN USE` on the SQL warehouse (for testing the connection and monitoring, e.g. as part of the [Data Quality Dashboard](/docs/data-product-studio/data-quality/failed-events/monitoring-failed-events/index.md#data-quality-dashboard)) + +Note that Lakeflow features require a Premium Databricks account. You might also need Databricks metastore admin privileges for some of the steps. + + + + +## Getting started + +You can add a Databricks destination through the Snowplow Console. + + + + +(For self-hosted customers, please refer to the [Loader API reference](/docs/api-reference/loaders-storage-targets/snowplow-rdb-loader/index.md) instead.) + + + + + + +Follow the instructions for [Delta Lake](../delta/index.md#getting-started). + +Then create an external table in Databricks pointing to the Delta Lake location. + + + + +(For self-hosted customers, please refer to the [Loader API reference](/docs/api-reference/loaders-storage-targets/databricks-streaming-loader/index.md) instead.) + + + + + + +## How loading works + + + + + + +For more details on the loading flow, see the [RDB Loader](/docs/api-reference/loaders-storage-targets/snowplow-rdb-loader/index.md) reference page, where you will find additional information and diagrams. + + + + +For more details on the loading flow, see the [Lake Loader](/docs/api-reference/loaders-storage-targets/lake-loader/index.md) reference page, where you will find additional information and diagrams. + + + + +For more details on the loading flow, see the [Databricks Streaming Loader](/docs/api-reference/loaders-storage-targets/databricks-streaming-loader/index.md) reference page, where you will find additional information and diagrams. + + + + + +## Snowplow data format in Databricks + +STRUCT} entitiesType={<>ARRAY of STRUCT}/> + +:::tip + +Check this [guide on querying](/docs/destinations/warehouses-lakes/querying-data/index.md?warehouse=databricks) Snowplow data. + +::: diff --git a/docs/destinations/warehouses-lakes/delta/index.md b/docs/destinations/warehouses-lakes/delta/index.md new file mode 100644 index 000000000..d5e04c847 --- /dev/null +++ b/docs/destinations/warehouses-lakes/delta/index.md @@ -0,0 +1,110 @@ +--- +title: "Delta Lake" +sidebar_position: 70 +description: "Send Snowplow data to Delta Lake for analytics and data processing" +--- + +```mdx-code-block +import SetupInstructions from '../_setup-instructions.mdx'; +import HowLoadingWorks from '../_how-loading-works.mdx'; +import SingleTableFormat from '../_single-table-format.mdx'; +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; +``` + +:::info Cloud availability + +The Delta Lake integration is available for Snowplow pipelines running on **AWS**, **Azure** and **GCP**. + +::: + +Delta Lake is an open table format for data lake architectures. The Snowplow Delta integration allows you to load enriched event data (as well as [failed events](/docs/fundamentals/failed-events/index.md)) into Delta tables in your data lake for analytics, data modeling, and more. + +Data in Delta Lake can be consumed using various tools and products, for example: + +* Amazon Athena +* Apache Spark or Amazon EMR +* Databricks* +* Microsoft Synapse Analytics +* Microsoft Fabric + +_*Delta+Databricks combination is currently not supported for AWS pipelines. The loader uses DynamoDB tables for mutually exclusive writes to S3, a feature of Delta. Databricks, however, does not support this (as of September 2025). This means that it’s not possible to alter the data via Databricks (e.g. to run `OPTIMIZE` or to delete PII)._ + +## What you will need + +Connecting to a destination always involves configuring cloud resources and granting permissions. It's a good idea to make sure you have sufficient priviliges before you begin the setup process. + +:::tip + +The list below is just a heads up. The Snowplow Console will guide you through the exact steps to set up the integration. + +::: + +Keep in mind that you will need to be able to: + + + + +* Provide an S3 bucket +* Create a DynamoDB table (required for file locking) +* Create an IAM role with the following permissions: + * For the S3 bucket: + * `s3:ListBucket` + * `s3:GetObject` + * `s3:PutObject` + * `s3:DeleteObject` + * `s3:ListBucketMultipartUploads` + * `s3:AbortMultipartUpload` + * For the DynamoDB table: + * `dynamodb:DescribeTable` + * `dynamodb:Query` + * `dynamodb:Scan` + * `dynamodb:GetItem` + * `dynamodb:PutItem` + * `dynamodb:UpdateItem` + * `dynamodb:DeleteItem` +* Schedule a regular job to optimize the lake + + + + + +* Provide a GCS bucket +* Create a service account with the `roles/storage.objectUser` role on the bucket +* Create and provide a service account key + + + + + +* Provide an ADLS storage container +* Create a new App Registration with the `Storage Blob Data Contributor` permission +* Provide the registration tenant ID, client ID and client secret + + + + + +## Getting started + +You can add a Delta Lake destination through the Snowplow Console. (For self-hosted customers, please refer to the [Loader API reference](/docs/api-reference/loaders-storage-targets/lake-loader/index.md) instead.) + + + +We recommend scheduling regular [lake maintenance jobs](/docs/api-reference/loaders-storage-targets/lake-loader/maintenance/index.md?lake-format=delta) to ensure the best long-term performance. + +## How loading works + + + +For more details on the loading flow, see the [Lake Loader](/docs/api-reference/loaders-storage-targets/lake-loader/index.md) reference page, where you will find additional information and diagrams. + +## Snowplow data format in Delta Lake + +STRUCT} entitiesType={<>ARRAY of STRUCT}/> + +:::tip + +Check this [guide on querying](/docs/destinations/warehouses-lakes/querying-data/index.md?warehouse=databricks) Snowplow data. (You will need a query engine such as Spark SQL or Databricks to query Delta tables.) + +::: diff --git a/docs/destinations/warehouses-lakes/iceberg/index.md b/docs/destinations/warehouses-lakes/iceberg/index.md new file mode 100644 index 000000000..5b9753ec6 --- /dev/null +++ b/docs/destinations/warehouses-lakes/iceberg/index.md @@ -0,0 +1,80 @@ +--- +title: "Iceberg" +sidebar_position: 60 +description: "Send Snowplow data to Iceberg data lakes for analytics and data processing" +--- + +```mdx-code-block +import SetupInstructions from '../_setup-instructions.mdx'; +import HowLoadingWorks from '../_how-loading-works.mdx'; +import SingleTableFormat from '../_single-table-format.mdx'; +import Tabs from '@theme/Tabs'; +import TabItem from '@theme/TabItem'; +``` + +:::info Cloud availability + +The Iceberg integration is available for Snowplow pipelines running on **AWS** only. + +::: + +Apache Iceberg is an open table format for data lake architectures. The Snowplow Iceberg integration allows you to load enriched event data (as well as [failed events](/docs/fundamentals/failed-events/index.md)) into Iceberg tables in your data lake for analytics, data modeling, and more. + +Iceberg data can be consumed using various tools and products, for example: +* Amazon Athena +* Amazon Redshift Spectrum +* Apache Spark or Amazon EMR +* Snowflake +* ClickHouse + +We currently only support the Glue Iceberg catalog. + +## What you will need + +Connecting to a destination always involves configuring cloud resources and granting permissions. It's a good idea to make sure you have sufficient priviliges before you begin the setup process. + +:::tip + +The list below is just a heads up. The Snowplow Console will guide you through the exact steps to set up the integration. + +::: + +Keep in mind that you will need to be able to: + +* Specify your AWS account ID +* Provide an S3 bucket and an AWS Glue database +* Create an IAM role with the following permissions: + * For the S3 bucket: + * `s3:ListBucket` + * `s3:GetObject` + * `s3:PutObject` + * `s3:DeleteObject` + * For the Glue database: + * `glue:CreateTable` + * `glue:GetTable` + * `glue:UpdateTable` +* Schedule a regular job to optimize the lake + +## Getting started + +You can add an Iceberg destination through the Snowplow Console. (For self-hosted customers, please refer to the [Loader API reference](/docs/api-reference/loaders-storage-targets/lake-loader/index.md) instead.) + + + +We recommend scheduling regular [lake maintenance jobs](/docs/api-reference/loaders-storage-targets/lake-loader/maintenance/index.md?lake-format=iceberg) to ensure the best long-term performance. + +## How loading works + + + +For more details on the loading flow, see the [Lake Loader](/docs/api-reference/loaders-storage-targets/lake-loader/index.md) reference page, where you will find additional information and diagrams. + +## Snowplow data format in Iceberg + +STRUCT} entitiesType={<>ARRAY of STRUCT}/> + +:::tip + +Check this [guide on querying](/docs/destinations/warehouses-lakes/querying-data/index.md?warehouse=databricks) Snowplow data. (You will need a query engine such as Spark SQL or Snowflake to query Iceberg tables.) + +::: diff --git a/docs/destinations/warehouses-lakes/index.md b/docs/destinations/warehouses-lakes/index.md index b9c61051a..fba95247f 100644 --- a/docs/destinations/warehouses-lakes/index.md +++ b/docs/destinations/warehouses-lakes/index.md @@ -5,99 +5,17 @@ sidebar_label: "Warehouses and lakes" description: "An overview of the available options for storing Snowplow data in data warehouses and lakes" --- -```mdx-code-block -import Tabs from '@theme/Tabs'; -import TabItem from '@theme/TabItem'; -``` - Data warehouses and data lakes are primary destinations for Snowplow data. For other options, see the [destinations overview](/docs/fundamentals/destinations/index.md) page. -## How loading works - -The Snowplow data loading process is engineered for large volumes of data. In addition, for each data warehouse, our loader applications ensure the best representation of Snowplow events. That includes [automatically adjusting the database types](/docs/destinations/warehouses-lakes/schemas-in-warehouse/index.md) for [self-describing events](/docs/fundamentals/events/index.md#self-describing-events) and [entities](/docs/fundamentals/entities/index.md) according to their [schemas](/docs/fundamentals/schemas/index.md). - -:::tip - -For more details on the loading flow, pick a destination below and follow the link in the _Loader_ column, where you will find additional information and diagrams. - -::: - -## Data warehouse loaders - -:::note Cloud - -The cloud selection is for where your Snowplow pipeline runs. The warehouse itself can be deployed in any cloud. - -::: - - - - -| Destination | Type | Loader application | Status | -| ---------------------------------------------- | -------------------------------------------- | --------------------------------------------------------------------------------------------------------------- | ---------------------------------- | -| Redshift
_(including Redshift serverless)_ | Batching (recommended)
or micro-batching | [RDB Loader](/docs/api-reference/loaders-storage-targets/snowplow-rdb-loader/index.md) | Production-ready | -| BigQuery | Streaming | [BigQuery Loader](/docs/api-reference/loaders-storage-targets/bigquery-loader/index.md) | Production-ready | -| Snowflake | Streaming | [Snowflake Streaming Loader](/docs/api-reference/loaders-storage-targets/snowflake-streaming-loader/index.md) | Production-ready | -| Databricks | Batching (recommended)
or micro-batching | [Snowplow RDB Loader](/docs/api-reference/loaders-storage-targets/snowplow-rdb-loader/index.md) | Production-ready | -| Databricks | Streaming | [Databricks Streaming Loader](/docs/api-reference/loaders-storage-targets/databricks-streaming-loader/index.md) | Early release | - -
- - -| Destination | Type | Loader application | Status | -| ----------- | -------------- | ------------------------------------------------------------------------------------------------------------- | ---------------------------------- | -| BigQuery | Streaming | [BigQuery Loader](/docs/api-reference/loaders-storage-targets/bigquery-loader/index.md) | Production-ready | -| Snowflake | Streaming | [Snowflake Streaming Loader](/docs/api-reference/loaders-storage-targets/snowflake-streaming-loader/index.md) | Production-ready | -| Databricks | Micro-batching
_(via a [data lake](#data-lake-loaders))_ | [Lake Loader](/docs/api-reference/loaders-storage-targets/lake-loader/index.md) | Production-ready | -| Databricks | Streaming | [Databricks Streaming Loader](/docs/api-reference/loaders-storage-targets/databricks-streaming-loader/index.md) | Early release | - -
- - -| Destination | Type | Loader application | Status | -| ----------------- | ------------------------------------------------------------ | --------------------------------------------------------------------------------------------------------------- | ---------------- | -| BigQuery | Streaming | [BigQuery Loader](/docs/api-reference/loaders-storage-targets/bigquery-loader/index.md) | Production-ready | -| Snowflake | Streaming | [Snowflake Streaming Loader](/docs/api-reference/loaders-storage-targets/snowflake-streaming-loader/index.md) | Production-ready | -| Databricks | Micro-batching
_(via a [data lake](#data-lake-loaders))_ | [Lake Loader](/docs/api-reference/loaders-storage-targets/lake-loader/index.md) | Production-ready | -| Databricks | Streaming | [Databricks Streaming Loader](/docs/api-reference/loaders-storage-targets/databricks-streaming-loader/index.md) | Early release | -| Synapse Analytics | Micro-batching
_(via a [data lake](#data-lake-loaders))_ | [Lake Loader](/docs/api-reference/loaders-storage-targets/lake-loader/index.md) | Production-ready | - -
-
- -## Data lake loaders - -All lake loaders are micro-batching. - - - - -| Lake | Format | Compatibility | Loader application | Status | -| ---- | -------- | ---------------- | ------------------------------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| S3 | Delta | Athena, Databricks | [Lake Loader](/docs/api-reference/loaders-storage-targets/lake-loader/index.md) | Production-ready | -| S3 | Iceberg | Athena, Redshift | [Lake Loader](/docs/api-reference/loaders-storage-targets/lake-loader/index.md) | Production-ready | -| S3 | TSV/JSON | Athena | [S3 Loader](/docs/api-reference/loaders-storage-targets/s3-loader/index.md) | Only recommended for use with [RDB Batch Transformer](/docs/api-reference/loaders-storage-targets/snowplow-rdb-loader/transforming-enriched-data/spark-transformer/index.md) or for [raw failed events](/docs/fundamentals/failed-events/index.md) | - -:::tip - -Please note that currently the S3 _Delta_ loader is not compatible with Databricks. The loader uses [DynamoDB tables for mutually exclusive writes to S3](https://docs.delta.io/latest/delta-storage.html#multi-cluster-setup), a feature of Delta. Databricks, however, does not support this (as of July 2025). This means that it’s not possible to alter the data via Databricks (e.g. to run `OPTIMIZE` or to delete PII). - -::: - - - +### Data warehouses -| Lake | Format | Compatibility | Loader application | Status | -| ---- | ------ | ------------- | ---------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------- | -| GCS | Delta | Databricks | [Lake Loader](/docs/api-reference/loaders-storage-targets/lake-loader/index.md) | Production-ready | -| GCS | JSON | BigQuery | [GCS Loader](/docs/api-reference/loaders-storage-targets/google-cloud-storage-loader/index.md) | Only recommended for [raw failed events](/docs/fundamentals/failed-events/index.md) | +* [Snowflake](/docs/destinations/warehouses-lakes/snowflake/index.md) +* [Databricks](/docs/destinations/warehouses-lakes/databricks/index.md) +* [BigQuery](/docs/destinations/warehouses-lakes/bigquery/index.md) +* [Redshift](/docs/destinations/warehouses-lakes/redshift/index.md) - - -| Lake | Format | Compatibility | Loader application | Status | -| --------- | ------ | ------------------------------------- | ------------------------------------------------------------------------------- | ------------- | -| ADLS Gen2 | Delta | Synapse Analytics, Fabric, Databricks | [Lake Loader](/docs/api-reference/loaders-storage-targets/lake-loader/index.md) | Production-ready | +### Data lakes - - +* [Iceberg](/docs/destinations/warehouses-lakes/iceberg/index.md) +* [Delta Lake](/docs/destinations/warehouses-lakes/delta/index.md) diff --git a/docs/destinations/warehouses-lakes/querying-data/index.md b/docs/destinations/warehouses-lakes/querying-data/index.md index bf0a3b62e..3e8ecdee8 100644 --- a/docs/destinations/warehouses-lakes/querying-data/index.md +++ b/docs/destinations/warehouses-lakes/querying-data/index.md @@ -1,7 +1,7 @@ --- title: "Querying Snowplow data" sidebar_label: "Querying data" -sidebar_position: 3 +sidebar_position: 1 description: "An introduction to querying Snowplow data, including self-describing events and entities, as well tips for dealing with duplicate events" --- @@ -12,7 +12,7 @@ import TabItem from '@theme/TabItem'; ## Basic queries -You will typically find most of your Snowplow data in the `events` table. If you are using Redshift or Postgres, there will be extra tables for [self-describing events](/docs/fundamentals/events/index.md#self-describing-events) and [entities](/docs/fundamentals/entities/index.md) — see [below](#self-describing-events). +You will typically find most of your Snowplow data in the `events` table. If you are using Redshift, there will be extra tables for [self-describing events](/docs/fundamentals/events/index.md#self-describing-events) and [entities](/docs/fundamentals/entities/index.md) — see [below](#self-describing-events). Please refer to [the structure of Snowplow data](/docs/fundamentals/canonical-event/index.md) for the principles behind our approach, as well as the descriptions of the various standard columns. @@ -48,9 +48,9 @@ This ensures that you read from the minimum number of (micro-)partitions necessa [Self-describing events](/docs/fundamentals/events/index.md#self-describing-events) can contain their own set of fields, defined by their [schema](/docs/fundamentals/schemas/index.md). - + -For Redshift and Postgres users, self-describing events are not part of the standard `events` table. Instead, each type of event is in its own table. The table name and the fields in the table will be determined by the event’s schema. See [how schemas translate to the warehouse](/docs/destinations/warehouses-lakes/schemas-in-warehouse/index.md) for more details. +For Redshift users, self-describing events are not part of the standard `events` table. Instead, each type of event is in its own table. The table name and the fields in the table will be determined by the event’s schema. See [how schemas translate to the warehouse](/docs/api-reference/loaders-storage-targets/schemas-in-warehouse/index.md) for more details. You can query just the table for that particular self-describing event, if that's all that's required for your analysis, or join that table back to the `events` table: @@ -73,7 +73,7 @@ You may need to take care of [duplicate events](#dealing-with-duplicates). -Each type of self-describing event is in a dedicated `RECORD`-type column. The column name and the fields in the record will be determined by the event’s schema. See [how schemas translate to the warehouse](/docs/destinations/warehouses-lakes/schemas-in-warehouse/index.md) for more details. +Each type of self-describing event is in a dedicated `RECORD`-type column. The column name and the fields in the record will be determined by the event’s schema. See [how schemas translate to the warehouse](/docs/api-reference/loaders-storage-targets/schemas-in-warehouse/index.md) for more details. You can query fields in the self-describing event like so: @@ -94,7 +94,7 @@ The [BigQuery Loader upgrade guide](/docs/api-reference/loaders-storage-targets/ -Each type of self-describing event is in a dedicated `OBJECT`-type column. The column name will be determined by the event’s schema. See [how schemas translate to the warehouse](/docs/destinations/warehouses-lakes/schemas-in-warehouse/index.md) for more details. +Each type of self-describing event is in a dedicated `OBJECT`-type column. The column name will be determined by the event’s schema. See [how schemas translate to the warehouse](/docs/api-reference/loaders-storage-targets/schemas-in-warehouse/index.md) for more details. You can query fields in the self-describing event like so: @@ -110,7 +110,7 @@ FROM -Each type of self-describing event is in a dedicated `STRUCT`-type column. The column name and the fields in the `STRUCT` will be determined by the event’s schema. See [how schemas translate to the warehouse](/docs/destinations/warehouses-lakes/schemas-in-warehouse/index.md) for more details. +Each type of self-describing event is in a dedicated `STRUCT`-type column. The column name and the fields in the `STRUCT` will be determined by the event’s schema. See [how schemas translate to the warehouse](/docs/api-reference/loaders-storage-targets/schemas-in-warehouse/index.md) for more details. You can query fields in the self-describing event by extracting them like so: @@ -126,7 +126,7 @@ FROM -Each type of self-describing event is in a dedicated column in JSON format. The column name will be determined by the event’s schema. See [how schemas translate to the warehouse](/docs/destinations/warehouses-lakes/schemas-in-warehouse/index.md) for more details. +Each type of self-describing event is in a dedicated column in JSON format. The column name will be determined by the event’s schema. See [how schemas translate to the warehouse](/docs/api-reference/loaders-storage-targets/schemas-in-warehouse/index.md) for more details. You can query fields in the self-describing event like so: @@ -147,9 +147,9 @@ FROM [Entities](/docs/fundamentals/entities/index.md) (also known as contexts) provide extra information about the event, such as data describing a product or a user. - + -For Redshift and Postgres users, entities are not part of the standard `events` table. Instead, each type of entity is in its own table. The table name and the fields in the table will be determined by the entity’s schema. See [how schemas translate to the warehouse](/docs/destinations/warehouses-lakes/schemas-in-warehouse/index.md) for more details. +For Redshift users, entities are not part of the standard `events` table. Instead, each type of entity is in its own table. The table name and the fields in the table will be determined by the entity’s schema. See [how schemas translate to the warehouse](/docs/api-reference/loaders-storage-targets/schemas-in-warehouse/index.md) for more details. The entities can be joined back to the core `events` table by the following, which is a one-to-one join (for a single record entity) or a one-to-many join (for a multi-record entity), assuming no duplicates. @@ -172,7 +172,7 @@ You may need to take care of [duplicate events](#dealing-with-duplicates). -Each type of entity is in a dedicated `REPEATED RECORD`-type column. The column name and the fields in the record will be determined by the entity’s schema. See [how schemas translate to the warehouse](/docs/destinations/warehouses-lakes/schemas-in-warehouse/index.md) for more details. +Each type of entity is in a dedicated `REPEATED RECORD`-type column. The column name and the fields in the record will be determined by the entity’s schema. See [how schemas translate to the warehouse](/docs/api-reference/loaders-storage-targets/schemas-in-warehouse/index.md) for more details. You can query a single entity’s fields by extracting them like so: @@ -204,7 +204,7 @@ Column name produced by previous versions of the BigQuery Loader (<2.0.0) would -Each type of entity is in a dedicated `ARRAY`-type column. The column name will be determined by the entity’s schema. See [how schemas translate to the warehouse](/docs/destinations/warehouses-lakes/schemas-in-warehouse/index.md) for more details. +Each type of entity is in a dedicated `ARRAY`-type column. The column name will be determined by the entity’s schema. See [how schemas translate to the warehouse](/docs/api-reference/loaders-storage-targets/schemas-in-warehouse/index.md) for more details. You can query a single entity’s fields by extracting them like so: @@ -232,7 +232,7 @@ FROM -Each type of entity is in a dedicated `ARRAY`-type column. The column name and the fields in the `STRUCT` will be determined by the entity’s schema. See [how schemas translate to the warehouse](/docs/destinations/warehouses-lakes/schemas-in-warehouse/index.md) for more details. +Each type of entity is in a dedicated `ARRAY`-type column. The column name and the fields in the `STRUCT` will be determined by the entity’s schema. See [how schemas translate to the warehouse](/docs/api-reference/loaders-storage-targets/schemas-in-warehouse/index.md) for more details. You can query a single entity’s fields by extracting them like so: @@ -260,7 +260,7 @@ FROM -Each type of entity is in a dedicated column in JSON format. The column name will be determined by the entity’s schema. See [how schemas translate to the warehouse](/docs/destinations/warehouses-lakes/schemas-in-warehouse/index.md) for more details. +Each type of entity is in a dedicated column in JSON format. The column name will be determined by the entity’s schema. See [how schemas translate to the warehouse](/docs/api-reference/loaders-storage-targets/schemas-in-warehouse/index.md) for more details. You can query a single entity’s fields by extracting them like so: @@ -299,9 +299,9 @@ In some cases, your data might contain duplicate events (full deduplication _bef While our [data models](/docs/modeling-your-data/modeling-your-data-with-dbt/index.md) deal with duplicates for you, there may be cases where you need to de-duplicate the events table yourself. - + -In Redshift/Postgres you must first generate a `ROW_NUMBER()` on your events and use this to de-duplicate. +In Redshift, you must first generate a `ROW_NUMBER()` on your events and use this to de-duplicate. ```sql WITH unique_events AS ( diff --git a/docs/destinations/warehouses-lakes/redshift/index.md b/docs/destinations/warehouses-lakes/redshift/index.md new file mode 100644 index 000000000..083b7dd4d --- /dev/null +++ b/docs/destinations/warehouses-lakes/redshift/index.md @@ -0,0 +1,94 @@ +--- +title: "Redshift" +sidebar_position: 40 +description: "Send Snowplow data to Amazon Redshift for analytics and data warehousing" +--- + +```mdx-code-block +import SetupInstructions from '../_setup-instructions.mdx'; +import HowLoadingWorks from '../_how-loading-works.mdx'; +``` + +:::info Cloud availability + +The Redshift integration is available for Snowplow pipelines running on **AWS** only. + +::: + +The Snowplow Redshift integration allows you to load enriched event data directly into your Redshift cluster (including Redshift serverless) for analytics, data modeling, and more. + +## What you will need + +Connecting to a destination always involves configuring cloud resources and granting permissions. It's a good idea to make sure you have sufficient priviliges before you begin the setup process. + +:::tip + +The list below is just a heads up. The Snowplow Console will guide you through the exact steps to set up the integration. + +::: + +Keep in mind that you will need to be able to: + +* Provide your Redshift cluster endpoint and connection details +* Allow-list Snowplow IP addresses +* Specify the desired database and schema names +* Create a user and a role with the following permissions: + * Schema ownership (`CREATE SCHEMA ... AUTHORIZATION`) + * `SELECT` on system tables (`svv_table_info`, `svv_interleaved_columns`, `stv_interleaved_counts`) — this is required for maintenance jobs + +## Getting started + +You can add a Redshift destination through the Snowplow Console. (For self-hosted customers, please refer to the [Loader API reference](/docs/api-reference/loaders-storage-targets/snowplow-rdb-loader/index.md) instead.) + + + +## How loading works + + + +For more details on the loading flow, see the [RDB Loader](/docs/api-reference/loaders-storage-targets/snowplow-rdb-loader/index.md) reference page, where you will find additional information and diagrams. + +## Snowplow data format in Redshift + +The event data is split across multiple tables. + +The main table (`events`) contains the [atomic fields](/docs/fundamentals/canonical-event/index.md), such as `app_id`, `user_id` and so on: + +| app_id | collector_tstamp | ... | event_id | ... | user_id | ... | +| ------ | ---------------- | --- | -------- | --- | ------- | --- | +| website | 2025-05-06 12:30:05.123 | ... | c6ef3124-b53a-4b13-a233-0088f79dcbcb | ... | c94f860b-1266-4dad-ae57-3a36a414a521 | ... | + +Snowplow data also includes customizable [self-describing events](/docs/fundamentals/events/index.md#self-describing-events) and [entities](/docs/fundamentals/entities/index.md). These use [schemas](/docs/fundamentals/schemas/index.md) to define which fields should be present, and of what type (e.g. string, number). + +For each type of self-describing event and entity, there are additional tables that can be joined with the main table: + +
+unstruct_event_com_acme_button_press_1 + +| root_id | root_tstamp | button_name | button_color | ... | +| ------- | ----------- | ----------- | ------------ | --- | +| c6ef3124-b53a-4b13-a233-0088f79dcbcb | 2025-05-06 12:30:05.123 | Cancel | red | ... | + +
+ +
+contexts_com_acme_product_1 + +| root_id | root_tstamp | name | price | ... | +| ------- | ----------- | ---- |------ | --- | +| c6ef3124-b53a-4b13-a233-0088f79dcbcb | 2025-05-06 12:30:05.123 | Salt | 2.60 | ... | +| c6ef3124-b53a-4b13-a233-0088f79dcbcb | 2025-05-06 12:30:05.123 | Pepper | 3.10 | ... | + +
+ +Note: +* "unstruct\[ured\] event" and "context" are the legacy terms for self-describing events and entities, respectively +* the `_1` suffix represents the major version of the schema (e.g. `1-x-y`) + +You can learn more [in the API reference section](/docs/api-reference/loaders-storage-targets/schemas-in-warehouse/index.md). + +:::tip + +Check this [guide on querying](/docs/destinations/warehouses-lakes/querying-data/index.md?warehouse=redshift) Snowplow data. + +::: diff --git a/docs/destinations/warehouses-lakes/snowflake/index.md b/docs/destinations/warehouses-lakes/snowflake/index.md new file mode 100644 index 000000000..3af71318c --- /dev/null +++ b/docs/destinations/warehouses-lakes/snowflake/index.md @@ -0,0 +1,62 @@ +--- +title: "Snowflake" +sidebar_position: 10 +description: "Send Snowplow data to Snowflake for analytics and data warehousing" +--- + +```mdx-code-block +import SetupInstructions from '../_setup-instructions.mdx'; +import HowLoadingWorks from '../_how-loading-works.mdx'; +import SingleTableFormat from '../_single-table-format.mdx'; +``` + +:::info Cloud availability + +The Snowflake integration is available for Snowplow pipelines running on **AWS**, **Azure** and **GCP**. + +::: + +The Snowplow Snowflake integration allows you to load enriched event data (as well as [failed events](/docs/fundamentals/failed-events/index.md)) directly into your Snowflake warehouse for analytics, data modeling, and more. + +## What you will need + +Connecting to a destination always involves configuring cloud resources and granting permissions. It's a good idea to make sure you have sufficient priviliges before you begin the setup process. + +:::tip + +The list below is just a heads up. The Snowplow Console will guide you through the exact steps to set up the integration. + +::: + +Keep in mind that you will need to be able to: + +* Provide your Snowflake account locator URL, cloud provider and region +* Allow-list Snowplow IP addresses +* Generate a key pair for key-based authentication +* Specify the desired database and schema names, as well as a warehouse name +* Create a role with the following permissions: + * `USAGE`, `OPERATE` on warehouse (for testing the connection and monitoring, e.g. as part of the [Data Quality Dashboard](/docs/data-product-studio/data-quality/failed-events/monitoring-failed-events/index.md#data-quality-dashboard)) + * `USAGE` on database + * `ALL` privileges on the target schema + +## Getting started + +You can add a Snowflake destination through the Snowplow Console. (For self-hosted customers, please refer to the [Loader API reference](/docs/api-reference/loaders-storage-targets/snowflake-streaming-loader/index.md) instead.) + + + +## How loading works + + + +For more details on the loading flow, see the [Snowflake Streaming Loader](/docs/api-reference/loaders-storage-targets/snowflake-streaming-loader/index.md) reference page, where you will find additional information and diagrams. + +## Snowplow data format in Snowflake + +a VARIANT object} entitiesType={<>a VARIANT array}/> + +:::tip + +Check this [guide on querying](/docs/destinations/warehouses-lakes/querying-data/index.md?warehouse=snowflake) Snowplow data. + +::: diff --git a/docs/fundamentals/canonical-event/index.md b/docs/fundamentals/canonical-event/index.md index 6243ce4c1..857e0e5c2 100644 --- a/docs/fundamentals/canonical-event/index.md +++ b/docs/fundamentals/canonical-event/index.md @@ -359,7 +359,7 @@ For more information on this topic please check out the relevant [Tracking Docum For each type of self-describing event, there will be a dedicated column (or table, in case of Redshift and Postgres) that holds the event-specific fields. -See [querying data](/docs/destinations/warehouses-lakes/querying-data/index.md#self-describing-events) for more details on the structure and how to query it in different warehouses. You might also want to check [how schema definitions translate to the warehouse](/docs/destinations/warehouses-lakes/schemas-in-warehouse/index.md). +See [querying data](/docs/destinations/warehouses-lakes/querying-data/index.md#self-describing-events) for more details on the structure and how to query it in different warehouses. You might also want to check [how schema definitions translate to the warehouse](/docs/api-reference/loaders-storage-targets/schemas-in-warehouse/index.md). For more information on this topic please check out the relevant [Tracking Documentation](/docs/events/custom-events/self-describing-events/index.md). @@ -369,7 +369,7 @@ For more information on this topic please check out the relevant [Tracking Docum For each type of entity, there will be a dedicated column (or table, in case of Redshift and Postgres) that holds entity-specific fields. Note that an event can have any number of entities attached, including multiple entities of the same type. For this reason, the data inside the entity columns is an array. -See [querying data](/docs/destinations/warehouses-lakes/querying-data/index.md#entities) for more details on the structure and how to query it in different warehouses. You might also want to check [how schema definitions translate to the warehouse](/docs/destinations/warehouses-lakes/schemas-in-warehouse/index.md). +See [querying data](/docs/destinations/warehouses-lakes/querying-data/index.md#entities) for more details on the structure and how to query it in different warehouses. You might also want to check [how schema definitions translate to the warehouse](/docs/api-reference/loaders-storage-targets/schemas-in-warehouse/index.md). For more information on this topic please check out the relevant [Tracking Documentation](/docs/events/custom-events/context-entities/index.md). diff --git a/docs/modeling-your-data/modeling-your-data-with-dbt/package-features/passthrough-fields/index.md b/docs/modeling-your-data/modeling-your-data-with-dbt/package-features/passthrough-fields/index.md index aa9742d91..d2bd3ae01 100644 --- a/docs/modeling-your-data/modeling-your-data-with-dbt/package-features/passthrough-fields/index.md +++ b/docs/modeling-your-data/modeling-your-data-with-dbt/package-features/passthrough-fields/index.md @@ -59,7 +59,7 @@ A more useful case for the SQL block is to extract a specific field from an enti **Step 1. Making fields available in the events table** -For Redshift and Postgres users, entities and self describing events are not part of the standard events table. Instead, each type of entity/sde is in its own table. The table name and the fields in the table will be determined by the entity’s schema. See [how schemas translate to the warehouse](/docs/destinations/warehouses-lakes/schemas-in-warehouse/index.md) for more details. +For Redshift and Postgres users, entities and self describing events are not part of the standard events table. Instead, each type of entity/sde is in its own table. The table name and the fields in the table will be determined by the entity’s schema. See [how schemas translate to the warehouse](/docs/api-reference/loaders-storage-targets/schemas-in-warehouse/index.md) for more details. In order for you to use fields from there through passthrough fields, you would need to first make sure that those fields are part of the [events this run](/docs/modeling-your-data/modeling-your-data-with-dbt/package-mechanics/this-run-tables/index.md#events-this-run) table. Any custom entities or self-describing events can be added to this table (which get de-duped by taking the earliest `collector_tstamp` record) by using the `snowplow__entities_or_sdes` variable in our package. See [modeling entities](/docs/modeling-your-data/modeling-your-data-with-dbt/package-features/modeling-entities/index.md) for more information and examples. diff --git a/static/_redirects b/static/_redirects index 774ac1b80..5d61400ef 100644 --- a/static/_redirects +++ b/static/_redirects @@ -403,3 +403,4 @@ docs/understanding-tracking-design/managing-data-structures-with-data-structures # Removing loading-process in favor of individual loader reference pages /docs/destinations/warehouses-lakes/loading-process/* /docs/destinations/warehouses-lakes/ 301 +/docs/destinations/warehouses-lakes/schemas-in-warehouse/* /docs/api-reference/loaders-storage-targets/schemas-in-warehouse/:splat 301