Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions docs/api-reference/elasticsearch/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ The `geo_latitude` and `geo_longitude` fields are combined into a single `g

### Self-describing events

Each [self-describing event](/docs/fundamentals/events/index.md#self-describing-events) gets its own field (same [naming rules](/docs/destinations/warehouses-lakes/schemas-in-warehouse/index.md?warehouse=snowflake#location) as for Snowflake). For example:
Each [self-describing event](/docs/fundamentals/events/index.md#self-describing-events) gets its own field (same [naming rules](/docs/api-reference/loaders-storage-targets/schemas-in-warehouse/index.md?warehouse=snowflake#location) as for Snowflake). For example:

```json
{
Expand All @@ -46,7 +46,7 @@ Each [self-describing event](/docs/fundamentals/events/index.md#self-describing-

### Entities

Each [entity](/docs/fundamentals/entities/index.md) type attached to the event gets its own field (same [naming rules](/docs/destinations/warehouses-lakes/schemas-in-warehouse/index.md?warehouse=snowflake#location) as for Snowflake). The field contains an array with the data for all entities of the given type. For example:
Each [entity](/docs/fundamentals/entities/index.md) type attached to the event gets its own field (same [naming rules](/docs/api-reference/loaders-storage-targets/schemas-in-warehouse/index.md?warehouse=snowflake#location) as for Snowflake). The field contains an array with the data for all entities of the given type. For example:

```json
{
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ The BigQuery Streaming Loader is an application that loads Snowplow events to Bi

:::tip Schemas in BigQuery

For more information on how events are stored in BigQuery, check the [mapping between Snowplow schemas and the corresponding BigQuery column types](/docs/destinations/warehouses-lakes/schemas-in-warehouse/index.md?warehouse=bigquery).
For more information on how events are stored in BigQuery, check the [mapping between Snowplow schemas and the corresponding BigQuery column types](/docs/api-reference/loaders-storage-targets/schemas-in-warehouse/index.md?warehouse=bigquery).

:::

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ Under the umbrella of Snowplow BigQuery Loader, we have a family of applications

:::tip Schemas in BigQuery

For more information on how events are stored in BigQuery, check the [mapping between Snowplow schemas and the corresponding BigQuery column types](/docs/destinations/warehouses-lakes/schemas-in-warehouse/index.md?warehouse=bigquery).
For more information on how events are stored in BigQuery, check the [mapping between Snowplow schemas and the corresponding BigQuery column types](/docs/api-reference/loaders-storage-targets/schemas-in-warehouse/index.md?warehouse=bigquery).

:::

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -110,4 +110,4 @@ If events with incorrectly evolved schemas never arrive, then the recovery colum

:::

You can read more about schema evolution and how recovery columns work [here](/docs/destinations/warehouses-lakes/schemas-in-warehouse/index.md?warehouse=bigquery#versioning).
You can read more about schema evolution and how recovery columns work [here](/docs/api-reference/loaders-storage-targets/schemas-in-warehouse/index.md?warehouse=bigquery#versioning).
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@ The Databricks Streaming Loader is an application that integrates with a Databri

:::tip Schemas in Databricks

For more information on how events are stored in Databricks, check the [mapping between Snowplow schemas and the corresponding Databricks column types](/docs/destinations/warehouses-lakes/schemas-in-warehouse/index.md?warehouse=databricks).
For more information on how events are stored in Databricks, check the [mapping between Snowplow schemas and the corresponding Databricks column types](/docs/api-reference/loaders-storage-targets/schemas-in-warehouse/index.md?warehouse=databricks).

:::

Expand Down
Original file line number Diff line number Diff line change
@@ -1,14 +1,14 @@
---
title: "How schema definitions translate to the warehouse"
sidebar_label: "Schemas in the warehouse"
sidebar_position: 4
description: "A detailed explanation of how Snowplow data is represented in Redshift, Postgres, BigQuery, Snowflake, Databricks and Synapse Analytics"
sidebar_position: 100
description: "A detailed explanation of how Snowplow data is represented in Redshift, BigQuery, Snowflake, Databricks, Iceberg and Delta Lake"
---

```mdx-code-block
import Tabs from '@theme/Tabs';
import TabItem from '@theme/TabItem';
import ParquetRecoveryColumns from '@site/docs/destinations/warehouses-lakes/schemas-in-warehouse/_parquet-recovery-columns.md';
import ParquetRecoveryColumns from '@site/docs/api-reference/loaders-storage-targets/schemas-in-warehouse/_parquet-recovery-columns.md';
```

[Self-describing events](/docs/fundamentals/events/index.md#self-describing-events) and [entities](/docs/fundamentals/entities/index.md) use [schemas](/docs/fundamentals/schemas/index.md) to define which fields should be present, and of what type (e.g. string, number). This page explains what happens to this information in the warehouse.
Expand All @@ -18,7 +18,7 @@ import ParquetRecoveryColumns from '@site/docs/destinations/warehouses-lakes/sch
Where can you find the data carried by a self-describing event or an entity?

<Tabs groupId="warehouse" queryString>
<TabItem value="redshift/postgres" label="Redshift, Postgres" default>
<TabItem value="redshift" label="Redshift" default>

Each type of self-describing event and each type of entity get their own dedicated tables. The name of such a table is composed of the schema vendor, schema name and its major version (more on versioning [later](#versioning)).

Expand Down Expand Up @@ -167,7 +167,7 @@ For example, suppose you have the following field in the schema:
It will be translated into an object with a `lastName` key that points to a value of type `VARIANT`.

</TabItem>
<TabItem value="databricks" label="Databricks, Spark SQL">
<TabItem value="databricks" label="Databricks, Iceberg, Delta">

Each type of self-describing event and each type of entity get their own dedicated columns in the `events` table. The name of such a column is composed of the schema vendor, schema name and major schema version (more on versioning [later](#versioning)).

Expand Down Expand Up @@ -213,53 +213,6 @@ For example, suppose you have the following field in the schema:

It will be translated into a field called `last_name` (notice the underscore), of type `STRING`.

</TabItem>
<TabItem value="synapse" label="Synapse Analytics">

Each type of self-describing event and each type of entity get their own dedicated columns in the underlying data lake table. The name of such a column is composed of the schema vendor, schema name and major schema version (more on versioning [later](#versioning)).

The column name is prefixed by `unstruct_event_` for self-describing events, and by `contexts_` for entities. _(In case you were wondering, those are the legacy terms for self-describing events and entities, respectively.)_

:::note

All characters are converted to lowercase and all symbols (like `.`) are replaced with an underscore.

:::

Examples:

| Kind | Schema | Resulting column |
| --------------------- | ------------------------------------------- | -------------------------------------------------- |
| Self-describing event | `com.example/button_press/jsonschema/1-0-0` | `events.unstruct_event_com_example_button_press_1` |
| Entity | `com.example/user/jsonschema/1-0-0` | `events.contexts_com_example_user_1` |

The column will be formatted as JSON — an object for self-describing events and an array of objects for entities (because an event can have more than one entity attached).

Inside the JSON object, there will be fields corresponding to the fields in the schema.

:::note

The name of each JSON field is the name of the schema field converted to snake case.

:::

:::caution

If an event or entity includes fields not defined in the schema, those fields will not be stored in the data lake, and will not be availble in Synapse.

:::

For example, suppose you have the following field in the schema:

```json
"lastName": {
"type": "string",
"maxLength": 100
}
```

It will be translated into a field called `last_name` (notice the underscore) inside the JSON object.

</TabItem>
</Tabs>

Expand Down Expand Up @@ -300,31 +253,6 @@ Note that this behavior was introduced in RDB Loader 6.0.0. In older versions, b

Once the loader creates a column for a given schema version as `NULLABLE` or `NOT NULL`, it will never alter the nullability constraint for that column. For example, if a field is nullable in schema version `1-0-0` and not nullable in version `1-0-1`, the column will remain nullable. (In this example, the Enrich application will still validate data according to the schema, accepting `null` values for `1-0-0` and rejecting them for `1-0-1`.)

:::

</TabItem>
<TabItem value="postgres" label="Postgres">

Because the table name for the self-describing event or entity includes the major schema version, each major version of a schema gets a new table:

| Schema | Resulting table |
| ------------------------------------------- | ---------------------------- |
| `com.example/button_press/jsonschema/1-0-0` | `com_example_button_press_1` |
| `com.example/button_press/jsonschema/1-2-0` | `com_example_button_press_1` |
| `com.example/button_press/jsonschema/2-0-0` | `com_example_button_press_2` |

When you evolve your schema within the same major version, (non-destructive) changes are applied to the existing table automatically. For example, if you change the `maxLength` of a `string` field, the limit of the `VARCHAR` column would be updated accordingly.

:::danger Breaking changes

If you make a breaking schema change (e.g. change a type of a field from a `string` to a `number`) without creating a new major schema version, the loader will not be able to adapt the table to receive new data. Your loading process will halt.

:::

:::info Nullability

Once the loader creates a column for a given schema version as `NULLABLE` or `NOT NULL`, it will never alter the nullability constraint for that column. For example, if a field is nullable in schema version `1-0-0` and not nullable in version `1-0-1`, the column will remain nullable. (In this example, the Enrich application will still validate data according to the schema, accepting `null` values for `1-0-0` and rejecting them for `1-0-1`.)

:::

</TabItem>
Expand Down Expand Up @@ -389,7 +317,7 @@ Also, creating a new major version of the schema (and hence a new column) is the
:::

</TabItem>
<TabItem value="databricks" label="Databricks, Spark SQL">
<TabItem value="databricks" label="Databricks, Iceberg, Delta">

Because the column name for the self-describing event or entity includes the major schema version, each major version of a schema gets a new column:

Expand All @@ -405,27 +333,6 @@ When you evolve your schema within the same major version, (non-destructive) cha

<ParquetRecoveryColumns/>

Note that this behavior was introduced in RDB Loader 5.3.0.

:::

</TabItem>
<TabItem value="synapse" label="Synapse Analytics">

Because the column name for the self-describing event or entity includes the major schema version, each major version of a schema gets a new column:

| Schema | Resulting column |
| ------------------------------------------- | ------------------------------------------- |
| `com.example/button_press/jsonschema/1-0-0` | `unstruct_event_com_example_button_press_1` |
| `com.example/button_press/jsonschema/1-2-0` | `unstruct_event_com_example_button_press_1` |
| `com.example/button_press/jsonschema/2-0-0` | `unstruct_event_com_example_button_press_2` |

When you evolve your schema within the same major version, (non-destructive) changes are applied to the existing column automatically in the underlying data lake. That said, for the purposes of querying the data from Synapse Analytics, all fields are in JSON format, so these internal modifications are invisible — the new fields just appear in the JSON data.

:::info Breaking changes

<ParquetRecoveryColumns/>

:::

</TabItem>
Expand All @@ -438,7 +345,7 @@ How do schema types translate to the database types?
### Nullability

<Tabs groupId="warehouse" queryString>
<TabItem value="redshift" label="Redshift, Postgres" default>
<TabItem value="redshift" label="Redshift" default>

All non-required schema fields translate to nullable columns.

Expand Down Expand Up @@ -517,22 +424,17 @@ In this case, the `RECORD` field will be nullable. It does not matter if `"null"
All fields are nullable (because they are stored inside the `VARIANT` type).

</TabItem>
<TabItem value="databricks" label="Databricks, Spark SQL">
<TabItem value="databricks" label="Databricks, Iceberg, Delta">

All schema fields, including the required ones, translate to nullable fields inside the `STRUCT`.

</TabItem>
<TabItem value="synapse" label="Synapse Analytics">

All fields are nullable (because they are stored inside the JSON-formatted column).

</TabItem>
</Tabs>

### Types themselves

<Tabs groupId="warehouse" queryString>
<TabItem value="redshift" label="Redshift, Postgres" default>
<TabItem value="redshift" label="Redshift" default>

:::note

Expand All @@ -543,7 +445,7 @@ The row order in this table is important. Type lookup stops after the first matc
<table>
<thead>
<td>Json Schema</td>
<td>Redshift/Postgres Type</td>
<td>Redshift Type</td>
</thead>
<tbody>
<tr>
Expand Down Expand Up @@ -1253,7 +1155,7 @@ _Values will be quoted as in JSON._
All types are `VARIANT`.

</TabItem>
<TabItem value="databricks" label="Databricks, Spark SQL">
<TabItem value="databricks" label="Databricks, Iceberg, Delta">

:::note

Expand Down Expand Up @@ -1795,12 +1697,5 @@ _Values will be quoted as in JSON._
</tr>
</tbody>
</table>
</TabItem>
<TabItem value="synapse" label="Synapse Analytics">

All types are `NVARCHAR(4000)` when extracted with [`JSON_VALUE`](https://learn.microsoft.com/en-us/sql/t-sql/functions/json-value-transact-sql?view=azure-sqldw-latest#return-value).

With [`OPENJSON`](https://learn.microsoft.com/en-us/sql/t-sql/functions/openjson-transact-sql?view=azure-sqldw-latest), you can explicitly specify more precise types.

</TabItem>
</Tabs>
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ The Streaming Loader is fully compatible with the table created and managed by t

:::tip

[This page](/docs/destinations/warehouses-lakes/schemas-in-warehouse/index.md) explains how Snowplow data maps to the warehouse in more detail.
[This page](/docs/api-reference/loaders-storage-targets/schemas-in-warehouse/index.md) explains how Snowplow data maps to the warehouse in more detail.

:::

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ The Postgres loader is not recommended for production use, especially with large

:::tip Schemas in Postgres

For more information on how events are stored in Postgres, check the [mapping between Snowplow schemas and the corresponding Postgres column types](/docs/destinations/warehouses-lakes/schemas-in-warehouse/index.md?warehouse=postgres).
For more information on how events are stored in Postgres, check the [mapping between Snowplow schemas and the corresponding Postgres column types](/docs/api-reference/loaders-storage-targets/schemas-in-warehouse/index.md?warehouse=postgres).

:::

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ We use the name RDB Loader (from "relational database") for a set of application

:::tip Schemas in Redshift, Snowflake and Databricks

For more information on how events are stored in the warehouse, check the [mapping between Snowplow schemas and the corresponding warehouse column types](/docs/destinations/warehouses-lakes/schemas-in-warehouse/index.md).
For more information on how events are stored in the warehouse, check the [mapping between Snowplow schemas and the corresponding warehouse column types](/docs/api-reference/loaders-storage-targets/schemas-in-warehouse/index.md).

:::

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -144,7 +144,7 @@ In order to solve this problem, we should patch `1-0-0` with `{ "type": "integer

After identifying all the offending schemas, you should patch them to reflect the changes in the warehouse.

Schema casting rules could be found [here](/docs/destinations/warehouses-lakes/schemas-in-warehouse/index.md?warehouse=redshift#types).
Schema casting rules could be found [here](/docs/api-reference/loaders-storage-targets/schemas-in-warehouse/index.md?warehouse=redshift#types).

#### `$.featureFlags.disableRecovery` configuration

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ There are two differences compared to regular events.
* Likewise, any column containing the JSON for a self-describing event (`unstruct_...`) will be set to `null` if that JSON fails validation.
* Finally, for entity columns (`contexts_`), if one entity is invalid, it will be removed from the array of entities. If all entities are invalid, the whole column will be set to `null`.

For more information about the different columns in Snowplow data, see [how Snowplow data is stored in the warehouse](/docs/destinations/warehouses-lakes/schemas-in-warehouse/index.md).
For more information about the different columns in Snowplow data, see [how Snowplow data is stored in the warehouse](/docs/api-reference/loaders-storage-targets/schemas-in-warehouse/index.md).

**There is an extra column with failure details.** The column is named `contexts_com_snowplowanalytics_snowplow_failure_1`. In most cases, it will also contain the invalid data in some form. See the [next section](#example-failed-event) for an example.

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,6 @@ Different data warehouses handle schema evolution slightly differently. Use the

:::caution

In Redshift and Databricks, changing _size_ may also mean _type_ change; e.g. changing the `maximum` integer from `30000` to `100000`. See our documentation on [how schemas translate to database types](/docs/destinations/warehouses-lakes/schemas-in-warehouse/index.md).
In Redshift and Databricks, changing _size_ may also mean _type_ change; e.g. changing the `maximum` integer from `30000` to `100000`. See our documentation on [how schemas translate to database types](/docs/api-reference/loaders-storage-targets/schemas-in-warehouse/index.md).

:::
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ Sometimes, small mistakes creep into your schemas. For example, you might mark a

It might be tempting to somehow “overwrite” the schema without updating the version. But this can bring several problems:
* Events that were previously valid could become invalid against the new changes.
* Your warehouse loader, which updates the table [according to the schema](/docs/destinations/warehouses-lakes/schemas-in-warehouse/index.md#versioning), could get stuck if it’s not possible to cast the data in the existing table column to the new definition (e.g. if you change a field type from a string to a number).
* Your warehouse loader, which updates the table [according to the schema](/docs/api-reference/loaders-storage-targets/schemas-in-warehouse/index.md#versioning), could get stuck if it’s not possible to cast the data in the existing table column to the new definition (e.g. if you change a field type from a string to a number).
* Similarly, data models or other applications consuming the data downstream might not be able to deal with the changes.

The best approach is to just create a new schema version and update your tracking code to use it. However, there are two alternatives for when it’s not ideal.
Expand Down
1 change: 1 addition & 0 deletions docs/destinations/warehouses-lakes/_how-loading-works.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
The Snowplow data loading process is engineered for large volumes of data. In addition, our loader applications ensure the best representation of Snowplow events. That includes automatically adjusting the tables to account for your custom data, whether it's new event types or new fields.
Loading