Skip to content

Updating formatting/Vale edits in Data Lakes docs #3074

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 6 commits into from
Jun 21, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 7 additions & 7 deletions src/connections/storage/data-lakes/comparison.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ As Segment builds new data storage products, each product evolves from prior pro
Data Lakes and Warehouses are not identical, but are compatible with a configurable mapping. This mapping helps you to identify and manage the differences between the two storage solutions, so you can easily understand how the data in each is related.


## Data Freshness
## Data freshness

Data Lakes and Warehouses offer different sync frequencies:
- Warehouses can sync up to once an hour, with the ability to set a custom sync schedule and [selectively sync](/docs/connections/warehouses/selective-sync/) collections and properties within a source to Warehouses.
Expand All @@ -21,7 +21,7 @@ Segment's [99% guarantee of no duplicates](/docs/guides/duplicate-data/) for dat

[Warehouses](/docs/guides/duplicate-data/#warehouse-deduplication) and [Data Lakes](/docs/guides/duplicate-data/#data-lake-deduplication) also have a secondary deduplication system to further reduce the volume of duplicates to ensure clean data in your Warehouses and Data Lakes.

## Object vs Event Data
## Object vs event data

Warehouses support both event and object data, while Data Lakes supports only event data.

Expand Down Expand Up @@ -73,7 +73,7 @@ See the table below for information about the [source](/docs/connections/sources

## Schema

### Data Types
### Data types

Warehouses and Data Lakes both infer data types for the events each receives. Since events are received by Warehouses one by one, Warehouses look at the first event received every hour to infer the data type for subsequent events. Data Lakes uses a similar approach, however because it receives data every hour, Data Lakes is able to look at a group of events to infer the data type.

Expand All @@ -84,15 +84,15 @@ This approach leads to a few scenarios where the data type for an event may be d

Variance in data types between Warehouses and Data Lakes don't happen often for booleans, strings, and timestamps, however it can occur for decimals and integers.

If a bad data type is seen, such as text in place of a number or an incorrectly formatted date, Warehouses and Data Lakes attempt a best effort conversion to cast the fields to the target data type. Fields that cannot be casted may be dropped. [Contact us](https://segment.com/contact) if you want to correct data types in the schema and perform a [replay](/docs/guides/what-is-replay/) to ensure no data is lost.
If a bad data type is seen, such as text in place of a number or an incorrectly formatted date, Warehouses and Data Lakes attempt a best effort conversion to cast the fields to the target data type. Fields that cannot be casted may be dropped. [Contact Segment Support](https://segment.com/contact){:target="_blank"} if you want to correct data types in the schema and perform a [replay](/docs/guides/what-is-replay/) to ensure no data is lost.


### Tables

Tables between Warehouses and Data Lakes will be the same, except for in these two cases:

- `tracks` - Warehouses provide one table per specific event (`track_button_clicked`) in addition to a summary table listing all `track` method calls. Data Lakes also creates one table per specific event, but does not provide a summary table. Learn more about the `tracks` table [here](/docs/connections/storage/warehouses/schema/).
- `users` - Both Warehouses and Data Lakes create an `identifies` table (as seen [here](/docs/connections/storage/warehouses/schema/)), however Warehouses also create a `users` table just for user data. Data Lakes does not create this, since it does not support object data. The `users` table is a materialized view of users in a source, constructed by data inferred about users from the identify calls.
- `tracks` - Warehouses provide one table per specific event (`track_button_clicked`) in addition to a summary table listing all `track` method calls. Data Lakes also creates one table per specific event, but does not provide a summary table. Learn more about the `tracks` table [in the Warehouses schema docs](/docs/connections/storage/warehouses/schema/).
- `users` - Both Warehouses and Data Lakes create an `identifies` table (as seen [in the Warehouses schema docs](/docs/connections/storage/warehouses/schema/)), however Warehouses also create a `users` table just for user data. Data Lakes does not create this, since it does not support object data. The `users` table is a materialized view of users in a source, constructed by data inferred about users from the identify calls.
- `accounts` - Group calls generate the `accounts` table in Warehouses. However because Data Lakes does not support object data (Groups are objects not events), there is no `accounts` table in Data Lakes.
- *(Redshift only)* **Table names which begin with numbers** - Table names are not allowed to begin with numbers in the Redshift Warehouse, so they are automatically given an underscore ( _ ) prefix. Glue Data Catalog does not have this restriction, so Data Lakes don't assign this prefix. For example, in Redshift a table name may be named `_101_account_update`, however in Data Lakes it would be named `101_account_update`. While this nuance is specific to Redshift, other warehouses may show similar behavior for other reserved words.

Expand All @@ -105,4 +105,4 @@ Similar to tables, columns between Warehouses and Data Lakes will be the same, e
- `channel`, `metadata_*`, `project_id`, `type`, `version` - These columns are Segment internal data which are not found in Warehouses, but are found in Data Lakes. Warehouses is intentionally very detailed about it's transformation logic and does not include these. Data Lakes does include them due to its more straightforward approach to flatten the whole event.
- (Redshift only) `uuid`, `uuid_ts` - Redshift customers will see columns for `uuid` and `uuid_ts`, which are used for de-duplication in Redshift; Other warehouses may have similar columns. These aren't relevant for Data Lakes so the columns won't appear there.
- `sent_at` - Warehouses computes the `sent_at` value based on timestamps found in the original event in order to account for clock skews and timestamps in the future. This was done when the Segment pipeline didn't do this on it's own, however it now calculates for this so Data Lakes does not need to do any additional computation, and will send the value as-is when computed at ingestion.
- `integrations` - Warehouses does not include the integrations object. Data Lakes flattens and includes the integrations object. You can read more about the `integrations` object [here](/docs/guides/filtering-data/#filtering-with-the-integrations-object).
- `integrations` - Warehouses does not include the integrations object. Data Lakes flattens and includes the integrations object. You can read more about the `integrations` object [in the filtering data documentation](/docs/guides/filtering-data/#filtering-with-the-integrations-object).
15 changes: 7 additions & 8 deletions src/connections/storage/data-lakes/data-lakes-manual-setup.md
Original file line number Diff line number Diff line change
Expand Up @@ -87,11 +87,11 @@ Segment requires access to an EMR cluster to perform necessary data processing.

The following steps provide examples of the IAM Role and IAM Policy.

### IAM Role
### IAM role

Create a `segment-data-lake-role` for Segment to assume. The trust relationship document you attach to the role will be different depending on your workspace region.

#### IAM Role for Data Lakes created in US workspaces:
#### IAM role for Data Lakes created in US workspaces:

Attach the following trust relationship document to the role to create a `segment-data-lake-role` role for Segment:

Expand Down Expand Up @@ -125,7 +125,7 @@ Attach the following trust relationship document to the role to create a `segmen
> note ""
> Replace the `ExternalID` list with the Segment `WorkspaceID` that contains the sources to sync to the Data Lake.

#### IAM Role for Data Lakes created in EU workspaces:
#### IAM role for Data Lakes created in EU workspaces:

> info ""
> EU workspaces are currently in beta. If you would like to learn more about the beta, please contact your account manager.
Expand Down Expand Up @@ -160,7 +160,7 @@ Attach the following trust relationship document to the role to create a `segmen
> note ""
> **NOTE:** Replace the `ExternalID` list with the Segment `WorkspaceID` that contains the sources to sync to the Data Lake.

### IAM Policy
### IAM policy

Add a policy to the role created above to give Segment access to the relevant Glue databases and tables, EMR cluster, and S3.

Expand Down Expand Up @@ -255,11 +255,10 @@ Add a policy to the role created above to give Segment access to the relevant Gl
Segment requires access to the data and schema for debugging data quality issues. The modes available for debugging are:
- Access the individual objects stored in S3 and the associated schema to understand data discrepancies
- Run an Athena query on the underlying data stored in S3
- Ensure Athena uses Glue as the data catalog. Older accounts may not have this configuration, and may require some additional steps to complete the upgrade. The Glue console typically displays a warning and provides a link to instructions on how to complete the upgrade.
![Debugging](images/dl_setup_glueerror.png)
- Ensure Athena uses Glue as the data catalog. Older accounts may not have this configuration, and may require some additional steps to complete the upgrade. The Glue console typically displays a warning and provides a link to instructions on how to complete the upgrade. The warning reads: <br/> **Upgrade to the AWS Glue Data Catalog** <br/> To use the AWS Glue Data Catalog with Amazon Athena and Amazon Redshift Spectrum, you must upgrade your Athena Data Catalog to the AWS Glue Data Catalog. Without the upgrade, tables and partitions created by AWS Glue cannot be queried with Amazon Athena or Redshift Spectrum. Start the upgrade in the [Athena console](https://console.aws.amazon.com/athena/){:target="_blank"}.
- An easier alternative is to create a new account that has Athena backed by Glue as the default.

## Updating EMR Clusters
## Updating EMR clusters
You can update your existing Data Lake destination to EMR version 5.33.0 by creating a new v5.33.0 cluster in AWS and associating it with your existing Data Lake. After you update the EMR cluster, your Segment Data Lake continues to use the Glue data catalog you initially configured.

When you update an EMR cluster to 5.33.0, you can participate in [AWS Lake Formation](https://aws.amazon.com/lake-formation/?whats-new-cards.sort-by=item.additionalFields.postDateTime&whats-new-cards.sort-order=desc){:target="_blank"}, use dynamic auto-scaling, and experience faster Parquet jobs.
Expand All @@ -273,7 +272,7 @@ When you update an EMR cluster to 5.33.0, you can participate in [AWS Lake Forma

## Procedure
1. Open your Segment app workspace and select the Data Lakes destination.
2. On the Settings tab, select the EMR Cluster ID field and replace the existing ID with the ID of your v5.33.0 EMR cluster. For help finding the cluster ID in AWS, see Amazon's [View cluster status and details](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-manage-view-clusters.html). You don't need to update the Glue Catalog ID, IAM Role ARN, or S3 Bucket name fields.
2. On the Settings tab, select the EMR Cluster ID field and replace the existing ID with the ID of your v5.33.0 EMR cluster. For help finding the cluster ID in AWS, see Amazon's [View cluster status and details](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-manage-view-clusters.html){:target="_blank"}. You don't need to update the Glue Catalog ID, IAM Role ARN, or S3 Bucket name fields.
3. Click **Save**.
4. In the AWS EMR console, view the Events tab for your cluster to verify it is receiving data.

Expand Down
Loading