From 7a5bf3a9bab03ff6fee0f1b50ccbc6fd52b8e3bb Mon Sep 17 00:00:00 2001 From: forstisabella <92472883+forstisabella@users.noreply.github.com> Date: Thu, 16 Jun 2022 12:46:03 -0400 Subject: [PATCH 1/5] Updating data lakes pages with consistent FAQ format, Vale updates --- .../storage/data-lakes/comparison.md | 8 +-- .../data-lakes/data-lakes-manual-setup.md | 10 ++-- src/connections/storage/data-lakes/index.md | 56 ++++++++++++------- .../storage/data-lakes/sync-history.md | 21 ++++--- .../storage/data-lakes/sync-reports.md | 55 +++++++++++++----- 5 files changed, 99 insertions(+), 51 deletions(-) diff --git a/src/connections/storage/data-lakes/comparison.md b/src/connections/storage/data-lakes/comparison.md index 9c22bf64c6..c5fb19bf53 100644 --- a/src/connections/storage/data-lakes/comparison.md +++ b/src/connections/storage/data-lakes/comparison.md @@ -9,7 +9,7 @@ As Segment builds new data storage products, each product evolves from prior pro Data Lakes and Warehouses are not identical, but are compatible with a configurable mapping. This mapping helps you to identify and manage the differences between the two storage solutions, so you can easily understand how the data in each is related. -## Data Freshness +## Data freshness Data Lakes and Warehouses offer different sync frequencies: - Warehouses can sync up to once an hour, with the ability to set a custom sync schedule and [selectively sync](/docs/connections/warehouses/selective-sync/) collections and properties within a source to Warehouses. @@ -21,7 +21,7 @@ Segment's [99% guarantee of no duplicates](/docs/guides/duplicate-data/) for dat [Warehouses](/docs/guides/duplicate-data/#warehouse-deduplication) and [Data Lakes](/docs/guides/duplicate-data/#data-lake-deduplication) also have a secondary deduplication system to further reduce the volume of duplicates to ensure clean data in your Warehouses and Data Lakes. -## Object vs Event Data +## Object vs event data Warehouses support both event and object data, while Data Lakes supports only event data. @@ -73,7 +73,7 @@ See the table below for information about the [source](/docs/connections/sources ## Schema -### Data Types +### Data types Warehouses and Data Lakes both infer data types for the events each receives. Since events are received by Warehouses one by one, Warehouses look at the first event received every hour to infer the data type for subsequent events. Data Lakes uses a similar approach, however because it receives data every hour, Data Lakes is able to look at a group of events to infer the data type. @@ -84,7 +84,7 @@ This approach leads to a few scenarios where the data type for an event may be d Variance in data types between Warehouses and Data Lakes don't happen often for booleans, strings, and timestamps, however it can occur for decimals and integers. -If a bad data type is seen, such as text in place of a number or an incorrectly formatted date, Warehouses and Data Lakes attempt a best effort conversion to cast the fields to the target data type. Fields that cannot be casted may be dropped. [Contact us](https://segment.com/contact) if you want to correct data types in the schema and perform a [replay](/docs/guides/what-is-replay/) to ensure no data is lost. +If a bad data type is seen, such as text in place of a number or an incorrectly formatted date, Warehouses and Data Lakes attempt a best effort conversion to cast the fields to the target data type. Fields that cannot be casted may be dropped. [Contact Segment Support](https://segment.com/contact){:target="_blank"} if you want to correct data types in the schema and perform a [replay](/docs/guides/what-is-replay/) to ensure no data is lost. ### Tables diff --git a/src/connections/storage/data-lakes/data-lakes-manual-setup.md b/src/connections/storage/data-lakes/data-lakes-manual-setup.md index 63a616dc0f..476980c648 100644 --- a/src/connections/storage/data-lakes/data-lakes-manual-setup.md +++ b/src/connections/storage/data-lakes/data-lakes-manual-setup.md @@ -87,11 +87,11 @@ Segment requires access to an EMR cluster to perform necessary data processing. The following steps provide examples of the IAM Role and IAM Policy. -### IAM Role +### IAM role Create a `segment-data-lake-role` for Segment to assume. The trust relationship document you attach to the role will be different depending on your workspace region. -#### IAM Role for Data Lakes created in US workspaces: +#### IAM role for Data Lakes created in US workspaces: Attach the following trust relationship document to the role to create a `segment-data-lake-role` role for Segment: @@ -125,7 +125,7 @@ Attach the following trust relationship document to the role to create a `segmen > note "" > Replace the `ExternalID` list with the Segment `WorkspaceID` that contains the sources to sync to the Data Lake. -#### IAM Role for Data Lakes created in EU workspaces: +#### IAM role for Data Lakes created in EU workspaces: > info "" > EU workspaces are currently in beta. If you would like to learn more about the beta, please contact your account manager. @@ -160,7 +160,7 @@ Attach the following trust relationship document to the role to create a `segmen > note "" > **NOTE:** Replace the `ExternalID` list with the Segment `WorkspaceID` that contains the sources to sync to the Data Lake. -### IAM Policy +### IAM policy Add a policy to the role created above to give Segment access to the relevant Glue databases and tables, EMR cluster, and S3. @@ -259,7 +259,7 @@ Segment requires access to the data and schema for debugging data quality issues ![Debugging](images/dl_setup_glueerror.png) - An easier alternative is to create a new account that has Athena backed by Glue as the default. -## Updating EMR Clusters +## Updating EMR clusters You can update your existing Data Lake destination to EMR version 5.33.0 by creating a new v5.33.0 cluster in AWS and associating it with your existing Data Lake. After you update the EMR cluster, your Segment Data Lake continues to use the Glue data catalog you initially configured. When you update an EMR cluster to 5.33.0, you can participate in [AWS Lake Formation](https://aws.amazon.com/lake-formation/?whats-new-cards.sort-by=item.additionalFields.postDateTime&whats-new-cards.sort-order=desc){:target="_blank"}, use dynamic auto-scaling, and experience faster Parquet jobs. diff --git a/src/connections/storage/data-lakes/index.md b/src/connections/storage/data-lakes/index.md index 965a73aafb..946574f787 100644 --- a/src/connections/storage/data-lakes/index.md +++ b/src/connections/storage/data-lakes/index.md @@ -10,18 +10,18 @@ Segment Data Lakes sends Segment data to a cloud data store (for example AWS S3) > info "" > Segment Data Lakes is available to Business tier customers only. -To learn more, check out our [blog post](https://segment.com/blog/introducing-segment-data-lakes/). +To learn more, check out the Segment blog post, [Introducing Segment Data Lakes](https://segment.com/blog/introducing-segment-data-lakes/){:target="_blank"}. ## How Segment Data Lakes work Data Lakes store Segment data in S3 in a read-optimized encoding format (Parquet) which makes the data more accessible and actionable. To help you zero-in on the right data, Data Lakes also creates logical data partitions and event tables, and integrates metadata with existing schema management tools, such as the AWS Glue Data Catalog. The resulting data set is optimized for use with systems like Spark, Athena, EMR, or Machine Learning vendors like DataBricks or DataRobot. -![](images/dl_overview2.png) +![A diagram showing data flowing from Segment, through Parquet and S3, into Glue, and then into your Data Lake](images/dl_overview2.png) Segment sends data to S3 by orchestrating the processing in an EMR (Elastic MapReduce) cluster within your AWS account using an assumed role. Customers using Data Lakes own and pay AWS directly for these AWS services. -![](images/dl_vpc.png) +![A diagram visualizing data flowing from a Segment user into your account and into a Glue catalog/S3 bucket](images/dl_vpc.png) Data Lakes offers 12 syncs in a 24 hour period and doesn't offer a custom sync schedule or selective sync. @@ -44,7 +44,7 @@ For detailed instructions on how to configure Segment Data Lakes, see the [Data Data Lakes uses an EMR cluster to run jobs that load events from all sources into Data Lakes. The [AWS resources portion of the set up instructions](/docs/connections/storage/catalog/data-lakes#step-1---set-up-aws-resources) sets up an EMR cluster using the `m5.xlarge` node type. Data Lakes keeps the cluster always running, however the cluster auto-scales to ensure it's not always running at full capacity. Check the Terraform module documentation for the [EMR specifications](https://github.com/segmentio/terraform-aws-data-lake/tree/master/modules/emr). -### AWS IAM Role +### AWS IAM role Data Lakes uses an IAM role to grant Segment secure access to your AWS account. The required inputs are: - **external_ids**: External IDs are the part of the IAM role which Segment uses to assume the role providing access to your AWS account. You will define the external ID in the IAM role as the Segment Workspace ID in which you want to connect to Data Lakes. The Segment Workspace ID can be retrieved from the [Segment app](https://app.segment.com/goto-my-workspace/overview)] when navigating to the Settings > General Settings > ID. @@ -67,7 +67,13 @@ The file path looks like: `s3:///data//segment_type=/day=/hr=` Here are a few examples of what events look like: -![](images/dl_s3bucket.png) +`s3:YOUR_BUCKET/segment-data/data/SOURCE_ID/segment_type=identify/day=2020-05-11/hr=11/` +`s3:YOUR_BUCKET/segment-data/data/SOURCE_ID/segment_type=identify/day=2020-05-11/hr=12/` +`s3:YOUR_BUCKET/segment-data/data/SOURCE_ID/segment_type=identify/day=2020-05-11/hr=13/` + +`s3:YOUR_BUCKET/segment-data/data/SOURCE_ID/segment_type=page_viewed/day=2020-05-11/hr=11/` +`s3:YOUR_BUCKET/segment-data/data/SOURCE_ID/segment_type=page_viewed/day=2020-05-11/hr=12/` +`s3:YOUR_BUCKET/segment-data/data/SOURCE_ID/segment_type=page_viewed/day=2020-05-11/hr=13/` By default, the date partition structure is `day=/hr=` to give you granular access to the S3 data. You can change the partition structure during the [set up process](/docs/connections/storage/catalog/data-lakes/), where you can choose from the following options: - Day/Hour [YYYY-MM-DD/HH] (Default) @@ -79,7 +85,7 @@ By default, the date partition structure is `day=/hr=` to give y Data Lakes stores the inferred schema and associated metadata of the S3 data in AWS Glue Data Catalog. This metadata includes the location of the S3 file, data converted into Parquet format, column names inferred from the Segment event, nested properties and traits which are now flattened, and the inferred data type. -![](images/dl_gluecatalog.png) +![A screenshot of the AWS ios_prod_identify table, containing the schema for the table, information about the table, and the table version](images/dl_gluecatalog.png)