From 7a5bf3a9bab03ff6fee0f1b50ccbc6fd52b8e3bb Mon Sep 17 00:00:00 2001
From: forstisabella <92472883+forstisabella@users.noreply.github.com>
Date: Thu, 16 Jun 2022 12:46:03 -0400
Subject: [PATCH 1/5] Updating data lakes pages with consistent FAQ format,
 Vale updates

---
 .../storage/data-lakes/comparison.md          |  8 +--
 .../data-lakes/data-lakes-manual-setup.md     | 10 ++--
 src/connections/storage/data-lakes/index.md   | 56 ++++++++++++-------
 .../storage/data-lakes/sync-history.md        | 21 ++++---
 .../storage/data-lakes/sync-reports.md        | 55 +++++++++++++-----
 5 files changed, 99 insertions(+), 51 deletions(-)

diff --git a/src/connections/storage/data-lakes/comparison.md b/src/connections/storage/data-lakes/comparison.md
index 9c22bf64c6..c5fb19bf53 100644
--- a/src/connections/storage/data-lakes/comparison.md
+++ b/src/connections/storage/data-lakes/comparison.md
@@ -9,7 +9,7 @@ As Segment builds new data storage products, each product evolves from prior pro
 Data Lakes and Warehouses are not identical, but are compatible with a configurable mapping. This mapping helps you to identify and manage the differences between the two storage solutions, so you can easily understand how the data in each is related.
 
 
-## Data Freshness
+## Data freshness
 
 Data Lakes and Warehouses offer different sync frequencies:
 - Warehouses can sync up to once an hour, with the ability to set a custom sync schedule and [selectively sync](/docs/connections/warehouses/selective-sync/) collections and properties within a source to Warehouses.
@@ -21,7 +21,7 @@ Segment's [99% guarantee of no duplicates](/docs/guides/duplicate-data/) for dat
 
 [Warehouses](/docs/guides/duplicate-data/#warehouse-deduplication) and [Data Lakes](/docs/guides/duplicate-data/#data-lake-deduplication) also have a secondary deduplication system to further reduce the volume of duplicates to ensure clean data in your Warehouses and Data Lakes.
 
-## Object vs Event Data
+## Object vs event data
 
 Warehouses support both event and object data, while Data Lakes supports only event data.
 
@@ -73,7 +73,7 @@ See the table below for information about the [source](/docs/connections/sources
 
 ## Schema
 
-### Data Types
+### Data types
 
 Warehouses and Data Lakes both infer data types for the events each receives. Since events are received by Warehouses one by one, Warehouses look at the first event received every hour to infer the data type for subsequent events. Data Lakes uses a similar approach, however because it receives data every hour, Data Lakes is able to look at a group of events to infer the data type.
 
@@ -84,7 +84,7 @@ This approach leads to a few scenarios where the data type for an event may be d
 
 Variance in data types between Warehouses and Data Lakes don't happen often for booleans, strings, and timestamps, however it can occur for decimals and integers.
 
-If a bad data type is seen, such as text in place of a number or an incorrectly formatted date, Warehouses and Data Lakes attempt a best effort conversion to cast the fields to the target data type. Fields that cannot be casted may be dropped. [Contact us](https://segment.com/contact) if you want to correct data types in the schema and perform a [replay](/docs/guides/what-is-replay/) to ensure no data is lost.
+If a bad data type is seen, such as text in place of a number or an incorrectly formatted date, Warehouses and Data Lakes attempt a best effort conversion to cast the fields to the target data type. Fields that cannot be casted may be dropped. [Contact Segment Support](https://segment.com/contact){:target="_blank"} if you want to correct data types in the schema and perform a [replay](/docs/guides/what-is-replay/) to ensure no data is lost.
 
 
 ### Tables
diff --git a/src/connections/storage/data-lakes/data-lakes-manual-setup.md b/src/connections/storage/data-lakes/data-lakes-manual-setup.md
index 63a616dc0f..476980c648 100644
--- a/src/connections/storage/data-lakes/data-lakes-manual-setup.md
+++ b/src/connections/storage/data-lakes/data-lakes-manual-setup.md
@@ -87,11 +87,11 @@ Segment requires access to an EMR cluster to perform necessary data processing.
 
 The following steps provide examples of the IAM Role and IAM Policy.
 
-### IAM Role
+### IAM role
 
 Create a `segment-data-lake-role` for Segment to assume. The trust relationship document you attach to the role will be different depending on your workspace region. 
 
-#### IAM Role for Data Lakes created in US workspaces:
+#### IAM role for Data Lakes created in US workspaces:
 
 Attach the following trust relationship document to the role to create a `segment-data-lake-role` role for Segment:
 
@@ -125,7 +125,7 @@ Attach the following trust relationship document to the role to create a `segmen
 > note ""
 > Replace the `ExternalID` list with the Segment `WorkspaceID` that contains the sources to sync to the Data Lake.
 
-#### IAM Role for Data Lakes created in EU workspaces:
+#### IAM role for Data Lakes created in EU workspaces:
 
 > info ""
 > EU workspaces are currently in beta. If you would like to learn more about the beta, please contact your account manager. 
@@ -160,7 +160,7 @@ Attach the following trust relationship document to the role to create a `segmen
 > note ""
 > **NOTE:** Replace the `ExternalID` list with the Segment `WorkspaceID` that contains the sources to sync to the Data Lake.
 
-### IAM Policy
+### IAM policy
 
 Add a policy to the role created above to give Segment access to the relevant Glue databases and tables, EMR cluster, and S3.
 
@@ -259,7 +259,7 @@ Segment requires access to the data and schema for debugging data quality issues
 ![Debugging](images/dl_setup_glueerror.png)
   - An easier alternative is to create a new account that has Athena backed by Glue as the default.
 
-## Updating EMR Clusters
+## Updating EMR clusters
 You can update your existing Data Lake destination to EMR version 5.33.0 by creating a new v5.33.0 cluster in AWS and associating it with your existing Data Lake. After you update the EMR cluster, your Segment Data Lake continues to use the Glue data catalog you initially configured.
 
 When you update an EMR cluster to 5.33.0, you can participate in [AWS Lake Formation](https://aws.amazon.com/lake-formation/?whats-new-cards.sort-by=item.additionalFields.postDateTime&whats-new-cards.sort-order=desc){:target="_blank"}, use dynamic auto-scaling, and experience faster Parquet jobs.  
diff --git a/src/connections/storage/data-lakes/index.md b/src/connections/storage/data-lakes/index.md
index 965a73aafb..946574f787 100644
--- a/src/connections/storage/data-lakes/index.md
+++ b/src/connections/storage/data-lakes/index.md
@@ -10,18 +10,18 @@ Segment Data Lakes sends Segment data to a cloud data store (for example AWS S3)
 > info ""
 > Segment Data Lakes is available to Business tier customers only.
 
-To learn more, check out our [blog post](https://segment.com/blog/introducing-segment-data-lakes/).
+To learn more, check out the Segment blog post, [Introducing Segment Data Lakes](https://segment.com/blog/introducing-segment-data-lakes/){:target="_blank"}.
 
 
 ## How Segment Data Lakes work
 
 Data Lakes store Segment data in S3 in a read-optimized encoding format (Parquet) which makes the data more accessible and actionable. To help you zero-in on the right data, Data Lakes also creates logical data partitions and event tables, and integrates metadata with existing schema management tools, such as the AWS Glue Data Catalog. The resulting data set is optimized for use with systems like Spark, Athena, EMR, or Machine Learning vendors like DataBricks or DataRobot.
 
-![](images/dl_overview2.png)
+![A diagram showing data flowing from Segment, through Parquet and S3, into Glue, and then into your Data Lake](images/dl_overview2.png)
 
 Segment sends data to S3 by orchestrating the processing in an EMR (Elastic MapReduce) cluster within your AWS account using an assumed role. Customers using Data Lakes own and pay AWS directly for these AWS services.
 
-![](images/dl_vpc.png)
+![A diagram visualizing data flowing from a Segment user into your account and into a Glue catalog/S3 bucket](images/dl_vpc.png)
 
 Data Lakes offers 12 syncs in a 24 hour period and doesn't offer a custom sync schedule or selective sync.
 
@@ -44,7 +44,7 @@ For detailed instructions on how to configure Segment Data Lakes, see the [Data
 
 Data Lakes uses an EMR cluster to run jobs that load events from all sources into Data Lakes. The [AWS resources portion of the set up instructions](/docs/connections/storage/catalog/data-lakes#step-1---set-up-aws-resources) sets up an EMR cluster using the `m5.xlarge` node type. Data Lakes keeps the cluster  always running, however the cluster auto-scales to ensure it's not always running at full capacity. Check the Terraform module documentation for the [EMR specifications](https://github.com/segmentio/terraform-aws-data-lake/tree/master/modules/emr).
 
-### AWS IAM Role
+### AWS IAM role
 
 Data Lakes uses an IAM role to grant Segment secure access to your AWS account. The required inputs are:
 - **external_ids**: External IDs are the part of the IAM role which Segment uses to assume the role providing access to your AWS account. You will define the external ID in the IAM role as the Segment Workspace ID in which you want to connect to  Data Lakes. The Segment Workspace ID can be retrieved from the [Segment app](https://app.segment.com/goto-my-workspace/overview)] when navigating to the Settings > General Settings > ID.
@@ -67,7 +67,13 @@ The file path looks like:
 `s3://<top-level-Segment-bucket>/data/<source-id>/segment_type=<event type>/day=<YYYY-MM-DD>/hr=<HH>`
 
 Here are a few examples of what events look like:
-![](images/dl_s3bucket.png)
+`s3:YOUR_BUCKET/segment-data/data/SOURCE_ID/segment_type=identify/day=2020-05-11/hr=11/`
+`s3:YOUR_BUCKET/segment-data/data/SOURCE_ID/segment_type=identify/day=2020-05-11/hr=12/`
+`s3:YOUR_BUCKET/segment-data/data/SOURCE_ID/segment_type=identify/day=2020-05-11/hr=13/`
+
+`s3:YOUR_BUCKET/segment-data/data/SOURCE_ID/segment_type=page_viewed/day=2020-05-11/hr=11/`
+`s3:YOUR_BUCKET/segment-data/data/SOURCE_ID/segment_type=page_viewed/day=2020-05-11/hr=12/`
+`s3:YOUR_BUCKET/segment-data/data/SOURCE_ID/segment_type=page_viewed/day=2020-05-11/hr=13/`
 
 By default, the date partition structure is `day=<YYYY-MM-DD>/hr=<HH>` to give you granular access to the S3 data. You can change the partition structure during the [set up process](/docs/connections/storage/catalog/data-lakes/), where you can choose from the following options:
 - Day/Hour [YYYY-MM-DD/HH] (Default)
@@ -79,7 +85,7 @@ By default, the date partition structure is `day=<YYYY-MM-DD>/hr=<HH>` to give y
 
 Data Lakes stores the inferred schema and associated metadata of the S3 data in AWS Glue Data Catalog. This metadata includes the location of the S3 file, data converted into Parquet format, column names inferred from the Segment event, nested properties and traits which are now flattened, and the inferred data type.
 
-![](images/dl_gluecatalog.png)
+![A screenshot of the AWS ios_prod_identify table, containing the schema for the table, information about the table, and the table version](images/dl_gluecatalog.png)
 <!--
 TODO:
 add annotated glue image calling out different parts of inferred schema)
@@ -111,29 +117,33 @@ Once Data Lakes sets a data type for a column, all subsequent data will attempt
 
 **Size mismatch**
 
-If the data type in Glue is wider than the data type for a column in an on-going sync (for example, a decimal vs integer, or string vs integer), then the column is cast to the wider type in the Glue table. If the column is narrower (for example, integer in the table versus decimal in the data), the data might be dropped if it cannot be cast at all, or in the case of numbers, some data might lose precision. The original data in Segment remains in its original format, so you can fix the types and [replay](/docs/guides/what-is-replay/) to ensure no data is lost. Learn more about type casting [here](https://www.w3schools.com/java/java_type_casting.asp).
+If the data type in Glue is wider than the data type for a column in an on-going sync (for example, a decimal vs integer, or string vs integer), then the column is cast to the wider type in the Glue table. If the column is narrower (for example, integer in the table versus decimal in the data), the data might be dropped if it cannot be cast at all, or in the case of numbers, some data might lose precision. The original data in Segment remains in its original format, so you can fix the types and [replay](/docs/guides/what-is-replay/) to ensure no data is lost. Learn more about type casting [here](https://www.w3schools.com/java/java_type_casting.asp){:target="_blank"}.
 
 **Data mismatch**
 
-If Data Lakes sees a bad data type, for example text in place of a number or an incorrectly formatted date, it attempts a best effort conversion to cast the field to the target data type. Fields that cannot be cast may be dropped. You can also correct the data type in the schema to the desired type and Replay to ensure no data is lost. [Contact Segment Support](https://segment.com/help/contact/) if you find a data type needs to be corrected.
+If Data Lakes sees a bad data type, for example text in place of a number or an incorrectly formatted date, it attempts a best effort conversion to cast the field to the target data type. Fields that cannot be cast may be dropped. You can also correct the data type in the schema to the desired type and Replay to ensure no data is lost. [Contact Segment Support](https://segment.com/help/contact/){:target="_blank"} if you find a data type needs to be corrected.
 
 
 
 ## FAQ
+{% faq %}
 
-#### Can I send all of my Segment data into Data Lakes?
+{% faqitem Can I send all of my Segment data into Data Lakes? %}
 Data Lakes supports data from all event sources, including website libraries, mobile, server and event cloud sources.
 
-Data Lakes doesn't support loading [object cloud source data](https://segment.com/docs/connections/sources/#object-cloud-sources), as well as the users and accounts tables from event cloud sources.
+Data Lakes doesn't support loading [object cloud source data](/docs/connections/sources/#object-cloud-sources), as well as the users and accounts tables from event cloud sources.
+{% endfaqitem %}
 
-#### Are user deletions and suppression supported?
-Segment doesn't support User deletions in Data Lakes, but supports [user suppression](https://segment.com/docs/privacy/user-deletion-and-suppression/#suppressed-users).
+{% faqitem Are user deletions and suppression supported? %}
+Segment doesn't support User deletions in Data Lakes, but supports [user suppression](/docs/privacy/user-deletion-and-suppression/#suppressed-users).
+{% endfaqitem %}
 
-#### How does Data Lakes handle schema evolution?
+{% faqitem How does Data Lakes handle schema evolution? %}
 As the data schema evolves and new columns are added, Segment Data Lakes will detect any new columns. New columns will be appended to the end of the table in the Glue Data Catalog.
+{% endfaqitem %}
 
-#### How does Data Lakes work with Protocols?
-Data Lakes doesn't have a direct integration with [Protocols](https://segment.com/docs/protocols/).
+{% faqitem How does Data Lakes work with Protocols? %}
+Data Lakes doesn't have a direct integration with [Protocols](/docs/protocols/).
 
 Any changes to events at the source level made with Protocols also change the data for all downstream destinations, including Data Lakes.
 
@@ -145,12 +155,14 @@ Data types and labels available in Protocols aren't supported by Data Lakes.
 
 - **Data Types** - Data Lakes infers the data type for each event using its own schema inference systems instead of using a data type set for an event in Protocols. This might lead to the data type set in a data lake being different from the data type in the tracking plan. For example, if you set `product_id` to be an integer in the Protocols Tracking Plan, but the event is sent into Segment as a string, then Data Lakes may infer this data type as a string in the Glue Data Catalog.
 - **Labels** - Labels set in Protocols aren't sent to Data Lakes.
+{% endfaqitem %}
 
-#### What is the cost to use AWS Glue?
-You can find details on Amazon's [pricing for Glue page](https://aws.amazon.com/glue/pricing/). For reference, Data Lakes creates 1 table per event type in your source, and adds 1 partition per hour to the event table.
+{% faqitem What is the cost to use AWS Glue? %}
+You can find details on Amazon's [pricing for Glue page](https://aws.amazon.com/glue/pricing/){:target="_blank"}. For reference, Data Lakes creates 1 table per event type in your source, and adds 1 partition per hour to the event table.
+{% endfaqitem %}
 
-#### What limits does AWS Glue have?
-AWS Glue has limits across various factors, such as number of databases per account, tables per account, and so on. See the [full list of Glue limits](https://docs.aws.amazon.com/general/latest/gr/glue.html#limits_glue) for more information.
+{% faqitem What limits does AWS Glue have? %}
+AWS Glue has limits across various factors, such as number of databases per account, tables per account, and so on. See the [full list of Glue limits](https://docs.aws.amazon.com/general/latest/gr/glue.html#limits_glue){:target="_blank"} for more information.
 
 The most common limits to keep in mind are:
 - Databases per account: 10,000
@@ -159,4 +171,8 @@ The most common limits to keep in mind are:
 
 Segment stops creating new tables for the events after you exceed this limit. However you can contact your AWS account representative to increase these limits.
 
-You should also read the [additional considerations](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-metastore-glue.html) when using AWS Glue Data Catalog.
+You should also read the [additional considerations](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-metastore-glue.html){:target="_blank"} when using AWS Glue Data Catalog.
+
+{% endfaqitem %}
+
+{% endfaq %}
diff --git a/src/connections/storage/data-lakes/sync-history.md b/src/connections/storage/data-lakes/sync-history.md
index af304c29c7..dae70ecd5d 100644
--- a/src/connections/storage/data-lakes/sync-history.md
+++ b/src/connections/storage/data-lakes/sync-history.md
@@ -32,17 +32,24 @@ Above the Daily Row Volume table is an overview of the total syncs for the curre
 To access the Sync history page from the Segment app, open the **My Destinations** page and select the data lake. On the data lakes settings page, select the **Health** tab.
 
 ## Data Lakes Reports FAQ
-##### How long is a data point available?
+{% faq %}
+{% faqitem How long is a data point available? %}
 The health tab shows an aggregate view of the last 30 days worth of data, while the sync history retains the last 100 syncs.
+{% endfaqitem %}
 
-##### How do sync history and health compare? 
-The sync history feature shows detailed information about the most recent 100 syncs to a data lake, while the health tab shows just the number of rows synced to the data lake over the last 30 days. 
+{% faqitem How do sync history and health compare? %}
+The sync history feature shows detailed information about the most recent 100 syncs to a data lake, while the health tab shows just the number of rows synced to the data lake over the last 30 days.
+{% endfaqitem %}
 
-##### What timezone is the time and date information in?
+{% faqitem What timezone is the time and date information in? %}
 All dates and times on the sync history and health pages are in the user's local time. 
+{% endfaqitem %}
 
-##### When does the data update?
+{% faqitem When does the data update? %}
 The sync data for both reports updates in real time.
+{% endfaqitem %}
 
-##### When do syncs occur?
-Syncs occur approximately every two hours. Users cannot choose how frequently the data lake syncs. 
\ No newline at end of file
+{% faqitem When do syncs occur? %}
+Syncs occur approximately every two hours. Users cannot choose how frequently the data lake syncs. 
+{% endfaqitem %}
+{% endfaq %}
\ No newline at end of file
diff --git a/src/connections/storage/data-lakes/sync-reports.md b/src/connections/storage/data-lakes/sync-reports.md
index 197e57220b..f1b6dee2da 100644
--- a/src/connections/storage/data-lakes/sync-reports.md
+++ b/src/connections/storage/data-lakes/sync-reports.md
@@ -6,7 +6,7 @@ title: Data Lakes Sync Reports and Errors
 
 Segment Data Lakes generates reports with operational metrics about each sync to your data lake so you can monitor sync performance. These sync reports are stored in your S3 bucket and Glue Data Catalog. This means you have access to the raw data, so you can query it to answer questions and set up alerting and monitoring tools.
 
-## Sync Report Schema
+## Sync Report schema
 
 Your sync_report table stores all of your sync data. You can query it to answer common questions about data synced to your data lake.
 The table has the following columns in its schema:
@@ -15,7 +15,7 @@ The table has the following columns in its schema:
 | ----------------- | ------------------- |
 | `workspace_id`    | Distinct ID assigned to each Segment workspace and [found in the workspace settings](https://app.segment.com/goto-my-workspace/settings/basic). |
 | `source_id`       | Distinct ID assigned to each Segment source, found in the Source Settings > API Keys > Source ID.         |
-| `database`        | Name of the Glue Database used to store sync report tables. Segment automatically creaets this database during the Data Lakes set up process.         |
+| `database`        | Name of the Glue Database used to store sync report tables. Segment automatically creates this database during the Data Lakes set up process.         |
 | `emr_cluster_id`  | ID of the EMR cluster which Data Lakes uses, found in the [Data Lakes Settings page]().  |
 | `s3_bucket`       | Name of the S3 bucket which Data Lakes uses, found in the [Data Lakes Settings page]().  |
 | `run_id`          | ID dynamically generated and assigned to each Data Lakes sync run.   |
@@ -35,13 +35,38 @@ The table has the following columns in its schema:
 | `replay_from`      | Start date for the replay, if applicable.       |
 | `replay_to`      | Finish date for the replay, if applicable.     |
 
-The Glue Database named `__segment_datalake` stores the schema of the `sync_reports` table. The schema looks like:
-![](images/dl_syncreports_schema.png)
+The Glue Database named `__segment_datalake` stores the schema of the `sync_reports` table. The `__segment_datalake` database has the following format:
+
+| Column name    | Data type   | Partition key  | Comment |
+| -------------- | ----------- | -------------- | ------- |
+| type           | string      |                |         |
+| workspace_id   | string      |                |         |
+| run_id         | string      |                |         |
+| start_time     | timestamp   |                |         |
+| finish_time    | timestamp   |                |         |
+| duration_mins  | bigint      |                |         |
+| status         | string      |                |         |
+| error          | string      |                |         |
+| error_code     | string      |                |         |
+| table_name     | string      |                |         |
+| database       | string      |                |         |
+| partitions     | array       |                |         |
+| new_columns    | array       |                |         |
+| row_count      | bigint      |                |         |
+| is_new         | boolean     |                |         |
+| replay         | boolean     |                |         |
+| replay_from    | timestamp   |                |         |
+| replay_to      | timestamp   |                |         |
+| emr_cluster_id | string      |                |         |
+| s3_bucket      | string      |                |         |
+| source_id      | string      | Partition (0)  |         |
+| day            | string      | Partition (1)  |         |
+
 
 
 The `sync_reports` table is available in S3 and Glue only once a sync completes. Sync reports are not available for syncs in progress.
 
-## Data Location
+## Data location
 
 Data Lakes sync reports are stored in Glue and in S3.
 
@@ -50,7 +75,7 @@ Segment automatically creates a Glue Database and table when you set up Data Lak
 The S3 structure is:
 `s3://my-bucket/segment-data/reports/day=YYYY-MM-DD/source=$SOURCE_ID/run_id=$RUN_ID/report.json`
 
-## Data Format
+## Data format
 
 The data in the sync reports is stored in JSON format to ensure that it is human-readable and can be processed by other systems.
 
@@ -151,7 +176,7 @@ The example below shows the raw JSON object for a **failed** sync report.
 ```
 
 
-## Querying the Sync Reports Table
+## Querying the Sync Reports table
 
 You can use SQL to query your Sync Reports table to explore and analyze operational sync metrics.
 A few helpful and commonly used queries are included below.
@@ -193,14 +218,14 @@ FROM "__segment_datalake"."sync_reports"
 WHERE source_id='9IP56Shn6' AND status='failed' AND date(day) >= (CURRENT_DATE - interval '2' day)
 ```
 
-## Sync Errors
+## Sync errors
 
 The following error types can cause your data lake syncs to fail:
-- **[Insufficient Permissions](#insufficient-permissions)** - Segment does not have the permissions necessary to perform a critical operation. You must grant Segment additional permissions.
-- **[Invalid Settings](#invalid-settings)** - The settings are invalid. This could be caused by a missing required field, or a validation check that fails. The invalid setting must be corrected before the sync can succeed.
-- **[Internal Error](#internal-error)** - An error occurred in Segment's internal systems. This should resolve on its own. [Contact the Segment Support team](https://segment.com/help/contact/) if the sync failure persists.
+- **[Insufficient permissions](#insufficient-permissions)** - Segment does not have the permissions necessary to perform a critical operation. You must grant Segment additional permissions.
+- **[Invalid settings](#invalid-settings)** - The settings are invalid. This could be caused by a missing required field, or a validation check that fails. The invalid setting must be corrected before the sync can succeed.
+- **[Internal error](#internal-error)** - An error occurred in Segment's internal systems. This should resolve on its own. [Contact the Segment Support team](https://segment.com/help/contact/) if the sync failure persists.
 
-### Insufficient Permissions
+### Insufficient permissions
 
 If Data Lakes does not have the correct access permissions for S3, Glue, and EMR, your syncs will fail.
 
@@ -217,7 +242,7 @@ If permissions are the problem, you might see one of the following permissions-r
 
 [Check the set up guide](/docs/connections/storage/data-lakes/data-lakes-manual-setup/) to ensure that you set up the required permission configuration for S3, Glue and EMR.
 
-### Invalid Settings
+### Invalid settings
 
 One or more settings might be incorrectly configured in the Segment app, preventing your Data Lakes syncs from succeeding.
 
@@ -230,9 +255,9 @@ If you have invalid settings, you might see one of the error messages below:
 
 The most common error occurs when you do not list all Source IDs in the External ID section of the IAM role. You can find your Source IDs in the Segment workspace, and you must add each one to the list of [External IDs](https://github.com/segmentio/terraform-aws-data-lake/tree/master/modules/iam#external_ids) in the IAM policy. You can either update the IAM policy from the AWS Console, or re-run the [Data Lakes set up Terraform job](https://github.com/segmentio/terraform-aws-data-lake).
 
-### Internal Error
+### Internal error
 
-Internal errors occurr in Segment's internal systems, and should resolve on their own. If sync failures persist, [contact the Segment Support team](https://segment.com/help/contact/).
+Internal errors occur in Segment's internal systems, and should resolve on their own. If sync failures persist, [contact the Segment Support team](https://segment.com/help/contact/).
 
 ## FAQ
 

From 179df372f4afc847ca42465909d72702fa3c2ccc Mon Sep 17 00:00:00 2001
From: forstisabella <92472883+forstisabella@users.noreply.github.com>
Date: Thu, 16 Jun 2022 13:01:07 -0400
Subject: [PATCH 2/5] Adding missing page to sidenav, removing unnecessary
 screenshot

---
 src/_data/sidenav/main.yml                                    | 2 ++
 src/connections/storage/data-lakes/data-lakes-manual-setup.md | 3 +--
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/src/_data/sidenav/main.yml b/src/_data/sidenav/main.yml
index 2384dcb427..64a91b7e94 100644
--- a/src/_data/sidenav/main.yml
+++ b/src/_data/sidenav/main.yml
@@ -192,6 +192,8 @@ sections:
         title: Data Lakes Overview
       - path: /connections/storage/catalog/data-lakes
         title: Set Up Data Lakes
+      - path: /connections/storage/data-lakes/data-lakes-manual-setup
+        title: Configure the Data Lakes AWS Environment
       - path: /connections/storage/data-lakes/sync-reports
         title: Sync Reports and Error Reporting
       - path: /connections/storage/data-lakes/lake-formation
diff --git a/src/connections/storage/data-lakes/data-lakes-manual-setup.md b/src/connections/storage/data-lakes/data-lakes-manual-setup.md
index 476980c648..a59b01f9ed 100644
--- a/src/connections/storage/data-lakes/data-lakes-manual-setup.md
+++ b/src/connections/storage/data-lakes/data-lakes-manual-setup.md
@@ -255,8 +255,7 @@ Add a policy to the role created above to give Segment access to the relevant Gl
 Segment requires access to the data and schema for debugging data quality issues. The modes available for debugging are:
 - Access the individual objects stored in S3 and the associated schema to understand data discrepancies
 - Run an Athena query on the underlying data stored in S3
-  - Ensure Athena uses Glue as the data catalog. Older accounts may not have this configuration, and may require some additional steps to complete the upgrade. The Glue console typically displays a warning and provides a link to instructions on how to complete the upgrade.
-![Debugging](images/dl_setup_glueerror.png)
+  - Ensure Athena uses Glue as the data catalog. Older accounts may not have this configuration, and may require some additional steps to complete the upgrade. The Glue console typically displays a warning and provides a link to instructions on how to complete the upgrade. The warning reads: <br/> **Upgrade to the AWS Glue Data Catalog** <br/> To use the AWS Glue Data Catalog with Amazon Athena and Amazon Redshift Spectrum, you must upgrade your Athena Data Catalog to the AWS Glue Data Catalog. Without the upgrade, tables and partitions created by AWS Glue cannot be queried with Amazon Athena or Redshift Spectrum. Start the upgrade in the [Athena console](https://console.aws.amazon.com/athena/){:target="_blank"}.
   - An easier alternative is to create a new account that has Athena backed by Glue as the default.
 
 ## Updating EMR clusters

From 0c6fed4e561ebaaddc6e68c863bfa28f6e8b332b Mon Sep 17 00:00:00 2001
From: forstisabella <92472883+forstisabella@users.noreply.github.com>
Date: Thu, 16 Jun 2022 15:28:52 -0400
Subject: [PATCH 3/5] Remove manual setup instructions from the side nav

---
 src/_data/sidenav/main.yml | 2 --
 1 file changed, 2 deletions(-)

diff --git a/src/_data/sidenav/main.yml b/src/_data/sidenav/main.yml
index 64a91b7e94..2384dcb427 100644
--- a/src/_data/sidenav/main.yml
+++ b/src/_data/sidenav/main.yml
@@ -192,8 +192,6 @@ sections:
         title: Data Lakes Overview
       - path: /connections/storage/catalog/data-lakes
         title: Set Up Data Lakes
-      - path: /connections/storage/data-lakes/data-lakes-manual-setup
-        title: Configure the Data Lakes AWS Environment
       - path: /connections/storage/data-lakes/sync-reports
         title: Sync Reports and Error Reporting
       - path: /connections/storage/data-lakes/lake-formation

From fa3cd6574f7889ca8fb00b782e936cae688a442b Mon Sep 17 00:00:00 2001
From: forstisabella <92472883+forstisabella@users.noreply.github.com>
Date: Thu, 16 Jun 2022 15:58:49 -0400
Subject: [PATCH 4/5] Vale, descriptive links, fixing janky language

---
 src/connections/storage/data-lakes/index.md | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/src/connections/storage/data-lakes/index.md b/src/connections/storage/data-lakes/index.md
index 946574f787..00800288b1 100644
--- a/src/connections/storage/data-lakes/index.md
+++ b/src/connections/storage/data-lakes/index.md
@@ -10,7 +10,7 @@ Segment Data Lakes sends Segment data to a cloud data store (for example AWS S3)
 > info ""
 > Segment Data Lakes is available to Business tier customers only.
 
-To learn more, check out the Segment blog post, [Introducing Segment Data Lakes](https://segment.com/blog/introducing-segment-data-lakes/){:target="_blank"}.
+To learn more, check out the blog post, [Introducing Segment Data Lakes](https://segment.com/blog/introducing-segment-data-lakes/){:target="_blank"}.
 
 
 ## How Segment Data Lakes work
@@ -42,12 +42,12 @@ For detailed instructions on how to configure Segment Data Lakes, see the [Data
 
 ### EMR
 
-Data Lakes uses an EMR cluster to run jobs that load events from all sources into Data Lakes. The [AWS resources portion of the set up instructions](/docs/connections/storage/catalog/data-lakes#step-1---set-up-aws-resources) sets up an EMR cluster using the `m5.xlarge` node type. Data Lakes keeps the cluster  always running, however the cluster auto-scales to ensure it's not always running at full capacity. Check the Terraform module documentation for the [EMR specifications](https://github.com/segmentio/terraform-aws-data-lake/tree/master/modules/emr).
+Data Lakes uses an EMR cluster to run jobs that load events from all sources into Data Lakes. The [AWS resources portion of the set up instructions](/docs/connections/storage/catalog/data-lakes#step-1---set-up-aws-resources) sets up an EMR cluster using the `m5.xlarge` node type. Data Lakes keeps the cluster  always running, however the cluster auto-scales to ensure it's not always running at full capacity. Check the Terraform module documentation for the [EMR specifications](https://github.com/segmentio/terraform-aws-data-lake/tree/master/modules/emr){:target="_blank"}.
 
 ### AWS IAM role
 
 Data Lakes uses an IAM role to grant Segment secure access to your AWS account. The required inputs are:
-- **external_ids**: External IDs are the part of the IAM role which Segment uses to assume the role providing access to your AWS account. You will define the external ID in the IAM role as the Segment Workspace ID in which you want to connect to  Data Lakes. The Segment Workspace ID can be retrieved from the [Segment app](https://app.segment.com/goto-my-workspace/overview)] when navigating to the Settings > General Settings > ID.
+- **external_ids**: External IDs are the part of the IAM role which Segment uses to assume the role providing access to your AWS account. You will define the external ID in the IAM role as the Segment Workspace ID in which you want to connect to  Data Lakes. The Segment Workspace ID can be retrieved from the [Segment app](https://app.segment.com/goto-my-workspace/overview){:target="_blank"} when navigating to the Settings > General Settings > ID.
 - **s3_bucket**: Name of the S3 bucket used by the Data Lake.
 
 
@@ -117,7 +117,7 @@ Once Data Lakes sets a data type for a column, all subsequent data will attempt
 
 **Size mismatch**
 
-If the data type in Glue is wider than the data type for a column in an on-going sync (for example, a decimal vs integer, or string vs integer), then the column is cast to the wider type in the Glue table. If the column is narrower (for example, integer in the table versus decimal in the data), the data might be dropped if it cannot be cast at all, or in the case of numbers, some data might lose precision. The original data in Segment remains in its original format, so you can fix the types and [replay](/docs/guides/what-is-replay/) to ensure no data is lost. Learn more about type casting [here](https://www.w3schools.com/java/java_type_casting.asp){:target="_blank"}.
+If the data type in Glue is wider than the data type for a column in an on-going sync (for example, a decimal vs integer, or string vs integer), then the column is cast to the wider type in the Glue table. If the column is narrower (for example, integer in the table versus decimal in the data), the data might be dropped if it cannot be cast at all, or in the case of numbers, some data might lose precision. The original data in Segment remains in its original format, so you can fix the types and [replay](/docs/guides/what-is-replay/) to ensure no data is lost. Learn more about type casting by reading the [W3School's Java Type Casting](https://www.w3schools.com/java/java_type_casting.asp){:target="_blank"} page.
 
 **Data mismatch**
 

From 01bc6b062fdb8736c809bbe75696a2e8f695b246 Mon Sep 17 00:00:00 2001
From: forstisabella <92472883+forstisabella@users.noreply.github.com>
Date: Fri, 17 Jun 2022 11:59:36 -0400
Subject: [PATCH 5/5] Making sure links open in a new tab, link descriptions
 are descriptive

---
 src/connections/storage/data-lakes/comparison.md       |  6 +++---
 .../storage/data-lakes/data-lakes-manual-setup.md      |  2 +-
 src/connections/storage/data-lakes/index.md            | 10 +++++-----
 src/connections/storage/data-lakes/sync-reports.md     |  8 ++++----
 4 files changed, 13 insertions(+), 13 deletions(-)

diff --git a/src/connections/storage/data-lakes/comparison.md b/src/connections/storage/data-lakes/comparison.md
index c5fb19bf53..c3ef2d1f98 100644
--- a/src/connections/storage/data-lakes/comparison.md
+++ b/src/connections/storage/data-lakes/comparison.md
@@ -91,8 +91,8 @@ If a bad data type is seen, such as text in place of a number or an incorrectly
 
 Tables between Warehouses and Data Lakes will be the same, except for in these two cases:
 
-- `tracks` - Warehouses provide one table per specific event (`track_button_clicked`) in addition to a summary table listing all `track` method calls. Data Lakes also creates one table per specific event, but does not provide a summary table. Learn more about the `tracks` table [here](/docs/connections/storage/warehouses/schema/).
-- `users` - Both Warehouses and Data Lakes create an  `identifies` table (as seen [here](/docs/connections/storage/warehouses/schema/)), however Warehouses also create a `users` table just for user data.  Data Lakes does not create this, since it does not support object data. The `users` table is a materialized view of users in a source, constructed by data inferred about users from the identify calls.
+- `tracks` - Warehouses provide one table per specific event (`track_button_clicked`) in addition to a summary table listing all `track` method calls. Data Lakes also creates one table per specific event, but does not provide a summary table. Learn more about the `tracks` table [in the Warehouses schema docs](/docs/connections/storage/warehouses/schema/).
+- `users` - Both Warehouses and Data Lakes create an  `identifies` table (as seen [in the Warehouses schema docs](/docs/connections/storage/warehouses/schema/)), however Warehouses also create a `users` table just for user data.  Data Lakes does not create this, since it does not support object data. The `users` table is a materialized view of users in a source, constructed by data inferred about users from the identify calls.
 - `accounts` - Group calls generate the `accounts` table in Warehouses. However because Data Lakes does not support object data (Groups are objects not events), there is no `accounts` table in Data Lakes.
 - *(Redshift only)* **Table names which begin with numbers** - Table names are not allowed to begin with numbers in the Redshift Warehouse, so they are automatically given an underscore ( _ ) prefix. Glue Data Catalog does not have this restriction, so Data Lakes don't assign this prefix. For example, in Redshift a table name may be named `_101_account_update`, however in Data Lakes it would be named `101_account_update`. While this nuance is specific to Redshift, other warehouses may show similar behavior for other reserved words.
 
@@ -105,4 +105,4 @@ Similar to tables, columns between Warehouses and Data Lakes will be the same, e
 - `channel`, `metadata_*`, `project_id`, `type`, `version` - These columns are Segment internal data which are not found in Warehouses, but are found in Data Lakes. Warehouses is intentionally very detailed about it's transformation logic and does not include these. Data Lakes does include them due to its more straightforward approach to flatten the whole event.
 - (Redshift only) `uuid`, `uuid_ts` - Redshift customers will see columns for `uuid` and `uuid_ts`, which are used for de-duplication in Redshift; Other warehouses may have similar columns. These aren't relevant for Data Lakes so the columns won't appear there.
 - `sent_at` - Warehouses computes the `sent_at` value based on timestamps found in the original event in order to account for clock skews and timestamps in the future. This was done when the Segment pipeline didn't do this on it's own, however it now calculates for this so Data Lakes does not need to do any additional computation, and will send the value as-is when computed at ingestion.
-- `integrations` - Warehouses does not include the integrations object.  Data Lakes flattens and includes the integrations object. You can read more about the `integrations` object [here](/docs/guides/filtering-data/#filtering-with-the-integrations-object).
+- `integrations` - Warehouses does not include the integrations object.  Data Lakes flattens and includes the integrations object. You can read more about the `integrations` object [in the filtering data documentation](/docs/guides/filtering-data/#filtering-with-the-integrations-object).
diff --git a/src/connections/storage/data-lakes/data-lakes-manual-setup.md b/src/connections/storage/data-lakes/data-lakes-manual-setup.md
index a59b01f9ed..93eda14e6b 100644
--- a/src/connections/storage/data-lakes/data-lakes-manual-setup.md
+++ b/src/connections/storage/data-lakes/data-lakes-manual-setup.md
@@ -272,7 +272,7 @@ When you update an EMR cluster to 5.33.0, you can participate in [AWS Lake Forma
 
 ## Procedure
 1. Open your Segment app workspace and select the Data Lakes destination.
-2. On the Settings tab, select the EMR Cluster ID field and replace the existing ID with the ID of your v5.33.0 EMR cluster. For help finding the cluster ID in AWS, see Amazon's [View cluster status and details](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-manage-view-clusters.html). You don't need to update the Glue Catalog ID, IAM Role ARN, or S3 Bucket name fields.
+2. On the Settings tab, select the EMR Cluster ID field and replace the existing ID with the ID of your v5.33.0 EMR cluster. For help finding the cluster ID in AWS, see Amazon's [View cluster status and details](https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-manage-view-clusters.html){:target="_blank"}. You don't need to update the Glue Catalog ID, IAM Role ARN, or S3 Bucket name fields.
 3. Click **Save**.
 4. In the AWS EMR console, view the Events tab for your cluster to verify it is receiving data.
 
diff --git a/src/connections/storage/data-lakes/index.md b/src/connections/storage/data-lakes/index.md
index 00800288b1..c3471e7cb7 100644
--- a/src/connections/storage/data-lakes/index.md
+++ b/src/connections/storage/data-lakes/index.md
@@ -10,7 +10,7 @@ Segment Data Lakes sends Segment data to a cloud data store (for example AWS S3)
 > info ""
 > Segment Data Lakes is available to Business tier customers only.
 
-To learn more, check out the blog post, [Introducing Segment Data Lakes](https://segment.com/blog/introducing-segment-data-lakes/){:target="_blank"}.
+To learn more, check out the blog post [Introducing Segment Data Lakes](https://segment.com/blog/introducing-segment-data-lakes/){:target="_blank"}.
 
 
 ## How Segment Data Lakes work
@@ -38,7 +38,7 @@ When you use Data Lakes, you can either use Data Lakes as your _only_ source of
 
 ## Set up Segment Data Lakes
 
-For detailed instructions on how to configure Segment Data Lakes, see the [Data Lakes catalog page](/docs/connections/storage/catalog/data-lakes/). Be sure to consider the EMR and AWS IAM components listed below."
+For detailed instructions on how to configure Segment Data Lakes, see the [Data Lakes catalog page](/docs/connections/storage/catalog/data-lakes/). Be sure to consider the EMR and AWS IAM components listed below.
 
 ### EMR
 
@@ -85,7 +85,7 @@ By default, the date partition structure is `day=<YYYY-MM-DD>/hr=<HH>` to give y
 
 Data Lakes stores the inferred schema and associated metadata of the S3 data in AWS Glue Data Catalog. This metadata includes the location of the S3 file, data converted into Parquet format, column names inferred from the Segment event, nested properties and traits which are now flattened, and the inferred data type.
 
-![A screenshot of the AWS ios_prod_identify table, containing the schema for the table, information about the table, and the table version](images/dl_gluecatalog.png)
+![A screenshot of the AWS ios_prod_identify table, displaying the schema for the table, information about the table, and the table version](images/dl_gluecatalog.png)
 <!--
 TODO:
 add annotated glue image calling out different parts of inferred schema)
@@ -158,7 +158,7 @@ Data types and labels available in Protocols aren't supported by Data Lakes.
 {% endfaqitem %}
 
 {% faqitem What is the cost to use AWS Glue? %}
-You can find details on Amazon's [pricing for Glue page](https://aws.amazon.com/glue/pricing/){:target="_blank"}. For reference, Data Lakes creates 1 table per event type in your source, and adds 1 partition per hour to the event table.
+You can find details on Amazon's [pricing for Glue](https://aws.amazon.com/glue/pricing/){:target="_blank"} page. For reference, Data Lakes creates 1 table per event type in your source, and adds 1 partition per hour to the event table.
 {% endfaqitem %}
 
 {% faqitem What limits does AWS Glue have? %}
@@ -171,7 +171,7 @@ The most common limits to keep in mind are:
 
 Segment stops creating new tables for the events after you exceed this limit. However you can contact your AWS account representative to increase these limits.
 
-You should also read the [additional considerations](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-metastore-glue.html){:target="_blank"} when using AWS Glue Data Catalog.
+You should also read the [additional considerations in Amazon's documentation](https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-hive-metastore-glue.html){:target="_blank"} when using AWS Glue Data Catalog.
 
 {% endfaqitem %}
 
diff --git a/src/connections/storage/data-lakes/sync-reports.md b/src/connections/storage/data-lakes/sync-reports.md
index f1b6dee2da..65589c813b 100644
--- a/src/connections/storage/data-lakes/sync-reports.md
+++ b/src/connections/storage/data-lakes/sync-reports.md
@@ -13,7 +13,7 @@ The table has the following columns in its schema:
 
 | **Sync Metric**   | **Description**    |
 | ----------------- | ------------------- |
-| `workspace_id`    | Distinct ID assigned to each Segment workspace and [found in the workspace settings](https://app.segment.com/goto-my-workspace/settings/basic). |
+| `workspace_id`    | Distinct ID assigned to each Segment workspace and [found in the workspace settings](https://app.segment.com/goto-my-workspace/settings/basic){:target="_blank"}. |
 | `source_id`       | Distinct ID assigned to each Segment source, found in the Source Settings > API Keys > Source ID.         |
 | `database`        | Name of the Glue Database used to store sync report tables. Segment automatically creates this database during the Data Lakes set up process.         |
 | `emr_cluster_id`  | ID of the EMR cluster which Data Lakes uses, found in the [Data Lakes Settings page]().  |
@@ -223,7 +223,7 @@ WHERE source_id='9IP56Shn6' AND status='failed' AND date(day) >= (CURRENT_DATE -
 The following error types can cause your data lake syncs to fail:
 - **[Insufficient permissions](#insufficient-permissions)** - Segment does not have the permissions necessary to perform a critical operation. You must grant Segment additional permissions.
 - **[Invalid settings](#invalid-settings)** - The settings are invalid. This could be caused by a missing required field, or a validation check that fails. The invalid setting must be corrected before the sync can succeed.
-- **[Internal error](#internal-error)** - An error occurred in Segment's internal systems. This should resolve on its own. [Contact the Segment Support team](https://segment.com/help/contact/) if the sync failure persists.
+- **[Internal error](#internal-error)** - An error occurred in Segment's internal systems. This should resolve on its own. [Contact the Segment Support team](https://segment.com/help/contact/){:target="_blank"} if the sync failure persists.
 
 ### Insufficient permissions
 
@@ -253,11 +253,11 @@ If you have invalid settings, you might see one of the error messages below:
 - "External ID is invalid. Please ensure the external ID in the IAM role used to connect to your Data Lake matches the source ID."
 - "External ID is not set. Please ensure that the IAM role used to connect to your Data Lake has the source ID in the list of external IDs."
 
-The most common error occurs when you do not list all Source IDs in the External ID section of the IAM role. You can find your Source IDs in the Segment workspace, and you must add each one to the list of [External IDs](https://github.com/segmentio/terraform-aws-data-lake/tree/master/modules/iam#external_ids) in the IAM policy. You can either update the IAM policy from the AWS Console, or re-run the [Data Lakes set up Terraform job](https://github.com/segmentio/terraform-aws-data-lake).
+The most common error occurs when you do not list all Source IDs in the External ID section of the IAM role. You can find your Source IDs in the Segment workspace, and you must add each one to the list of [External IDs](https://github.com/segmentio/terraform-aws-data-lake/tree/master/modules/iam#external_ids){:target="_blank"} in the IAM policy. You can either update the IAM policy from the AWS Console, or re-run the [Data Lakes set up Terraform job](https://github.com/segmentio/terraform-aws-data-lake){:target="_blank"}.
 
 ### Internal error
 
-Internal errors occur in Segment's internal systems, and should resolve on their own. If sync failures persist, [contact the Segment Support team](https://segment.com/help/contact/).
+Internal errors occur in Segment's internal systems, and should resolve on their own. If sync failures persist, [contact the Segment Support team](https://segment.com/help/contact/){:target="_blank"}.
 
 ## FAQ