Data Lake Platform Adoption with POC

The goal of this repo and the exercises in it is to walk through engineers ACP use cases for Analytical Services. While doing it, we aim to show rough edges of those services, perf consideration, immutability challenges and obviously some solutions to those problems

Each use case provides step by step instruction to simulate common scenarios in production like CRUD and Change Data Capture. Additionally, how services data can be introduced into AWS Analytical Services. Each step shall wipe out the environment, resources it relies on and recreate all every time executed. Here is a link to the Confluence page for more details: https://accoladeinc.atlassian.net/wiki/spaces/PD/pages/79364467/Dlp-Poc+Github+Repo+Documentation

The project requires three parameters; use-case, step-id, tag

use-case: group of exercises(steps) for a particular topic
step-id: an exercise within a topic
tag:an arbitrary value to tag each AWS resources and isolate engineers in AWS Sandbox.

The logs on the console shall provide enough information to showcase the use cases. Please refer to AWS documentation for further explanation.

Use Case 1: Slice/ Dice CSV Data

We shall use Athena to access file(s) sitting in s3

create a bucket with folder /csv/
copy the csv file to this folder

Step 1: Access CSV file without metadata in Glue Catalog

Purpose: Show that Athena doesn't properly function without metadata in Glue Catalog

Read the data via Athena
Showcase the errors

Step 2: Access CSV file with metadata in Glue Catalog

Purpose: Show how Athena properly functions with metadata in Glue Catalog

Deploy Glue Crawler and Catalog
Read the metadata
Read the data via Athena

Step 3: Generate metadata via Athena

Purpose: Show how to generate metadata via an Athena query

Manually generate metadata in the Catalog via Athena
Read the metadata
Read the updated data via Athena

Step 4: Update CSV file

Purpose: Show how Athena doesn't store underlying data and only properly functions with updated metadata

Run Step 3
Update the existing data in the backend
Read the updated data via Athena
Showcase the errors
Deploy Glue Crawler and Catalog on the updated data
Read the updated metadata
Read the updated data via Athena

~~#### Step3:Simulate Immutability of s3 data~~ ~~1. Read the data~~
~~2. Insert a row via Athena as duplicate to an existing record~~ ~~3. Read the data and print to point out that we have duplicate in the data~~ We learned that that Athena does not support certain DML commands like update and insert commands, so we can't insert a row via Athena

~~#### Step4: Dedup~~ ~~1. Read the data~~
~~2. Run dedup data via Athena~~ ~~3. Read the data~~ Since Athena doesn't support numerous DML commands, we also can't showcase dedup.

Use Case 2: Slice/Dice Complex Data

Files would be cvs, backspace delimited files, json

Create a bucket with the following folders for the file types
1. /json/
2. /json-omitted/
3. /complex_json/
4. /backspace/
Copy the files to their respective folders

Step 1: For a simple Json file

Purpose: Show how the built-in json classifier correctly crawls simple json

Deploy Glue Crawler and Catalog
Read the metadata
Read the data via Athena

Step 2: For a complex Json file(Explicit metadata)

Purpose: Show how the built-in json classifier crawls the entire json file, even if it has a complex data structure. Furthermore, this step shows how to query complex data using Athena if we have the metadata explicitly predefined using a Crawler.

Deploy Glue Crawler and Catalog
Read the metadata
Read the data via Athena

Step 3: For a complex Json file(Generic metadata)

Purpose: This step shows how to query complex json data without having complex, explicit metadata predefined. Basically, we could use built-in json queries seen in Presto to simulate the same output from step 2. The result is the same but the main difference is that the data structure interpreation is happening during query creation time rather than during table creation time. In this instance, the data structure interpretation is is happening during query creation time, which permits for more dynamic querying but requries users to have a understanding of Presto json functions.

Explain built-in Presto json fxns
Run ad-hoc json queries

Step 4: For a complex Json file(Custom Classifier)

Purpose: This step shows how to use a custom classifier for a complex json file to classify a specific part of the complex json file instead of the entire complex json file.

Create & Deploy Custom Glue Classifier
Read the metadata

Step 5: For a Json file with Omitted Attribute

Purpose: Show how the the Crawler handles omitted attributes, like an optional address attribute, for a complex json file.

Deploy Glue Crawler and Catalog
Read the metadata
Create & Deploy Custom Glue Classifier
Read the updated metadata

Step 6: For a backspace delimited file

Purpose: Show how you need to build a custom classifier for Grok patterns

Deploy Glue Crawler and Catalog
Read the metadata
Read via Athena
Showcase the error
Create & Deploy Custom Glue Classifier (Grok Pattern)
Read the updated metadata
Read the data via Athena

Use Case 3: Firehose Data Transformation

Create the following resources:

S3 bucket
Firehose Stream

Step 1: Transform simple Json data without metadata

Purpose: Show how metadata must be in Glue Catalog for Firehose to properly transform function

Send a json file to Firehose through boto
Have Firehose ingest and transform the simple json file into Parquet
Showcase the errors

Step 2: Transform simple Json data with metadata

Purpose: Show how Firehose properly functions for a simple file with metadata present in Catalog

Deploy the Crawler on the json file
Send a simple json file to Firehose through boto
Dump the Parquet file into S3

Step 3: Transform complex Json data with metadata

Purpose: Show how Firehose properly functions even with complex, nested json files without any custom classifiers. This is especially important since a team only has to use a Glue Crawler to generate the metadata for a complex, nested file for Firehose to be able to successfully transform it to Parquet.

Deploy the Crawler on the complex json file
Configure Firehose to transform the json into Parquet by pointing Firehose to the Glue Catalog
Send a complex json file to Firehose through Boto
Dump the Parquet file into S3

Step 4: Transform complex Json data with with metadata & classifier

Purpose: Show how Firehose could also work with a custom Glue classifier if that is necessary.

Deploy the Crawler with the custom Classifier on the complex json file
Configure Firehose to transform the json into Parquet by pointing Firehose to the Glue Catalog
Transform the json to only include the fields specified on the classifier.
Send the transformed json file to Firehose through Boto
Dump the Parquet file into S3

Step 5: Transform updated Json data with metadata

Purpose: Show how data transformation still works even when underlying data is updated and you want to maintain previously created resources. The method below is the best option if you want to maintain "Version 1" resources and not delete them. If you never plan on using "Version 1" resources, then you only have to re-run the same Crawler on the updated data to update the schema that exists within the Glue Catalog. Everything else will works as is.

Update json data and create a "Version 2" for every resource previously created
Deploy Crawler Version 2 on the updated json file
Send a complex json file to Firehose Version 2 through boto
Point Firehose Version 2 to the Glue Catalog Version 2
Have Firehose Version 2 ingest and transform the json file into Parquet
Dump the Parquet file into S3 bucket version 2

Use Case 4: Glue Data Transformation

Important Note: Glue ETL runs Spark, which may be more expensive than using a Firehose stream to convert data to Parquet. Please refer to Use Case 3 to learn more about Firehose data transformation.

Create the following resources:

S3 bucket
Upload a json file into this bucket

Step 1: Transform Json data without metadata

Purpose: Show how metadata must be in Glue Catalog for Glue ETL job to properly transform function

Run an ETL job without pointing to Catalog
Showcase the errors

Step 2: Transform simple Json data with metadata

Purpose: Show how a Glue ETL to properly transform data with metadata in Glue Catalog

Deploy Glue Catalog and Crawler
Run an ETL job pointing to Catalog that transforms the json file to parquet
Read the transformed file

Step 3: Transform complex Json data with metadata

Purpose: Show how a Glue ETL to properly transform data with metadata in Glue Catalog

Deploy Glue Catalog and Crawler with Custom Classifier
Run an ETL job pointing to Catalog that transforms the json file to parquet
Read the transformed file

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
common		common
use_cases		use_cases
utility		utility
.DS_Store		.DS_Store
README.md		README.md
application.py		application.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data Lake Platform Adoption with POC

Use Case 1: Slice/ Dice CSV Data

Step 1: Access CSV file without metadata in Glue Catalog

Step 2: Access CSV file with metadata in Glue Catalog

Step 3: Generate metadata via Athena

Step 4: Update CSV file

Use Case 2: Slice/Dice Complex Data

Step 1: For a simple Json file

Step 2: For a complex Json file(Explicit metadata)

Step 3: For a complex Json file(Generic metadata)

Step 4: For a complex Json file(Custom Classifier)

Step 5: For a Json file with Omitted Attribute

Step 6: For a backspace delimited file

Use Case 3: Firehose Data Transformation

Step 1: Transform simple Json data without metadata

Step 2: Transform simple Json data with metadata

Step 3: Transform complex Json data with metadata

Step 4: Transform complex Json data with with metadata & classifier

Step 5: Transform updated Json data with metadata

Use Case 4: Glue Data Transformation

Step 1: Transform Json data without metadata

Step 2: Transform simple Json data with metadata

Step 3: Transform complex Json data with metadata

About

Releases

Packages

Languages

vikramchandran/Data-Lake-Application

Folders and files

Latest commit

History

Repository files navigation

Data Lake Platform Adoption with POC

Use Case 1: Slice/ Dice CSV Data

Step 1: Access CSV file without metadata in Glue Catalog

Step 2: Access CSV file with metadata in Glue Catalog

Step 3: Generate metadata via Athena

Step 4: Update CSV file

Use Case 2: Slice/Dice Complex Data

Step 1: For a simple Json file

Step 2: For a complex Json file(Explicit metadata)

Step 3: For a complex Json file(Generic metadata)

Step 4: For a complex Json file(Custom Classifier)

Step 5: For a Json file with Omitted Attribute

Step 6: For a backspace delimited file

Use Case 3: Firehose Data Transformation

Step 1: Transform simple Json data without metadata

Step 2: Transform simple Json data with metadata

Step 3: Transform complex Json data with metadata

Step 4: Transform complex Json data with with metadata & classifier

Step 5: Transform updated Json data with metadata

Use Case 4: Glue Data Transformation

Step 1: Transform Json data without metadata

Step 2: Transform simple Json data with metadata

Step 3: Transform complex Json data with metadata

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages