MSDS-498: Data Engineering Capstone: Serverless-Data-Pipeline-Architecture implementation

References:

Gift, N (2020) Python for DevOps. (Links to an external site.) Sebastopol, CA: O'Reilly. [ISBN: 9781492057697]

Noah Gift, Robert Jordan, Kennedy Behrman, Data Engineering with Python and AWS Lambda LiveLessons, https://learning.oreilly.com/videos/data-engineering-with/9780135964330

https://github.com/noahgift/awslambda

Synopsis:

This is an attempt to recreate a reference architecture with the intention of completing a data engineering capstone project and also learning the services and technologies along the progress.

This is a submission for the individual project under category IV per the below definition.

Project: Serverless Data Engineering Pipeline Reproduce the architecture of the example serverless data engineering project. Enhance the project by extending the functionality of the NLP analysis: adding entity extraction, key phrase extraction, or some other NLP feature.

Architecture Diagram:

Cloudwatch is schduled to run at defined interval to trigger lambdaproducer.
lambda producer pulls the reference data from Dynamo db table and pushes it to SQS.
SQS triggers lambda consumer.
lambda consumer searches the wikipedia articles with the keyword and extracts the summary.
summary is passed on to comprehend to conduct sentiment analysis and also extract entity recognition.
lambda consumer then pushes the dataframe with required attributes to amazon S3.
Athena sits on amazon s3 to query the data.
Quicksight connects to Athena as a data source for its dataset and publishes dashboard with required analytics.

Prerequisies:

AWS Console access
Role for Lambda
S3 bucket creation

Configuiring the Cloud9 IDE service in AWS

Create a Cloud9 IDE environment for deploying AWS Lambda function and application.
1. Walk through the steps of creating an environment with default steps.

Create a DynamoDb table (key value store)

The lookup table with key words to search for the articles are stored in Dynamodbtable.
Create a table by name 'keywords' through the console and the items could be added via create item on the console screen.

Setup SQS queue

Create a queue through console and name it 'producer'

Create a producer lambda

Using cloud9 IDE, lambda function and application could be easily created and deployed, we could develop, execute and test the functionality before deploying it from cloud9. Required libraries and packages are installed in the virtual environment created within the lambda application folder by using.
```
        <source venv/bin/activate>
        <pip install --upgrade <<packagename>>>
```
producer lambda does two things
1. connect to dynamo db and pull the keywords
2. Push the message to SQS queue.

Create a consumer lambda

Consumer lambda connects to wikipedia API and pulls the summary of the article limited to few sentences as the snippit.
Parses the snippit to comprehend service to detect the sentiment and also extend to obtain the classification of the article using detect entity.

Create an athena database and external table

Database and external table are created using the below syntax

     <create database wikianalytics;>
     <CREATE EXTERNAL TABLE wikiepedia_summary_analytics.article_sentiments (
        id STRING,
        names STRING,
        wikipedia_snippit STRING,
        Sentiment STRING,
        Type STRING
        ) 
      ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
      LOCATION 's3://fangsentiment3387/'
      TBLPROPERTIES ("skip.header.line.count"="1")>

Configure Quicksight

Via console, you could signup for amazon quicksight and map the athena data source to build the dashboard and publish it to the users.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
MSDS 498_ Data engineering project1.jpg		MSDS 498_ Data engineering project1.jpg
QSDashboard.JPG		QSDashboard.JPG
README.md		README.md
lambdaconsumer_sqs.py		lambdaconsumer_sqs.py
lambdaproducer_sqs.py		lambdaproducer_sqs.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MSDS-498: Data Engineering Capstone: Serverless-Data-Pipeline-Architecture implementation

References:

Synopsis:

Architecture Diagram:

Prerequisies:

Configuiring the Cloud9 IDE service in AWS

Create a DynamoDb table (key value store)

Setup SQS queue

Create a producer lambda

Create a consumer lambda

Create an athena database and external table

Configure Quicksight

About

Releases

Packages

Languages

shankarfierce/MSDS-498

Folders and files

Latest commit

History

Repository files navigation

MSDS-498: Data Engineering Capstone: Serverless-Data-Pipeline-Architecture implementation

References:

Synopsis:

Architecture Diagram:

Prerequisies:

Configuiring the Cloud9 IDE service in AWS

Create a DynamoDb table (key value store)

Setup SQS queue

Create a producer lambda

Create a consumer lambda

Create an athena database and external table

Configure Quicksight

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages