Gift, N (2020) Python for DevOps. (Links to an external site.) Sebastopol, CA: O'Reilly. [ISBN: 9781492057697]
Noah Gift, Robert Jordan, Kennedy Behrman, Data Engineering with Python and AWS Lambda LiveLessons, https://learning.oreilly.com/videos/data-engineering-with/9780135964330
https://github.com/noahgift/awslambda
This is an attempt to recreate a reference architecture with the intention of completing a data engineering capstone project and also learning the services and technologies along the progress.
This is a submission for the individual project under category IV per the below definition.
Project: Serverless Data Engineering Pipeline Reproduce the architecture of the example serverless data engineering project. Enhance the project by extending the functionality of the NLP analysis: adding entity extraction, key phrase extraction, or some other NLP feature.
- Cloudwatch is schduled to run at defined interval to trigger lambdaproducer.
- lambda producer pulls the reference data from Dynamo db table and pushes it to SQS.
- SQS triggers lambda consumer.
- lambda consumer searches the wikipedia articles with the keyword and extracts the summary.
- summary is passed on to comprehend to conduct sentiment analysis and also extract entity recognition.
- lambda consumer then pushes the dataframe with required attributes to amazon S3.
- Athena sits on amazon s3 to query the data.
- Quicksight connects to Athena as a data source for its dataset and publishes dashboard with required analytics.
- AWS Console access
- Role for Lambda
- S3 bucket creation
- Create a Cloud9 IDE environment for deploying AWS Lambda function and application.
- Walk through the steps of creating an environment with default steps.
- The lookup table with key words to search for the articles are stored in Dynamodbtable.
- Create a table by name 'keywords' through the console and the items could be added via create item on the console screen.
- Create a queue through console and name it 'producer'
-
Using cloud9 IDE, lambda function and application could be easily created and deployed, we could develop, execute and test the functionality before deploying it from cloud9. Required libraries and packages are installed in the virtual environment created within the lambda application folder by using.
<source venv/bin/activate> <pip install --upgrade <<packagename>>>
-
producer lambda does two things
- connect to dynamo db and pull the keywords
- Push the message to SQS queue.
- Consumer lambda connects to wikipedia API and pulls the summary of the article limited to few sentences as the snippit.
- Parses the snippit to comprehend service to detect the sentiment and also extend to obtain the classification of the article using detect entity.
Database and external table are created using the below syntax
<create database wikianalytics;>
<CREATE EXTERNAL TABLE wikiepedia_summary_analytics.article_sentiments (
id STRING,
names STRING,
wikipedia_snippit STRING,
Sentiment STRING,
Type STRING
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
LOCATION 's3://fangsentiment3387/'
TBLPROPERTIES ("skip.header.line.count"="1")>
Via console, you could signup for amazon quicksight and map the athena data source to build the dashboard and publish it to the users.