Rearc Data Quest

Q. What is this quest?

It is a fun way to assess your data skills. It is also a good representative sample of the work we do at Rearc.

Q. So what skills should I have?

Data management / data engineering concepts.
Programming language (python, java, scala, etc).
AWS knowledge (Lambda, SQS, CloudWatch logs).
Infrastructure-as-code (Terraform, CloudFormation, etc)

Q. What do I have to do?

This quest consists of 4 different parts. Putting all 4 parts together we will have a Data Pipeline architecture.

Part 1 and Part 2 will showcase your skills with data management, AWS concepts, and your overall data engineering skillset. The goal is to source data from different places and store it in-house.
Part 3 will showcase your data analytics skills. The goal is to find some interesting insights with data.
Lastly, Part 4 will put all the pieces together. The goal here is to showcase your experience with automation and AWS services.

Part 1: AWS S3 & Sourcing Datasets

Republish this open dataset in Amazon S3 and share with us a link.
Script this process so the files in the S3 bucket are kept in sync with the source when data on the website is updated, added, or deleted.
Don't rely on hard coded names - the script should be able to handle added or removed files.
Ensure the script doesn't upload the same file more than once.

Part 2: APIs

Create a script that will fetch data from this API. You can read the documentation here
Save the result of this API call as a JSON file in S3.

Part 3: Data Analytics

Load both the csv file from Part 1 pr.data.0.Current and the json file from Part 2 as dataframes (Spark, Pyspark, Pandas, Koalas, etc).
Using the dataframe from the population data API (Part 2), generate the mean and the standard deviation of the US population across the years [2013, 2018] inclusive.

Using the dataframe from the time-series (Part 1), For every series_id, find the best year: the year with the max/largest sum of "value" for all quarters in that year. Generate a report with each series id, the best year for that series, and the summed value for that year. For example, if the table had the following values:

series_id	year	period	value
PRS30006011	1995	Q01	1
PRS30006011	1995	Q02	2
PRS30006011	1996	Q01	3
PRS30006011	1996	Q02	4
PRS30006012	2000	Q01	0
PRS30006012	2000	Q02	8
PRS30006012	2001	Q01	2
PRS30006012	2001	Q02	3

the report would generate the following table:

series_id	year	value
PRS30006011	1996	7
PRS30006012	2000	8

Using both dataframes from Part 1 and Part 2, generate a report that will provide the value for series_id = PRS30006032 and period = Q01 and the population for that given year (if available in the population dataset)

series_id year period value Population

PRS30006032 2018 Q01 1.9 327167439

Hints: when working with public datasets you sometimes might have to perform some data cleaning first. For example, you might find it useful to perform trimming of whitespaces before doing any filtering or joins
Submit your analysis, your queries, and the outcome of the reports as a .ipynb file.

Part 4: Infrastructure as Code & Data Pipeline with AWS CDK

Using AWS CloudFormation, AWS CDK or Terraform, create a data pipeline that will automate the steps above.
The deployment should include a Lambda function that executes Part 1 and Part 2 (you can combine both in 1 lambda function). The lambda function will be scheduled to run daily.
The deployment should include an SQS queue that will be populated every time the JSON file is written to S3. (Hint: S3 - Notifications)
For every message on the queue - execute a Lambda function that outputs the reports from Part 3 (just logging the results of the queries would be enough. No .ipynb is required).

Q. Do I have to do all these?

You can do as many as you like. We suspect though that once you start you won't be able to stop. It's addictive.

Q. What do I have to submit?

Link to data in S3 and source code (Step 1) https://questdata.s3.amazonaws.com/api.json
Source code (Step 2)
Source code in .ipynb file format and results (Step 3)
Source code of the data pipeline infrastructure (Step 4)

Q. What if I successfully complete all the steps?

We have many more for you to solve as a member of the Rearc team!

Q. What if I fail?

Do. Or do not. There is no fail.

Q. Can i share this quest with others?

No.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.gitignore		.gitignore
BrokenAPI.png		BrokenAPI.png
Part3.ipynb		Part3.ipynb
README.md		README.md
main.tf		main.tf
part1.py		part1.py
part2.py		part2.py
part3.py		part3.py
requirements_url.txt		requirements_url.txt
setup.sh		setup.sh

tnorlund/rearc

Folders and files

Latest commit

History

Repository files navigation

Rearc Data Quest

Q. What is this quest?

Q. So what skills should I have?

Q. What do I have to do?

Part 1: AWS S3 & Sourcing Datasets

Part 2: APIs

Part 3: Data Analytics

Part 4: Infrastructure as Code & Data Pipeline with AWS CDK

Q. Do I have to do all these?

Q. What do I have to submit?

Q. What if I successfully complete all the steps?

Q. What if I fail?

Q. Can i share this quest with others?

About

Resources

Stars

Watchers

Forks

Languages