senior-data-engineer-coding-interview

In Git, please create a feature/ branch, off of the development branch, using your name and surname as the branch name.

Submit a pull request into the development branch once you're done coding.

Problem Statement

You are consulting on a Banking group's Data Engineering project and need to create a monthly extract of data from their Aurora cluster into a CSV file on S3, to be consumed by an external system.

There are three different banks in the group.

An Amazon Aurora cluster (postgres flavour) houses their Operational Data Store.

Connection details are: mycluster.cluster-123456789012.us-east-1.rds.amazonaws.com (Please use the default postgres port to connect) username: postgres password: 5Y67bg#r#

Note: Unfortunately this cluster no longer runs, so it's connection and the resulting use of data will need to be theoretical. Get those imaginations warmed up! And don't worry, we are not concerned with detailed correctness. We want to have an understanding of HOW you code and think about the problem. :)

Given the following data model:

Perform the following tasks

Write code to deploy the following resources via Terraform:

(Put this code in the file Terraform/main.tf)

A Glue Crawler to crawl this datasource
A Glue Catalog Database to persist the metadata
A Glue Job which will read data from this datasource and write it to S3
An S3 bucket to house the transformed dataset
Any other terraform resources you might require

Write a Glue ETL script (use the file Glue/etl.py), containing scaleable business logic to calculate the moving average of loan amounts taken out over the last three months, per branch.
1. Create a separate monthly output file for each bank in the group.
2. Files need to be partitioned by Bank Name, Year and Month and the filename needs to be of the format BankName_YYYYMMDD.csv
Remember, bonus points can be earned, if you really want to impress your client, by adding some of the following:
1. Create an SFTP server on EC2 (using Terraform), which makes the contents of the S3 bucket containing the extracts available to the external system
2. Build in some form of scheduling and an orchestration layer which includes notifications upon job failure to a support email address (data-support@mybigbank.co.za)
3. Ensure idempotency of the ETL system
4. Keep passwords and connection details secure. In fact, passwords should probably be rotated and S3 data should probably be encrypted...
5. Add comments and logging to your code
6. Documentation is always nice, especially if you're good at drawing architectual diagrams. Please add any documentation you create to a separate README file.
7. Anything else we did not specify in the task outline, but which you think will add value to your solution (keep in mind templatisability for reuse of your code by other team members)

Name		Name	Last commit message	Last commit date
Latest commit History 56 Commits
Glue		Glue
Terraform		Terraform
.DS_Store		.DS_Store
DataModel_ERD.png		DataModel_ERD.png
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Glue

Glue

Terraform

Terraform

.DS_Store

.DS_Store

DataModel_ERD.png

DataModel_ERD.png

README.md

README.md

Repository files navigation

senior-data-engineer-coding-interview

Problem Statement

Perform the following tasks

About

Releases

Packages

Contributors 2

Languages

teraflow-ai/data-engineer-coding-interview

Folders and files

Latest commit

History

Repository files navigation

senior-data-engineer-coding-interview

Problem Statement

Perform the following tasks

About

Resources

Stars

Watchers

Forks

Languages