We built this application as a capstone for our coursework on the Northcoders Data/Cloud Engineering bootcamp, to gain practical experience creating an ETL pipeline from the ground up and as a demonstration of the skills we learned during the 13-week course.
Using infrastructure as code to deploy a monitored system to AWS, this application first extracts constantly updating data from a live operational database. Using lambda functions, it then stores the data in an S3 bucket and transforms it into organised Parquet format. Within 30 minutes of appearing in the original database, the processed data is loaded into a data warehouse, ready for visualisation with PowerBI, Tableau, or similar tools.
Technologies used
- Python 3.12.6
- AWS (Amazon Web Services)
- Terraform
- PostgreSQL
- PG8000
- Pandas
- PyArrow
- Github Actions
The lambda functions are written in Python, using PG8000 to connect to PostgreSQL databases. The application is then deployed to AWS using Terraform and Github Actions. Our code is fully tested, and the project conforms to PEP8 standards.
System requirements
- An AWS account with appropriate credentials
Instructions
-
Clone the repository
-
Enter the repository and ensure you are working within the correct directory (the folder is named 'terrific-totes-data-pipeline')
-
When you have successfully cloned and entered the repo, enter the following commands to the terminal. Press enter after each one and allow the program to run until ready for the next command.
make requirements make dev-setup make run-checks
-
To initialise Terraform, first change directory to 'terraform' to enter the terraform folder of the repository. The correct path is ~/terrific-totes-data-pipeline/terraform.
-
Enter the following command:
terraform init
-
If Terraform successfully initialised, now enter
terraform plan
-
Finally, enter
terraform apply
and type 'yes' when prompted. This deploys the code on AWS.
Many thanks to our Northcoders 'product owner' Paul Copley for his guidance during the build phase of this project.
- Max Downer (@MaxDowner)
- Georgina Hardcastle (@xandriska)
- Charlotte Hooson (@CharlotteHooson)
- Morgan Lamb (@CoachLamb92)
- Andrew Rudge (@AndrewFudge)
- Hamzah Saeid (@hamzahsaeid)