The project consists of a batch-processing data application for a data-intensive machine learning application that:
- reads data from a set of CSV files containing daily temperature data for cities around the world (city_temperature.csv from Kaggle) with 2.9 million rows;
- processes the source data by aggregating temperatures to a monthly average per city;
- incrementally loads the processed data to a database;
- reads the processed data and adds a label to each record indicating the temperature level of each city on each quarter by using a KMeans Machine Learning model to clusterize the data;
- loads the clusterized data to the database every quarter;
- presents the processed and clusterized data in HTML web pages.
The application architecture is presented below:
The application is implemented by deploying four docker containers:
- ETL: Prefect is used to manage the data pipeline flow and execute the ETL tasks written in Python (see code and details here). This job is scheduled to incrementally add new data by running every month.
- Database: PostgreSQL database is used with a docker named volume to persist data outside the container, with port 5432 exposed for external access.
- ML: Prefect is used to manage the flow of extracting the processed data and running the Machine Learning model to clusterize the data, also written in Python (see code and details here). The process is scheduled to label the data and recreate the table every quarter.
- Visualization: Prefect is used to manage the creation flow of the HTML pages containing the processed data from ETL and ML processes, also using Python and the Bokeh library (see code and details here). Nginx service is used to expose the pages created through port 80.
The project deployment leaverages Infrastructure as Code via docker-compose with set dependencies for each container.
- All containers depend on the successfully initialization of the database container and the ML job is only executed after the ETL job is done.
- A status table is used to control the flow execution between containers.
- A Docker network is set with the containers deployment, allowing the communication between containers.
- Docker
- Docker-compose
In order to successfully execute the application and verify the results, follow the steps below:
-
Clone the GitHub repository by running the following command in the terminal/command line:
git clone https://github.com/vcmueller/project-data-engineering.git -
Go into the cloned repository folder:
cd project-data-engineeringNote: An alternative to steps 1 and 2 is to download the source files here, go into the downloaded folder and continue with Step 3 from the terminal.
-
Execute the following docker-compose command to build the docker images:
docker-compose build -
Execute the following docker-compose command to start the containers:
docker-compose up -dThe containers execution status can be monitored via
docker ps -
Once the containers are initiated and running, open the following link in the browser:
localhost:8080 -
The results can be verified by following the links within the webpage
Note that it might take a few minutes until the jobs are completed, so it's recommended to refresh the pages for seeing updated information.
By default the ETL and ML jobs are executed in test mode (executed every 15 min). Production mode can be enabled by following the steps documented in ETL and ML.
The "Process Status" page should reflect the jobs execution details, so it can be used as a monitoring tool.
More information regarding the results displayed can be found here.
-
To stop the application, run the following docker-compose command:
docker-compose down

