A fully automated data pipeline that ingests daily trending books from Project Gutenberg, enriches them with metadata from Open Library, transforms the data using dbt, and indexes it into Elasticsearch for visualization in Kibana. The entire workflow is orchestrated using Apache Airflow.
Before running this project, ensure you have the following installed on your machine:
- uv: Fast Python package manager.
- Docker Desktop (or Docker Engine + Docker Compose).
To install dependencies, start the databases, and launch the Airflow orchestrator, simply run:
make startNote: The script will automatically pause and wait for Elasticsearch to become healthy before initializing Airflow.
Once make start completes, you can access the following local services:
- Apache Airflow: http://localhost:8080
- User:
admin - Password:
admin - Turn on the
daily_books_pipelineDAG to trigger the data flow.
- User:
- Kibana (Dashboard): http://localhost:5601
- Elasticsearch (Database): http://localhost:9200
Kibana dashboards are stored inside the database, so they must be imported manually the first time you spin up the project.
- Open Kibana at http://localhost:5601.
- Open the main menu (top left) and scroll down to Stack Management.
- Under the "Kibana" section on the left sidebar, click Saved Objects.
- Click the blue Import button in the top right corner.
- Click the import box and select the
trending_books_dashboard.ndjsonfile located in thedashboards/directory of this repository. - Click Import at the bottom of the flyout menu.
- Navigate back to the Dashboard page from the main menu to view the charts!
This project uses a Makefile to simplify operations.
make start- Full setup and launch of the Airflow scheduler.make stop- Safely spin down the Docker containers.make reset- Nuclear option. Deletes the virtual environment, Airflow database, and dbt cache for a completely fresh start.make pipeline- Runs the Python/dbt scripts manually in sequence (bypassing Airflow).
- Ingestion (
scripts/): Fetches trending books and API metadata. - Transformation (
dbt_project/): Cleans, joins, and calculates theclassic_scoreusing DuckDB, outputting a "Gold" Parquet file. - Indexing (
scripts/index_to_elastic.py): Pushes the structured Parquet data into Elasticsearch. - Orchestration (
airflow/dags/): Automates the daily execution.