Book Data Lake

A fully automated data pipeline that ingests daily trending books from Project Gutenberg, enriches them with metadata from Open Library, transforms the data using dbt, and indexes it into Elasticsearch for visualization in Kibana. The entire workflow is orchestrated using Apache Airflow.

Prerequisites

Before running this project, ensure you have the following installed on your machine:

uv: Fast Python package manager.
Docker Desktop (or Docker Engine + Docker Compose).

Quickstart (One-Command Setup)

To install dependencies, start the databases, and launch the Airflow orchestrator, simply run:

make start

Note: The script will automatically pause and wait for Elasticsearch to become healthy before initializing Airflow.

Accessing the Services

Once make start completes, you can access the following local services:

Apache Airflow: http://localhost:8080
- User: admin
- Password: admin
- Turn on the daily_books_pipeline DAG to trigger the data flow.
Kibana (Dashboard): http://localhost:5601
Elasticsearch (Database): http://localhost:9200

Importing the Dashboard

Kibana dashboards are stored inside the database, so they must be imported manually the first time you spin up the project.

Open Kibana at http://localhost:5601.
Open the main menu (top left) and scroll down to Stack Management.
Under the "Kibana" section on the left sidebar, click Saved Objects.
Click the blue Import button in the top right corner.
Click the import box and select the trending_books_dashboard.ndjson file located in the dashboards/ directory of this repository.
Click Import at the bottom of the flyout menu.
Navigate back to the Dashboard page from the main menu to view the charts!

Useful Commands

This project uses a Makefile to simplify operations.

make start - Full setup and launch of the Airflow scheduler.
make stop - Safely spin down the Docker containers.
make reset - Nuclear option. Deletes the virtual environment, Airflow database, and dbt cache for a completely fresh start.
make pipeline - Runs the Python/dbt scripts manually in sequence (bypassing Airflow).

Project Architecture

Ingestion (scripts/): Fetches trending books and API metadata.
Transformation (dbt_project/): Cleans, joins, and calculates the classic_score using DuckDB, outputting a "Gold" Parquet file.
Indexing (scripts/index_to_elastic.py): Pushes the structured Parquet data into Elasticsearch.
Orchestration (airflow/dags/): Automates the daily execution.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
dags		dags
dashboard		dashboard
dbt_project		dbt_project
report		report
scripts		scripts
.gitignore		.gitignore
.python-version		.python-version
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Book Data Lake

Prerequisites

Quickstart (One-Command Setup)

Accessing the Services

Importing the Dashboard

Useful Commands

Project Architecture

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Book Data Lake

Prerequisites

Quickstart (One-Command Setup)

Accessing the Services

Importing the Dashboard

Useful Commands

Project Architecture

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages