Skip to content

regkhalil/Data-Lake-Project

Repository files navigation

Book Data Lake

A fully automated data pipeline that ingests daily trending books from Project Gutenberg, enriches them with metadata from Open Library, transforms the data using dbt, and indexes it into Elasticsearch for visualization in Kibana. The entire workflow is orchestrated using Apache Airflow.

Prerequisites

Before running this project, ensure you have the following installed on your machine:

  • uv: Fast Python package manager.
  • Docker Desktop (or Docker Engine + Docker Compose).

Quickstart (One-Command Setup)

To install dependencies, start the databases, and launch the Airflow orchestrator, simply run:

make start

Note: The script will automatically pause and wait for Elasticsearch to become healthy before initializing Airflow.

Accessing the Services

Once make start completes, you can access the following local services:

Importing the Dashboard

Kibana dashboards are stored inside the database, so they must be imported manually the first time you spin up the project.

  1. Open Kibana at http://localhost:5601.
  2. Open the main menu (top left) and scroll down to Stack Management.
  3. Under the "Kibana" section on the left sidebar, click Saved Objects.
  4. Click the blue Import button in the top right corner.
  5. Click the import box and select the trending_books_dashboard.ndjson file located in the dashboards/ directory of this repository.
  6. Click Import at the bottom of the flyout menu.
  7. Navigate back to the Dashboard page from the main menu to view the charts!

Useful Commands

This project uses a Makefile to simplify operations.

  • make start - Full setup and launch of the Airflow scheduler.
  • make stop - Safely spin down the Docker containers.
  • make reset - Nuclear option. Deletes the virtual environment, Airflow database, and dbt cache for a completely fresh start.
  • make pipeline - Runs the Python/dbt scripts manually in sequence (bypassing Airflow).

Project Architecture

  1. Ingestion (scripts/): Fetches trending books and API metadata.
  2. Transformation (dbt_project/): Cleans, joins, and calculates the classic_score using DuckDB, outputting a "Gold" Parquet file.
  3. Indexing (scripts/index_to_elastic.py): Pushes the structured Parquet data into Elasticsearch.
  4. Orchestration (airflow/dags/): Automates the daily execution.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors