Skip to content

aDudko/streaming-data-loader

Repository files navigation

Streaming Data Loader

Python Kafka Elasticsearch Prometheus Grafana AsyncIO CI Docker Kubernetes License


Overview

Streaming Data Loader is a high-performance, async-powered microservice for processing data streams from Kafka and bulk-loading them into Elasticsearch. It combines modern Python tooling with observability best practices to provide reliable, scalable, and debuggable data pipelines.

It’s fast, resilient, and production-ready β€” ideal for those who need lightweight alternatives to complex ETL systems.


Key Features

  • Asynchronous processing with asyncio, aiokafka, and aiohttp
  • Batch insertions for throughput efficiency
  • Retry & fault-tolerant logic for Kafka and Elasticsearch
  • Configurable via .env and pydantic-settings
  • Docker & Kubernetes ready
  • Prometheus + Grafana monitoring included
  • Tested with pytest, including integration scenarios

Quick Start

🐳 Docker Compose

docker-compose up --build

☸️ Kubernetes

Step 1 β€” Deploy

./k8s-deploy.sh

Step 2 β€” Cleanup

./k8s-clean.sh

Architecture

This project uses Hexagonal Architecture (Ports and Adapters), ensuring modularity, extensibility, and clean separation of concerns.

Kafka -β†’ KafkaConsumerService -β†’ EventService -β†’ ElasticsearchClientService -β†’ Elasticsearch
                         β”‚                ↓
                         β””-β†’ Metrics + Logging (Prometheus + JSON logs)

Layers

  • Input Ports: Kafka Consumer (aiokafka), deserialization, batching
  • Application Core: Event transformation, validation, retry logic
  • Output Ports: Async Elasticsearch client, bulk insert, failure handling
  • Infrastructure: Docker, Kubernetes, logging, metrics, monitoring

πŸ” Why Choose This Over Logstash, Flume, etc.?

  • βœ… True async data pipeline β€” lower latency, better throughput
  • βœ… No heavyweight config DSL β€” Python code, pyproject.toml, .env
  • βœ… Built-in retries & fault handling β€” robust out of the box
  • βœ… JSON logging and metric labeling for full observability
  • βœ… Open-source & customizable β€” perfect for modern data teams

Observability

Prometheus scrapes metrics on /metrics (port 8000). Dashboards are automatically provisioned in Grafana.

Metric Description
messages_processed_total Total number of processed messages
errors_total Total errors during processing
consume_duration_seconds Time spent reading from Kafka
response_duration_seconds Time to insert into Elasticsearch
transform_duration_seconds Time spent transforming messages
batch_processing_duration_seconds Full batch processing time

Testing

pytest -v

Includes:

  • βœ… Unit tests
  • βœ… Integration tests (Kafka β†’ ES)
  • βœ… Metrics verification
  • βœ… Config validation

Technologies

  • Python 3.12 + asyncio
  • Kafka + aiokafka
  • Elasticsearch Bulk API
  • Pydantic dotenv poetry
  • Prometheus Grafana
  • Docker docker-compose
  • Kubernetes-ready
  • JSON logging (python-json-logger)

Project Structure

streaming-data-loader/
β”œβ”€β”€ configs/                     # Prometheus / Grafana
β”œβ”€β”€ src/                         # Main source code
β”‚   β”œβ”€β”€ domain/
β”‚   β”œβ”€β”€ ports/
β”‚   β”œβ”€β”€ services/                # Event processing logic
β”‚   β”œβ”€β”€ config.py                # Settings & env config
β”‚   β”œβ”€β”€ logger.py                # JSON logger setup
β”‚   └── metrics.py               # Prometheus metrics
β”œβ”€β”€ tests/                       # Unit & integration tests
β”œβ”€β”€ k8s/                         # Kubernetes manifests
β”œβ”€β”€ docker-compose.yml
β”œβ”€β”€ Dockerfile
β”œβ”€β”€ deploy.sh / clean.sh
└── pyproject.toml

If you find this useful...

...give it a star, fork it, or mention it in your next data project!

Author

Anatoly Dudko
GitHub @aDudko β€’ LinkedIn

About

Tool for efficient loading and processing of streaming data from Kafka to Elasticsearch

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published