🚀 Cloud-Native Big Data Platform on Kubernetes (Raw K8s / AWS)

An enterprise-grade, cloud-native orchestration framework for distributed big data workloads. Built on self-managed Kubernetes (kubeadm) on AWS EC2 with Cilium CNI, this platform provides a decoupled, elastic environment for Apache Spark, Delta Lake, and Airflow, featuring a unified suite of modern interactive notebook environments.

👉 View the v0.3.0 Changelog | Release Notes

📖 Introduction

This repository contains a Data Platform as Code (DPaC) implementation, designed to modernize distributed computing by enforcing a strict separation of compute and storage. Leveraging Kubernetes as the primary orchestration plane, the platform eliminates infrastructure silos, enabling teams to deploy and scale production-ready data ecosystems elastically.

Architectural Core Principles:

Decoupled Compute/Storage: Persistence is offloaded to S3-compatible object storage (MinIO), allowing compute resources (Spark Executors) to remain ephemeral and cost-efficient.
GitOps-Centric Design: Every component, from networking routes to database schemas, is defined as declarative Kubernetes manifests for reproducible deployments. Docker images are built and pushed automatically via GitHub Actions CI/CD on every Dockerfile change.
Zero-Trust Ingress: External access is routed through Cloudflare Tunnel (cloudflared) — no inbound firewall ports needed. Traefik runs as a pure internal ClusterIP service.
High Observability: Integrated telemetry across the stack provides deep visibility into job performance, resource utilization, and system health.

🚦 Project Status

Feature	Status	Notes
JupyterHub / Spark	✅ Stable	Core interactive environment
Spark Connect	✅ Stable	Shared Spark gateway for all clients
Delta Lake	✅ Stable	ACID transactions and Time Travel on S3
Hive Metastore	✅ Stable	Centralized metadata management (Thrift) — Hive 4.1.0
Cilium CNI	✅ Stable	AWS ENI IPAM mode for native VPC networking
StarRocks	✅ Stable	Verified with Native Delta Catalog (OLAP)
Airflow + Git-Sync	✅ Stable	DAGs auto-synced from Git repository
Monitoring Stack	🏗 Beta	Prometheus, Grafana, Loki, Hubble UI
Marimo Notebooks	🧪 Exp	Reactive Python UI integration

Important

Features marked as Experimental (🧪 Exp) are in the development phase. They may have incomplete functionality or require additional configuration.

🏗 Architecture & Components

The platform is divided into three logical domains:

1️⃣ Ingress & Networking

Cilium CNI: Pod networking with AWS ENI IPAM mode. Pods receive real VPC IPs for full AWS compatibility.
MetalLB: Provides a network load-balancer implementation, assigning a dedicated Elastic IP (44.203.26.241) for internal cluster use.
Cloudflare Tunnel (cloudflared): Replaces public LoadBalancer exposure. A 3-replica HA deployment of cloudflared in its own namespace connects outbound to Cloudflare's edge, routing external HTTPS traffic securely to Traefik without opening inbound firewall ports. Includes PodDisruptionBudget, topology spread, NetworkPolicy, RBAC, and Prometheus ServiceMonitor.
Traefik Proxy: The unified ingress controller. Operates as ClusterIP (not LoadBalancer) — receives traffic exclusively from cloudflared. Routes requests to internal services. No hostNetwork.
Hubble UI: Cilium's observability dashboard for real-time network flow visibility.
SSLIP.IO: Automatic DNS resolution for LoadBalancer IPs (internal cluster access).

2️⃣ Application Layer (Blue Domain)

Apache Airflow (2.x): The workflow orchestrator. It schedules DAGs that trigger Spark jobs, move data, and manage dependencies. configured with the KubernetesExecutor for scaling tasks.
Notebook Suite:
- JupyterHub: Standard interactive environment with Zeppelin features (SQL magic, Scala kernel, z.show()).
- Marimo: Reactive Python notebooks with high-performance UI components.
- Polynote: IDE-focused notebook for Scala and multi-language Spark development.
Apache Spark (4.1.1): The distributed compute engine, pre-configured with Delta Lake and Hadoop 3.4.1 support.
Apache Superset: Enterprise-ready BI. Connects to the platform for data visualization.
Hive Metastore (HMS): Standalone Thrift service acting as the central catalog for Spark and StarRocks.

3️⃣ Data & Persistence (Green Domain)

OpenEBS (Hostpath): Dynamic storage provisioner that manages local node storage. Replaces static PersistentVolumes for an automated storage lifecycle.
MinIO: High-performance Object Storage (S3 Compatible). Acts as the "Data Lake" storage layer.
PostgreSQL: The relational metadata backbone. Stores state for Airflow, Superset, and Hive.
Redis: In-memory cache used by Superset.
StarRocks: High-performance analytical (OLAP) database. Reads directly from MinIO via Delta Native Catalog.
Kong Gateway (Experimental): Secondary API gateway for external service management.

🛠 Tech Stack

Component	Version	Role	Usage
Apache Airflow	`2.10.x`	Orchestrator	Scheduling ETL pipelines
Spark / Delta	`4.1.1 / 4.0.1`	Compute / Format	Distributed processing & ACID tables
Hadoop / AWS SDK	`3.4.1 / 1.12.367`	Storage Access	S3A FileSystem optimizations
JupyterHub	`4.0.7`	Notebooks	Standard Data Engineering workflow
Marimo / Polynote	`latest`	Notebooks	Reactive & Multi-language environments
Hive Metastore	`4.1.0`	Catalog	Metadata persistence (arm64 native, JDK 17+)
StarRocks	`v3.x`	OLAP Database	Sub-second queries on large datasets
Apache Superset	`4.0.x`	BI / Viz	Dashboards & Analytics
MinIO	`RELEASE.2024`	Object Store	Data Lake (S3 API)
Traefik / Kong	`v2.10 / v3.x`	Ingress/API Gateway	Load Balancing & Service Routing
Prometheus / Loki	`Custom Helm`	Observability	Metrics & Centralized Logging
Grafana	`latest`	Dashboards	Visualizing cluster health & job metrics

⚡ Deployment Guide

Prerequisites

AWS EC2 Cluster: Self-managed Kubernetes via kubeadm on EC2 instances (ARM64 Graviton recommended).
Tools: kubectl, helm installed locally.
Permissions: Admin access to the cluster (KUBECONFIG configured).

Step 1: Clone & Configure

git clone https://github.com/your-repo/k8s-big-data-platform.git
cd k8s-big-data-platform

Step 2: Build Custom Images (Crucial)

The platform uses optimized images for notebooks and executors. Build and push them to your registry. If you want to customize these images (e.g. adding specific spark dependencies or Python libraries), explore and modify the Dockerfiles within the docker/ folder before running these scripts:

# Hive Metastore (arm64-native, Hive 4.1.0)
docker/hive/build.sh

# Spark Executor & Driver Base
docker/spark/build.sh

# User Interfaces
docker/jupyterhub/build.sh
docker/marimo/build.sh

Tip

CI/CD Auto-Build: Any push to main that modifies a docker/*/Dockerfile will automatically trigger a multi-arch (linux/amd64 + linux/arm64) build and push via GitHub Actions. You only need to run these scripts manually for local testing. See .github/workflows/docker-build.yml.

Step 3: Deploy Platform

Run the main deployment script. This automation handles namespace creation, CRD installation, and Helm chart deployments.

chmod +x deploy-v2.sh
./deploy-v2.sh

Wait for the script to complete. It may take 5-10 minutes for the LoadBalancer IP to provision.

Step 3: Access Services

The script will output the dynamic URLs for your services. The base domain $INGRESS_DOMAIN is constructed automatically using the LoadBalancer IP (e.g., 44.203.26.241.sslip.io).

Service	URL Pattern	Default Credentials
Airflow	`http://airflow.<INGRESS_DOMAIN>`	`admin` / `admin`
JupyterHub	`http://jupyterhub.<INGRESS_DOMAIN>`	No token (Dev Mode)
Superset	`http://superset.<INGRESS_DOMAIN>`	`admin` / `admin`
Minio	`http://minio.<INGRESS_DOMAIN>`	`minioadmin` / `minioadmin`
Grafana	`http://grafana.<INGRESS_DOMAIN>`	`admin` / `prom-operator`
Spark UI	`http://spark.<INGRESS_DOMAIN>`	-
Spark History	`http://spark-history.<INGRESS_DOMAIN>`	-
Hubble UI	`http://hubble.<INGRESS_DOMAIN>`	-
Headlamp UI	`http://headlamp.<INGRESS_DOMAIN>`	See token below

Headlamp Cluster Admin Token

Generate a token for Headlamp UI access:

# One-time setup: Create admin-user service account
kubectl apply -f - <<EOF
apiVersion: v1
kind: ServiceAccount
metadata:
  name: admin-user
  namespace: default
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: admin-user-binding
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: cluster-admin
subjects:
- kind: ServiceAccount
  name: admin-user
  namespace: default
EOF

# Generate token (GKE limits to 48h max)
kubectl create token admin-user -n default --duration=48h

Copy the output token and paste it into the Headlamp login page.

📊 Observability

The platform comes with a pre-configured monitoring stack:

Prometheus Operator: Automatically scrapes metrics from Spark applications and system components.
ServiceMonitors: Defines what to monitor (Spark Driver/Executors, Airflow scheduler, Nodes).
Grafana Dashboards: Custom JSON dashboards are provided to visualize:
- JVM Heap usage
- Active Tasks / Executors
- CPU/Memory saturation

👉 Read the Full Monitoring Guide

🔌 Connecting to Data (Superset)

Superset is pre-connected to the internal Postgres and Hive Metastore.

To query Data Lake files: Use the Hive connector.
To query Metadata: Use the Postgres connector.

👉 Read the Superset Connection Guide

📂 Repository Structure

├── .github/workflows/        # CI/CD — auto-build Docker images on Dockerfile change
│   └── docker-build.yml      # Per-image build jobs (hive, spark, jupyterhub, marimo, k8s-git-sync)
├── docker/                   # Custom image source code and Dockerfiles (Customize here!)
│   ├── hive/                 # Hive 4.1.0 + Postgres JDBC + AWS JARs (arm64 native)
│   ├── jupyterhub/           # Notebook environment with Spark & Scala
│   ├── k8s-git-sync/         # Git-sync sidecar for Airflow DAGs
│   ├── marimo/               # Reactive Python notebook
│   └── spark/                # Golden Spark image (4.1.1, multi-arch)
├── deploy-v2.sh              # Main automation script
├── k8s_diagram.drawio.svg    # Architecture Diagram
├── k8s-platform-v2/          # V2 Source of Truth (Kustomize)
│   ├── 00-core/              # Namespaces, OpenEBS StorageClasses, PVCs
│   ├── 01-networking/        # Cilium, MetalLB, Traefik (ClusterIP), Cloudflare Tunnel, Hubble UI
│   ├── 02-database/          # Postgres, MinIO (S3), Redis
│   ├── 03-apps/              # Airflow, Spark Connect, JupyterHub, Superset
│   ├── 04-configs/           # Global configs, Spark defaults, Ingress domain
│   └── 05-monitoring/        # Prometheus, Grafana, Loki
├── docs/                     # Detailed technical guides
│   ├── notebooks.md          # Guide: JupyterHub, Marimo
│   ├── delta_lake.md         # Guide: ACID tables on S3
│   ├── spark_on_k8s.md       # Deep Dive: Spark Client vs Cluster mode
│   └── airflow.md            # Workflow orchestration
├── airflow-dags/             # Airflow DAG definitions
├── scripts/                  # Utility scripts
├── CHANGELOG.md              # Version history with detailed changes
├── ISSUES.md                 # Known issues and resolutions
├── MONITORING_GUIDE.md       # Observability instructions
├── README.md                 # Entry point (this file)
└── SUPERSET_CONNECTION_GUIDE.md # BI connectivity instructions

📚 Documentation & References

Document	Description
Changelog	Version history with detailed changes per release
Issues & Resolutions	Troubleshooting log of known bugs and fixes
Deployment Guide	Step-by-step installation instructions
JupyterHub Guide	PySpark jobs and executor configuration
Monitoring Guide	Prometheus, Grafana, and Loki setup
Superset Connection	BI tool data source connections
Lakehouse Architecture	HMS + StarRocks + Spark architecture
Docker Images	Build, customize, and version Docker images
Platform Docs	Full documentation index

🔧 Manual DAG Deployment (Bypass Git-Sync)

For rapid development and testing, you can bypass the Git synchronizer and manually upload DAGs directly to the cluster. This is useful when you want to test changes immediately without committing to the repository.

1. Identify the Git-Sync Pod

The airflow-git-sync pod has write access to the DAGs volume.

kubectl get pods -n default -l app=airflow-git-sync
# Example Output: airflow-git-sync-5669c94965-t52rx

2. Upload Files

Use kubectl exec to pipe file contents directly to the pod (this bypasses some read-only/ownership issues with kubectl cp).

Syntax:

cat <local-file> | kubectl exec -i -n default <git-sync-pod-name> -- tee /dags/repo/dags/<filename> > /dev/null

Example:

# Upload DAG file
cat airflow-dags/dags/my_dag.py | kubectl exec -i -n default airflow-git-sync-5669c94965-t52rx -- tee /dags/repo/dags/my_dag.py > /dev/null

# Upload Spark Manifest
cat airflow-dags/dags/my_manifest.yaml | kubectl exec -i -n default airflow-git-sync-5669c94965-t52rx -- tee /dags/repo/dags/my_manifest.yaml > /dev/null

Warning

Changes made this way are ephemeral and will be overwritten the next time the Git-Sync sidecar pulls from the remote repository. Always commit your final changes to Git.

🔧 Spark Configuration Management

The spark-production-defaults ConfigMap provides global defaults for all Spark applications. When you make changes to production-spark-defaults.conf, you must sync them to the cluster:

# Update ConfigMap from local file
kubectl create configmap spark-production-defaults --from-file=spark-defaults.conf=production-spark-defaults.conf --dry-run=client -o yaml | kubectl apply -f -

Name		Name	Last commit message	Last commit date
Latest commit History 278 Commits
.github/workflows		.github/workflows
airflow-dags		airflow-dags
archive		archive
big-data-platform		big-data-platform
docker		docker
docs		docs
scripts		scripts
.env		.env
.gitignore		.gitignore
ARCHITECTURE.md		ARCHITECTURE.md
CHANGELOG.md		CHANGELOG.md
DEPLOYMENT.md		DEPLOYMENT.md
ISSUES.md		ISSUES.md
JUPYTERHUB_GUIDE.md		JUPYTERHUB_GUIDE.md
LAKEHOUSE_README.md		LAKEHOUSE_README.md
MONITORING_GUIDE.md		MONITORING_GUIDE.md
README.md		README.md
RELEASES.md		RELEASES.md
SUPERSET_CONNECTION_GUIDE.md		SUPERSET_CONNECTION_GUIDE.md
UPDATING.md		UPDATING.md
apiserver.log		apiserver.log
cilium-config.yaml		cilium-config.yaml
deploy-v2.sh		deploy-v2.sh
fix-storage-ds.yaml		fix-storage-ds.yaml
fix-storage-job.yaml		fix-storage-job.yaml
global-resource.env		global-resource.env
issues.md		issues.md
k8s_diagram.drawio.svg		k8s_diagram.drawio.svg
production-spark-defaults.conf		production-spark-defaults.conf
spark-defaults.conf		spark-defaults.conf
test-render.yaml		test-render.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🚀 Cloud-Native Big Data Platform on Kubernetes (Raw K8s / AWS)

📖 Introduction

Architectural Core Principles:

🚦 Project Status

🏗 Architecture & Components

1️⃣ Ingress & Networking

2️⃣ Application Layer (Blue Domain)

3️⃣ Data & Persistence (Green Domain)

🛠 Tech Stack

⚡ Deployment Guide

Prerequisites

Step 1: Clone & Configure

Step 2: Build Custom Images (Crucial)

Step 3: Deploy Platform

Step 3: Access Services

Headlamp Cluster Admin Token

📊 Observability

🔌 Connecting to Data (Superset)

📂 Repository Structure

📚 Documentation & References

🔧 Manual DAG Deployment (Bypass Git-Sync)

1. Identify the Git-Sync Pod

2. Upload Files

🔧 Spark Configuration Management

About

Uh oh!

Releases 2

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🚀 Cloud-Native Big Data Platform on Kubernetes (Raw K8s / AWS)

📖 Introduction

Architectural Core Principles:

🚦 Project Status

🏗 Architecture & Components

1️⃣ Ingress & Networking

2️⃣ Application Layer (Blue Domain)

3️⃣ Data & Persistence (Green Domain)

🛠 Tech Stack

⚡ Deployment Guide

Prerequisites

Step 1: Clone & Configure

Step 2: Build Custom Images (Crucial)

Step 3: Deploy Platform

Step 3: Access Services

Headlamp Cluster Admin Token

📊 Observability

🔌 Connecting to Data (Superset)

📂 Repository Structure

📚 Documentation & References

🔧 Manual DAG Deployment (Bypass Git-Sync)

1. Identify the Git-Sync Pod

2. Upload Files

🔧 Spark Configuration Management

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages