An enterprise-grade, cloud-native orchestration framework for distributed big data workloads. Built on self-managed Kubernetes (kubeadm) on AWS EC2 with Cilium CNI, this platform provides a decoupled, elastic environment for Apache Spark, Delta Lake, and Airflow, featuring a unified suite of modern interactive notebook environments.
π View the v0.3.0 Changelog | Release Notes
This repository contains a Data Platform as Code (DPaC) implementation, designed to modernize distributed computing by enforcing a strict separation of compute and storage. Leveraging Kubernetes as the primary orchestration plane, the platform eliminates infrastructure silos, enabling teams to deploy and scale production-ready data ecosystems elastically.
- Decoupled Compute/Storage: Persistence is offloaded to S3-compatible object storage (MinIO), allowing compute resources (Spark Executors) to remain ephemeral and cost-efficient.
- GitOps-Centric Design: Every component, from networking routes to database schemas, is defined as declarative Kubernetes manifests for reproducible deployments. Docker images are built and pushed automatically via GitHub Actions CI/CD on every Dockerfile change.
- Zero-Trust Ingress: External access is routed through Cloudflare Tunnel (
cloudflared) β no inbound firewall ports needed. Traefik runs as a pure internalClusterIPservice. - High Observability: Integrated telemetry across the stack provides deep visibility into job performance, resource utilization, and system health.
| Feature | Status | Notes |
|---|---|---|
| JupyterHub / Spark | β Stable | Core interactive environment |
| Spark Connect | β Stable | Shared Spark gateway for all clients |
| Delta Lake | β Stable | ACID transactions and Time Travel on S3 |
| Hive Metastore | β Stable | Centralized metadata management (Thrift) β Hive 4.1.0 |
| Cilium CNI | β Stable | AWS ENI IPAM mode for native VPC networking |
| StarRocks | β Stable | Verified with Native Delta Catalog (OLAP) |
| Airflow + Git-Sync | β Stable | DAGs auto-synced from Git repository |
| Monitoring Stack | π Beta | Prometheus, Grafana, Loki, Hubble UI |
| Marimo Notebooks | π§ͺ Exp | Reactive Python UI integration |
Important
Features marked as Experimental (π§ͺ Exp) are in the development phase. They may have incomplete functionality or require additional configuration.
The platform is divided into three logical domains:
- Cilium CNI: Pod networking with AWS ENI IPAM mode. Pods receive real VPC IPs for full AWS compatibility.
- MetalLB: Provides a network load-balancer implementation, assigning a dedicated Elastic IP (
44.203.26.241) for internal cluster use. - Cloudflare Tunnel (
cloudflared): Replaces public LoadBalancer exposure. A 3-replica HA deployment ofcloudflaredin its own namespace connects outbound to Cloudflare's edge, routing external HTTPS traffic securely to Traefik without opening inbound firewall ports. Includes PodDisruptionBudget, topology spread, NetworkPolicy, RBAC, and Prometheus ServiceMonitor. - Traefik Proxy: The unified ingress controller. Operates as
ClusterIP(notLoadBalancer) β receives traffic exclusively fromcloudflared. Routes requests to internal services. NohostNetwork. - Hubble UI: Cilium's observability dashboard for real-time network flow visibility.
- SSLIP.IO: Automatic DNS resolution for LoadBalancer IPs (internal cluster access).
- Apache Airflow (2.x): The workflow orchestrator. It schedules DAGs that trigger Spark jobs, move data, and manage dependencies. configured with the KubernetesExecutor for scaling tasks.
- Notebook Suite:
- JupyterHub: Standard interactive environment with Zeppelin features (SQL magic, Scala kernel,
z.show()). - Marimo: Reactive Python notebooks with high-performance UI components.
- Polynote: IDE-focused notebook for Scala and multi-language Spark development.
- JupyterHub: Standard interactive environment with Zeppelin features (SQL magic, Scala kernel,
- Apache Spark (4.1.1): The distributed compute engine, pre-configured with Delta Lake and Hadoop 3.4.1 support.
- Apache Superset: Enterprise-ready BI. Connects to the platform for data visualization.
- Hive Metastore (HMS): Standalone Thrift service acting as the central catalog for Spark and StarRocks.
- OpenEBS (Hostpath): Dynamic storage provisioner that manages local node storage. Replaces static PersistentVolumes for an automated storage lifecycle.
- MinIO: High-performance Object Storage (S3 Compatible). Acts as the "Data Lake" storage layer.
- PostgreSQL: The relational metadata backbone. Stores state for Airflow, Superset, and Hive.
- Redis: In-memory cache used by Superset.
- StarRocks: High-performance analytical (OLAP) database. Reads directly from MinIO via Delta Native Catalog.
- Kong Gateway (Experimental): Secondary API gateway for external service management.
| Component | Version | Role | Usage |
|---|---|---|---|
| Apache Airflow | 2.10.x |
Orchestrator | Scheduling ETL pipelines |
| Spark / Delta | 4.1.1 / 4.0.1 |
Compute / Format | Distributed processing & ACID tables |
| Hadoop / AWS SDK | 3.4.1 / 1.12.367 |
Storage Access | S3A FileSystem optimizations |
| JupyterHub | 4.0.7 |
Notebooks | Standard Data Engineering workflow |
| Marimo / Polynote | latest |
Notebooks | Reactive & Multi-language environments |
| Hive Metastore | 4.1.0 |
Catalog | Metadata persistence (arm64 native, JDK 17+) |
| StarRocks | v3.x |
OLAP Database | Sub-second queries on large datasets |
| Apache Superset | 4.0.x |
BI / Viz | Dashboards & Analytics |
| MinIO | RELEASE.2024 |
Object Store | Data Lake (S3 API) |
| Traefik / Kong | v2.10 / v3.x |
Ingress/API Gateway | Load Balancing & Service Routing |
| Prometheus / Loki | Custom Helm |
Observability | Metrics & Centralized Logging |
| Grafana | latest |
Dashboards | Visualizing cluster health & job metrics |
- AWS EC2 Cluster: Self-managed Kubernetes via
kubeadmon EC2 instances (ARM64 Graviton recommended). - Tools:
kubectl,helminstalled locally. - Permissions: Admin access to the cluster (
KUBECONFIGconfigured).
git clone https://github.com/your-repo/k8s-big-data-platform.git
cd k8s-big-data-platformThe platform uses optimized images for notebooks and executors. Build and push them to your registry. If you want to customize these images (e.g. adding specific spark dependencies or Python libraries), explore and modify the Dockerfiles within the docker/ folder before running these scripts:
# Hive Metastore (arm64-native, Hive 4.1.0)
docker/hive/build.sh
# Spark Executor & Driver Base
docker/spark/build.sh
# User Interfaces
docker/jupyterhub/build.sh
docker/marimo/build.shTip
CI/CD Auto-Build: Any push to main that modifies a docker/*/Dockerfile will automatically trigger a multi-arch (linux/amd64 + linux/arm64) build and push via GitHub Actions. You only need to run these scripts manually for local testing. See .github/workflows/docker-build.yml.
Run the main deployment script. This automation handles namespace creation, CRD installation, and Helm chart deployments.
chmod +x deploy-v2.sh
./deploy-v2.shWait for the script to complete. It may take 5-10 minutes for the LoadBalancer IP to provision.
The script will output the dynamic URLs for your services. The base domain $INGRESS_DOMAIN is constructed automatically using the LoadBalancer IP (e.g., 44.203.26.241.sslip.io).
| Service | URL Pattern | Default Credentials |
|---|---|---|
| Airflow | http://airflow.<INGRESS_DOMAIN> |
admin / admin |
| JupyterHub | http://jupyterhub.<INGRESS_DOMAIN> |
No token (Dev Mode) |
| Superset | http://superset.<INGRESS_DOMAIN> |
admin / admin |
| Minio | http://minio.<INGRESS_DOMAIN> |
minioadmin / minioadmin |
| Grafana | http://grafana.<INGRESS_DOMAIN> |
admin / prom-operator |
| Spark UI | http://spark.<INGRESS_DOMAIN> |
- |
| Spark History | http://spark-history.<INGRESS_DOMAIN> |
- |
| Hubble UI | http://hubble.<INGRESS_DOMAIN> |
- |
| Headlamp UI | http://headlamp.<INGRESS_DOMAIN> |
See token below |
Generate a token for Headlamp UI access:
# One-time setup: Create admin-user service account
kubectl apply -f - <<EOF
apiVersion: v1
kind: ServiceAccount
metadata:
name: admin-user
namespace: default
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: admin-user-binding
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: cluster-admin
subjects:
- kind: ServiceAccount
name: admin-user
namespace: default
EOF
# Generate token (GKE limits to 48h max)
kubectl create token admin-user -n default --duration=48hCopy the output token and paste it into the Headlamp login page.
The platform comes with a pre-configured monitoring stack:
- Prometheus Operator: Automatically scrapes metrics from Spark applications and system components.
- ServiceMonitors: Defines what to monitor (Spark Driver/Executors, Airflow scheduler, Nodes).
- Grafana Dashboards: Custom JSON dashboards are provided to visualize:
- JVM Heap usage
- Active Tasks / Executors
- CPU/Memory saturation
π Read the Full Monitoring Guide
Superset is pre-connected to the internal Postgres and Hive Metastore.
- To query Data Lake files: Use the Hive connector.
- To query Metadata: Use the Postgres connector.
π Read the Superset Connection Guide
βββ .github/workflows/ # CI/CD β auto-build Docker images on Dockerfile change
β βββ docker-build.yml # Per-image build jobs (hive, spark, jupyterhub, marimo, k8s-git-sync)
βββ docker/ # Custom image source code and Dockerfiles (Customize here!)
β βββ hive/ # Hive 4.1.0 + Postgres JDBC + AWS JARs (arm64 native)
β βββ jupyterhub/ # Notebook environment with Spark & Scala
β βββ k8s-git-sync/ # Git-sync sidecar for Airflow DAGs
β βββ marimo/ # Reactive Python notebook
β βββ spark/ # Golden Spark image (4.1.1, multi-arch)
βββ deploy-v2.sh # Main automation script
βββ k8s_diagram.drawio.svg # Architecture Diagram
βββ k8s-platform-v2/ # V2 Source of Truth (Kustomize)
β βββ 00-core/ # Namespaces, OpenEBS StorageClasses, PVCs
β βββ 01-networking/ # Cilium, MetalLB, Traefik (ClusterIP), Cloudflare Tunnel, Hubble UI
β βββ 02-database/ # Postgres, MinIO (S3), Redis
β βββ 03-apps/ # Airflow, Spark Connect, JupyterHub, Superset
β βββ 04-configs/ # Global configs, Spark defaults, Ingress domain
β βββ 05-monitoring/ # Prometheus, Grafana, Loki
βββ docs/ # Detailed technical guides
β βββ notebooks.md # Guide: JupyterHub, Marimo
β βββ delta_lake.md # Guide: ACID tables on S3
β βββ spark_on_k8s.md # Deep Dive: Spark Client vs Cluster mode
β βββ airflow.md # Workflow orchestration
βββ airflow-dags/ # Airflow DAG definitions
βββ scripts/ # Utility scripts
βββ CHANGELOG.md # Version history with detailed changes
βββ ISSUES.md # Known issues and resolutions
βββ MONITORING_GUIDE.md # Observability instructions
βββ README.md # Entry point (this file)
βββ SUPERSET_CONNECTION_GUIDE.md # BI connectivity instructions| Document | Description |
|---|---|
| Changelog | Version history with detailed changes per release |
| Issues & Resolutions | Troubleshooting log of known bugs and fixes |
| Deployment Guide | Step-by-step installation instructions |
| JupyterHub Guide | PySpark jobs and executor configuration |
| Monitoring Guide | Prometheus, Grafana, and Loki setup |
| Superset Connection | BI tool data source connections |
| Lakehouse Architecture | HMS + StarRocks + Spark architecture |
| Docker Images | Build, customize, and version Docker images |
| Platform Docs | Full documentation index |
For rapid development and testing, you can bypass the Git synchronizer and manually upload DAGs directly to the cluster. This is useful when you want to test changes immediately without committing to the repository.
The airflow-git-sync pod has write access to the DAGs volume.
kubectl get pods -n default -l app=airflow-git-sync
# Example Output: airflow-git-sync-5669c94965-t52rxUse kubectl exec to pipe file contents directly to the pod (this bypasses some read-only/ownership issues with kubectl cp).
Syntax:
cat <local-file> | kubectl exec -i -n default <git-sync-pod-name> -- tee /dags/repo/dags/<filename> > /dev/nullExample:
# Upload DAG file
cat airflow-dags/dags/my_dag.py | kubectl exec -i -n default airflow-git-sync-5669c94965-t52rx -- tee /dags/repo/dags/my_dag.py > /dev/null
# Upload Spark Manifest
cat airflow-dags/dags/my_manifest.yaml | kubectl exec -i -n default airflow-git-sync-5669c94965-t52rx -- tee /dags/repo/dags/my_manifest.yaml > /dev/nullWarning
Changes made this way are ephemeral and will be overwritten the next time the Git-Sync sidecar pulls from the remote repository. Always commit your final changes to Git.
The spark-production-defaults ConfigMap provides global defaults for all Spark applications. When you make changes to production-spark-defaults.conf, you must sync them to the cluster:
# Update ConfigMap from local file
kubectl create configmap spark-production-defaults --from-file=spark-defaults.conf=production-spark-defaults.conf --dry-run=client -o yaml | kubectl apply -f -