⚡ Spark Data Dashboard

A production-grade data dashboard prototype integrating Apache Spark for large-scale data processing with a rich Python visualization layer built on Plotly Dash.

📸 Preview

Overview Tab — KPI Cards + Revenue Trend

4 KPI metric cards, revenue time-series with Spark rolling window average, region breakdown, and category donut chart.

Trends Tab — Activity Heatmap + Device Split

Session activity heatmap (day-of-week × hour), stacked device area chart, and month-over-month revenue growth powered by Spark lag window functions.

Deep Dive Tab — Product Analytics

Discount vs revenue bubble scatter (500 products), inventory health status, top-10 product leaderboard ranked by Spark rank() window function.

🏗 Architecture

┌──────────────────────────────────────────────────────────┐
│               Synthetic Data (130k+ rows)                │
│   transactions.csv · sessions.csv · inventory.csv        │
└────────────────────┬─────────────────────────────────────┘
                     │
┌────────────────────▼─────────────────────────────────────┐
│             Apache Spark 3.5 (PySpark)                   │
│  • Ingestion & schema enforcement                         │
│  • groupBy aggregations (revenue, sessions, inventory)    │
│  • Window functions: rolling avg, lag (MoM), rank (top-N) │
│  • Filter pushdown for interactive filtering              │
│  • DataFrames cached for fast repeated queries           │
└────────────────────┬─────────────────────────────────────┘
                     │  Arrow-backed toPandas()
┌────────────────────▼─────────────────────────────────────┐
│          Plotly Dash 2.x (Visualization Layer)           │
│  • 3-tab dashboard: Overview · Trends · Deep Dive        │
│  • Real-time filter callbacks (date, region, category)    │
│  • Dark glassmorphism UI with animated KPI cards         │
└──────────────────────────────────────────────────────────┘

✨ Features

Data Processing (Spark)

Spark Feature	Used For
`groupBy` + `agg`	Revenue, session & inventory aggregations
`Window.rowsBetween`	3-month rolling revenue average
`lag()` window function	Month-over-month growth % calculation
`rank()` window function	Top-10 product leaderboard
Filter pushdown	All interactive dashboard filters
Arrow `toPandas()`	High-speed Spark → Pandas bridge
`DataFrame.cache()`	Warm-up for fast repeated queries

Visualization (Plotly Dash)

KPI Cards — Total Revenue, Orders, Avg Order Value, Conversion Rate
Line chart — Monthly revenue with rolling average overlay
Heatmap — Session activity by day-of-week and hour
Stacked Area — Device split (Desktop / Mobile / Tablet) over time
Bar charts — Revenue by region, MoM growth
Bubble Scatter — Discount % vs Revenue (500 products)
Donut charts — Category mix, Inventory health
Data table — Top 10 ranked products

📁 Project Structure

spark-dashboard/
├── app.py                    # Dash app entry point
├── run.sh                    # Startup script (sets correct JAVA_HOME)
├── requirements.txt          # All Python dependencies
├── spark_engine/
│   ├── session.py            # SparkSession factory (Java 11 config)
│   ├── data_generator.py     # Synthetic dataset generator (Faker + NumPy)
│   └── transforms.py         # All Spark transformations
├── viz/
│   ├── components.py         # Reusable UI components (KPI card, filters)
│   ├── layouts.py            # 3-tab chart layouts (Plotly)
│   └── callbacks.py          # Dash callbacks (filter → Spark → chart)
└── assets/
    └── styles.css            # Dark glassmorphism CSS theme

Note: data/ is excluded from git via .gitignore. Raw CSVs are auto-generated on first run using data_generator.py.

🚀 Quick Start

Prerequisites

Requirement	Version	Notes
Python	3.9+
Java	11 (LTS)	⚠️ PySpark 3.5 is incompatible with Java 21+
pip	Latest

Download Java 11: Adoptium Temurin 11

Installation

# 1. Clone the repo
git clone https://github.com/YOUR_USERNAME/spark-data-dashboard.git
cd spark-data-dashboard

# 2. Install dependencies
pip install -r requirements.txt

# 3. Run the dashboard
bash run.sh

Then open http://localhost:8050 in your browser.

The first run generates 130,000+ rows of synthetic data automatically.

Manual Run (if not using run.sh)

# Set Java 11 explicitly (adjust path for your OS)
# macOS:
export JAVA_HOME=/Library/Java/JavaVirtualMachines/temurin-11.jdk/Contents/Home
# Linux:
# export JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64

python app.py

🛠 Tech Stack

Layer	Technology
Data Processing	Apache Spark 3.5.1 (PySpark)
Visualization	Plotly Dash 2.17
Charts	Plotly Express + Graph Objects
UI Components	Dash Bootstrap Components
Styling	Vanilla CSS (dark glassmorphism)
Data Generation	Faker + NumPy
Python	3.9+

⚠️ Known Limitations

Java 11 required — PySpark 3.5.x uses javax.security.auth.Subject.getSubject() which was removed in Java 21. Always run with Java 11.
Local mode only — Runs Spark in local[*] mode; not configured for a cluster.
Synthetic data — Dataset is procedurally generated; swap data_generator.py with a real data source for production use.

📄 License

MIT License — see LICENSE for details.

Built with ⚡ Apache Spark · 📊 Plotly Dash · 🐍 Python

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

⚡ Spark Data Dashboard

📸 Preview

Overview Tab — KPI Cards + Revenue Trend

Trends Tab — Activity Heatmap + Device Split

Deep Dive Tab — Product Analytics

🏗 Architecture

✨ Features

Data Processing (Spark)

Visualization (Plotly Dash)

📁 Project Structure

🚀 Quick Start

Prerequisites

Installation

Manual Run (if not using run.sh)

🛠 Tech Stack

⚠️ Known Limitations

📄 License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
assets		assets
spark_engine		spark_engine
viz		viz
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
app.py		app.py
requirements.txt		requirements.txt
run.sh		run.sh

Folders and files

Latest commit

History

Repository files navigation

⚡ Spark Data Dashboard

📸 Preview

Overview Tab — KPI Cards + Revenue Trend

Trends Tab — Activity Heatmap + Device Split

Deep Dive Tab — Product Analytics

🏗 Architecture

✨ Features

Data Processing (Spark)

Visualization (Plotly Dash)

📁 Project Structure

🚀 Quick Start

Prerequisites

Installation

Manual Run (if not using run.sh)

🛠 Tech Stack

⚠️ Known Limitations

📄 License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages