🚀 Spark + Jupyter on Docker

This project sets up an Apache Spark cluster (1 master, 2 workers) and a Jupyter Notebook environment for running PySpark code. It uses Docker Compose for easy orchestration.

📂 Project Structure

.
├── docker-compose.yml   # Compose file to start all services
├── apps/                # Your Spark / PySpark jobs (mounted into containers)
├── data/                # Data shared between Spark services
├── requirements.txt     # Python dependencies for Jupyter

⚙️ Services

1️⃣ Spark Master

Image: bitnami/spark:3.5.0
Ports:
- 8080 → Spark Master Web UI (http://localhost:8080)
- 7077 → Spark Master RPC (cluster communication)
Environment:
- Runs in master mode.
- Exposes itself as spark://spark-master:7077.
Volumes:
- ./apps → /opt/spark-apps
- ./data → /opt/spark-data

2️⃣ Spark Workers (2 replicas)

Image: bitnami/spark:3.5.0
Mode: Worker
Connects to: spark://spark-master:7077
Environment:
- SPARK_WORKER_MEMORY=2000m
- SPARK_DRIVER_MEMORY=2000m
Scale up/down by adding more workers in docker-compose.yml.

3️⃣ Jupyter Notebook

Image: jupyter/pyspark-notebook:latest
Port: 8888 → localhost:8888
Volumes:
- ./apps → /home/jovyan/work (your notebooks and scripts)
- ./requirements.txt → /tmp/requirements.txt (extra dependencies)

Environment:

Configured to connect directly to the Spark cluster:

SPARK_MASTER=spark://spark-master:7077
SPARK_LOCAL_IP=jupyter
SPARK_DRIVER_HOST=jupyter
SPARK_DRIVER_BIND_ADDRESS=0.0.0.0

▶️ Usage

1. Start the cluster

docker compose up -d

2. Access UIs

Spark Master UI: http://localhost:8080
Jupyter UI: http://localhost:8888

Jupyter will give you a token URL (e.g. http://127.0.0.1:8888/?token=...).

📦 Installing Custom Python Packages

Add dependencies to requirements.txt.

Rebuild or restart Jupyter:

docker exec -it jupyter pip install -r /tmp/requirements.txt

To make installs persistent, bake them into a custom Jupyter image or add them to requirements.txt.

⚡ Connecting to Spark in Jupyter

Example code in notebook:

from pyspark.sql import SparkSession

spark = (
    SparkSession.builder
    .appName("ExampleApp")
    .master("spark://spark-master:7077")
    .config("spark.executor.memory", "1G")
    .config("spark.driver.memory", "1G")
    .getOrCreate()
)

print(spark.version)

❗ Python Version Mismatch (Important)

Jupyter (pyspark-notebook) uses Python 3.12.
Spark workers (bitnami/spark) use Python 3.11.

This causes errors like:

PySparkRuntimeError: [PYTHON_VERSION_MISMATCH]

🔧 Fix Options

Quick fix: Use Python 3.11 kernel in Jupyter (pin version in Conda).
Long-term fix: Rebuild Spark worker images with Python 3.12 to match Jupyter.

🔑 Setting Jupyter Password Instead of Token

Create a password hash:

python -c "from notebook.auth import passwd; print(passwd())"

Example: sha1:abcd1234...

Add it to your Jupyter service in docker-compose.yml:

environment:
    - JUPYTER_TOKEN=
    - JUPYTER_PASSWORD=sha1:abcd1234...

Restart Jupyter:
```
docker compose restart jupyter
```

Now you can log in with username jovyan and your chosen password.

🛑 Stopping the Cluster

docker compose down

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
apps		apps
archive		archive
spark		spark
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
compose.yaml		compose.yaml
main.py		main.py
pyproject.toml		pyproject.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🚀 Spark + Jupyter on Docker

📂 Project Structure

⚙️ Services

1️⃣ Spark Master

2️⃣ Spark Workers (2 replicas)

3️⃣ Jupyter Notebook

▶️ Usage

1. Start the cluster

2. Access UIs

📦 Installing Custom Python Packages

⚡ Connecting to Spark in Jupyter

❗ Python Version Mismatch (Important)

🔧 Fix Options

🔑 Setting Jupyter Password Instead of Token

🛑 Stopping the Cluster

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🚀 Spark + Jupyter on Docker

📂 Project Structure

⚙️ Services

1️⃣ Spark Master

2️⃣ Spark Workers (2 replicas)

3️⃣ Jupyter Notebook

▶️ Usage

1. Start the cluster

2. Access UIs

📦 Installing Custom Python Packages

⚡ Connecting to Spark in Jupyter

❗ Python Version Mismatch (Important)

🔧 Fix Options

🔑 Setting Jupyter Password Instead of Token

🛑 Stopping the Cluster

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages