This project sets up an Apache Spark cluster (1 master, 2 workers) and a Jupyter Notebook environment for running PySpark code. It uses Docker Compose for easy orchestration.
.
├── docker-compose.yml # Compose file to start all services
├── apps/ # Your Spark / PySpark jobs (mounted into containers)
├── data/ # Data shared between Spark services
├── requirements.txt # Python dependencies for Jupyter
-
Image:
bitnami/spark:3.5.0 -
Ports:
8080→ Spark Master Web UI (http://localhost:8080)7077→ Spark Master RPC (cluster communication)
-
Environment:
- Runs in
mastermode. - Exposes itself as
spark://spark-master:7077.
- Runs in
-
Volumes:
./apps → /opt/spark-apps./data → /opt/spark-data
-
Image:
bitnami/spark:3.5.0 -
Mode: Worker
-
Connects to:
spark://spark-master:7077 -
Environment:
SPARK_WORKER_MEMORY=2000mSPARK_DRIVER_MEMORY=2000m
-
Scale up/down by adding more workers in
docker-compose.yml.
-
Image:
jupyter/pyspark-notebook:latest -
Port:
8888 → localhost:8888 -
Volumes:
./apps → /home/jovyan/work(your notebooks and scripts)./requirements.txt → /tmp/requirements.txt(extra dependencies)
-
Environment:
-
Configured to connect directly to the Spark cluster:
SPARK_MASTER=spark://spark-master:7077 SPARK_LOCAL_IP=jupyter SPARK_DRIVER_HOST=jupyter SPARK_DRIVER_BIND_ADDRESS=0.0.0.0
-
docker compose up -d- Spark Master UI: http://localhost:8080
- Jupyter UI: http://localhost:8888
Jupyter will give you a token URL (e.g. http://127.0.0.1:8888/?token=...).
-
Add dependencies to
requirements.txt. -
Rebuild or restart Jupyter:
docker exec -it jupyter pip install -r /tmp/requirements.txt -
To make installs persistent, bake them into a custom Jupyter image or add them to
requirements.txt.
Example code in notebook:
from pyspark.sql import SparkSession
spark = (
SparkSession.builder
.appName("ExampleApp")
.master("spark://spark-master:7077")
.config("spark.executor.memory", "1G")
.config("spark.driver.memory", "1G")
.getOrCreate()
)
print(spark.version)-
Jupyter (
pyspark-notebook) uses Python 3.12. -
Spark workers (
bitnami/spark) use Python 3.11. -
This causes errors like:
PySparkRuntimeError: [PYTHON_VERSION_MISMATCH]
- Quick fix: Use Python 3.11 kernel in Jupyter (pin version in Conda).
- Long-term fix: Rebuild Spark worker images with Python 3.12 to match Jupyter.
-
Create a password hash:
python -c "from notebook.auth import passwd; print(passwd())"Example:
sha1:abcd1234... -
Add it to your Jupyter service in
docker-compose.yml:environment: - JUPYTER_TOKEN= - JUPYTER_PASSWORD=sha1:abcd1234...
-
Restart Jupyter:
docker compose restart jupyter
Now you can log in with username jovyan and your chosen password.
docker compose down