Simple Repo for running PySpark locally using VS Code, Jupyter Notebooks, Jetbrains or any other local IDE. NOTE, this repo is intended for Ulster University MSc Data Science students (Big data technologies COM739). However, if you find it useful then please feel free to use it. This repository now includes a streamlined setup that handles everything for you:
from pyspark_local import initialize_pyspark
spark = initialize_pyspark()See SETUP_INSTRUCTIONS.md for complete details, or jump to the Quick Start section below.
- Java 8 or 11 - Required for PySpark (Installation guide)
- Python 3.11 - Check with
python --version
# Option 1: Using UV (recommended)
uv pip install -e ".[dev]"
# Option 2: Using pip
pip install -e ".[dev]"from pyspark_local import initialize_pyspark
spark = initialize_pyspark()
# Now use spark normally
df = spark.range(10)
df.show()
# Remember to stop when done
spark.stop()See student_example.ipynb for a complete tutorial!
As we have seen over the past few weeks, there is an issue on google Colab due to Java versions. Its not entirely fair to point the blame exclusively at Java, as the incompatibility comes from Java - Python V3.13 and PySpark 4.0.1. I've discovered that it works best if we use Java v11, Python 3.11 and we can continue to use PySpark 3.5.1. I originally tried to avoid introducing virtual environments as it introduces unnecessary complexity; however this is the real world of software engineering. If you want to learn more about virtual environments and the challenges of version control specifically The method i prefer to use when managing complex dependency relationships is UV. To setup PySpark locally and ensure the dependencies are managed we can go through the following process: 1. install UV, 2. Clone github repo, 3. activate venv and finally 4. sync dependencies.
PySpark_Local/
├── src/pyspark_local/ # Main package (NEW!)
│ ├── __init__.py # Package exports
│ ├── setup.py # Environment checking & Spark initialization
│ ├── validation.py # Validation tests
│ └── cli.py # Command-line interface
├── tests/ # Pytest test suite (NEW!)
│ ├── conftest.py # Pytest configuration
│ ├── test_environment.py # Environment tests
│ ├── test_spark_session.py # Spark session tests
│ └── test_validation.py # Validation tests
├── data/ # Sample data files
│ ├── employees.json
│ └── people.json
├── student_example.ipynb # Tutorial notebook (NEW!)
├── quick_start.ipynb # Quick start guide
├── SETUP_INSTRUCTIONS.md # Detailed setup guide (NEW!)
├── pyproject.toml # Project configuration
├── uv.lock # Dependency lock file
└── README.md # This file
| File/Directory | Description |
|---|---|
| src/pyspark_local/ | NEW! Main package with setup, validation, and CLI tools |
| tests/ | NEW! Comprehensive pytest test suite |
| student_example.ipynb | NEW! Complete tutorial showing one-line setup |
| SETUP_INSTRUCTIONS.md | NEW! Detailed setup and usage instructions |
| uv.lock | Locks your Python project's dependencies to specific versions to ensure reproducible and cross-platform environments |
| pyproject.toml | Project metadata, dependencies, build systems, and tool configurations |
| README.md | This file - main project overview |
| LICENSE | Standard MIT license details |
| data/ | Sample data files (employees.json, people.json) for testing PySpark |
You will first need to install UV. This is fundamentally a package manager like PIP (python) however has the added benefit of allowing us to manage virtual environments and python versions as we would any other package.
powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"curl -LsSf https://astral.sh/uv/install.sh | shYou can find more documentation on getting started with UV (and installation) here
There is more information and some useful tutorials on what UV is and some common cmds when using it here. Some of them are decent but i've also attached specific notes on UV in general. Only review these as required, they are not that important but useful for getting and understanding of the tool and any issues that might arise.
- UV Docs: official documentation
- UV introduction - Tech with Tim: Decent enough overview of what UV is and how we can use it.
- UV more details - Another youtube tutorial: Another decent tutorial, goes into more depth than the previous one.
With UV installed, we can proceed to setting up the virtual environment and initializing spark locally (this should work for VS code, jupyter notebooks and any other local development environment). We can do this in colab also if we want although its not really necessary.
- Install UV: (we should have this done already. we can check by running the cmd
uv --version) - Clone UV repo: (
https://github.com/tonserrobo/PySpark_Local.git). Make sure you open a terminal within this directory. - Setup UV virtual environment: (cmd:
uv venv). This will create a python 3.11 venv - Activate the venv: (cmd:
.venv\Scripts\activate). This will activate the venv and allow us to install the dependencies. - Sync environment: run
UV syncto pull all the required dependencies (cmd:uv sync). This will install all the dependencies to the venv
NOTE: i've placed the package files within blackboard also, so this essentially replaces step 2 if you don't want to clone the github repo i created. Lastly, when we are finished working in our venv, you can deactivate it by typing the cmd deactivate.
Once we have that done, we can then import PySpark and initialize our spark session
import sys, os, findspark
from pyspark.sql import SparkSession
os.environ["PYSPARK_PYTHON"] = sys.executable
os.environ["PYSPARK_DRIVER_PYTHON"] = sys.executable
findspark.init()
spark = SparkSession.builder.appName("SparkTest").getOrCreate()
print(f"findspark version: {findspark.__version__}")
print(f"pyspark version: {spark.version}")This should output the following:
findspark version: 2.0.1 pyspark version: 3.5.7
# Using Python
python -c "from pyspark_local import check_environment, print_environment_report; print_environment_report(check_environment())"
# Using CLI
python -m pyspark_local.cli check# Run all pytest tests
pytest
# Run with verbose output
pytest -v
# Run only quick tests (skip slow tests)
pytest -m "not slow"
# Run with coverage report
pytest --cov=pyspark_localfrom pyspark_local import create_spark_session
spark = create_spark_session(
app_name="MyProject",
master="local[2]", # Use only 2 cores
log_level="WARN",
# Add custom Spark configs
**{
"spark.sql.shuffle.partitions": "10",
"spark.driver.memory": "2g"
}
)from pyspark_local import check_environment
env = check_environment()
print(env)from pyspark_local import run_validation_tests
spark = create_spark_session()
results = run_validation_tests(spark, verbose=True)
print(f"Passed: {results['passed']}/{results['total']}")
spark.stop()I've put together a help document in Docs > common_pitfalls.md to help you identify and resolve any errors which might be coming up during the installation and sync process. These errors don't go into PySpark specific problems, just virtual environment management related to PySpark per our use case.
Additional troubleshooting can be found in SETUP_INSTRUCTIONS.md.