PySpark Local: A simple guide to getting started using PySpark locally

Simple Repo for running PySpark locally using VS Code, Jupyter Notebooks, Jetbrains or any other local IDE. NOTE, this repo is intended for Ulster University MSc Data Science students (Big data technologies COM739). However, if you find it useful then please feel free to use it. This repository now includes a streamlined setup that handles everything for you:

from pyspark_local import initialize_pyspark
spark = initialize_pyspark()

See SETUP_INSTRUCTIONS.md for complete details, or jump to the Quick Start section below.

Quick Start

Prerequisites

Java 8 or 11 - Required for PySpark (Installation guide)
Python 3.11 - Check with python --version

Installation

# Option 1: Using UV (recommended)
uv pip install -e ".[dev]"

# Option 2: Using pip
pip install -e ".[dev]"

Usage in Notebooks

from pyspark_local import initialize_pyspark
spark = initialize_pyspark()

# Now use spark normally
df = spark.range(10)
df.show()

# Remember to stop when done
spark.stop()

See student_example.ipynb for a complete tutorial!

Installing PySpark locally with Notebooks (Manual Setup)

As we have seen over the past few weeks, there is an issue on google Colab due to Java versions. Its not entirely fair to point the blame exclusively at Java, as the incompatibility comes from Java - Python V3.13 and PySpark 4.0.1. I've discovered that it works best if we use Java v11, Python 3.11 and we can continue to use PySpark 3.5.1. I originally tried to avoid introducing virtual environments as it introduces unnecessary complexity; however this is the real world of software engineering. If you want to learn more about virtual environments and the challenges of version control specifically The method i prefer to use when managing complex dependency relationships is UV. To setup PySpark locally and ensure the dependencies are managed we can go through the following process: 1. install UV, 2. Clone github repo, 3. activate venv and finally 4. sync dependencies.

Project Structure

PySpark_Local/
├── src/pyspark_local/          # Main package (NEW!)
│   ├── __init__.py             # Package exports
│   ├── setup.py                # Environment checking & Spark initialization
│   ├── validation.py           # Validation tests
│   └── cli.py                  # Command-line interface
├── tests/                       # Pytest test suite (NEW!)
│   ├── conftest.py             # Pytest configuration
│   ├── test_environment.py     # Environment tests
│   ├── test_spark_session.py  # Spark session tests
│   └── test_validation.py     # Validation tests
├── data/                        # Sample data files
│   ├── employees.json
│   └── people.json
├── student_example.ipynb       # Tutorial notebook (NEW!)
├── quick_start.ipynb           # Quick start guide
├── SETUP_INSTRUCTIONS.md       # Detailed setup guide (NEW!)
├── pyproject.toml              # Project configuration
├── uv.lock                     # Dependency lock file
└── README.md                   # This file

File Descriptions

File/Directory	Description
src/pyspark_local/	NEW! Main package with setup, validation, and CLI tools
tests/	NEW! Comprehensive pytest test suite
student_example.ipynb	NEW! Complete tutorial showing one-line setup
SETUP_INSTRUCTIONS.md	NEW! Detailed setup and usage instructions
uv.lock	Locks your Python project's dependencies to specific versions to ensure reproducible and cross-platform environments
pyproject.toml	Project metadata, dependencies, build systems, and tool configurations
README.md	This file - main project overview
LICENSE	Standard MIT license details
data/	Sample data files (employees.json, people.json) for testing PySpark

Installing UV

You will first need to install UV. This is fundamentally a package manager like PIP (python) however has the added benefit of allowing us to manage virtual environments and python versions as we would any other package.

powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"

curl -LsSf https://astral.sh/uv/install.sh | sh

You can find more documentation on getting started with UV (and installation) here

There is more information and some useful tutorials on what UV is and some common cmds when using it here. Some of them are decent but i've also attached specific notes on UV in general. Only review these as required, they are not that important but useful for getting and understanding of the tool and any issues that might arise.

UV Docs: official documentation
UV introduction - Tech with Tim: Decent enough overview of what UV is and how we can use it.
UV more details - Another youtube tutorial: Another decent tutorial, goes into more depth than the previous one.

Setting up venv

With UV installed, we can proceed to setting up the virtual environment and initializing spark locally (this should work for VS code, jupyter notebooks and any other local development environment). We can do this in colab also if we want although its not really necessary.

Install UV: (we should have this done already. we can check by running the cmd uv --version)
Clone UV repo: (https://github.com/tonserrobo/PySpark_Local.git). Make sure you open a terminal within this directory.
Setup UV virtual environment: (cmd: uv venv). This will create a python 3.11 venv
Activate the venv: (cmd: .venv\Scripts\activate). This will activate the venv and allow us to install the dependencies.
Sync environment: run UV sync to pull all the required dependencies (cmd: uv sync). This will install all the dependencies to the venv

NOTE: i've placed the package files within blackboard also, so this essentially replaces step 2 if you don't want to clone the github repo i created. Lastly, when we are finished working in our venv, you can deactivate it by typing the cmd deactivate.

Setting up Spark Session

Once we have that done, we can then import PySpark and initialize our spark session

import sys, os, findspark
from pyspark.sql import SparkSession

os.environ["PYSPARK_PYTHON"] = sys.executable
os.environ["PYSPARK_DRIVER_PYTHON"] = sys.executable

findspark.init()

spark = SparkSession.builder.appName("SparkTest").getOrCreate()

print(f"findspark version: {findspark.__version__}")
print(f"pyspark version: {spark.version}")

This should output the following:

findspark version: 2.0.1 pyspark version: 3.5.7

Testing Your Setup

Run Environment Check

# Using Python
python -c "from pyspark_local import check_environment, print_environment_report; print_environment_report(check_environment())"

# Using CLI
python -m pyspark_local.cli check

Run Validation Tests

# Run all pytest tests
pytest

# Run with verbose output
pytest -v

# Run only quick tests (skip slow tests)
pytest -m "not slow"

# Run with coverage report
pytest --cov=pyspark_local

Advanced Usage

Custom Spark Configuration

from pyspark_local import create_spark_session

spark = create_spark_session(
    app_name="MyProject",
    master="local[2]",  # Use only 2 cores
    log_level="WARN",
    # Add custom Spark configs
    **{
        "spark.sql.shuffle.partitions": "10",
        "spark.driver.memory": "2g"
    }
)

Manual Environment Check

from pyspark_local import check_environment

env = check_environment()
print(env)

Run Validation Tests Programmatically

from pyspark_local import run_validation_tests

spark = create_spark_session()
results = run_validation_tests(spark, verbose=True)
print(f"Passed: {results['passed']}/{results['total']}")
spark.stop()

Common errors

I've put together a help document in Docs > common_pitfalls.md to help you identify and resolve any errors which might be coming up during the installation and sync process. These errors don't go into PySpark specific problems, just virtual environment management related to PySpark per our use case.

Additional troubleshooting can be found in SETUP_INSTRUCTIONS.md.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PySpark Local: A simple guide to getting started using PySpark locally

Quick Start

Prerequisites

Installation

Usage in Notebooks

Installing PySpark locally with Notebooks (Manual Setup)

Project Structure

File Descriptions

Installing UV

Setting up venv

Setting up Spark Session

Testing Your Setup

Run Environment Check

Run Validation Tests

Advanced Usage

Custom Spark Configuration

Manual Environment Check

Run Validation Tests Programmatically

Common errors

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
data		data
docs		docs
src		src
tests		tests
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
quick_start.ipynb		quick_start.ipynb
student_example.ipynb		student_example.ipynb
uv.lock		uv.lock

License

tonserrobo/PySpark_Local

Folders and files

Latest commit

History

Repository files navigation

PySpark Local: A simple guide to getting started using PySpark locally

Quick Start

Prerequisites

Installation

Usage in Notebooks

Installing PySpark locally with Notebooks (Manual Setup)

Project Structure

File Descriptions

Installing UV

Setting up venv

Setting up Spark Session

Testing Your Setup

Run Environment Check

Run Validation Tests

Advanced Usage

Custom Spark Configuration

Manual Environment Check

Run Validation Tests Programmatically

Common errors

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages