Skip to content

Simple Repo for running pyspark locally using VS Code, Juypter Notebooks, Jetbrains or any other local IDE

License

Notifications You must be signed in to change notification settings

tonserrobo/PySpark_Local

Repository files navigation

PySpark Local: A simple guide to getting started using PySpark locally

Simple Repo for running PySpark locally using VS Code, Jupyter Notebooks, Jetbrains or any other local IDE. NOTE, this repo is intended for Ulster University MSc Data Science students (Big data technologies COM739). However, if you find it useful then please feel free to use it. This repository now includes a streamlined setup that handles everything for you:

from pyspark_local import initialize_pyspark
spark = initialize_pyspark()

See SETUP_INSTRUCTIONS.md for complete details, or jump to the Quick Start section below.

Quick Start

Prerequisites

  1. Java 8 or 11 - Required for PySpark (Installation guide)
  2. Python 3.11 - Check with python --version

Installation

# Option 1: Using UV (recommended)
uv pip install -e ".[dev]"

# Option 2: Using pip
pip install -e ".[dev]"

Usage in Notebooks

from pyspark_local import initialize_pyspark
spark = initialize_pyspark()

# Now use spark normally
df = spark.range(10)
df.show()

# Remember to stop when done
spark.stop()

See student_example.ipynb for a complete tutorial!

Installing PySpark locally with Notebooks (Manual Setup)

As we have seen over the past few weeks, there is an issue on google Colab due to Java versions. Its not entirely fair to point the blame exclusively at Java, as the incompatibility comes from Java - Python V3.13 and PySpark 4.0.1. I've discovered that it works best if we use Java v11, Python 3.11 and we can continue to use PySpark 3.5.1. I originally tried to avoid introducing virtual environments as it introduces unnecessary complexity; however this is the real world of software engineering. If you want to learn more about virtual environments and the challenges of version control specifically The method i prefer to use when managing complex dependency relationships is UV. To setup PySpark locally and ensure the dependencies are managed we can go through the following process: 1. install UV, 2. Clone github repo, 3. activate venv and finally 4. sync dependencies.

Project Structure

PySpark_Local/
├── src/pyspark_local/          # Main package (NEW!)
│   ├── __init__.py             # Package exports
│   ├── setup.py                # Environment checking & Spark initialization
│   ├── validation.py           # Validation tests
│   └── cli.py                  # Command-line interface
├── tests/                       # Pytest test suite (NEW!)
│   ├── conftest.py             # Pytest configuration
│   ├── test_environment.py     # Environment tests
│   ├── test_spark_session.py  # Spark session tests
│   └── test_validation.py     # Validation tests
├── data/                        # Sample data files
│   ├── employees.json
│   └── people.json
├── student_example.ipynb       # Tutorial notebook (NEW!)
├── quick_start.ipynb           # Quick start guide
├── SETUP_INSTRUCTIONS.md       # Detailed setup guide (NEW!)
├── pyproject.toml              # Project configuration
├── uv.lock                     # Dependency lock file
└── README.md                   # This file

File Descriptions

File/Directory Description
src/pyspark_local/ NEW! Main package with setup, validation, and CLI tools
tests/ NEW! Comprehensive pytest test suite
student_example.ipynb NEW! Complete tutorial showing one-line setup
SETUP_INSTRUCTIONS.md NEW! Detailed setup and usage instructions
uv.lock Locks your Python project's dependencies to specific versions to ensure reproducible and cross-platform environments
pyproject.toml Project metadata, dependencies, build systems, and tool configurations
README.md This file - main project overview
LICENSE Standard MIT license details
data/ Sample data files (employees.json, people.json) for testing PySpark

Installing UV

You will first need to install UV. This is fundamentally a package manager like PIP (python) however has the added benefit of allowing us to manage virtual environments and python versions as we would any other package.

powershell -ExecutionPolicy ByPass -c "irm https://astral.sh/uv/install.ps1 | iex"
curl -LsSf https://astral.sh/uv/install.sh | sh

You can find more documentation on getting started with UV (and installation) here

There is more information and some useful tutorials on what UV is and some common cmds when using it here. Some of them are decent but i've also attached specific notes on UV in general. Only review these as required, they are not that important but useful for getting and understanding of the tool and any issues that might arise.

Setting up venv

With UV installed, we can proceed to setting up the virtual environment and initializing spark locally (this should work for VS code, jupyter notebooks and any other local development environment). We can do this in colab also if we want although its not really necessary.

  1. Install UV: (we should have this done already. we can check by running the cmd uv --version)
  2. Clone UV repo: (https://github.com/tonserrobo/PySpark_Local.git). Make sure you open a terminal within this directory.
  3. Setup UV virtual environment: (cmd: uv venv). This will create a python 3.11 venv
  4. Activate the venv: (cmd: .venv\Scripts\activate). This will activate the venv and allow us to install the dependencies.
  5. Sync environment: run UV sync to pull all the required dependencies (cmd: uv sync). This will install all the dependencies to the venv

NOTE: i've placed the package files within blackboard also, so this essentially replaces step 2 if you don't want to clone the github repo i created. Lastly, when we are finished working in our venv, you can deactivate it by typing the cmd deactivate.

Setting up Spark Session

Once we have that done, we can then import PySpark and initialize our spark session

import sys, os, findspark
from pyspark.sql import SparkSession

os.environ["PYSPARK_PYTHON"] = sys.executable
os.environ["PYSPARK_DRIVER_PYTHON"] = sys.executable

findspark.init()

spark = SparkSession.builder.appName("SparkTest").getOrCreate()

print(f"findspark version: {findspark.__version__}")
print(f"pyspark version: {spark.version}")

This should output the following:

findspark version: 2.0.1 pyspark version: 3.5.7

Testing Your Setup

Run Environment Check

# Using Python
python -c "from pyspark_local import check_environment, print_environment_report; print_environment_report(check_environment())"

# Using CLI
python -m pyspark_local.cli check

Run Validation Tests

# Run all pytest tests
pytest

# Run with verbose output
pytest -v

# Run only quick tests (skip slow tests)
pytest -m "not slow"

# Run with coverage report
pytest --cov=pyspark_local

Advanced Usage

Custom Spark Configuration

from pyspark_local import create_spark_session

spark = create_spark_session(
    app_name="MyProject",
    master="local[2]",  # Use only 2 cores
    log_level="WARN",
    # Add custom Spark configs
    **{
        "spark.sql.shuffle.partitions": "10",
        "spark.driver.memory": "2g"
    }
)

Manual Environment Check

from pyspark_local import check_environment

env = check_environment()
print(env)

Run Validation Tests Programmatically

from pyspark_local import run_validation_tests

spark = create_spark_session()
results = run_validation_tests(spark, verbose=True)
print(f"Passed: {results['passed']}/{results['total']}")
spark.stop()

Common errors

I've put together a help document in Docs > common_pitfalls.md to help you identify and resolve any errors which might be coming up during the installation and sync process. These errors don't go into PySpark specific problems, just virtual environment management related to PySpark per our use case.

Additional troubleshooting can be found in SETUP_INSTRUCTIONS.md.

About

Simple Repo for running pyspark locally using VS Code, Juypter Notebooks, Jetbrains or any other local IDE

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published