Skip to content

tomz/spark-rust

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 

Repository files navigation

spark-rust: A Rust-Native Spark Engine That's Both Embeddable and Distributed

An MVP-grade Rust implementation of Apache Spark that runs as a single 73 MiB binary in-process (think DuckDB) or scales out across a cluster as a Spark-compatible distributed compute engine. Passes 75.5% of Apache Spark's own SQL test suite. TPC-DS 99/99 at SF=1000 single-node. Ships with a real-time topology dashboard that visualizes data flow across the cluster as queries run.

Standing on the shoulders of Apache DataFusion, DuckDB, and Apache Spark.


What's the pitch?

The data engine landscape has two giants worth celebrating: DuckDB has set the bar for embedded analytical engines — a single binary, vectorized execution, and beautiful ergonomics that have made it the go-to for laptop-scale analytics. Apache Spark has been the foundation of modern big-data computing for over a decade, defining the API surface (DataFrames, SQL, MLlib, Structured Streaming) and the deployment model (drivers, executors, shuffle) that every other distributed engine is measured against. Both projects are remarkable engineering achievements, and spark-rust stands on their shoulders.

What spark-rust adds: the same engine in both lanes. One Rust query engine, three packagings:

  1. In-process Python module: pip install sparkrust, then con = sparkrust.connect(). No JVM, no daemon, no network — the engine runs inside your Python process and reads parquet directly. DuckDB-shaped ergonomics with the Spark API surface, within ~2x of DuckDB's wall time at SF=1000 TPC-DS.

  2. Single-node Spark Connect server: same spark-connect-server binary on localhost, for when you want multiple Notebook / PySpark / JDBC clients to share one engine.

  3. Distributed Spark Connect cluster: same code, same query plans, pointed at a cluster driver. Plug in unmodified PySpark from pip install pyspark and your existing notebooks just work.

All three share the same query engine (DataFusion + spark-rust's own analyzer / optimizer rules / spill operators), the same physical plans, the same MLlib bindings.


Where it is in its lifecycle

Advanced MVP. All the major Spark capabilities are implemented and verified end-to-end:

  • Spark SQL — full ANSI surface plus Spark-flavored extensions. Verified against Apache Spark's own SQL test suite (9,508 golden output files): 7,177 / 9,508 = 75.5% pass rate.
  • DataFrames + Datasets — via spark-dataframe crate. Window functions, UDFs, aggregates, grouping sets, ROLLUP/CUBE, lateral joins, recursive CTEs.
  • DataFrameWriter / df.writeTo(...) — full Connect surface, saveModes (append / overwrite / errorIfExists / ignore), WriteOperationV2 create/replace/append/overwrite variants.
  • Spark Connect — full gRPC server. Unmodified pyspark from PyPI (both 3.5 and 4.0 dialects) connects to sc://host:port and runs.
  • MLlib via Spark Connect — unmodified PySpark MLlib pipelines work through the Connect protocol. LogisticRegression, RandomForest, LinearRegression, k-means, FPGrowth, PrefixSpan, feature transformers, cross-validation, Pipeline/CrossValidator/TrainValidationSplit walker, and Spark-compatible on-disk model persistence.
  • Joins — broadcast hash, sort-merge, shuffle hash, broadcast nested loop, semi/anti, mark joins. Adaptive runtime filter (Bloom filter) pushdown rule.
  • Adaptive Query Execution — coalesce partitions, skew join split (with 4-phase build-side replication), runtime broadcast switch, SMJ→BHJ runtime strategy swap.
  • Operator + shuffle spill — hash aggregation, sort, hash join (including multi-partition build + probe replay), cross-join, window operator, symmetric hash join. Spills to local disk via Arrow IPC.
  • Distributed shuffle — Tonic-based gRPC service, Arrow IPC over HTTP/2, ShuffleManager + ShuffleClient/Server pair.
  • Metastore + DDL persistence — CREATE / DROP / ALTER TABLE persist to the metastore backend across process restarts. Iceberg catalog backend with snapshot writes. Delta and Hive Metastore integration in progress.
  • Storage formats — Parquet (read/write with stats, bloom-filter pruning, column projection, predicate pushdown), ORC (read with predicate pushdown), Iceberg, Delta, CSV, JSON, Avro.
  • Object stores — S3, Azure Blob, GCS, local FS.
  • Statistics + cost model — column-level histograms (equi-width + min/max synthesis), ANALYZE TABLE FOR COLUMNS, planner hooks consuming collected stats.
  • Structured Streaming — micro-batch engine with readStream / writeStream, watermarks (event-time + session windows), late-event policies, state stores (in-memory, file-backed, RocksDB), checkpoint manager, Kafka / Socket / Rate / Memory / File sources, Console / Memory / File / Parquet sinks, trigger modes (Once, Processing-Time, Available- Now, Continuous skeleton). Full Spark-compatible API coverage on the roadmap.
  • Deployment — Standalone gRPC executors, Kubernetes manifests, dynamic allocation primitive, KEDA setup, graceful shutdown, error-budget stage flagging in the history server.

Real-time topology dashboard

spark-rust ships with a live topology page that visualizes the cluster as queries run. Open it in a browser and watch:

  • Executors appear as nodes, sized by active task count, colored by state (registered / running / draining / dead).
  • Shuffle edges light up between executors with animated particles showing the direction and volume of in-flight data transfers.
  • Per-executor counters for active tasks, shuffle read/write bytes, result bytes, cumulative since-start metrics.
  • Cluster-wide cumulative pills at the top: rows scanned, shuffle bytes, result bytes, query count.
  • Stale-ring indicators when an executor misses heartbeats.
  • Playback pause for capturing screenshots of the moment a particular query was executing.

Powered by a TopologyService gRPC server on the driver that accepts executor registrations + per-heartbeat counter snapshots, fanned out via JSONL event logs to the history server's dashboard. See crates/spark-history-server/src/dashboard.html and the topology-demo-* workload generators in crates/spark-connect/src/bin/.


The benchmark headline

Single-node TPC-DS, 32 cores, 251 GB RAM, 1.2 TB local NVMe, parquet on disk. Cold cache, single-shot. Full 99-query sweep at four scale factors.

Scale spark-rust total DuckDB total Ratio spark-rust completion DuckDB completion
SF=1 18.1s 6.2s 2.90x 99/99 99/99
SF=10 61.4s 23.0s 2.67x 99/99 99/99
SF=100 358.7s 201.5s 1.78x 99/99 99/99
SF=1000 4,146.6s 1,859.0s 2.23x 99/99 98/99*

* DuckDB hits a working-set limit at SF=1000 Q85 on this hardware — the intermediate exceeds the 547 GB local disk available for spill. DuckDB's overall optimizer is excellent and faster than spark-rust on most queries in this sweep; this particular case is a plan-shape difference where spark-rust's join-order selectivity keeps the intermediate small enough to finish in 8 seconds without spilling.

spark-rust runs the full TPC-DS suite at SF=1000 single-node and tracks within 2-3x of DuckDB's wall time — and DuckDB is a remarkable benchmark to track against. That's the headline as a single-node embeddable engine.

As a distributed engine, the same binary scales to a cluster: at SF=1000 the Q72 plan that takes 156s single-node ran in 4,315s on a 20-executor cluster previously, with the cluster shape being preferable for the queries that genuinely don't fit single-node (TPC-DS at SF=10,000+).

Per-query glimpse (seconds)

A few illustrative queries showing the shape of the engine across scales:

Query SF=1 SR/Duck SF=10 SR/Duck SF=100 SR/Duck SF=1000 SR/Duck
Q01 0.06 / 0.04 0.14 / 0.06 0.57 / 0.53 4.60 / 1.76
Q23 0.35 / 0.15 1.86 / 0.71 16.87 / 13.81 506 / 633
Q41 0.07 / 0.01 0.10 / 0.03 0.15 / 0.09 0.10 / 0.02
Q67 0.26 / 0.23 2.39 / 1.77 21.25 / 32.33 363 / 233
Q72 2.88 / 0.09 9.36 / 0.31 24.0 / 1.33 155 / 19
Q85 0.35 / 0.05 0.49 / 0.14 1.06 / 0.66 8.0 / — *
Q97 0.10 / 0.06 0.72 / 0.20 2.18 / 7.17 22.25 / 12.74
Q99 0.08 / 0.03 0.22 / 0.08 1.56 / 0.54 6.74 / 3.42

* Same Q85 disk-limit note as above.

Full per-query table in docs/tpcds-single-node-benchmark.md.

The pattern: DuckDB is consistently fast across the board — particularly strong on rollup and grouping-set queries where its operator fusion shines. spark-rust matches or beats it on a handful of shapes (subqueries, selective filter patterns) and stays within 2x for most of the middle. Both engines are doing real, sophisticated work; the comparison is a healthy benchmark to keep tracking.


Binary footprint

spark-rust ships in three forms, all dynamically linked against the same standard system libs (libc, libstdc++, libm, libpthread, libdl). No JVM, no bundled deps, no installer.

Artifact What it is Size
spark-connect-server Full engine + Spark Connect protocol (Rust binary, stripped) 72.9 MiB
sparkrust._sparkrust.abi3.so In-process Python extension (CPython abi3-py39, release-stripped) ~75 MiB
duckdb-1.5.2 (reference) DuckDB CLI for benchmark comparison 58.9 MiB

The Connect server + Python extension are different packagings of the same Rust engine. Pick the one that matches the deployment shape: pip install sparkrust for embedded in-process usage, the standalone binary for laptop/cluster Spark Connect.

That 73 MiB binary includes the entire Spark Connect protocol, MLlib bindings, adaptive query execution rules, spill operators, distributed shuffle service, metastore writer, history server, and topology visualization. DuckDB's 59 MiB single-binary is a remarkable engineering achievement and the benchmark we measure against.

For context, Apache Spark 4.2's distribution is ~330 MB compressed (~1 GB extracted) plus the JVM runtime — that footprint reflects Spark's much broader feature surface (Structured Streaming, Scala/Java APIs, SparkR, full MLlib, the Catalyst optimizer, history server, web UI, the entire Hadoop ecosystem integration). spark-rust trades that breadth for a Rust single-binary deployment and a much narrower feature surface (SQL + DataFrames + MLlib + Spark Connect + Iceberg/Delta writes).


How does this exist?

spark-rust is built on top of Apache DataFusion — the remarkable Rust query engine that already powers Polars, ParadeDB, InfluxDB v3, Sail, and several others. DataFusion gives us:

  • Arrow-columnar execution with first-class vectorization.
  • A working SQL parser, optimizer, and physical planner.
  • Parquet read/write, object_store, the core operators.

The DataFusion community has done extraordinary work building this foundation. spark-rust uses a local fork of DataFusion (we contribute back where we can) and adds the Spark-shaped pieces on top:

  • A Spark-flavored SQL parser layer (24 crates handling Spark dialect rewrites — make_ym_interval, array_compact, from_unixtime, etc.)
  • An analyzer pass that injects Spark-specific rewrites between SQL parsing and DataFusion's optimizer.
  • Operator + shuffle spill (the hash-aggregate, sort, and hash-join spill series we contributed back to DataFusion's branch-53 integration line).
  • Skew join split (4-phase build-side replication).
  • A gRPC shuffle service for distributed mode.
  • A Spark Connect gRPC server.
  • MLlib over the Connect protocol — pyspark.ml.classification.LogisticRegression and friends, fitting and persisting models via Rust-side linfa + custom code.
  • A metastore writer with iceberg / delta / hive backends.
  • A real-time topology gRPC service + browser dashboard.
  • 22+ crates of glue, totaling ~600 Rust source files and ~250k LOC.

The result is a 73 MiB binary that runs PySpark notebooks against parquet on local disk or against a distributed cluster.


Three ways to use it

The same engine surface is reachable through three entry points, all pulling from one Rust codebase:

1. In-process Python (no daemon, DuckDB-shaped)

# pip install sparkrust
import sparkrust

con = sparkrust.connect()
con.sql("CREATE VIEW sales AS SELECT * FROM read_parquet('sales/*.parquet')")
df = con.sql("SELECT region, SUM(revenue) AS rev FROM sales GROUP BY region")

# Pick your result shape:
df.show()                  # PySpark-style tabular print
rows = df.collect()        # list of tuples
table = df.arrow()         # pyarrow.Table (zero-copy ready)
pdf = df.toPandas()        # pandas DataFrame

Or with a PySpark-flavored entry point for code transitioning from PySpark:

from sparkrust import SparkSession
spark = SparkSession()              # in-process, no Connect server needed
spark.sql("SELECT 1 AS x").show()

The Python extension module (sparkrust._sparkrust) is a CPython abi3-py39 binding over the same Rust query engine. No JVM, no network hop, no daemon — the engine runs inside your Python process. Arrow data crosses the FFI boundary via Arrow IPC; conversions to pandas/pyarrow are zero-copy where possible.

2. Single-node Spark Connect server (laptop-scale)

# Same `spark-connect-server` binary, run it locally
# (it's the same engine; this mode is useful when you want to
# share one engine between multiple Python/Notebook/Java clients)
import pyspark.sql
spark = pyspark.sql.SparkSession.builder.remote("sc://localhost:50051").getOrCreate()

df = spark.read.parquet("s3://my-bucket/sales/")
df.groupBy("region").agg({"revenue": "sum"}).show()

3. Distributed Spark Connect cluster

# Same code, point at a cluster driver
spark = pyspark.sql.SparkSession.builder.remote("sc://driver.example.com:50051").getOrCreate()
# ...identical pipeline, now executed across N executors...

The driver routes operations through Spark Connect; executors receive Substrait-encoded plans and execute them locally with their own Arrow-batched DataFusion. Local-disk shuffle, gRPC fetch over Arrow IPC, the same memory manager rules apply on both sides.

What this means in practice: you can prototype in-process with the Python API on a laptop, then deploy the exact same SQL or DataFrame code through Spark Connect to a cluster without changing a line.


AI-native features ahead

Expect AI-native features baked into spark-rust going forward:

  • Natural-language SQL: ask for queries in English, get them planned and run.
  • Self-tuning execution: the engine learns from prior runs (cost-model population, plan-shape histograms, adaptive bloom-filter gating) and applies it to new queries without manual configuration.
  • AI-assisted plan explanation: ask why a query is slow and get a plain-language summary of the physical plan + per-operator metrics.
  • Embedded model serving: load a model artifact alongside parquet and call it from SQL.

These are not in the v0 release. They're the natural next direction once the core engine ships.


Release status

spark-rust is currently in private development. Open-source release is planned near future at github.com/tomz/spark-rust. The instructions in this article — cargo build, the PySpark client examples — will be runnable from the public repository once that lands.

If you're interested in early access or a specific use case, reach out before then.


Where to start (once released)

  • In-process Python: pip install sparkrust, then import sparkrust; con = sparkrust.connect(); con.sql("...").show(). No daemon, no JVM, no network — the engine runs inside your Python process.
  • Single-node Connect server: clone github.com/tomz/spark-rust, cargo build --release -p spark-connect-server, then pyspark --remote sc://localhost:50051 against a local parquet directory.
  • Topology dashboard: launch the connect server, run bash scripts/topology-demo.sh (or topology-demo-driver / topology-demo-shuffle / topology-demo-heavy directly), then open the history-server dashboard URL.
  • MLlib demo: examples/demo/pyspark/connect_ml.py — unmodified PySpark fits a LogisticRegression against our engine.
  • TPC-DS bench: docs/tpcds-single-node-benchmark.md for the numbers above plus the harness to reproduce them.
  • Spark SQL coverage: docs/golden-report.json is the live golden test result (7,177 / 9,508 ok, refreshed each commit).

The 30-second summary

spark-rust is an advanced MVP of a Rust-native Spark engine that runs three ways from one codebase: as an in-process Python module (think DuckDB ergonomics — pip install sparkrust), as a single-node Spark Connect server, and as a distributed Spark Connect cluster. It tracks within 2-3x of DuckDB single-node at TPC-DS SF=1000 (and runs the full 99-query sweep, including queries with disk-limit sensitivity), passes 75.5% of Apache Spark's own SQL test suite, ships MLlib via the Connect protocol so unmodified PySpark ML pipelines work, ships a Structured Streaming engine with watermarks, state stores, and Kafka/file/socket sources, and ships a real-time topology dashboard that visualizes data flow across the cluster as queries run — all in 73 MiB of dynamically-linked binary.

If you've valued DuckDB's ergonomics for embedded analytics and Apache Spark's API surface for distributed compute, spark-rust offers both shapes in a single Rust-native binary.

About

A Rust-native Spark engine: embeddable single-binary like DuckDB, or distributed Spark Connect cluster. MVP-stage, 74.3% Spark SQL test pass rate, TPC-DS 99/99 at SF=1000 single-node.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors