An MVP-grade Rust implementation of Apache Spark that runs as a single 73 MiB binary in-process (think DuckDB) or scales out across a cluster as a Spark-compatible distributed compute engine. Passes 75.5% of Apache Spark's own SQL test suite. TPC-DS 99/99 at SF=1000 single-node. Ships with a real-time topology dashboard that visualizes data flow across the cluster as queries run.
Standing on the shoulders of Apache DataFusion, DuckDB, and Apache Spark.
The data engine landscape has two giants worth celebrating: DuckDB has set the bar for embedded analytical engines — a single binary, vectorized execution, and beautiful ergonomics that have made it the go-to for laptop-scale analytics. Apache Spark has been the foundation of modern big-data computing for over a decade, defining the API surface (DataFrames, SQL, MLlib, Structured Streaming) and the deployment model (drivers, executors, shuffle) that every other distributed engine is measured against. Both projects are remarkable engineering achievements, and spark-rust stands on their shoulders.
What spark-rust adds: the same engine in both lanes. One Rust query engine, three packagings:
-
In-process Python module:
pip install sparkrust, thencon = sparkrust.connect(). No JVM, no daemon, no network — the engine runs inside your Python process and reads parquet directly. DuckDB-shaped ergonomics with the Spark API surface, within ~2x of DuckDB's wall time at SF=1000 TPC-DS. -
Single-node Spark Connect server: same
spark-connect-serverbinary onlocalhost, for when you want multiple Notebook / PySpark / JDBC clients to share one engine. -
Distributed Spark Connect cluster: same code, same query plans, pointed at a cluster driver. Plug in unmodified PySpark from
pip install pysparkand your existing notebooks just work.
All three share the same query engine (DataFusion + spark-rust's own analyzer / optimizer rules / spill operators), the same physical plans, the same MLlib bindings.
Advanced MVP. All the major Spark capabilities are implemented and verified end-to-end:
- Spark SQL — full ANSI surface plus Spark-flavored extensions. Verified against Apache Spark's own SQL test suite (9,508 golden output files): 7,177 / 9,508 = 75.5% pass rate.
- DataFrames + Datasets — via
spark-dataframecrate. Window functions, UDFs, aggregates, grouping sets, ROLLUP/CUBE, lateral joins, recursive CTEs. DataFrameWriter/df.writeTo(...)— full Connect surface, saveModes (append / overwrite / errorIfExists / ignore),WriteOperationV2create/replace/append/overwrite variants.- Spark Connect — full gRPC server. Unmodified
pysparkfrom PyPI (both 3.5 and 4.0 dialects) connects tosc://host:portand runs. - MLlib via Spark Connect — unmodified PySpark MLlib pipelines work
through the Connect protocol.
LogisticRegression,RandomForest,LinearRegression, k-means, FPGrowth, PrefixSpan, feature transformers, cross-validation, Pipeline/CrossValidator/TrainValidationSplit walker, and Spark-compatible on-disk model persistence. - Joins — broadcast hash, sort-merge, shuffle hash, broadcast nested loop, semi/anti, mark joins. Adaptive runtime filter (Bloom filter) pushdown rule.
- Adaptive Query Execution — coalesce partitions, skew join split (with 4-phase build-side replication), runtime broadcast switch, SMJ→BHJ runtime strategy swap.
- Operator + shuffle spill — hash aggregation, sort, hash join (including multi-partition build + probe replay), cross-join, window operator, symmetric hash join. Spills to local disk via Arrow IPC.
- Distributed shuffle — Tonic-based gRPC service, Arrow IPC over HTTP/2, ShuffleManager + ShuffleClient/Server pair.
- Metastore + DDL persistence — CREATE / DROP / ALTER TABLE persist to the metastore backend across process restarts. Iceberg catalog backend with snapshot writes. Delta and Hive Metastore integration in progress.
- Storage formats — Parquet (read/write with stats, bloom-filter pruning, column projection, predicate pushdown), ORC (read with predicate pushdown), Iceberg, Delta, CSV, JSON, Avro.
- Object stores — S3, Azure Blob, GCS, local FS.
- Statistics + cost model — column-level histograms (equi-width + min/max synthesis), ANALYZE TABLE FOR COLUMNS, planner hooks consuming collected stats.
- Structured Streaming — micro-batch engine with
readStream/writeStream, watermarks (event-time + session windows), late-event policies, state stores (in-memory, file-backed, RocksDB), checkpoint manager, Kafka / Socket / Rate / Memory / File sources, Console / Memory / File / Parquet sinks, trigger modes (Once, Processing-Time, Available- Now, Continuous skeleton). Full Spark-compatible API coverage on the roadmap. - Deployment — Standalone gRPC executors, Kubernetes manifests, dynamic allocation primitive, KEDA setup, graceful shutdown, error-budget stage flagging in the history server.
spark-rust ships with a live topology page that visualizes the cluster as queries run. Open it in a browser and watch:
- Executors appear as nodes, sized by active task count, colored by state (registered / running / draining / dead).
- Shuffle edges light up between executors with animated particles showing the direction and volume of in-flight data transfers.
- Per-executor counters for active tasks, shuffle read/write bytes, result bytes, cumulative since-start metrics.
- Cluster-wide cumulative pills at the top: rows scanned, shuffle bytes, result bytes, query count.
- Stale-ring indicators when an executor misses heartbeats.
- Playback pause for capturing screenshots of the moment a particular query was executing.
Powered by a TopologyService gRPC server on the driver that accepts
executor registrations + per-heartbeat counter snapshots, fanned out via
JSONL event logs to the history server's dashboard. See
crates/spark-history-server/src/dashboard.html and the
topology-demo-* workload generators in crates/spark-connect/src/bin/.
Single-node TPC-DS, 32 cores, 251 GB RAM, 1.2 TB local NVMe, parquet on disk. Cold cache, single-shot. Full 99-query sweep at four scale factors.
| Scale | spark-rust total | DuckDB total | Ratio | spark-rust completion | DuckDB completion |
|---|---|---|---|---|---|
| SF=1 | 18.1s | 6.2s | 2.90x | 99/99 | 99/99 |
| SF=10 | 61.4s | 23.0s | 2.67x | 99/99 | 99/99 |
| SF=100 | 358.7s | 201.5s | 1.78x | 99/99 | 99/99 |
| SF=1000 | 4,146.6s | 1,859.0s | 2.23x | 99/99 | 98/99* |
* DuckDB hits a working-set limit at SF=1000 Q85 on this hardware — the intermediate exceeds the 547 GB local disk available for spill. DuckDB's overall optimizer is excellent and faster than spark-rust on most queries in this sweep; this particular case is a plan-shape difference where spark-rust's join-order selectivity keeps the intermediate small enough to finish in 8 seconds without spilling.
spark-rust runs the full TPC-DS suite at SF=1000 single-node and tracks within 2-3x of DuckDB's wall time — and DuckDB is a remarkable benchmark to track against. That's the headline as a single-node embeddable engine.
As a distributed engine, the same binary scales to a cluster: at SF=1000 the Q72 plan that takes 156s single-node ran in 4,315s on a 20-executor cluster previously, with the cluster shape being preferable for the queries that genuinely don't fit single-node (TPC-DS at SF=10,000+).
A few illustrative queries showing the shape of the engine across scales:
| Query | SF=1 SR/Duck | SF=10 SR/Duck | SF=100 SR/Duck | SF=1000 SR/Duck |
|---|---|---|---|---|
| Q01 | 0.06 / 0.04 | 0.14 / 0.06 | 0.57 / 0.53 | 4.60 / 1.76 |
| Q23 | 0.35 / 0.15 | 1.86 / 0.71 | 16.87 / 13.81 | 506 / 633 |
| Q41 | 0.07 / 0.01 | 0.10 / 0.03 | 0.15 / 0.09 | 0.10 / 0.02 |
| Q67 | 0.26 / 0.23 | 2.39 / 1.77 | 21.25 / 32.33 | 363 / 233 |
| Q72 | 2.88 / 0.09 | 9.36 / 0.31 | 24.0 / 1.33 | 155 / 19 |
| Q85 | 0.35 / 0.05 | 0.49 / 0.14 | 1.06 / 0.66 | 8.0 / — * |
| Q97 | 0.10 / 0.06 | 0.72 / 0.20 | 2.18 / 7.17 | 22.25 / 12.74 |
| Q99 | 0.08 / 0.03 | 0.22 / 0.08 | 1.56 / 0.54 | 6.74 / 3.42 |
* Same Q85 disk-limit note as above.
Full per-query table in docs/tpcds-single-node-benchmark.md.
The pattern: DuckDB is consistently fast across the board — particularly strong on rollup and grouping-set queries where its operator fusion shines. spark-rust matches or beats it on a handful of shapes (subqueries, selective filter patterns) and stays within 2x for most of the middle. Both engines are doing real, sophisticated work; the comparison is a healthy benchmark to keep tracking.
spark-rust ships in three forms, all dynamically linked against the same
standard system libs (libc, libstdc++, libm, libpthread, libdl).
No JVM, no bundled deps, no installer.
| Artifact | What it is | Size |
|---|---|---|
spark-connect-server |
Full engine + Spark Connect protocol (Rust binary, stripped) | 72.9 MiB |
sparkrust._sparkrust.abi3.so |
In-process Python extension (CPython abi3-py39, release-stripped) | ~75 MiB |
duckdb-1.5.2 (reference) |
DuckDB CLI for benchmark comparison | 58.9 MiB |
The Connect server + Python extension are different packagings of the
same Rust engine. Pick the one that matches the deployment shape:
pip install sparkrust for embedded in-process usage, the standalone
binary for laptop/cluster Spark Connect.
That 73 MiB binary includes the entire Spark Connect protocol, MLlib bindings, adaptive query execution rules, spill operators, distributed shuffle service, metastore writer, history server, and topology visualization. DuckDB's 59 MiB single-binary is a remarkable engineering achievement and the benchmark we measure against.
For context, Apache Spark 4.2's distribution is ~330 MB compressed (~1 GB extracted) plus the JVM runtime — that footprint reflects Spark's much broader feature surface (Structured Streaming, Scala/Java APIs, SparkR, full MLlib, the Catalyst optimizer, history server, web UI, the entire Hadoop ecosystem integration). spark-rust trades that breadth for a Rust single-binary deployment and a much narrower feature surface (SQL + DataFrames + MLlib + Spark Connect + Iceberg/Delta writes).
spark-rust is built on top of Apache DataFusion — the remarkable Rust query engine that already powers Polars, ParadeDB, InfluxDB v3, Sail, and several others. DataFusion gives us:
- Arrow-columnar execution with first-class vectorization.
- A working SQL parser, optimizer, and physical planner.
- Parquet read/write, object_store, the core operators.
The DataFusion community has done extraordinary work building this foundation. spark-rust uses a local fork of DataFusion (we contribute back where we can) and adds the Spark-shaped pieces on top:
- A Spark-flavored SQL parser layer (24 crates handling Spark dialect
rewrites —
make_ym_interval,array_compact,from_unixtime, etc.) - An analyzer pass that injects Spark-specific rewrites between SQL parsing and DataFusion's optimizer.
- Operator + shuffle spill (the hash-aggregate, sort, and hash-join spill
series we contributed back to DataFusion's
branch-53integration line). - Skew join split (4-phase build-side replication).
- A gRPC shuffle service for distributed mode.
- A Spark Connect gRPC server.
- MLlib over the Connect protocol —
pyspark.ml.classification.LogisticRegressionand friends, fitting and persisting models via Rust-sidelinfa+ custom code. - A metastore writer with iceberg / delta / hive backends.
- A real-time topology gRPC service + browser dashboard.
- 22+ crates of glue, totaling ~600 Rust source files and ~250k LOC.
The result is a 73 MiB binary that runs PySpark notebooks against parquet on local disk or against a distributed cluster.
The same engine surface is reachable through three entry points, all pulling from one Rust codebase:
# pip install sparkrust
import sparkrust
con = sparkrust.connect()
con.sql("CREATE VIEW sales AS SELECT * FROM read_parquet('sales/*.parquet')")
df = con.sql("SELECT region, SUM(revenue) AS rev FROM sales GROUP BY region")
# Pick your result shape:
df.show() # PySpark-style tabular print
rows = df.collect() # list of tuples
table = df.arrow() # pyarrow.Table (zero-copy ready)
pdf = df.toPandas() # pandas DataFrameOr with a PySpark-flavored entry point for code transitioning from PySpark:
from sparkrust import SparkSession
spark = SparkSession() # in-process, no Connect server needed
spark.sql("SELECT 1 AS x").show()The Python extension module (sparkrust._sparkrust) is a CPython
abi3-py39 binding over the same Rust query engine. No JVM, no
network hop, no daemon — the engine runs inside your Python process.
Arrow data crosses the FFI boundary via Arrow IPC; conversions to
pandas/pyarrow are zero-copy where possible.
# Same `spark-connect-server` binary, run it locally
# (it's the same engine; this mode is useful when you want to
# share one engine between multiple Python/Notebook/Java clients)
import pyspark.sql
spark = pyspark.sql.SparkSession.builder.remote("sc://localhost:50051").getOrCreate()
df = spark.read.parquet("s3://my-bucket/sales/")
df.groupBy("region").agg({"revenue": "sum"}).show()# Same code, point at a cluster driver
spark = pyspark.sql.SparkSession.builder.remote("sc://driver.example.com:50051").getOrCreate()
# ...identical pipeline, now executed across N executors...The driver routes operations through Spark Connect; executors receive Substrait-encoded plans and execute them locally with their own Arrow-batched DataFusion. Local-disk shuffle, gRPC fetch over Arrow IPC, the same memory manager rules apply on both sides.
What this means in practice: you can prototype in-process with the Python API on a laptop, then deploy the exact same SQL or DataFrame code through Spark Connect to a cluster without changing a line.
Expect AI-native features baked into spark-rust going forward:
- Natural-language SQL: ask for queries in English, get them planned and run.
- Self-tuning execution: the engine learns from prior runs (cost-model population, plan-shape histograms, adaptive bloom-filter gating) and applies it to new queries without manual configuration.
- AI-assisted plan explanation: ask why a query is slow and get a plain-language summary of the physical plan + per-operator metrics.
- Embedded model serving: load a model artifact alongside parquet and call it from SQL.
These are not in the v0 release. They're the natural next direction once the core engine ships.
spark-rust is currently in private development. Open-source release is
planned near future at
github.com/tomz/spark-rust. The
instructions in this article — cargo build, the PySpark client examples —
will be runnable from the public repository once that lands.
If you're interested in early access or a specific use case, reach out before then.
- In-process Python:
pip install sparkrust, thenimport sparkrust; con = sparkrust.connect(); con.sql("...").show(). No daemon, no JVM, no network — the engine runs inside your Python process. - Single-node Connect server: clone
github.com/tomz/spark-rust,cargo build --release -p spark-connect-server, thenpyspark --remote sc://localhost:50051against a local parquet directory. - Topology dashboard: launch the connect server, run
bash scripts/topology-demo.sh(ortopology-demo-driver/topology-demo-shuffle/topology-demo-heavydirectly), then open the history-server dashboard URL. - MLlib demo:
examples/demo/pyspark/connect_ml.py— unmodified PySpark fits aLogisticRegressionagainst our engine. - TPC-DS bench:
docs/tpcds-single-node-benchmark.mdfor the numbers above plus the harness to reproduce them. - Spark SQL coverage:
docs/golden-report.jsonis the live golden test result (7,177 / 9,508 ok, refreshed each commit).
spark-rust is an advanced MVP of a Rust-native Spark engine that runs
three ways from one codebase: as an in-process Python module (think
DuckDB ergonomics — pip install sparkrust), as a single-node Spark
Connect server, and as a distributed Spark Connect cluster. It tracks
within 2-3x of DuckDB single-node at TPC-DS SF=1000 (and runs the full
99-query sweep, including queries with disk-limit sensitivity), passes
75.5% of Apache Spark's own SQL test suite, ships MLlib via the
Connect protocol so unmodified PySpark ML pipelines work, ships a
Structured Streaming engine with watermarks, state stores, and
Kafka/file/socket sources, and ships a real-time topology dashboard
that visualizes data flow across the cluster as queries run — all in
73 MiB of dynamically-linked binary.
If you've valued DuckDB's ergonomics for embedded analytics and Apache Spark's API surface for distributed compute, spark-rust offers both shapes in a single Rust-native binary.