# The Bleeding Edge: SmallPond

So if DuckDB excels at single-node processing, even up to terabytes of data, and Ray excels at multinode processing, particularly for the last mile, surely there's something to be gained by distributing duckdb processing across a cluster with Ray!

That was probably the thought behind DeepSeek's recent (2 March) announcement of `smallpond` (https://github.com/deepseek-ai/smallpond).  Combined with their 3fs ("Fire Flyer File System") library, this is a distributed data processing system meant for terabytes to petabytes of data.

And it all sounds great...
... until you try to use it.

As mentioned in blog posts, you probably don't need this and for most cases it's probably slower than duckdb or spark.  As http://definite.app/blog/smallpond notes,

> Is smallpond for me?
> tl;dr: probably not.
> Whether you'd want to use smallpond depends on several factors:
> Your Data Scale: If your dataset is under 10TB, smallpond adds unnecessary complexity and overhead. 

https://dataengineeringcentral.substack.com/p/smallpond-distributed-duckdb notes

> It seems very early stage to me, reminds of a recent incubating Apache Project we looked at. Tons of work to do in the areas of …
> documentation
> expanded functionality (read/write/transform)
> increased usability
> first class cloud integration (like s3)

So.  There's basically *no* documentation.  It doesn't read directly from S3.  It doesn't read hive directly and easily (there's some related code in there but I couldn't get it doing what I asked for as it only wanted to partition on keys with base 10 integer values).

So we're not going to do much with it besides a simple test and discussion.  In fact, because it doesn't directly handle hive, the first thing we're going to do is undo much of our good work by unhiving our previous hive.

In [None]:
import duckdb
duckdb.sql("""
COPY 
(SELECT * FROM read_parquet(['../3_etl_with_duckdb_part_2/minilob/**/*parquet'], hive_partitioning=true))
TO minilob.parquet 
(FORMAT 'parquet', COMPRESSION 'zstd')
""")

In [None]:
print(
    pl
    .scan_parquet("minilob.parquet")
    .limit(10)
    .collect())

In [None]:
import time

# let's make a simple little profiler
class Print_time:
    def __enter__(self):
        self.start = time.perf_counter()
        return self
    
    def __exit__(self, exc_type, exc_value, traceback):
        self.end = time.perf_counter()
        self.elapsed = self.end - self.start
        print(f"Time taken: {self.elapsed:.4f} seconds")


In [None]:
import smallpond
from smallpond.logical.dataset import ParquetDataSet
sp = smallpond.init(ray_address="auto")

df = sp.read_parquet("minilob.parquet")

# The magic of smallpond lies in partitioning, and what this effectively would do is partition this data between
# all the values of SYMBOL so that eg CSCO would run on one node, LSTR would run on one node, etc.
# however, this will likely fill up all available memory in the workshop because we are running on laptops!
# df = df.repartition(6, hash_by="symbol")

# so instead, try running once, then repartitioning into two sets.
df = df.repartition(6, by_rows=True)
df = sp.partial_sql("SELECT symbol, min(buyside_price), max(buyside_price) FROM {0} GROUP BY symbol", df, enable_temp_directory=True)

# Show results
with Print_time():
    print(df.to_pandas().head())
