picomap makes it easy to store and load datasets for machine learning. It is tiny (<200 LOC) but works well whenever I have a non-standard dataset and want efficient loading.
✅ Fast — writes arrays directly to disk in binary form
✅ Reproducible — per-item hashing for content verification
✅ Simple — one Python file, only dependencies are numpy, xxhash, and tqdm. Tbh you probably don't need the last two.
pip install picomapimport numpy as np
import picomap as pm
# Build a ragged dataset from a generator of arrays
lens = np.random.randint(16, 302, size=(101,))
arrs = [np.random.randn(l, 4, 16, 3) for l in lens]
pm.build_map(arrs, "ds/test")
assert pm.verify_hash("ds/test.dat")
# Load individual items on demand
load, N = pm.get_loader_fn("ds/test.dat")
for i in range(N):
assert np.allclose(arrs[i], load(i))This writes three files and creates the directory ds
ds/test.dat # raw binary data
ds/test.starts.npy # index offsets
ds/test.json # metadata + hash
| Function | Purpose |
|---|---|
build_map(gen, path) |
Stream arrays → build dataset on disk |
verify_hash(path) |
Recompute & validate hash |
get_loader_fn(path) |
Return (loader_fn, count) for random access |
update_hash_with_array(h, arr) |
Internal helper (streamed hashing) |
- All arrays must share the same dtype and trailing dimensions.
- The first dimension can be ragged across the dataset (i.e., you can have sequences with shapes
(*, d1, d2, ..., dn)). - Use
load(i, copy=True)to materialize a slice if you need to modify it. I generally copy the tensor to GPU in a training loop anyway.
