Reason or Problem
_write_vrt_tiled in xrspatial/geotiff/__init__.py (line 1708) writes a dask-backed DataArray to a directory of tiled GeoTIFFs by building one dask.delayed task per tile, then executing them all with:
dask.compute(*delayed_tasks, scheduler='synchronous')
The synchronous scheduler runs every task one at a time on the current thread. Tile writes are independent (different chunks, different output files, no shared mutable state inside _write_single_tile), so the parallelism is left on the table.
Microbench on a single 16-thread machine, 4096x4096 float32 random data, chunks=256, zstd compression, 256 output tiles:
| Scheduler |
Wall time |
| synchronous (current) |
0.49 s |
| threads (monkey-patched) |
0.33 s |
That is a ~33% reduction with zero correctness risk on this path. The gain grows with tile count and with codec cost (zstd level 9 / LERC are higher per-tile CPU).
Proposal
Switch to the threaded scheduler explicitly:
dask.compute(*delayed_tasks, scheduler='threads')
Considerations:
_write_single_tile opens a fresh file path per tile and never mutates shared Python state, so threading is safe.
- zstd/zlib/LZW release the GIL during compression, so threading delivers real parallelism on the compression stage.
- File-write concurrency is bounded by the dask thread pool default (a few threads on a typical box). Local filesystems handle that fine; the OS write-back coalesces.
- On
dask+cupy data, each thread will land in chunk_data.get() independently. cupy releases the GIL on D2H, so threading is still safe.
If a future caller wants to write to a slow networked filesystem and serialise writes intentionally, they can set DASK_SCHEDULER=synchronous in the environment to override.
Acceptance criteria
_write_vrt_tiled uses scheduler='threads'.
- Existing tests in
xrspatial/geotiff/tests/test_vrt_tiled_metadata_1606.py, test_polish_1488.py, and the dask-backed VRT write paths continue to pass.
- Microbench (
to_geotiff(da, 'out.vrt', compression='zstd') on a 4096x4096 chunks=256 dask DataArray) shows wall-time reduction comparable to the 33% measured.
Context
Found via deep-sweep performance audit on 2026-05-12. Cat 2 (Dask chunking): synchronous scheduler on an embarrassingly-parallel write loop.
Original code from #1083 / #1085 (May 2025) used synchronous without a documented rationale; the comment block above the call does not mention threading. Looks like a defensive default that no longer needs to be defensive.
Reason or Problem
_write_vrt_tiledinxrspatial/geotiff/__init__.py(line 1708) writes a dask-backed DataArray to a directory of tiled GeoTIFFs by building onedask.delayedtask per tile, then executing them all with:The
synchronousscheduler runs every task one at a time on the current thread. Tile writes are independent (different chunks, different output files, no shared mutable state inside_write_single_tile), so the parallelism is left on the table.Microbench on a single 16-thread machine, 4096x4096 float32 random data, chunks=256, zstd compression, 256 output tiles:
That is a ~33% reduction with zero correctness risk on this path. The gain grows with tile count and with codec cost (zstd level 9 / LERC are higher per-tile CPU).
Proposal
Switch to the threaded scheduler explicitly:
Considerations:
_write_single_tileopens a fresh file path per tile and never mutates shared Python state, so threading is safe.dask+cupydata, each thread will land inchunk_data.get()independently. cupy releases the GIL on D2H, so threading is still safe.If a future caller wants to write to a slow networked filesystem and serialise writes intentionally, they can set
DASK_SCHEDULER=synchronousin the environment to override.Acceptance criteria
_write_vrt_tiledusesscheduler='threads'.xrspatial/geotiff/tests/test_vrt_tiled_metadata_1606.py,test_polish_1488.py, and the dask-backed VRT write paths continue to pass.to_geotiff(da, 'out.vrt', compression='zstd')on a 4096x4096 chunks=256 dask DataArray) shows wall-time reduction comparable to the 33% measured.Context
Found via deep-sweep performance audit on 2026-05-12. Cat 2 (Dask chunking): synchronous scheduler on an embarrassingly-parallel write loop.
Original code from #1083 / #1085 (May 2025) used
synchronouswithout a documented rationale; the comment block above the call does not mention threading. Looks like a defensive default that no longer needs to be defensive.