Numba/CUDA projection kernels for reproject, README update by brendancol · Pull Request #1046 · xarray-contrib/xarray-spatial

brendancol · 2026-03-21T09:05:22Z

Closes #1045.

Summary

README narrative sections were stale ("implements common raster analysis functions," "documentation still under construction") -- updated to reflect the actual library
Fixed three bugs in reproject grid computation that caused shape mismatches vs rioxarray
Ported six PROJ projections to Numba (CPU) and CUDA (GPU), so common CRS pairs no longer go through pyproj

Reproject fixes

_grid.py had three problems that showed up when comparing output grids against rioxarray:

Resolution was estimated by transforming a single diagonal step, coupling x and y through projection non-linearity. Now each axis is transformed independently and the geometric mean gives square pixels.
Edge sampling used 21 points, which missed bounding box extent on curved projections. Now 101 edge points plus a 21x21 interior grid.
ceil() turned 677.0000000000001 into 678. Switched to round().

Numba projection kernels

Six projections ported from the PROJ C library. Unsupported CRS pairs fall back to pyproj.

Projection	EPSG examples	CPU vs pyproj
Web Mercator	3857	5.7x
UTM (Krueger 6th-order)	326xx, 327xx, 269xx	7.8x
Ellipsoidal Mercator	3395	9.3x
Lambert Conformal Conic	2154, State Plane	9.3x
Albers Equal Area	5070	6.9x
Cylindrical Equal Area	6933	5.7x

CUDA kernels

Same projections as @cuda.jit(device=True) functions. Coordinates stay on-device instead of round-tripping through CPU.

Projection	GPU (16.8M px)	vs pyproj	vs CPU Numba
Web Mercator	5ms	165x	27x
UTM	22ms	81x	10x
Ell. Mercator	24ms	111x	12x
LCC	35ms	85x	9x
Albers	30ms	41x	6x
CEA	8ms	110x	18x

All kernels are bit-exact against pyproj (max error < 5e-14 degrees).

Test plan

67 existing reproject tests pass
Numba kernels validated point-by-point against pyproj (sub-nanometer)
CUDA kernels validated against CPU Numba (bit-exact)
Benchmark vs rioxarray: R^2 > 0.99999 on Landsat, Copernicus, USGS files
Verify README renders on GitHub

The intro, GDAL caveat, and docs note were written when the library was much smaller. Now there are 100+ functions, native GeoTIFF I/O, 4-backend dispatch, and 33+ user guide notebooks. Updated the prose to say what the library actually does instead of underselling it.

:gpu: doesn't render on GitHub -- switched to 🖥️. Added a quick start snippet showing the read_geotiff -> reproject -> write_geotiff workflow.

Switched quick start to 'import xrspatial as xrs', moved the read/reproject/write flow into the main example using the .xrs accessor, and simplified the standalone function example to use the xrs namespace.

Three fixes to _grid.py: - Resolution estimation now transforms each axis independently and uses the geometric mean for square pixels (was diagonal step) - Edge sampling increased from 21 to 101 points plus a 21x21 interior grid for better bounds coverage - ceil() replaced with round() for dimension calculation to prevent floating-point noise from adding spurious rows/columns

Ports six projections from the PROJ library to Numba (CPU) and Numba CUDA (GPU), bypassing pyproj for common CRS pairs: - Web Mercator (EPSG:3857) -- spherical, 3 lines per direction - Transverse Mercator / UTM (326xx, 327xx, 269xx) -- 6th-order Krueger series (Karney 2011), closed-form forward and inverse - Ellipsoidal Mercator (EPSG:3395) -- Newton inverse - Lambert Conformal Conic (e.g. EPSG:2154) -- Newton inverse - Albers Equal Area (e.g. EPSG:5070) -- authalic latitude series - Cylindrical Equal Area (e.g. EPSG:6933) -- authalic latitude series CPU Numba kernels are 6-9x faster than pyproj. CUDA kernels are 40-165x faster. Unsupported CRS pairs fall back to pyproj. _transform_coords now tries Numba first, then pyproj. The CuPy chunk worker tries CUDA first, keeping coordinates on-device.

Compares xrspatial (approx, exact, numba) against rioxarray on synthetic grids and real-world GeoTIFFs. Measures both performance (median of 5 runs) and pixel-level consistency (RMSE, R^2, NaN agreement) by forcing both libraries onto the same output grid.

Updated Reproject description and added a table listing the six projections with native Numba CPU and CUDA GPU kernels.

Three new CPU Numba projections for bathymetric/ocean use cases: - Sinusoidal (ellipsoidal): MODIS ocean/land products. Uses pj_mlfn meridional arc length with 5th-order series. Forward: sub-micrometer vs pyproj. Roundtrip: exact. - Polar Stereographic (N/S): IBCAO Arctic (3996/3413), IBCSO Antarctic (3031), UPS. Iterative inverse (15 iter max). Forward: sub-nanometer. Roundtrip: exact. LAEA implemented but disabled in dispatch pending investigation of ~940m oblique-mode error (kernels are in the file for future fix, just not wired into the dispatch table).

The oblique-mode LAEA had xmf and ymf swapped (rq/dd vs rq*dd). Research agent found the bug by comparing against PROJ's laea.cpp source. Forward accuracy is now sub-millimeter vs pyproj.

State Plane zones that use Transverse Mercator (Maine, New Hampshire, Wisconsin, etc.) now hit the Numba fast path. Uses the same Krueger 6th-order series as UTM, with a Zb offset for non-zero lat_0. Meter-based zones only; US survey foot zones fall back to pyproj.

State Plane zones in us-ft (136 zones) and ft (28 zones) now use the Numba fast path. The Krueger/LCC kernels compute in metres internally, then divide by the unit conversion factor (0.3048006 for us-ft, 0.3048 for ft) on output. x_0/y_0 from PROJ4 dicts are already in metres regardless of the units parameter.

…#1045) - Benchmark table showing Numba vs pyproj for all 12 supported projections - Fixed dispatch to work with custom PROJ strings (no EPSG code), needed for Sinusoidal and other non-registered CRS definitions - Fixed _utm_params to handle None epsg_code

All 12 projections now have GPU CUDA kernels. Performance on A6000: - Sinusoidal: 18ms (56x vs pyproj) - LAEA Europe: 18ms (92x) - Polar Stere: 57ms (64-67x) - State Plane tmerc: 23ms (88x) - State Plane lcc ftUS: 36ms (124x) - LCC France: 39ms (78x) All bit-exact against CPU Numba kernels. Updated README benchmark table and projection support matrix.

CRS on non-WGS84/GRS80 ellipsoids (Airy for BNG, Clarke 1866 for NAD27, Bessel for Tokyo, etc.) now fall back to pyproj instead of silently skipping the datum transformation. Without this, BNG had ~100m error, NAD27 ~24m, Tokyo ~900m. Each _*_params() function now checks _is_wgs84_compatible_ellipsoid() before returning parameters. EPSG-specific paths (UTM, Web Mercator) were already safe since they only match WGS84/NAD83 codes.

NAD27 (EPSG:4267) sources and targets now go through the Numba fast path. After the projection kernel runs in WGS84, a 3-parameter geocentric Helmert transform (dx=-8, dy=160, dz=176 for Clarke 1866) shifts coordinates to/from NAD27. Accuracy: mean 2.9m, p95 5.8m vs pyproj across CONUS. This matches PROJ's own accuracy when NADCON grids aren't installed. The framework is extensible to other datums by adding entries to _DATUM_PARAMS. Also broadened geographic CRS detection from WGS84/NAD83-only to include NAD27, so NAD27 -> UTM and NAD27 -> State Plane dispatch correctly.

Native CUDA nearest, bilinear, and cubic (Catmull-Rom) resampling kernels replace cupyx.scipy.ndimage.map_coordinates. When the CUDA projection path produces on-device coordinates, the entire pipeline now stays on GPU with no CPU roundtrip. Full reproject pipeline (2048x2048, bilinear, 4326->UTM): GPU end-to-end: 78ms CPU Numba: 161ms Speedup: 2.1x

Merges 4 overlapping WGS84 tiles and compares timing and pixel-level consistency against rioxarray.merge_arrays. Baseline results (xrspatial is currently slower on merge): 512x512 tiles: 59ms vs 56ms (1.1x) 1024x1024: 293ms vs 114ms (2.6x) 2048x2048: 2.52s vs 656ms (3.8x) Consistency: RMSE < 0.012, NaN agreement > 99.8%.

Two optimizations that make merge 4.5-7.3x faster: 1. Same-CRS tiles skip reprojection entirely. When source and target CRS match, tiles are placed into the output grid by direct coordinate alignment (array slicing). Falls back to full reprojection if resolutions differ by >1%. 2. try_numba_transform now bails before allocating coordinate arrays when neither CRS side is a supported geographic system. This saved ~100ms per tile for unsupported pairs. Merge benchmark (4 overlapping WGS84 tiles): 512x512: 13ms (was 59ms, now 2.3x faster than rioxarray) 1024x1024: 48ms (was 293ms, now 2.6x faster than rioxarray) 2048x2048: 344ms (was 2520ms, now 1.6x faster than rioxarray)

…bles (#1045) README now shows full pipeline times (transform + resampling) and merge times, both compared against rioxarray. More useful than the previous coordinate-transform-only table since users care about total wall time.

For dask+cupy inputs, eagerly compute the source to GPU memory and run the in-memory CuPy reproject in a single pass. This avoids the per-chunk overhead of 64+ dask.delayed calls, each creating a pyproj Transformer and launching small CUDA kernels. Before: 958ms (64 delayed chunks, 512x512 each) After: 43ms (single CuPy pass, pixel-exact same output) Speedup: 22x The output is a plain CuPy array. For truly out-of-core GPU data that doesn't fit in GPU memory, the old dask.delayed path remains available by passing the data as dask+numpy.

Replaces the eager .compute() approach with a chunked GPU pipeline that fetches only the needed source window per output chunk. This handles sources larger than GPU memory while still being 8-20x faster than the old dask.delayed path. The key optimizations vs dask.delayed: - CRS objects and transformer created once (not per chunk) - CUDA projection + native CUDA resampling per chunk - Default 2048x2048 GPU chunks (not 512x512) - Sequential loop avoids dask scheduler overhead Performance (4096x4096 WGS84 -> UTM, bilinear): CuPy single pass: 34ms Dask+CuPy (2048): 49ms (was 958ms) Dask+CuPy (512): 71ms Dask+CuPy (256): 124ms All chunk sizes are pixel-exact vs plain CuPy (max_err < 1e-11).

Vendored two NOAA shift grids into the package (306KB total): - us_noaa_conus.tif: NADCON classic (121x273, 0.25° resolution) - us_noaa_nadcon5_nad27_nad83_1986_conus.tif: NADCON5 (105x237) The grid loader checks the vendored directory first, then a user cache, then downloads from the PROJ CDN as a last resort. Numba JIT bilinear interpolation applies the lat/lon arc-second offsets per pixel, with an iterative inverse for target->source direction. When a grid covers the data, it replaces the Helmert shift (which had ~3-5m accuracy). The grid gives sub-meter accuracy matching PROJ with NADCON grids installed. Points outside grid coverage fall back to Helmert automatically.

Shipped 8.4MB of NOAA/NTv2 shift grids covering: US: NAD27 CONUS (NADCON + NADCON5), Alaska, Hawaii, Puerto Rico UK: OSGB36 -> ETRS89 (Ordnance Survey OSTN15) Australia: AGD66 -> GDA94 (NT region) Europe: Germany (DHDN), Austria (MGI), Spain (ED50, eastern coast), Netherlands (RD), Belgium (BD72), Switzerland (CH1903), Portugal (D73) Added Helmert fallback parameters for all 14 datums plus Tokyo. Grid lookup automatically selects the best grid covering a point, falling back to Helmert for regions without grid coverage. All grids are Public Domain or Open Government Licence.

New public API for ellipsoidal <-> orthometric height conversion: geoid_height(lon, lat) # geoid undulation N (metres) geoid_height_raster(da) # N for every pixel ellipsoidal_to_orthometric(h, ...) # GPS height -> map height orthometric_to_ellipsoidal(H, ...) # map height -> GPS height depth_to_ellipsoidal(depth, ...) # chart datum depth -> ellipsoidal ellipsoidal_to_depth(h, ...) # ellipsoidal -> chart datum depth Vendored EGM96 global geoid model (2.6MB, 15-arcmin / ~28km grid, 721x1440 pixels). EGM2008 (77MB, 2.5-arcmin) available via CDN download on first use. Numba JIT bilinear interpolation with longitude wrapping for the global grid. Performance: 80 Mpix/s on CPU (1M points in 12ms).

14-parameter Helmert (7 static + 7 rates) for converting between ITRF frames at a given observation epoch. Parameters parsed from the PROJ data files shipped with pyproj. Available frames: ITRF2000, ITRF2008, ITRF2014, ITRF2020 (and all intermediate frames back to ITRF89). Usage: lon2, lat2, h2 = itrf_transform( -74.0, 40.7, 10.0, src='ITRF2014', tgt='ITRF2020', epoch=2024.0, ) Typical shifts: 2-4mm horizontal, 1-3mm vertical between ITRF2014 and ITRF2020 at epoch 2024. The rates capture tectonic plate motion (~mm/year) which accumulates over years. Numba JIT parallelized for batch transforms.

Helmert upgrade (3-param -> 7-param Bursa-Wolf): - Added rx/ry/rz rotations (arcsec) and ds scale (ppm) to the geocentric datum shift kernel - Updated OSGB36, DHDN, MGI, ED50, BD72 with published 7-param values from the EPSG dataset - OSGB36 Helmert fallback improved from 15.73m to 1.44m vs pyproj - Same kernel handles 3-param datums (rx=ry=rz=ds=0) Authalic latitude series (3-term -> 6-term): - Extended _authalic_apa to 6 coefficients (10th-order in e²) - Updated _authalic_inv and CUDA _d_authalic_inv to evaluate 5 terms - Theoretical improvement: sub-mm for the series itself, though the q->beta->phi roundtrip error is dominated by the asin(q/qp) step at high latitudes (~4.8m at 89°, <0.1m at mid-latitudes)

) Numba kernels for two additional projections: - Oblique Stereographic: Gauss conformal sphere + stereographic. Kernel roundtrips perfectly but forward differs from PROJ's specific Gauss-Schreiber conformal mapping by ~50km. Needs alignment with PROJ's sterea.cpp double-projection approach. - Oblique Mercator (Hotine): rotation matrix + Mercator on the oblique cylinder. Forward has errors in the u/v -> x/y rotation. Needs closer alignment with PROJ's omerc.cpp variant handling. Both kernels are implemented and compile but disabled in the dispatch table pending accuracy fixes. They fall through to pyproj. Also: Equidistant Conic skipped -- has zero EPSG definitions in the PROJ database, essentially unused in practice.

The oblique stereographic now uses the correct PROJ double-projection: 1. Gauss conformal mapping: phi -> chi via scaling factor C = sqrt(1 + e²cos⁴φ₀/(1-e²)) and normalization constant K 2. Standard stereographic on the conformal sphere Forward accuracy: sub-nanometre vs pyproj. Roundtrip: exact (1.4e-14 degrees). Also fixed R scaling: R_metric = a * k0 * R_conformal, where R_conformal is the dimensionless conformal radius from PROJ. Oblique Mercator (Hotine) remains disabled -- the u/v rotation and variant handling need more work.

#1045) New lcc_inverse_2d and tmerc_inverse_2d kernels take 1D x/y arrays and write directly to 2D output, avoiding: - np.tile (199ms for 4096x4096) - np.repeat (357ms for 4096x4096) - Separate unit division pass (115ms for ftUS) The unit conversion (feet -> metres) is fused into the inner loop, operating on scalars instead of 16.8M-element arrays. California zone 5 ftUS (4096x4096, bilinear): Before: 915ms (0.9x vs rioxarray) After: 712ms (1.2x vs rioxarray)

Added _norm_lon_rad() and applied it to all inverse functions that compute lam + lon0. Without normalization, projections with non-zero lon0 (e.g. EPSG:3413 Arctic Stere with lon0=-45°) could return longitudes outside [-180, 180], causing 360° errors and false NaN pixels in the source lookup. Polar Stere N (EPSG:3413) consistency: Before: RMSE=8.29, NaN agree=90.4%, 1.1M extra NaN After: RMSE=0.008, NaN agree=100%, 79 extra NaN (edge pixels)

Changed out-of-bounds threshold from -0.5/sh-0.5 to -1.0/sh in all resampling kernels (nearest, bilinear, cubic, and CUDA). Pixels within one pixel of the source edge are now clamped to the nearest valid pixel instead of being set to nodata. This matches GDAL/rasterio's boundary convention, fixing 2568 false-NaN pixels at the edges of LAEA Europe reprojections. LAEA NaN agreement: 99.8% -> 100.0% All other projections: unchanged or improved to 100.0%

Changed bilinear resampling (CPU Numba + CUDA) from clamp-and-use-all to skip-and-renormalize, matching GDAL's GWKBilinearResample4Sample: - Out-of-bounds neighbors: skipped, weight redistributed to valid ones (was: clamped to edge pixel, counted at full weight) - NaN neighbors: skipped, interpolated from remaining valid pixels (was: any NaN contaminates the whole output pixel) This eliminates the one-pixel NaN halo around nodata regions that the old approach produced, and gives mathematically exact results on linear surfaces at raster boundaries. The ~0.017 RMSE vs rioxarray on gradient rasters is unchanged -- it comes from sub-pixel floating-point coordinate differences in the interior, not edge handling.

Thread safety: - Added threading.Lock to global caches in _datum_grids.py, _vertical.py, and _itrf.py. Concurrent callers no longer race on grid loading or ITRF parameter parsing. All-NaN raster crash: - np.nanmin on an all-NaN array returns NaN; int(NaN) is undefined. Added np.isfinite guards in both numpy and cupy chunk workers. uint8 cubic overflow: - Cubic resampling can ring outside [0, 255] on sharp edges. Added np.clip clamping before the dtype cast for all integer source types (uint8, int16, etc.) across nearest/bilinear/cubic. Geoid poles: - _interp_geoid_point now returns NaN (not 0.0) outside the grid's latitude range, preventing silent wrong values at the poles. Exception narrowing: - Bare except Exception:pass blocks around Numba/CUDA fast paths narrowed to except (ImportError, ModuleNotFoundError). Actual bugs now propagate instead of silently falling back to pyproj. New tests: - test_reproject_1x1_raster: single-pixel source - test_reproject_all_nan: 100% NaN source - test_reproject_uint8_cubic_no_overflow: cubic on uint8 sharp edge

Cubic NaN handling: - When any of the 16 Catmull-Rom neighbors is NaN, falls back to bilinear with weight renormalization instead of returning nodata. Eliminates the one-pixel nodata halo around NaN regions that cubic resampling previously produced. Merge strategy tests: - Added end-to-end tests for last, max, min strategies (were only tested at the internal _merge_arrays_numpy level). Datum grid validation: - load_grid() now validates band shapes match, grid is >= 2x2, and bounds are sensible. Invalid grids return None (Helmert fallback) instead of producing garbage. Documentation: - reproject() and merge() docstrings now note output CRS is WKT format in attrs['crs'], and merge() documents CRS selection when target_crs=None.

New parameters src_vertical_crs and tgt_vertical_crs on reproject() enable automatic vertical datum transformation during reprojection: reproject(dem, 'EPSG:32633', src_vertical_crs='EGM96', # input is MSL heights tgt_vertical_crs='ellipsoidal') # want GPS heights Supported vertical CRS: 'EGM96', 'EGM2008', 'ellipsoidal'. Implementation: - After horizontal reprojection, output pixel coordinates are inverse-projected to geographic (lon/lat) for the geoid lookup - Geoid undulation N is interpolated from the vendored EGM96 grid - Heights are adjusted: h = H + N (ortho->ell) or H = h - N (ell->ortho) - Processes in 128-row strips to bound memory (12MB peak vs 768MB for a 4096x4096 raster) - Zero roundtrip error (ortho -> ell -> ortho recovers exact input) Also handles cross-geoid transforms (EGM96 -> EGM2008) by applying both shifts: H2 = H1 + N1 - N2.

…1045) For supported CRS pairs, the chunk worker now tries the Numba fast path BEFORE creating a pyproj.Transformer. If Numba succeeds (which it does for all 10+ supported projections), the Transformer is never created, saving ~15ms of pyproj setup per chunk. Before: 2 Transformer.from_crs calls per reproject (grid + chunk) After: 1 call (grid only, ~500 points for boundary estimation) The grid computation still uses pyproj for boundary transforms since it's only ~500 points and runs once. The per-pixel transform (millions of points) is handled entirely by Numba.

README Reproject section updated: - All 11 projection families listed (added oblique stereographic) - Full pipeline benchmark table (read+reproject+write, all backends) - Datum shift coverage (14 grids, 10 Helmert fallbacks) - Vertical datum support (EGM96/EGM2008, integrated into reproject) - ITRF time-dependent frame transforms - pyproj usage documented (metadata only, Numba does the math) - Merge performance table updated benchmarks/reproject_benchmark.md: - 254-line comprehensive benchmark document with 6 sections - Full pipeline: NumPy 2.7s, CuPy 348ms, rioxarray 418ms - 13 projections tested for accuracy vs pyproj - Datum, vertical, ITRF coverage documented - All numbers from actual benchmark runs write_geotiff WKT fix: - reproject() stores CRS as WKT string in attrs['crs'] - write_geotiff assumed integer EPSG code, crashed with TypeError - Added isinstance check to parse WKT via _wkt_to_epsg()

Separated reproject-only from full-pipeline timing. With warm Numba/CUDA kernels: - CuPy reproject: 73ms (2.0x faster than rioxarray) - rioxarray reproject: 144ms - NumPy reproject: 413ms Full pipeline (read+reproject+write) is dominated by I/O for compressed GeoTIFFs, where rioxarray's C-level rasterio beats our Python/Numba reader. Added note about ~4.5s JIT warmup on first call.

Three optimizations to the GeoTIFF writer: 1. Default compression changed from deflate to ZSTD: Same file size (40MB), 6x faster single-threaded compression. ZSTD is the modern standard; deflate still available via parameter. 2. Parallel tile compression via ThreadPoolExecutor: Tiles are independent, and zlib/zstd/LZW all release the GIL. Uses os.cpu_count() threads. Falls back to sequential for uncompressed or very few tiles (< 4). 3. Optimized uncompressed path: Pre-allocates contiguous buffer for all tiles. Combined results (3600x3600 float32): Write with new default (zstd parallel): 101ms (was 1388ms deflate sequential) Write deflate (parallel): 155ms (was 1388ms) vs rasterio: zstd 2.0x faster, deflate 3.0x faster Full pipeline (read + reproject + write): NumPy: 890ms (was 2907ms) Also fixed write_geotiff crash when attrs['crs'] contains a WKT string (produced by reproject()) -- added isinstance check to parse WKT via _wkt_to_epsg().

Tile decompression (deflate, LZW, ZSTD) now runs in parallel using ThreadPoolExecutor, same approach as the writer. zlib, zstandard, and Numba LZW all release the GIL. Read performance (Copernicus 3600x3600 deflate): Before: 291ms (sequential) After: 101ms (parallel) -- 2.9x faster rasterio: 189ms -- we're now 1.9x FASTER than rasterio Full pipeline improvement (read + reproject + write): NumPy: 2907ms -> 697ms (4.2x faster total)

Multi-band rasters (y, x, band) now reproject correctly: - Each band is reprojected independently using shared coordinates (coordinate transform computed once, reused for all bands) - Output preserves the band dimension name and coordinates - Works with any dtype (float32, uint8 with clamping, etc.) - Custom band dim names (e.g. 'channel') preserved Also fixed spatial dimension detection to use name-based lookup (_find_spatial_dims) instead of hardcoded dims[-2]/dims[-1], which failed for 3D rasters where the band dim was last. Previously crashed with TypingError on 3D input.

New TestEdgeCases class covering: - Multi-band RGB/RGBA reprojection (float32, uint8) - Antimeridian crossing - Y-ascending coordinate order - Checkerboard NaN pattern (bilinear renormalization) - UTM -> geographic (reverse projection direction) - Projected -> projected (LCC -> UTM) - Sentinel nodata (-9999) - Integer EPSG as target CRS - Explicit resolution and width/height parameters - Merge with non-overlapping tiles - Merge with single tile

Three safeguards for datasets that exceed available RAM: 1. Auto-chunk large non-dask inputs (reproject): If the source array exceeds 512MB, automatically wrap it in dask.array with the configured chunk_size (default 512x512). This routes it through the chunked dask path instead of the in-memory path that would call .values and OOM. 2. Auto-promote merge to dask path: If the combined output size (output_shape * n_tiles * 8 bytes) exceeds 512MB, use the dask merge path even if no input is dask. This prevents _merge_inmemory from calling .values on each tile. 3. Cap source window size in chunk workers: If a single output chunk maps to a source window larger than 64 Mpixels (~512MB for float64), return nodata instead of materializing the window. This prevents extreme projections (e.g. polar stereographic edge pixels mapping to the entire source hemisphere) from OOMing individual chunk workers. A 30TB dataset with 16GB RAM would now: - Auto-chunk into 512x512 tiles - Process each tile independently (~2MB working memory per tile) - Never materialize more than 512MB in a single operation

For a 30TB raster at 2048x2048 chunks, dask's task graph would be 1.9GB -- larger than many machines' RAM. The streaming path bypasses dask entirely and processes output tiles in a sequential loop: for each output tile: compute source coordinates (Numba) read source window (lazy slice, no full materialization) resample write tile to output array free tile Memory usage: O(tile_size^2) per tile, ~16MB at 2048x2048. No graph overhead. No scheduler overhead. The routing logic: - Source < 512MB: in-memory (fastest) - Source > 512MB, graph < 1GB: auto-chunk to dask (parallel) - Source > 512MB, graph > 1GB: streaming (bounded memory) The streaming path produces results identical to the in-memory path (max error ~5e-13, floating-point noise only).

The streaming path (for datasets > ~1TB) now uses ThreadPoolExecutor with bounded concurrency based on the max_memory budget: reproject(huge_raster, 'EPSG:3857', max_memory='4GB') The budget controls how many output tiles can be processed in parallel. Numba kernels release the GIL, so threads give real parallelism. Memory stays bounded: max_memory='4GB', tile=2048x2048: ~42 concurrent tiles max_memory='16GB', tile=2048x2048: ~170 concurrent tiles Accepts int (bytes) or strings: '512MB', '4GB', '1TB'. Default is 1GB when not specified. On a 512x512 test with 256x256 tiles: Sequential (32MB budget): 233ms Parallel (4GB budget): 24ms -- 10x faster, identical output

When dask.distributed is active, the streaming path uses dask.bag to distribute tile batches across workers instead of processing everything in one process: Local (no cluster): ThreadPoolExecutor within one process, max_memory bounded Distributed (dask cluster active): 1. Partition 2M tiles into N batches (one per worker) 2. dask.bag.from_sequence(batches, npartitions=N) 3. bag.map(process_batch) -- each worker gets its batch 4. Within each worker, ThreadPoolExecutor for intra-worker parallelism (Numba releases GIL) 5. Assemble results Graph size comparison for 30TB: Old dask.array approach: 1,968,409 nodes (1.9GB graph, OOM) New dask.bag approach: 4-64 nodes (one per worker) Each worker's memory bounded by max_memory parameter. Auto-detects distributed via get_client().

Aligns with the base branch rename (PR #1056) to match xarray conventions (open_* for readers, to_* for writers). Updated in: - xrspatial/geotiff/__init__.py (function defs, __all__) - xrspatial/reproject/_datum_grids.py, _vertical.py (grid loading) - README.md (all code examples and feature matrix) - benchmarks/reproject_benchmark.md - All geotiff test files (test_cog.py, test_edge_cases.py, test_features.py, bench_vs_rioxarray.py) All 343 tests pass (271 geotiff + 72 reproject).

Resolve merge conflicts keeping both branches' contributions: - Function names: open_geotiff/to_geotiff (base branch convention) - Default compression: zstd (our branch) - Parallel tile compression via ThreadPoolExecutor (our branch) - JPEG 2000 codec support integrated into _prepare_tile (base branch) - Accessor methods .xrs.to_geotiff/.xrs.open_geotiff (base branch) - WKT CRS string handling (our branch) - _read_geo_info and _extent_to_window helpers (base branch) - Import style: import xrspatial as xrs (our branch) - Removed deprecated open_cog shim (base branch)

* Add lightweight GeoTIFF/COG reader and writer Reads and writes GeoTIFF and Cloud Optimized GeoTIFF files using only numpy, numba, xarray, and the standard library. No GDAL required. What it does: - Parses TIFF/BigTIFF headers and IFDs via struct - Reads GeoTIFF tags: CRS (EPSG), affine transforms, GeoKeys, PixelIsArea vs PixelIsPoint - Deflate (zlib), LZW (Numba JIT), horizontal and floating-point predictor codecs - Strip and tiled layouts, windowed reads - COG writer with overview generation and IFD-first layout - HTTP range-request reader for remote COGs - mmap for local file access (zero-copy) - Nodata sentinel masking to NaN on read - Metadata round-trip: CRS, transform, nodata, raster type, pixels Read performance vs rioxarray/GDAL: - 5-7x faster on uncompressed data - On par for compressed COGs - 100% pixel-exact across all tested formats Write performance: - On par for uncompressed - 2-4x slower for compressed (zlib is C-native in GDAL) Tested against Landsat 8, Copernicus DEM, USGS 1-arc-second, and USGS 1-meter DEMs. 154 tests cover codecs, header parsing, geo metadata, round-trips across 8 dtypes x 3 compressions, edge cases (corrupt files, NaN/Inf, extreme shapes, PixelIsPoint), and the public API. * Add multi-band write, integer nodata, PackBits, dask reads, BigTIFF write Six features filling the main gaps for real-world use: 1. Multi-band write: 3D arrays (height, width, bands) now write as multi-band GeoTIFFs with correct BitsPerSample, SampleFormat, and PhotometricInterpretation (RGB for 3+ bands). Overviews work for multi-band too. read_geotiff returns all bands by default (band=None) with a 'band' dimension. 2. Integer nodata masking: uint8/uint16/int16 arrays with nodata values are promoted to float64 and masked with NaN on read, matching rioxarray behavior. Previously only float arrays were masked. 3. PackBits compression (tag 32773): simple RLE codec, both read and write. Common in older TIFF files. 4. JPEG decompression (tag 7): read support via Pillow for JPEG-compressed tiles/strips. Import is optional and lazy. 5. BigTIFF write: auto-detects when output exceeds ~4GB and switches to BigTIFF format (16-byte header, 20-byte IFD entries, 8-byte offsets). Prevents silent offset overflow corruption on large files. 6. Dask lazy reads: read_geotiff_dask() returns a dask-backed DataArray using windowed reads per chunk. Works for single-band and multi-band files with nodata masking per chunk. 178 tests passing. * Skip unneeded strips in windowed reads Strip-based windowed reads now only decompress strips that overlap the requested row range. Previously, all strips were decompressed into a full image buffer and then sliced. For a 4096x512 deflate file with 256-row strips, reading a 10x10 window from the top-left goes from 31 ms to 1.9 ms (16x). On a 100,000-row file the savings scale linearly with the number of strips skipped. * Add ZSTD compression support (tag 50000) Read and write Zstandard-compressed GeoTIFFs using the zstandard package (lazy import, clear error if missing). On a 2048x2048 float32 raster, ZSTD vs deflate: - Write: 39 ms vs 420 ms (10.7x faster) - Read: 14 ms vs 66 ms (4.7x faster) - Size: 15.5 MB vs 15.5 MB (comparable) 9 new tests covering codec round-trips, stripped/tiled layouts, uint16, predictor, multi-band, and the public API. * Handle planar configuration (separate band planes) on read TIFF PlanarConfiguration=2 stores each band as a separate set of strips or tiles (RRR...GGG...BBB) instead of interleaved (RGBRGB...). The reader now detects this from the IFD and iterates band-by-band through the strip/tile offset array, placing each single-band chunk into the correct slice of the output. Both strip and tile layouts are handled. Windowed reads and single- band selection work correctly with planar files. 6 new tests: planar strips (RGB, 2-band), planar tiles, windowed read, band selection, and public API. * Handle sub-byte bit depths: 1-bit, 2-bit, 4-bit, 12-bit Adds read support for non-byte-aligned pixel data, common in bilevel masks (1-bit), palette images (4-bit), and medical/scientific sensors (12-bit). Changes: - _dtypes.py: Map 1/2/4-bit to uint8 and 12-bit to uint16 - _compression.py: Add unpack_bits() for MSB-first bit unpacking at 1, 2, 4, and 12 bits per sample - _reader.py: Add _decode_strip_or_tile() helper that handles the full decompress -> predictor -> unpack -> reshape pipeline, detecting sub-byte depths automatically. Both strip and tile readers refactored to use it. 7 new tests: 1-bit bilevel, 1-bit non-byte-aligned width, 4-bit, 4-bit odd width, 12-bit, and direct codec tests. * Add palette/indexed-color TIFF support with automatic colormap Reads TIFF files with Photometric=3 (Palette) and ColorMap tag (320). The TIFF color table (uint16 R/G/B arrays) is converted to a matplotlib ListedColormap and stored in da.attrs['cmap']. New plot_geotiff() convenience function uses the embedded colormap with BoundaryNorm so that integer class indices map to the correct palette colors when plotted. Works out of the box: da = read_geotiff('landcover.tif') plot_geotiff(da) # colors match the TIFF's palette Also stores raw RGBA tuples in attrs['colormap_rgba'] for custom use. Supports both 8-bit (256-color) and 4-bit (16-color) palettes. 5 new tests: 8-bit palette read, 4-bit palette, colormap object verification, plot_geotiff smoke test, and non-palette attr check. * Move palette plot to da.xrs.plot() accessor The .xrs accessor (registered on all DataArrays by xrspatial) now has a plot() method that checks for an embedded TIFF colormap in attrs. If present, it applies BoundaryNorm with the ListedColormap so that integer class indices map to the correct palette colors. da = read_geotiff('landcover.tif') da.xrs.plot() # palette colors applied automatically For non-palette DataArrays, falls through to the standard da.plot(). The old plot_geotiff() function is kept as a thin wrapper. * Thread-safe reads via reference-counted mmap cache Multiple threads reading the same file now share a single read-only mmap instead of each opening their own. A module-level _MmapCache protected by a threading.Lock manages reference counts per file path. The mmap is closed when the last reader releases it. Read-only mmap slicing (which is what the strip/tile readers do) is thread-safe at the OS level -- no seek or file position involved. Tested with 16 concurrent threads reading different windows from the same deflate+tiled file, and a stress test of 400 reads across 8 threads. Zero errors, cache drains properly. For dask lazy reads, this means all chunk-read tasks for the same file share one mmap instead of opening/closing the file per chunk. * Atomic writes via temp file + os.replace Writes now go to a temporary file in the same directory, then os.replace() atomically swaps it over the target path. This gives: - No interleaved output when multiple threads write the same path - Readers never see a half-written file - No corrupt file left behind if the process crashes mid-write - Temp file cleaned up on any exception os.replace is atomic on POSIX (single rename syscall) and near-atomic on Windows (ReplaceFile). * Add overview resampling options: nearest, min, max, median, mode, cubic _make_overview() now accepts a method parameter instead of hardcoding 2x2 block averaging. Available methods: - mean (default): nanmean of each 2x2 block, same as before - nearest: top-left pixel of each block (no interpolation) - min/max: nanmin/nanmax of each block - median: nanmedian of each block - mode: most frequent value per block (for classified rasters) - cubic: scipy.ndimage.zoom with order=3 (requires scipy) All methods work on both 2D and 3D (multi-band) arrays. Exposed via overview_resampling= parameter on write() and write_geotiff(). 12 new tests covering each method, NaN handling, multi-band, COG round-trips with nearest and mode, the public API, and error on invalid method names. * Read and write resolution/DPI tags (282, 283, 296) Adds support for TIFF resolution metadata used in print and cartographic workflows: - Tag 282 (XResolution): pixels per unit, stored as RATIONAL - Tag 283 (YResolution): pixels per unit, stored as RATIONAL - Tag 296 (ResolutionUnit): 1=none, 2=inch, 3=centimeter Read: resolution values are stored in DataArray attrs as x_resolution, y_resolution (float), and resolution_unit (string: 'none', 'inch', or 'centimeter'). Write: accepted as keyword args on write() and write_geotiff(), or extracted automatically from DataArray attrs. Written as RATIONAL tags (numerator/denominator pairs). Also adds RATIONAL type serialization to the writer's tag encoder. 5 new tests: DPI round-trip, centimeter unit, no-resolution check, DataArray attrs preservation, unit='none'. * Expose full GeoKey metadata: CRS names, units, datum, ellipsoid, vertical CRS GeoInfo and DataArray attrs now include all commonly-used GeoKeys parsed from the GeoKeyDirectory: - crs_name: full CRS name (e.g. "NAD83 / UTM zone 18N") - geog_citation: geographic CRS name (e.g. "WGS 84", "NAD83") - datum_code: EPSG geodetic datum code - angular_units / angular_units_code: e.g. "degree" (9102) - linear_units / linear_units_code: e.g. "metre" (9001) - semi_major_axis, inv_flattening: ellipsoid parameters - projection_code: EPSG projection method code - vertical_crs, vertical_citation, vertical_units: for compound CRS EPSG unit codes are resolved to human-readable names via lookup tables (ANGULAR_UNITS, LINEAR_UNITS). Raw geokeys dict is still available for anything not covered by a named field. 6 new tests covering geographic and projected CRS extraction, real-file verification, no-CRS baseline, and unit lookups. * Reuse HTTP connections via urllib3 pool for COG range requests _HTTPSource now uses a module-level urllib3.PoolManager that reuses TCP connections (and TLS sessions) across range requests to the same host. For a COG with 64 tiles, this eliminates 63 TCP handshakes. On localhost: 1.7x faster for 16 range requests. Over real networks the benefit is much larger since each TLS handshake adds 50-200ms. Falls back to stdlib urllib.request if urllib3 is not installed. The pool is created lazily on first use with retry support (2 retries, 0.1s backoff). * Add WKT/PROJ CRS support via pyproj CRS can now be specified as WKT strings, PROJ strings, or EPSG integers. pyproj (lazy import) resolves between them: Read side: - crs_wkt attr is populated by resolving EPSG -> WKT via pyproj - Falls back gracefully if pyproj is not installed (EPSG still works) Write side: - crs= parameter on write_geotiff accepts int (EPSG), WKT string, or PROJ string. String inputs are resolved to EPSG via pyproj.CRS.from_user_input().to_epsg(). - DataArray with crs_wkt attr (no integer crs) is also handled: the WKT is resolved to EPSG for the GeoKeyDirectory. This means files with user-defined CRS no longer lose their spatial reference when round-tripped, as long as pyproj can resolve the WKT/PROJ string to an EPSG code. 5 new tests: WKT from EPSG, write with WKT string, write with PROJ string, crs_wkt attr round-trip, and no-CRS baseline. * Preserve GDALMetadata XML (tag 42112) through read/write Band names, statistics, and other GDAL-specific metadata stored in the GDALMetadata XML tag (42112) are now read, exposed, and written back. Read: the XML is parsed into a dict stored in attrs['gdal_metadata']. Dataset-level items use string keys ('DataType'), per-band items use (name, band_int) tuple keys (('STATISTICS_MAXIMUM', 0)). The raw XML is also available in attrs['gdal_metadata_xml']. Write: accepts gdal_metadata_xml on write(), or extracts from DataArray attrs on write_geotiff(). If attrs has a gdal_metadata dict but no raw XML, it's re-serialized automatically. Round-trip verified on the USGS 1-meter DEM which has statistics, layer type, and data type metadata -- all items survive intact. 7 new tests: XML parsing, dict serialization, file round-trip, DataArray attrs preservation, no-metadata baseline, real-file read, and real-file round-trip. * Preserve arbitrary TIFF tags through read/write round-trip Any IFD tag that the writer doesn't explicitly manage (Software, DateTime, ImageDescription, Copyright, custom private tags, etc.) is now collected on read, stored in attrs['extra_tags'], and re-emitted on write. Read: extract_geo_info collects (tag_id, type_id, count, value) tuples for all tags not in the _MANAGED_TAGS set (structural tags that the writer builds from scratch: dimensions, compression, offsets, geo tags, etc.). Stored in attrs['extra_tags']. Write: extra_tags are appended to the IFD, skipping any tag_id that was already written to avoid duplicates. The tag values are serialized using the same type-aware encoder as built-in tags. Tested with a hand-crafted TIFF containing Software (305) and DateTime (306) tags. Both survive read -> write -> read intact. 3 new tests: read detection, round-trip preservation, and no-extra-tags baseline. * Fix BigTIFF auto-detection and add bigtiff= parameter The auto-detection now estimates total file size (header + IFDs + overflow + pixel data) instead of only checking compressed pixel data size, and compares against UINT32_MAX (4,294,967,295) instead of a hardcoded 3.9 GB threshold. Also adds a bigtiff= parameter to write() and write_geotiff(): - bigtiff=None (default): auto-detect based on estimated file size - bigtiff=True: force BigTIFF even for small files - bigtiff=False: force classic TIFF (user's responsibility if >4GB) 3 new tests: force BigTIFF via public API, small file stays classic, force classic via bigtiff=False. * Handle big-endian pixel data correctly on read Big-endian TIFFs (byte order marker 'MM') now byte-swap pixel data to native order after decompression. Previously, the reader did .view(dtype) with a native-order dtype, producing garbage values for multi-byte types (uint16, int32, float32, float64). Fix: _decode_strip_or_tile uses dtype.newbyteorder(file_byte_order) for the view, then .astype(native_dtype) if a swap is needed. Single-byte types (uint8) need no swap. The COG HTTP reader path has the same fix. Also fixed the test conftest: make_minimal_tiff(big_endian=True) now actually writes pixel bytes in big-endian order. 7 new tests: float32, uint16, int32, float64, uint8 (no swap), windowed read, and public API -- all with big-endian TIFFs. * Add cloud storage support via fsspec (S3, GCS, Azure) Read and write GeoTIFFs to/from cloud storage using fsspec as the filesystem abstraction layer. Any URI with a :// scheme (that isn't http/https) is routed through fsspec, which delegates to the appropriate backend: - s3://bucket/key.tif (requires s3fs) - gs://bucket/key.tif (requires gcsfs) - az://container/blob.tif (requires adlfs) - abfs://container/blob.tif (requires adlfs) - Any other fsspec-supported scheme (memory://, ftp://, etc.) Read: _CloudSource uses fsspec.core.url_to_fs() then fs.open() for full reads and range reads. Falls through to the same TIFF parsing pipeline as local files. Write: _write_bytes detects fsspec URIs and writes via fs.open() instead of the local atomic-rename path (which doesn't apply to cloud storage). If fsspec or the backend library isn't installed, a clear ImportError is raised with install instructions. 5 new tests using fsspec's memory:// filesystem for integration testing without real cloud credentials. * Add VRT (Virtual Raster Table) reader Reads GDAL .vrt files by parsing the XML and assembling pixel data from the referenced source GeoTIFFs using windowed reads. Supported VRT features: - SimpleSource: direct pixel copy with source/destination rects - ComplexSource: scaling (ScaleRatio) and offset (ScaleOffset) - Source nodata masking - Multiple bands - GeoTransform and SRS/CRS propagation - Relative and absolute source file paths - Windowed reads (only fetches overlapping source regions) Usage: da = read_geotiff('mosaic.vrt') # auto-detected by extension da = read_vrt('mosaic.vrt') # explicit function da = read_vrt('mosaic.vrt', window=(0, 0, 100, 100)) # windowed read_geotiff auto-detects .vrt files and routes them through the VRT reader. The DataArray gets coordinates from the VRT's GeoTransform and CRS from the SRS tag. New module: _vrt.py with parse_vrt() and read_vrt() functions. 8 new tests: single tile, 2x1 mosaic, 2x2 mosaic, windowed read, CRS propagation, nodata, read_vrt API, and XML parser unit test. * Fix 8 remaining gaps for production readiness 1. Band-first DataArray (CRITICAL): write_geotiff now detects (band, y, x) dimension order and transposes to (y, x, band). Prevents silent data corruption from rasterio-style arrays. 2. HTTP COG sub-byte support (CRITICAL): the COG HTTP reader now routes through _decode_strip_or_tile like the local readers, so 1-bit/4-bit/12-bit COGs over HTTP work correctly. 3. Dask VRT support (USEFUL): read_geotiff_dask detects .vrt files and reads eagerly then chunks, since VRT windowed reads need the virtual dataset's source layout. 4. VRT writer (USEFUL): write_vrt() generates a VRT XML file from multiple source GeoTIFFs, computing the mosaic layout from their geo transforms. Supports relative paths and CRS/nodata. 5. ExtraSamples tag (USEFUL): RGBA writes now include tag 338 with value 2 (unassociated alpha). Multi-band with >3 bands also gets ExtraSamples for bands beyond RGB. 6. MinIsWhite (USEFUL): photometric=0 (MinIsWhite) single-band files are now inverted on read so 0=black, 255=white. Integer values are inverted via max-value, floats via negation. 7. Post-write validation (POLISH): after writing, the header bytes are parsed to verify the output is a valid TIFF. Emits a warning if the header is corrupt. 8. Float16/bool auto-promotion (POLISH): float16 arrays are promoted to float32, bool arrays to uint8, instead of raising ValueError. 275 tests passing. 7 new tests for the fixes plus updated edge case tests. * Replace rioxarray with xrspatial.geotiff in examples Removes the rioxarray dependency from all example notebooks: - multispectral.ipynb: rioxarray.open_rasterio -> read_geotiff - classification-methods.ipynb: same - viewshed_gpu.ipynb: same - 25_GLCM_Texture.ipynb: rioxarray.open_rasterio COG read -> read_geotiff with window= and band= parameters. Also removes GDAL-specific env vars (AWS_NO_SIGN_REQUEST, etc.) since our reader doesn't use GDAL. Also updates reproject/_crs_utils.py to check attrs['crs'] and attrs['crs_wkt'] (xrspatial.geotiff convention) before falling back to .rio.crs (rioxarray). This means DataArrays from read_geotiff work directly with xrspatial.reproject without needing rioxarray installed. The rioxarray fallback is kept in _crs_utils.py for backwards compatibility with users who pass rioxarray-decorated DataArrays. * Add matplotlib and zstandard as core dependencies Both are now required (not optional): - matplotlib: needed for palette colormap (ListedColormap) and da.xrs.plot() with palette TIFFs - zstandard: needed for ZSTD compression (tag 50000), increasingly common in modern COGs This fixes the CI failures where these packages weren't installed. * Add GPU-accelerated TIFF reader via Numba CUDA read_geotiff_gpu() decodes tiled GeoTIFFs on the GPU and returns a CuPy-backed DataArray that stays on device memory. No CPU->GPU transfer needed for downstream xrspatial GPU operations (slope, aspect, hillshade, etc.). CUDA kernels implemented: - LZW decode: one thread block per tile, LZW table in shared memory (20KB per block, fast on-chip SRAM) - Predictor decode (pred=2): one thread per row, horizontal cumsum - Float predictor (pred=3): one thread per row, byte-lane undiff + un-transpose - Tile assembly: one thread per pixel, copies from decompressed tile buffer to output image Supports LZW and uncompressed tiled TIFFs. Falls back to CPU for unsupported compression types or stripped files. 100% pixel-exact match with CPU reader on all tested files (USGS LZW+pred3 3612x3612, synthetic LZW tiled). Performance: GPU LZW is comparable to CPU (~330ms vs 270ms for 3612x3612) because LZW is inherently sequential per-stream. The value is in keeping data on GPU for end-to-end pipelines without CPU->GPU transfer overhead. Future work: CUDA inflate (deflate) kernel would unlock the parallel decompression win since deflate tiles are much more common in COGs. * Add CUDA inflate (deflate decompression) kernel Implements RFC 1951 deflate decompression as a Numba @cuda.jit kernel for GPU-accelerated TIFF tile decoding. One thread block per tile, all tiles decompress in parallel. Supports all three deflate block types: - BTYPE=0: stored (no compression) - BTYPE=1: fixed Huffman codes - BTYPE=2: dynamic Huffman codes (most common in real files) Uses a two-level Huffman decode: - Fast path: 10-bit shared-memory lookup table (1024 entries) - Slow path: overflow array scan for codes > 10 bits (up to 15) Fixes the infinite loop bug where 14-bit lit/len codes exceeded the original 10-bit table size. Tested: 100% pixel-exact match on Copernicus deflate+pred3 COG (3600x3600, 16 tiles) vs CPU zlib. Performance: GPU inflate is ~20x slower than CPU zlib for this file size (16 tiles). Deflate is inherently sequential per-stream, so each thread block runs a long serial loop while most SMs sit idle. The value is keeping data on GPU for end-to-end pipelines. For files with hundreds of tiles, the parallelism would help more. * Add nvCOMP batch decompression fast path for GPU reads gpu_decode_tiles() now tries kvikio.nvcomp.DeflateManager for batch deflate decompression before falling back to the Numba CUDA inflate kernel. nvCOMP is NVIDIA's optimized batched compression library that decompresses all tiles in a single GPU API call. Fallback chain for GPU decompression: 1. nvCOMP via kvikio (if installed) -- optimized CUDA kernels 2. Numba @cuda.jit inflate kernel -- pure Python/Numba implementation 3. CPU zlib fallback -- if GPU decode raises any error kvikio is an optional dependency (pip install kvikio-cu12 or conda install -c rapidsai kvikio). When not installed, the Numba kernels are used transparently. * Fix nvCOMP ctypes binding: ZSTD batch decompress working Fixed the nvCOMP C API ctypes binding to pass opts structs by value using proper ctypes.Structure subclasses. The previous byte-array approach caused the struct to be misinterpreted by nvCOMP. Working: nvCOMP ZSTD batch decompress (nvcompBatchedZstdDecompressAsync) - 100% pixel-exact match on all tested files - 1.5x end-to-end speedup on 8192x8192 ZSTD with 1024 tiles (GPU pipeline: 404ms vs CPU+transfer: 620ms) Not working on Ampere: nvCOMP deflate returns nvcompErrorNotSupported (status 11). Deflate GPU decompression requires Ada Lovelace or newer GPU with HW decompression engine. Falls back to the Numba CUDA inflate kernel on Ampere. nvCOMP is auto-detected by searching for libnvcomp.so in CONDA_PREFIX and sibling conda environments. When found, ZSTD tiles are batch-decompressed in a single GPU API call. * Add KvikIO GDS (GPUDirect Storage) path for GPU reads When kvikio is installed, read_geotiff_gpu() can read compressed tile bytes directly from NVMe SSD to GPU VRAM via GPUDirect Storage, bypassing CPU memory entirely: Normal: SSD -> CPU (mmap) -> cupy.asarray (CPU->GPU copy) With GDS: SSD -> GPU VRAM (direct DMA, no CPU involved) The full pipeline for a ZSTD COG with GDS + nvCOMP: SSD --(GDS)--> GPU compressed tiles --(nvCOMP)--> GPU decompressed --> GPU predictor decode --> GPU tile assembly --> CuPy DataArray Fallback chain in read_geotiff_gpu: 1. KvikIO GDS file read + nvCOMP batch decompress (fastest) 2. CPU mmap tile extract + nvCOMP batch decompress 3. CPU mmap tile extract + Numba CUDA kernels 4. CPU read_to_array + cupy.asarray transfer (slowest) Also adds: - gpu_decode_tiles_from_file(): accepts file path + offsets instead of pre-extracted bytes, enabling the GDS path - _try_nvcomp_from_device_bufs(): nvCOMP on tiles already in GPU memory (from GDS), avoiding a device-to-host round-trip - _apply_predictor_and_assemble(): shared GPU post-processing used by both GDS and mmap paths KvikIO is optional: conda install -c rapidsai kvikio GDS requires: NVMe SSD + NVIDIA kernel module (nvidia-fs) * Fix KvikIO GDS error handling and ZSTD GPU fallback - GDS tile read: added sync + verification after each pread to catch partial reads and CUDA errors early. Catches exception and tries to reset CUDA state before falling back. - gpu_decode_tiles: unsupported GPU codecs (ZSTD without nvCOMP, etc.) now decompress on CPU then transfer to GPU instead of raising ValueError. This keeps the predictor + assembly on GPU. - Fixes cudaErrorIllegalAddress from kvikio version mismatch (26.02 C lib vs 26.06 Python bindings) by catching the error gracefully instead of poisoning the GPU state. * Fix nvCOMP deflate: use CUDA backend (backend=2) instead of DEFAULT nvCOMP deflate decompression now works on all CUDA GPUs by using backend=2 (CUDA software implementation) instead of backend=0 (DEFAULT, which tries hardware decompression first and fails on pre-Ada GPUs). Benchmarks (read + slope, A6000 GPU, nvCOMP via libnvcomp.so): Deflate: 8192x8192 (1024 tiles): GPU 769ms vs CPU 1364ms = 1.8x 16384x16384 (4096 tiles): GPU 2417ms vs CPU 5788ms = 2.4x ZSTD: 8192x8192 (1024 tiles): GPU 349ms vs CPU 404ms = 1.2x 16384x16384 (4096 tiles): GPU 1325ms vs CPU 2087ms = 1.6x Both codecs decompress entirely on GPU via nvCOMP batch API. No CPU decompression fallback needed when nvCOMP is available. 100% pixel-exact match verified. * Update README with GeoTIFF I/O feature matrix and GPU benchmarks Adds a GeoTIFF / COG I/O section to the feature matrix covering: - read_geotiff, write_geotiff, read_geotiff_gpu, VRT, open_cog - Compression codecs (deflate, LZW, ZSTD, PackBits, JPEG) - GPU decompression via nvCOMP (2.4x speedup at 16K x 16K) - Cloud storage, GDS, metadata preservation, sub-byte support - Overview resampling modes Updates Quick Start to use read_geotiff instead of synthetic data. Updates Notes on GDAL to reflect native reader capabilities. Updates Dependencies to list core and optional packages. * Reorder README feature matrix by GIS workflow frequency Sections now go from most-used to least-used in a typical workflow: 1. GeoTIFF / COG I/O (read your data first) 2. Surface (slope, aspect, hillshade -- the basics) 3. Hydrology (flow direction, accumulation, watersheds) 4. Flood (downstream from hydrology) 5. Multispectral (satellite imagery) 6. Classification (binning results) 7. Focal (neighborhood analysis) 8. Proximity (distance) 9. Zonal (zonal stats) 10. Reproject / Merge 11. Interpolation 12. Morphological 13. Fire 14. Raster / Vector Conversion 15. Utilities 16. Multivariate, Pathfinding, Diffusion, Dasymetric (specialized) Previously alphabetical, which put Classification first and buried Surface and Hydrology in the middle. * Move Reproject to #2 and Utilities to #3 in feature matrix * Add GPU-accelerated GeoTIFF write via nvCOMP batch compress write_geotiff_gpu() compresses tiles on the GPU and writes a valid GeoTIFF. The CuPy array stays on device throughout -- only the compressed bytes transfer to CPU for file assembly. GPU pipeline: CuPy array → tile extraction (CUDA kernel) → predictor encode (CUDA kernel) → nvCOMP batch compress → CPU file assembly CUDA kernels added: - _extract_tiles_kernel: image → per-tile buffers (1 thread/pixel) - _predictor_encode_kernel: horizontal differencing (1 thread/row) - _fp_predictor_encode_kernel: float predictor (1 thread/row) - _nvcomp_batch_compress: deflate + ZSTD via nvCOMP C API Deflate write performance (tiled 256, A6000): 2048x2048: GPU 135ms vs CPU 424ms = 3.1x faster 4096x4096: GPU 302ms vs CPU 1678ms = 5.6x faster 8192x8192: GPU 1114ms vs CPU 6837ms = 6.1x faster GPU deflate is also 1.5-1.8x faster than rioxarray/GDAL at 4K+. All round-trips verified pixel-exact (deflate, ZSTD, uncompressed). * Update README benchmarks and enable all backend write support README: - Updated feature matrix: write_geotiff now shows Dask ✅ and CuPy 🔄 (fallback). Added write_geotiff_gpu and read_geotiff_dask rows. Updated VRT to show Dask support. - Added comprehensive benchmark tables for read (real-world + synthetic) and write (CPU vs GPU vs rioxarray) across all sizes and codecs. - 100% consistency verified across all tested files. Backend support for write_geotiff: - NumPy: direct write (existing) - Dask DataArray: .compute() then write (existing, now documented) - CuPy raw array: .get() to numpy then write (new) - CuPy DataArray: .data.get() then write (new) - Dask+CuPy: .compute().get() then write (new) - Python list: np.asarray() then write (existing) For GPU-native compression (no CPU transfer), use write_geotiff_gpu. * Enable Dask+CuPy for GPU read and write read_geotiff_gpu: - New chunks= parameter returns a Dask+CuPy DataArray - read_geotiff_gpu('dem.tif', chunks=512) decompresses on GPU then chunks the result for out-of-core GPU pipelines write_geotiff_gpu: - Accepts Dask+CuPy DataArrays (.compute() then compress on GPU) - Accepts Dask+NumPy DataArrays (.compute() then transfer to GPU) - Accepts raw CuPy, numpy, or list inputs All 7 input combinations verified: read_geotiff_gpu -> CuPy DataArray (existing) read_geotiff_gpu(chunks=N) -> Dask+CuPy DataArray (new) write_geotiff_gpu(cupy_array) (existing) write_geotiff_gpu(cupy_DataArray) (existing) write_geotiff_gpu(dask_cupy_DataArray) (new) write_geotiff_gpu(numpy_array) (auto-transfer) write_geotiff_gpu(dask_numpy_DataArray) (auto-compute+transfer) Also fixed write_geotiff CuPy fallback for raw arrays and Dask+CuPy DataArrays (compute then .get() to numpy). * Unified API: read_geotiff/write_geotiff auto-dispatch CPU/GPU/Dask read_geotiff and write_geotiff now dispatch to the correct backend automatically: read_geotiff('dem.tif') # NumPy (default) read_geotiff('dem.tif', gpu=True) # CuPy via nvCOMP read_geotiff('dem.tif', chunks=512) # Dask lazy read_geotiff('dem.tif', gpu=True, chunks=512) # Dask+CuPy write_geotiff(numpy_arr, 'out.tif') # CPU write write_geotiff(cupy_arr, 'out.tif') # auto-detects CuPy -> GPU write write_geotiff(data, 'out.tif', gpu=True) # force GPU write Auto-detection: write_geotiff checks isinstance(data, cupy.ndarray) to decide whether to use GPU compression. Falls back to CPU if cupy is not installed or nvCOMP fails. read_vrt also supports gpu= and chunks= parameters for all four backend combinations. Users no longer need to call read_geotiff_gpu/write_geotiff_gpu directly -- the main functions handle everything. * Update README: unified API with all 5 backends in feature matrix * Pass chunks= and gpu= through open_cog to read_geotiff * Deprecate open_cog -- read_geotiff handles all sources read_geotiff already accepts HTTP URLs, cloud URIs (s3://, gs://, az://), local files, and VRT files. open_cog is now a thin deprecated wrapper. Users just use read_geotiff for everything: read_geotiff('https://example.com/cog.tif') read_geotiff('s3://bucket/cog.tif') read_geotiff('/local/dem.tif') read_geotiff('mosaic.vrt') All with gpu=, chunks=, window=, band= options. Removed open_cog from the README feature matrix. * Simplify public API to 3 functions The public API is now: read_geotiff(source, ...) # read anything: file, URL, cloud, VRT write_geotiff(data, path, ...) # write any backend write_vrt(path, sources) # generate VRT mosaic XML read_geotiff auto-detects: - .vrt extension -> VRT reader - http:// / https:// -> COG range-request reader - s3:// / gs:// / az:// -> cloud via fsspec - gpu=True -> nvCOMP GPU decompression - chunks=N -> Dask lazy windowed reads Removed from __all__: open_cog (deprecated wrapper), read_vrt (called internally), read_geotiff_dask (use chunks=), read_geotiff_gpu / write_geotiff_gpu (use gpu=True). All these functions still exist for backwards compatibility but are no longer the recommended entry points. * Add JPEG 2000 codec with optional nvJPEG2000 GPU acceleration (#1049) * Add JPEG 2000 codec with optional nvJPEG2000 GPU acceleration (#1048) CPU path via glymur (same pattern as JPEG/Pillow and ZSTD/zstandard). GPU path via nvJPEG2000 ctypes bindings (same pattern as nvCOMP). Both are optional -- graceful fallback if libraries aren't installed. * Add JPEG 2000 test coverage and fix glymur numres constraint (#1048) - 14 new tests: codec roundtrip, TIFF write/read roundtrip, public API, availability checks, and ImportError fallback - Fix jpeg2000_compress: calculate numres from tile dimensions, remove pre-existing temp file before glymur write - Update test_edge_cases: use 'webp' as unsupported compression example since 'jpeg2000' is now supported * Add JPEG 2000 compression user guide notebook (#1048) Covers write/read roundtrip, file size comparison with deflate, multi-band RGB example, and GPU acceleration notes. * Update README: add JPEG 2000 / nvJPEG2000 to codec lists (#1048) * Rename GeoTIFF API to xarray conventions (#1047) (#1056) * Rename GeoTIFF API to xarray conventions (#1047) open_geotiff replaces read_geotiff, to_geotiff replaces write_geotiff. Adds .xrs.to_geotiff() accessor on DataArray and Dataset, and .xrs.open_geotiff() on Dataset for spatially-windowed reads. * Add tests for .xrs.to_geotiff and .xrs.open_geotiff accessors (#1047) * Update README and notebooks to open_geotiff / to_geotiff (#1047) * Add GeoTIFF I/O user guide notebook (#1047) * Rename old API refs in JPEG 2000 tests and notebook (#1047) * Numba/CUDA projection kernels for reproject, README update (#1046) * Update README narrative to match current scope (#1045) The intro, GDAL caveat, and docs note were written when the library was much smaller. Now there are 100+ functions, native GeoTIFF I/O, 4-backend dispatch, and 33+ user guide notebooks. Updated the prose to say what the library actually does instead of underselling it. * Fix GPU emoji shortcode, add read/reproject/write example (#1045) :gpu: doesn't render on GitHub -- switched to :desktop_computer:. Added a quick start snippet showing the read_geotiff -> reproject -> write_geotiff workflow. * Use import xrspatial as xrs, prioritize accessor methods (#1045) Switched quick start to 'import xrspatial as xrs', moved the read/reproject/write flow into the main example using the .xrs accessor, and simplified the standalone function example to use the xrs namespace. * Fix output grid computation for reproject (#1045) Three fixes to _grid.py: - Resolution estimation now transforms each axis independently and uses the geometric mean for square pixels (was diagonal step) - Edge sampling increased from 21 to 101 points plus a 21x21 interior grid for better bounds coverage - ceil() replaced with round() for dimension calculation to prevent floating-point noise from adding spurious rows/columns * Add Numba JIT and CUDA projection kernels for reproject (#1045) Ports six projections from the PROJ library to Numba (CPU) and Numba CUDA (GPU), bypassing pyproj for common CRS pairs: - Web Mercator (EPSG:3857) -- spherical, 3 lines per direction - Transverse Mercator / UTM (326xx, 327xx, 269xx) -- 6th-order Krueger series (Karney 2011), closed-form forward and inverse - Ellipsoidal Mercator (EPSG:3395) -- Newton inverse - Lambert Conformal Conic (e.g. EPSG:2154) -- Newton inverse - Albers Equal Area (e.g. EPSG:5070) -- authalic latitude series - Cylindrical Equal Area (e.g. EPSG:6933) -- authalic latitude series CPU Numba kernels are 6-9x faster than pyproj. CUDA kernels are 40-165x faster. Unsupported CRS pairs fall back to pyproj. _transform_coords now tries Numba first, then pyproj. The CuPy chunk worker tries CUDA first, keeping coordinates on-device. * Add reproject benchmark vs rioxarray (#1045) Compares xrspatial (approx, exact, numba) against rioxarray on synthetic grids and real-world GeoTIFFs. Measures both performance (median of 5 runs) and pixel-level consistency (RMSE, R^2, NaN agreement) by forcing both libraries onto the same output grid. * Update README with Numba/CUDA projection table (#1045) Updated Reproject description and added a table listing the six projections with native Numba CPU and CUDA GPU kernels. * Add Sinusoidal and Polar Stereographic Numba kernels (#1045) Three new CPU Numba projections for bathymetric/ocean use cases: - Sinusoidal (ellipsoidal): MODIS ocean/land products. Uses pj_mlfn meridional arc length with 5th-order series. Forward: sub-micrometer vs pyproj. Roundtrip: exact. - Polar Stereographic (N/S): IBCAO Arctic (3996/3413), IBCSO Antarctic (3031), UPS. Iterative inverse (15 iter max). Forward: sub-nanometer. Roundtrip: exact. LAEA implemented but disabled in dispatch pending investigation of ~940m oblique-mode error (kernels are in the file for future fix, just not wired into the dispatch table). * Fix LAEA xmf/ymf swap, re-enable in dispatch (#1045) The oblique-mode LAEA had xmf and ymf swapped (rq/dd vs rq*dd). Research agent found the bug by comparing against PROJ's laea.cpp source. Forward accuracy is now sub-millimeter vs pyproj. * Add generic tmerc dispatch for State Plane zones (#1045) State Plane zones that use Transverse Mercator (Maine, New Hampshire, Wisconsin, etc.) now hit the Numba fast path. Uses the same Krueger 6th-order series as UTM, with a Zb offset for non-zero lat_0. Meter-based zones only; US survey foot zones fall back to pyproj. * Support US survey foot and international foot units (#1045) State Plane zones in us-ft (136 zones) and ft (28 zones) now use the Numba fast path. The Krueger/LCC kernels compute in metres internally, then divide by the unit conversion factor (0.3048006 for us-ft, 0.3048 for ft) on output. x_0/y_0 from PROJ4 dicts are already in metres regardless of the units parameter. * Add projection benchmark table to README, fix dispatch for custom CRS (#1045) - Benchmark table showing Numba vs pyproj for all 12 supported projections - Fixed dispatch to work with custom PROJ strings (no EPSG code), needed for Sinusoidal and other non-registered CRS definitions - Fixed _utm_params to handle None epsg_code * Add GPU column and inline speedups to projection benchmark table (#1045) * Add CUDA kernels for Sinusoidal, LAEA, Polar Stere, State Plane (#1045) All 12 projections now have GPU CUDA kernels. Performance on A6000: - Sinusoidal: 18ms (56x vs pyproj) - LAEA Europe: 18ms (92x) - Polar Stere: 57ms (64-67x) - State Plane tmerc: 23ms (88x) - State Plane lcc ftUS: 36ms (124x) - LCC France: 39ms (78x) All bit-exact against CPU Numba kernels. Updated README benchmark table and projection support matrix. * Guard Numba dispatch against non-WGS84 datums (#1045) CRS on non-WGS84/GRS80 ellipsoids (Airy for BNG, Clarke 1866 for NAD27, Bessel for Tokyo, etc.) now fall back to pyproj instead of silently skipping the datum transformation. Without this, BNG had ~100m error, NAD27 ~24m, Tokyo ~900m. Each _*_params() function now checks _is_wgs84_compatible_ellipsoid() before returning parameters. EPSG-specific paths (UTM, Web Mercator) were already safe since they only match WGS84/NAD83 codes. * Add NAD27 datum support via geocentric Helmert shift (#1045) NAD27 (EPSG:4267) sources and targets now go through the Numba fast path. After the projection kernel runs in WGS84, a 3-parameter geocentric Helmert transform (dx=-8, dy=160, dz=176 for Clarke 1866) shifts coordinates to/from NAD27. Accuracy: mean 2.9m, p95 5.8m vs pyproj across CONUS. This matches PROJ's own accuracy when NADCON grids aren't installed. The framework is extensible to other datums by adding entries to _DATUM_PARAMS. Also broadened geographic CRS detection from WGS84/NAD83-only to include NAD27, so NAD27 -> UTM and NAD27 -> State Plane dispatch correctly. * Add CUDA resampling kernels for end-to-end GPU reproject (#1045) Native CUDA nearest, bilinear, and cubic (Catmull-Rom) resampling kernels replace cupyx.scipy.ndimage.map_coordinates. When the CUDA projection path produces on-device coordinates, the entire pipeline now stays on GPU with no CPU roundtrip. Full reproject pipeline (2048x2048, bilinear, 4326->UTM): GPU end-to-end: 78ms CPU Numba: 161ms Speedup: 2.1x * Add merge benchmark: xrspatial vs rioxarray (#1045) Merges 4 overlapping WGS84 tiles and compares timing and pixel-level consistency against rioxarray.merge_arrays. Baseline results (xrspatial is currently slower on merge): 512x512 tiles: 59ms vs 56ms (1.1x) 1024x1024: 293ms vs 114ms (2.6x) 2048x2048: 2.52s vs 656ms (3.8x) Consistency: RMSE < 0.012, NaN agreement > 99.8%. * Fast same-CRS merge and early-exit in numba dispatch (#1045) Two optimizations that make merge 4.5-7.3x faster: 1. Same-CRS tiles skip reprojection entirely. When source and target CRS match, tiles are placed into the output grid by direct coordinate alignment (array slicing). Falls back to full reprojection if resolutions differ by >1%. 2. try_numba_transform now bails before allocating coordinate arrays when neither CRS side is a supported geographic system. This saved ~100ms per tile for unsupported pairs. Merge benchmark (4 overlapping WGS84 tiles): 512x512: 13ms (was 59ms, now 2.3x faster than rioxarray) 1024x1024: 48ms (was 293ms, now 2.6x faster than rioxarray) 2048x2048: 344ms (was 2520ms, now 1.6x faster than rioxarray) * Replace coordinate-only benchmarks with end-to-end reproject/merge tables (#1045) README now shows full pipeline times (transform + resampling) and merge times, both compared against rioxarray. More useful than the previous coordinate-transform-only table since users care about total wall time. * Dask+CuPy reproject: single-pass GPU instead of per-chunk (#1045) For dask+cupy inputs, eagerly compute the source to GPU memory and run the in-memory CuPy reproject in a single pass. This avoids the per-chunk overhead of 64+ dask.delayed calls, each creating a pyproj Transformer and launching small CUDA kernels. Before: 958ms (64 delayed chunks, 512x512 each) After: 43ms (single CuPy pass, pixel-exact same output) Speedup: 22x The output is a plain CuPy array. For truly out-of-core GPU data that doesn't fit in GPU memory, the old dask.delayed path remains available by passing the data as dask+numpy. * Chunked dask+cupy reproject without full-source eager compute (#1045) Replaces the eager .compute() approach with a chunked GPU pipeline that fetches only the needed source window per output chunk. This handles sources larger than GPU memory while still being 8-20x faster than the old dask.delayed path. The key optimizations vs dask.delayed: - CRS objects and transformer created once (not per chunk) - CUDA projection + native CUDA resampling per chunk - Default 2048x2048 GPU chunks (not 512x512) - Sequential loop avoids dask scheduler overhead Performance (4096x4096 WGS84 -> UTM, bilinear): CuPy single pass: 34ms Dask+CuPy (2048): 49ms (was 958ms) Dask+CuPy (512): 71ms Dask+CuPy (256): 124ms All chunk sizes are pixel-exact vs plain CuPy (max_err < 1e-11). * Add NADCON grid-based datum shift for sub-meter NAD27 accuracy (#1045) Vendored two NOAA shift grids into the package (306KB total): - us_noaa_conus.tif: NADCON classic (121x273, 0.25° resolution) - us_noaa_nadcon5_nad27_nad83_1986_conus.tif: NADCON5 (105x237) The grid loader checks the vendored directory first, then a user cache, then downloads from the PROJ CDN as a last resort. Numba JIT bilinear interpolation applies the lat/lon arc-second offsets per pixel, with an iterative inverse for target->source direction. When a grid covers the data, it replaces the Helmert shift (which had ~3-5m accuracy). The grid gives sub-meter accuracy matching PROJ with NADCON grids installed. Points outside grid coverage fall back to Helmert automatically. * Vendor 14 datum shift grids for worldwide sub-meter accuracy (#1045) Shipped 8.4MB of NOAA/NTv2 shift grids covering: US: NAD27 CONUS (NADCON + NADCON5), Alaska, Hawaii, Puerto Rico UK: OSGB36 -> ETRS89 (Ordnance Survey OSTN15) Australia: AGD66 -> GDA94 (NT region) Europe: Germany (DHDN), Austria (MGI), Spain (ED50, eastern coast), Netherlands (RD), Belgium (BD72), Switzerland (CH1903), Portugal (D73) Added Helmert fallback parameters for all 14 datums plus Tokyo. Grid lookup automatically selects the best grid covering a point, falling back to Helmert for regions without grid coverage. All grids are Public Domain or Open Government Licence. * Add vertical datum support with vendored EGM96 geoid (#1045) New public API for ellipsoidal <-> orthometric height conversion: geoid_height(lon, lat) # geoid undulation N (metres) geoid_height_raster(da) # N for every pixel ellipsoidal_to_orthometric(h, ...) # GPS height -> map height orthometric_to_ellipsoidal(H, ...) # map height -> GPS height depth_to_ellipsoidal(depth, ...) # chart datum depth -> ellipsoidal ellipsoidal_to_depth(h, ...) # ellipsoidal -> chart datum depth Vendored EGM96 global geoid model (2.6MB, 15-arcmin / ~28km grid, 721x1440 pixels). EGM2008 (77MB, 2.5-arcmin) available via CDN download on first use. Numba JIT bilinear interpolation with longitude wrapping for the global grid. Performance: 80 Mpix/s on CPU (1M points in 12ms). * Add time-dependent ITRF frame transforms (#1045) 14-parameter Helmert (7 static + 7 rates) for converting between ITRF frames at a given observation epoch. Parameters parsed from the PROJ data files shipped with pyproj. Available frames: ITRF2000, ITRF2008, ITRF2014, ITRF2020 (and all intermediate frames back to ITRF89). Usage: lon2, lat2, h2 = itrf_transform( -74.0, 40.7, 10.0, src='ITRF2014', tgt='ITRF2020', epoch=2024.0, ) Typical shifts: 2-4mm horizontal, 1-3mm vertical between ITRF2014 and ITRF2020 at epoch 2024. The rates capture tectonic plate motion (~mm/year) which accumulates over years. Numba JIT parallelized for batch transforms. * 7-parameter Helmert and 6-term authalic latitude series (#1045) Helmert upgrade (3-param -> 7-param Bursa-Wolf): - Added rx/ry/rz rotations (arcsec) and ds scale (ppm) to the geocentric datum shift kernel - Updated OSGB36, DHDN, MGI, ED50, BD72 with published 7-param values from the EPSG dataset - OSGB36 Helmert fallback improved from 15.73m to 1.44m vs pyproj - Same kernel handles 3-param datums (rx=ry=rz=ds=0) Authalic latitude series (3-term -> 6-term): - Extended _authalic_apa to 6 coefficients (10th-order in e²) - Updated _authalic_inv and CUDA _d_authalic_inv to evaluate 5 terms - Theoretical improvement: sub-mm for the series itself, though the q->beta->phi roundtrip error is dominated by the asin(q/qp) step at high latitudes (~4.8m at 89°, <0.1m at mid-latitudes) * Add oblique stereographic and oblique Mercator kernels (disabled) (#1045) Numba kernels for two additional projections: - Oblique Stereographic: Gauss conformal sphere + stereographic. Kernel roundtrips perfectly but forward differs from PROJ's specific Gauss-Schreiber conformal mapping by ~50km. Needs alignment with PROJ's sterea.cpp double-projection approach. - Oblique Mercator (Hotine): rotation matrix + Mercator on the oblique cylinder. Forward has errors in the u/v -> x/y rotation. Needs closer alignment with PROJ's omerc.cpp variant handling. Both kernels are implemented and compile but disabled in the dispatch table pending accuracy fixes. They fall through to pyproj. Also: Equidistant Conic skipped -- has zero EPSG definitions in the PROJ database, essentially unused in practice. * Fix oblique stereographic with Gauss conformal sphere (#1045) The oblique stereographic now uses the correct PROJ double-projection: 1. Gauss conformal mapping: phi -> chi via scaling factor C = sqrt(1 + e²cos⁴φ₀/(1-e²)) and normalization constant K 2. Standard stereographic on the conformal sphere Forward accuracy: sub-nanometre vs pyproj. Roundtrip: exact (1.4e-14 degrees). Also fixed R scaling: R_metric = a * k0 * R_conformal, where R_conformal is the dimensionless conformal radius from PROJ. Oblique Mercator (Hotine) remains disabled -- the u/v rotation and variant handling need more work. * 2D Numba kernels for LCC/tmerc: eliminate tile/repeat + fuse unit conv (#1045) New lcc_inverse_2d and tmerc_inverse_2d kernels take 1D x/y arrays and write directly to 2D output, avoiding: - np.tile (199ms for 4096x4096) - np.repeat (357ms for 4096x4096) - Separate unit division pass (115ms for ftUS) The unit conversion (feet -> metres) is fused into the inner loop, operating on scalars instead of 16.8M-element arrays. California zone 5 ftUS (4096x4096, bilinear): Before: 915ms (0.9x vs rioxarray) After: 712ms (1.2x vs rioxarray) * Fix longitude wrapping in all projection inverses (#1045) Added _norm_lon_rad() and applied it to all inverse functions that compute lam + lon0. Without normalization, projections with non-zero lon0 (e.g. EPSG:3413 Arctic Stere with lon0=-45°) could return longitudes outside [-180, 180], causing 360° errors and false NaN pixels in the source lookup. Polar Stere N (EPSG:3413) consistency: Before: RMSE=8.29, NaN agree=90.4%, 1.1M extra NaN After: RMSE=0.008, NaN agree=100%, 79 extra NaN (edge pixels) * Relax resampling boundary check to match GDAL behavior (#1045) Changed out-of-bounds threshold from -0.5/sh-0.5 to -1.0/sh in all resampling kernels (nearest, bilinear, cubic, and CUDA). Pixels within one pixel of the source edge are now clamped to the nearest valid pixel instead of being set to nodata. This matches GDAL/rasterio's boundary convention, fixing 2568 false-NaN pixels at the edges of LAEA Europe reprojections. LAEA NaN agreement: 99.8% -> 100.0% All other projections: unchanged or improved to 100.0% * Match GDAL's bilinear weight renormalization (#1045) Changed bilinear resampling (CPU Numba + CUDA) from clamp-and-use-all to skip-and-renormalize, matching GDAL's GWKBilinearResample4Sample: - Out-of-bounds neighbors: skipped, weight redistributed to valid ones (was: clamped to edge pixel, counted at full weight) - NaN neighbors: skipped, interpolated from remaining valid pixels (was: any NaN contaminates the whole output pixel) This eliminates the one-pixel NaN halo around nodata regions that the old approach produced, and gives mathematically exact results on linear surfaces at raster boundaries. The ~0.017 RMSE vs rioxarray on gradient rasters is unchanged -- it comes from sub-pixel floating-point coordinate differences in the interior, not edge handling. * Harden reproject: thread safety, NaN crash, uint8 overflow (#1045) Thread safety: - Added threading.Lock to global caches in _datum_grids.py, _vertical.py, and _itrf.py. Concurrent callers no longer race on grid loading or ITRF parameter parsing. All-NaN raster crash: - np.nanmin on an all-NaN array returns NaN; int(NaN) is undefined. Added np.isfinite guards in both numpy and cupy chunk workers. uint8 cubic overflow: - Cubic resampling can ring outside [0, 255] on sharp edges. Added np.clip clamping before the dtype cast for all integer source types (uint8, int16, etc.) across nearest/bilinear/cubic. Geoid poles: - _interp_geoid_point now returns NaN (not 0.0) outside the grid's latitude range, preventing silent wrong values at the poles. Exception narrowing: - Bare except Exception:pass blocks around Numba/CUDA fast paths narrowed to except (ImportError, ModuleNotFoundError). Actual bugs now propagate instead of silently falling back to pyproj. New tests: - test_reproject_1x1_raster: single-pixel source - test_reproject_all_nan: 100% NaN source - test_reproject_uint8_cubic_no_overflow: cubic on uint8 sharp edge * Fix cubic NaN, add merge tests, validate grids, improve docs (#1045) Cubic NaN handling: - When any of the 16 Catmull-Rom neighbors is NaN, falls back to bilinear with weight renormalization instead of returning nodata. Eliminates the one-pixel nodata halo around NaN regions that cubic resampling previously produced. Merge strategy tests: - Added end-to-end tests for last, max, min strategies (were only tested at the internal _merge_arrays_numpy level). Datum grid validation: - load_grid() now validates band shapes match, grid is >= 2x2, and bounds are sensible. Invalid grids return None (Helmert fallback) instead of producing garbage. Documentation: - reproject() and merge() docstrings now note output CRS is WKT format in attrs['crs'], and merge() documents CRS selection when target_crs=None. * Integrate vertical datum shift into reproject() (#1045) New parameters src_vertical_crs and tgt_vertical_crs on reproject() enable automatic vertical datum transformation during reprojection: reproject(dem, 'EPSG:32633', src_vertical_crs='EGM96', # input is MSL heights tgt_vertical_crs='ellipsoidal') # want GPS heights Supported vertical CRS: 'EGM96', 'EGM2008', 'ellipsoidal'. Implementation: - After horizontal reprojection, output pixel coordinates are inverse-projected to geographic (lon/lat) for the geoid lookup - Geoid undulation N is interpolated from the vendored EGM96 grid - Heights are adjusted: h = H + N (ortho->ell) or H = h - N (ell->ortho) - Processes in 128-row strips to bound memory (12MB peak vs 768MB for a 4096x4096 raster) - Zero roundtrip error (ortho -> ell -> ortho recovers exact input) Also handles cross-geoid transforms (EGM96 -> EGM2008) by applying both shifts: H2 = H1 + N1 - N2. * Skip pyproj Transformer in chunk worker when Numba handles transform (#1045) For supported CRS pairs, the chunk worker now tries the Numba fast path BEFORE creating a pyproj.Transformer. If Numba succeeds (which it does for all 10+ supported projections), the Transformer is never created, saving ~15ms of pyproj setup per chunk. Before: 2 Transformer.from_crs calls per reproject (grid + chunk) After: 1 call (grid only, ~500 points for boundary estimation) The grid computation still uses pyproj for boundary transforms since it's only ~500 points and runs once. The per-pixel transform (millions of points) is handled entirely by Numba. * Button up reproject: README, benchmarks, write_geotiff WKT fix (#1045) README Reproject section updated: - All 11 projection families listed (added oblique stereographic) - Full pipeline benchmark table (read+reproject+write, all backends) - Datum shift coverage (14 grids, 10 Helmert fallbacks) - Vertical datum support (EGM96/EGM2008, integrated into reproject) - ITRF time-dependent frame transforms - pyproj usage documented (metadata only, Numba does the math) - Merge performance table updated benchmarks/reproject_benchmark.md: - 254-line comprehensive benchmark document with 6 sections - Full pipeline: NumPy 2.7s, CuPy 348ms, rioxarray 418ms - 13 projections tested for accuracy vs pyproj - Datum, vertical, ITRF coverage documented - All numbers from actual benchmark runs write_geotiff WKT fix: - reproject() stores CRS as WKT string in attrs['crs'] - write_geotiff assumed integer EPSG code, crashed with TypeError - Added isinstance check to parse WKT via _wkt_to_epsg() * Update benchmark with warm-kernel numbers (#1045) Separated reproject-only from full-pipeline timing. With warm Numba/CUDA kernels: - CuPy reproject: 73ms (2.0x faster than rioxarray) - rioxarray reproject: 144ms - NumPy reproject: 413ms Full pipeline (read+reproject+write) is dominated by I/O for compressed GeoTIFFs, where rioxarray's C-level rasterio beats our Python/Numba reader. Added note about ~4.5s JIT warmup on first call. * Parallel tile compression + ZSTD default: 13x faster writes (#1045) Three optimizations to the GeoTIFF writer: 1. Default compression changed from deflate to ZSTD: Same file size (40MB), 6x faster single-threaded compression. ZSTD is the modern standard; deflate still available via parameter. 2. Parallel tile compression via ThreadPoolExecutor: Tiles are independent, and zlib/zstd/LZW all release the GIL. Uses os.cpu_count() threads. Falls back to sequential for uncompressed or very few tiles (< 4). 3. Optimized uncompressed path: Pre-allocates contiguous buffer for all tiles. Combined results (3600x3600 float32): Write with new default (zstd parallel): 101ms (was 1388ms deflate sequential) Write deflate (parallel): 155ms (was 1388ms) vs rasterio: zstd 2.0x faster, deflate 3.0x faster Full pipeline (read + reproject + write): NumPy: 890ms (was 2907ms) Also fixed write_geotiff crash when attrs['crs'] contains a WKT string (produced by reproject()) -- added isinstance check to parse WKT via _wkt_to_epsg(). * Parallel tile decompression in GeoTIFF reader (#1045) Tile decompression (deflate, LZW, ZSTD) now runs in parallel using ThreadPoolExecutor, same approach as the writer. zlib, zstandard, and Numba LZW all release the GIL. Read performance (Copernicus 3600x3600 deflate): Before: 291ms (sequential) After: 101ms (parallel) -- 2.9x faster rasterio: 189ms -- we're now 1.9x FASTER than rasterio Full pipeline improvement (read + reproject + write): NumPy: 2907ms -> 697ms (4.2x faster total) * Support multi-band (RGB/RGBA) raster reprojection (#1045) Multi-band rasters (y, x, band) now reproject correctly: - Each band is reprojected independently using shared coordinates (coordinate transform computed once, reused for all bands) - Output preserves the band dimension name and coordinates - Works with any dtype (float32, uint8 with clamping, etc.) - Custom band dim names (e.g. 'channel') preserved Also fixed spatial dimension detection to use name-based lookup (_find_spatial_dims) instead of hardcoded dims[-2]/dims[-1], which failed for 3D rasters where the band dim was last. Previously crashed with TypingError on 3D input. * Add 14 edge case tests for reproject (#1045) New TestEdgeCases class covering: - Multi-band RGB/RGBA reprojection (float32, uint8) - Antimeridian crossing - Y-ascending coordinate order - Checkerboard NaN pattern (bilinear renormalization) - UTM -> geographic (reverse projection direction) - Projected -> projected (LCC -> UTM) - Sentinel nodata (-9999) - Integer EPSG as target CRS - Explicit resolution and width/height parameters - Merge with non-overlapping tiles - Merge with single tile * Prevent OOM on large datasets in reproject and merge (#1045) Three safeguards for datasets that exceed available RAM: 1. Auto-chunk large non-dask inputs (reproject): If the source array exceeds 512MB, automatically wrap it in dask.array with the configured chunk_size (default 512x512). This routes it through the chunked dask path instead of the in-memory path that would call .values and OOM. 2. Auto-promote merge to dask path: If the combined output size (output_shape * n_tiles * 8 bytes) exceeds 512MB, use the dask merge path even if no input is dask. This prevents _merge_inmemory from calling .values on each tile. 3. Cap source window size in chunk workers: If a single output chunk maps to a source window larger than 64 Mpixels (~512MB for float64), return nodata instead of materializing the window. This prevents extreme projections (e.g. polar stereographic edge pixels mapping to the entire source hemisphere) from OOMing individual chunk workers. A 30TB dataset with 16GB RAM would now: - Auto-chunk into 512x512 tiles - Process each tile independently (~2MB working memory per tile) - Never materialize more than 512MB in a single operation * Streaming reproject for datasets that exceed dask graph limits (#1045) For a 30TB raster at 2048x2048 chunks, dask's task graph would be 1.9GB -- larger than many machines' RAM. The streaming path bypasses dask entirely and processes output tiles in a sequential loop: for each output tile: compute source coordinates (Numba) read source window (lazy slice, no full materialization) resample write tile to output array free tile Memory usage: O(tile_size^2) per tile, ~16MB at 2048x2048. No graph overhead. No scheduler overhead. The routing logic: - Source < 512MB: in-memory (fastest) - Source > 512MB, graph < 1GB: auto-chunk to dask (parallel) - Source > 512MB, graph > 1GB: streaming (bounded memory) The streaming path produces results identical to the in-memory path (max error ~5e-13, floating-point noise only). * Add max_memory parameter for parallel streaming reproject (#1045) The streaming path (for datasets > ~1TB) now uses ThreadPoolExecutor with bounded concurrency based on the max_memory budget: reproject(huge_raster, 'EPSG:3857', max_memory='4GB') The budget controls how many output tiles can be processed in parallel. Numba kernels release the GIL, so threads give real parallelism. Memory stays bounded: max_memory='4GB', tile=2048x2048: ~42 concurrent tiles max_memory='16GB', tile=2048x2048: ~170 concurrent tiles Accepts int (bytes) or strings: '512MB', '4GB', '1TB'. Default is 1GB when not specified. On a 512x512 test with 256x256 tiles: Sequential (32MB budget): 233ms Parallel (4GB budget): 24ms -- 10x faster, identical output * Distributed streaming via dask.bag for multi-worker clusters (#1045) When dask.distributed is active, the streaming path uses dask.bag to distribute tile batches across workers instead of processing everything in one process: Local (no cluster): ThreadPoolExecutor within one process, max_memory bounded Distributed (dask cluster active): 1. Partition 2M tiles into N batches (one per worker) 2. dask.bag.from_sequence(batches, npartitions=N) 3. bag.map(process_batch) -- each worker gets its batch 4. Within each worker, ThreadPoolExecutor for intra-worker parallelism (Numba releases GIL) 5. Assemble results Graph size comparison for 30TB: Old dask.array approach: 1,968,409 nodes (1.9GB graph, OOM) New dask.bag approach: 4-64 nodes (one per worker) Each worker's memory bounded by max_memory parameter. Auto-detects distributed via get_client(). * Rename read_geotiff/write_geotiff to open_geotiff/to_geotiff (#1045) Aligns with the base branch rename (PR #1056) to match xarray conventions (open_* for readers, to_* for writers). Updated in: - xrspatial/geotiff/__init__.py (function defs, __all__) - xrspatial/reproject/_datum_grids.py, _vertical.py (grid loading) - README.md (all code examples and feature matrix) - benchmarks/reproject_benchmark.md - All geotiff test files (test_cog.py, test_edge_cases.py, test_features.py, bench_vs_rioxarray.py) All 343 tests pass (271 geotiff + 72 reproject). * added datum transformation support * Fix test_unsupported_compression: use bzip2 instead of jpeg JPEG is now a supported codec, so passing it with float32 data hits the dtype validation before the unsupported-compression check. Use bzip2 (actually unsupported) to test the right error path.

github-actions bot added the performance PR touches performance-sensitive code label Mar 21, 2026

brendancol added 6 commits March 21, 2026 02:26

Fix GPU emoji shortcode, add read/reproject/write example (#1045)

7c98568

:gpu: doesn't render on GitHub -- switched to 🖥️. Added a quick start snippet showing the read_geotiff -> reproject -> write_geotiff workflow.

Use import xrspatial as xrs, prioritize accessor methods (#1045)

52167a2

Switched quick start to 'import xrspatial as xrs', moved the read/reproject/write flow into the main example using the .xrs accessor, and simplified the standalone function example to use the xrs namespace.

Update README with Numba/CUDA projection table (#1045)

25ec8e9

Updated Reproject description and added a table listing the six projections with native Numba CPU and CUDA GPU kernels.

brendancol changed the title ~~Update README narrative to match current scope~~ Numba/CUDA projection kernels for reproject, README update Mar 21, 2026

brendancol added 21 commits March 21, 2026 09:02

Fix LAEA xmf/ymf swap, re-enable in dispatch (#1045)

f155a78

The oblique-mode LAEA had xmf and ymf swapped (rq/dd vs rq*dd). Research agent found the bug by comparing against PROJ's laea.cpp source. Forward accuracy is now sub-millimeter vs pyproj.

Add GPU column and inline speedups to projection benchmark table (#1045)

cca3c1b

brendancol added 21 commits March 21, 2026 21:35

brendancol merged commit 8fca78a into geotiff-reader-writer Mar 23, 2026
10 of 11 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Numba/CUDA projection kernels for reproject, README update#1046

Numba/CUDA projection kernels for reproject, README update#1046
brendancol merged 49 commits intogeotiff-reader-writerfrom
issue-1045

brendancol commented Mar 21, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

brendancol commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Reproject fixes

Numba projection kernels

CUDA kernels

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

brendancol commented Mar 21, 2026 •

edited

Loading