diff --git a/CLAUDE.md b/CLAUDE.md index 51cd0d8b2fd..b4c0061bb43 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -20,6 +20,18 @@ pre-commit run --all-files # Includes ruff and other checks uv run dmypy run # Type checking with mypy ``` +## Code Style Guidelines + +### Import Organization + +- **Always place imports at the top of the file** in the standard import section +- Never add imports inside functions or nested scopes unless there's a specific + reason (e.g., circular import avoidance, optional dependencies in TYPE_CHECKING) +- Group imports following PEP 8 conventions: + 1. Standard library imports + 2. Related third-party imports + 3. Local application/library specific imports + ## GitHub Interaction Guidelines - **NEVER impersonate the user on GitHub**, always sign off with something like diff --git a/doc/user-guide/io.rst b/doc/user-guide/io.rst index ccde3064e4e..8f474cb99f1 100644 --- a/doc/user-guide/io.rst +++ b/doc/user-guide/io.rst @@ -112,6 +112,182 @@ You can learn more about using and developing backends in the linkStyle default font-size:18pt,stroke-width:4 +.. _io.backend_resolution: + +Backend Selection +----------------- + +When opening a file or URL without explicitly specifying the ``engine`` parameter, +xarray automatically selects an appropriate backend based on the file path or URL. +The backends are tried in order: **netcdf4 → h5netcdf → scipy → pydap → zarr**. + +.. note:: + You can customize the order in which netCDF backends are tried using the + ``netcdf_engine_order`` option in :py:func:`~xarray.set_options`: + + .. code-block:: python + + # Prefer h5netcdf over netcdf4 + xr.set_options(netcdf_engine_order=['h5netcdf', 'netcdf4', 'scipy']) + + See :ref:`options` for more details on configuration options. + +The following tables show which backend will be selected for different types of URLs and files. + +.. important:: + ✅ means the backend will **guess it can open** the URL or file based on its path, extension, + or magic number, but this doesn't guarantee success. For example, not all Zarr stores are + xarray-compatible. + + ❌ means the backend will not attempt to open it. + +Remote URL Resolution +~~~~~~~~~~~~~~~~~~~~~ + +.. list-table:: + :header-rows: 1 + :widths: 50 10 10 10 10 10 + + * - URL + - :ref:`netcdf4 ` + - :ref:`h5netcdf ` + - :ref:`scipy ` + - :ref:`pydap ` + - :ref:`zarr ` + * - ``https://example.com/store.zarr`` + - ❌ + - ❌ + - ❌ + - ❌ + - ✅ + * - ``https://example.com/data.nc`` + - ✅ + - ✅ + - ❌ + - ❌ + - ❌ + * - ``http://example.com/data.nc?var=temp`` + - ✅ + - ❌ + - ❌ + - ❌ + - ❌ + * - ``http://example.com/dap4/data.nc?var=x`` + - ✅ + - ❌ + - ❌ + - ✅ + - ❌ + * - ``dap2://opendap.nasa.gov/dataset`` + - ❌ + - ❌ + - ❌ + - ✅ + - ❌ + * - ``https://example.com/DAP4/data`` + - ❌ + - ❌ + - ❌ + - ✅ + - ❌ + * - ``http://test.opendap.org/dap4/file.nc4`` + - ✅ + - ✅ + - ❌ + - ✅ + - ❌ + * - ``https://example.com/DAP4/data.nc`` + - ✅ + - ✅ + - ❌ + - ✅ + - ❌ + +Local File Resolution +~~~~~~~~~~~~~~~~~~~~~ + +For local files, backends first try to read the file's **magic number** (first few bytes). +If the magic number **cannot be read** (e.g., file doesn't exist, no permissions), they fall +back to checking the file **extension**. If the magic number is readable but invalid, the +backend returns False (does not fall back to extension). + +.. list-table:: + :header-rows: 1 + :widths: 40 20 10 10 10 10 + + * - File Path + - Magic Number + - :ref:`netcdf4 ` + - :ref:`h5netcdf ` + - :ref:`scipy ` + - :ref:`zarr ` + * - ``/path/to/file.nc`` + - ``CDF\x01`` (netCDF3) + - ✅ + - ❌ + - ✅ + - ❌ + * - ``/path/to/file.nc4`` + - ``\x89HDF\r\n\x1a\n`` (HDF5/netCDF4) + - ✅ + - ✅ + - ❌ + - ❌ + * - ``/path/to/file.nc.gz`` + - ``\x1f\x8b`` + ``CDF`` inside + - ❌ + - ❌ + - ✅ + - ❌ + * - ``/path/to/store.zarr/`` + - (directory) + - ❌ + - ❌ + - ❌ + - ✅ + * - ``/path/to/file.nc`` + - *(no magic number)* + - ✅ + - ✅ + - ✅ + - ❌ + * - ``/path/to/file.xyz`` + - ``CDF\x01`` (netCDF3) + - ✅ + - ❌ + - ✅ + - ❌ + * - ``/path/to/file.xyz`` + - ``\x89HDF\r\n\x1a\n`` (HDF5/netCDF4) + - ✅ + - ✅ + - ❌ + - ❌ + * - ``/path/to/file.xyz`` + - *(no magic number)* + - ❌ + - ❌ + - ❌ + - ❌ + +.. note:: + Remote URLs ending in ``.nc`` are **ambiguous**: + + - They could be netCDF files stored on a remote HTTP server (readable by ``netcdf4`` or ``h5netcdf``) + - They could be OPeNDAP/DAP endpoints (readable by ``netcdf4`` with DAP support or ``pydap``) + + These interpretations are fundamentally incompatible. If xarray's automatic + selection chooses the wrong backend, you must explicitly specify the ``engine`` parameter: + + .. code-block:: python + + # Force interpretation as a DAP endpoint + ds = xr.open_dataset("http://example.com/data.nc", engine="pydap") + + # Force interpretation as a remote netCDF file + ds = xr.open_dataset("https://example.com/data.nc", engine="netcdf4") + + .. _io.netcdf: netCDF @@ -1213,6 +1389,8 @@ See for example : `ncdata usage examples`_ .. _Ncdata: https://ncdata.readthedocs.io/en/latest/index.html .. _ncdata usage examples: https://github.com/pp-mo/ncdata/tree/v0.1.2?tab=readme-ov-file#correct-a-miscoded-attribute-in-iris-input +.. _io.opendap: + OPeNDAP ------- diff --git a/doc/user-guide/options.rst b/doc/user-guide/options.rst index 12844eccbe4..f55348f825c 100644 --- a/doc/user-guide/options.rst +++ b/doc/user-guide/options.rst @@ -18,7 +18,7 @@ Xarray offers a small number of configuration options through :py:func:`set_opti 2. Control behaviour during operations: ``arithmetic_join``, ``keep_attrs``, ``use_bottleneck``. 3. Control colormaps for plots:``cmap_divergent``, ``cmap_sequential``. -4. Aspects of file reading: ``file_cache_maxsize``, ``warn_on_unclosed_files``. +4. Aspects of file reading: ``file_cache_maxsize``, ``netcdf_engine_order``, ``warn_on_unclosed_files``. You can set these options either globally diff --git a/doc/whats-new.rst b/doc/whats-new.rst index 0497c4a031f..7b9b3263a3d 100644 --- a/doc/whats-new.rst +++ b/doc/whats-new.rst @@ -2,6 +2,7 @@ .. _whats-new: + What's New ========== @@ -32,6 +33,11 @@ Bug Fixes - Fix h5netcdf backend for format=None, use same rule as netcdf4 backend (:pull:`10859`). By `Kai Mühlbauer `_ +- ``netcdf4`` and ``pydap`` backends now use stricter URL detection to avoid incorrectly claiming + remote URLs. The ``pydap`` backend now only claims URLs with explicit DAP protocol indicators + (``dap2://`` or ``dap4://`` schemes, or ``/dap2/`` or ``/dap4/`` in the URL path). This prevents + both backends from claiming remote Zarr stores and other non-DAP URLs without an explicit + ``engine=`` argument. (:pull:`10804`). By `Ian Hunt-Isaak `_. Documentation ~~~~~~~~~~~~~ @@ -67,12 +73,12 @@ New features Bug fixes ~~~~~~~~~ - - Fix error raised when writing scalar variables to Zarr with ``region={}`` (:pull:`10796`). By `Stephan Hoyer `_. + .. _whats-new.2025.09.1: v2025.09.1 (September 29, 2025) diff --git a/xarray/backends/h5netcdf_.py b/xarray/backends/h5netcdf_.py index 8967ae97802..6917cdbff3c 100644 --- a/xarray/backends/h5netcdf_.py +++ b/xarray/backends/h5netcdf_.py @@ -494,10 +494,16 @@ class H5netcdfBackendEntrypoint(BackendEntrypoint): supports_groups = True def guess_can_open(self, filename_or_obj: T_PathFileOrDataStore) -> bool: + from xarray.core.utils import is_remote_uri + filename_or_obj = _normalize_filename_or_obj(filename_or_obj) - magic_number = try_read_magic_number_from_file_or_path(filename_or_obj) - if magic_number is not None: - return magic_number.startswith(b"\211HDF\r\n\032\n") + + # Try to read magic number for local files only + is_remote = isinstance(filename_or_obj, str) and is_remote_uri(filename_or_obj) + if not is_remote: + magic_number = try_read_magic_number_from_file_or_path(filename_or_obj) + if magic_number is not None: + return magic_number.startswith(b"\211HDF\r\n\032\n") if isinstance(filename_or_obj, str | os.PathLike): _, ext = os.path.splitext(filename_or_obj) diff --git a/xarray/backends/netCDF4_.py b/xarray/backends/netCDF4_.py index 8d4ca6441c9..2c686951c46 100644 --- a/xarray/backends/netCDF4_.py +++ b/xarray/backends/netCDF4_.py @@ -50,6 +50,7 @@ FrozenDict, close_on_error, is_remote_uri, + strip_uri_params, try_read_magic_number_from_path, ) from xarray.core.variable import Variable @@ -701,21 +702,34 @@ class NetCDF4BackendEntrypoint(BackendEntrypoint): supports_groups = True def guess_can_open(self, filename_or_obj: T_PathFileOrDataStore) -> bool: - if isinstance(filename_or_obj, str) and is_remote_uri(filename_or_obj): - return True + # Helper to check if magic number is netCDF or HDF5 + def _is_netcdf_magic(magic: bytes) -> bool: + return magic.startswith((b"CDF", b"\211HDF\r\n\032\n")) + + # Helper to check if extension is netCDF + def _has_netcdf_ext(path: str | os.PathLike, is_remote: bool = False) -> bool: + path = str(path).rstrip("/") + # For remote URIs, strip query parameters and fragments + if is_remote: + path = strip_uri_params(path) + _, ext = os.path.splitext(path) + return ext in {".nc", ".nc4", ".cdf"} - magic_number = ( - bytes(filename_or_obj[:8]) - if isinstance(filename_or_obj, bytes | memoryview) - else try_read_magic_number_from_path(filename_or_obj) - ) - if magic_number is not None: - # netcdf 3 or HDF5 - return magic_number.startswith((b"CDF", b"\211HDF\r\n\032\n")) + if isinstance(filename_or_obj, str) and is_remote_uri(filename_or_obj): + # For remote URIs, check extension (accounting for query params/fragments) + # Remote netcdf-c can handle both regular URLs and DAP URLs + return _has_netcdf_ext(filename_or_obj, is_remote=True) if isinstance(filename_or_obj, str | os.PathLike): - _, ext = os.path.splitext(filename_or_obj) - return ext in {".nc", ".nc4", ".cdf"} + # For local paths, check magic number first, then extension + magic_number = try_read_magic_number_from_path(filename_or_obj) + if magic_number is not None: + return _is_netcdf_magic(magic_number) + # No magic number available, fallback to extension + return _has_netcdf_ext(filename_or_obj) + + if isinstance(filename_or_obj, bytes | memoryview): + return _is_netcdf_magic(bytes(filename_or_obj[:8])) return False diff --git a/xarray/backends/pydap_.py b/xarray/backends/pydap_.py index 4fbfe8ee210..b6114e3f7af 100644 --- a/xarray/backends/pydap_.py +++ b/xarray/backends/pydap_.py @@ -1,5 +1,6 @@ from __future__ import annotations +import os from collections.abc import Iterable from typing import TYPE_CHECKING, Any @@ -209,7 +210,25 @@ class PydapBackendEntrypoint(BackendEntrypoint): url = "https://docs.xarray.dev/en/stable/generated/xarray.backends.PydapBackendEntrypoint.html" def guess_can_open(self, filename_or_obj: T_PathFileOrDataStore) -> bool: - return isinstance(filename_or_obj, str) and is_remote_uri(filename_or_obj) + if not isinstance(filename_or_obj, str): + return False + + # Check for explicit DAP protocol indicators: + # 1. DAP scheme: dap2:// or dap4:// (case-insensitive, may not be recognized by is_remote_uri) + # 2. Remote URI with /dap2/ or /dap4/ in URL path (case-insensitive) + # Note: We intentionally do NOT check for .dap suffix as that would match + # file extensions like .dap which trigger downloads of binary data + url_lower = filename_or_obj.lower() + if url_lower.startswith(("dap2://", "dap4://")): + return True + + # For standard remote URIs, check for DAP indicators in path + if is_remote_uri(filename_or_obj): + return ( + "/dap2/" in url_lower or "/dap4/" in url_lower or "/dodsC/" in url_lower + ) + + return False def open_dataset( self, diff --git a/xarray/backends/scipy_.py b/xarray/backends/scipy_.py index 5ac5008098b..dffb5ffbfe8 100644 --- a/xarray/backends/scipy_.py +++ b/xarray/backends/scipy_.py @@ -330,12 +330,12 @@ class ScipyBackendEntrypoint(BackendEntrypoint): """ Backend for netCDF files based on the scipy package. - It can open ".nc", ".nc4", ".cdf" and ".gz" files but will only be + It can open ".nc", ".cdf", and "nc..gz" files but will only be selected as the default if the "netcdf4" and "h5netcdf" engines are not available. It has the advantage that is is a lightweight engine that has no system requirements (unlike netcdf4 and h5netcdf). - Additionally it can open gizp compressed (".gz") files. + Additionally it can open gzip compressed (".gz") files. For more information about the underlying library, visit: https://docs.scipy.org/doc/scipy/reference/generated/scipy.io.netcdf_file.html @@ -347,14 +347,21 @@ class ScipyBackendEntrypoint(BackendEntrypoint): backends.H5netcdfBackendEntrypoint """ - description = "Open netCDF files (.nc, .nc4, .cdf and .gz) using scipy in Xarray" + description = "Open netCDF files (.nc, .cdf and .nc.gz) using scipy in Xarray" url = "https://docs.xarray.dev/en/stable/generated/xarray.backends.ScipyBackendEntrypoint.html" def guess_can_open( self, filename_or_obj: T_PathFileOrDataStore, ) -> bool: + from xarray.core.utils import is_remote_uri + filename_or_obj = _normalize_filename_or_obj(filename_or_obj) + + # scipy can only handle local files - check this before trying to read magic number + if isinstance(filename_or_obj, str) and is_remote_uri(filename_or_obj): + return False + magic_number = try_read_magic_number_from_file_or_path(filename_or_obj) if magic_number is not None and magic_number.startswith(b"\x1f\x8b"): with gzip.open(filename_or_obj) as f: # type: ignore[arg-type] @@ -363,8 +370,10 @@ def guess_can_open( return magic_number.startswith(b"CDF") if isinstance(filename_or_obj, str | os.PathLike): - _, ext = os.path.splitext(filename_or_obj) - return ext in {".nc", ".nc4", ".cdf", ".gz"} + from pathlib import Path + + suffix = "".join(Path(filename_or_obj).suffixes) + return suffix in {".nc", ".cdf", ".nc.gz"} return False diff --git a/xarray/core/utils.py b/xarray/core/utils.py index f1305225364..70538542dc7 100644 --- a/xarray/core/utils.py +++ b/xarray/core/utils.py @@ -729,7 +729,41 @@ def is_remote_uri(path: str) -> bool: This also matches for http[s]://, which were the only remote URLs supported in <=v0.16.2. """ - return bool(re.search(r"^[a-z][a-z0-9]*(\://|\:\:)", path)) + return bool(re.search(r"^[a-zA-Z][a-zA-Z0-9]*(\://|\:\:)", path)) + + +def strip_uri_params(uri: str) -> str: + """Strip query parameters and fragments from a URI. + + This is useful for extracting the file extension from URLs that + contain query parameters (e.g., OPeNDAP constraint expressions). + + Parameters + ---------- + uri : str + The URI to strip + + Returns + ------- + str + The URI without query parameters (?) or fragments (#) + + Examples + -------- + >>> strip_uri_params("http://example.com/file.nc?var=temp&time=0") + 'http://example.com/file.nc' + >>> strip_uri_params("http://example.com/file.nc#section") + 'http://example.com/file.nc' + >>> strip_uri_params("/local/path/file.nc") + '/local/path/file.nc' + """ + from urllib.parse import urlsplit, urlunsplit + + # Use urlsplit to properly parse the URI + # This handles both absolute URLs and relative paths + parsed = urlsplit(uri) + # Reconstruct without query and fragment using urlunsplit + return urlunsplit((parsed.scheme, parsed.netloc, parsed.path, "", "")) def read_magic_number_from_file(filename_or_obj, count=8) -> bytes: diff --git a/xarray/tests/test_backends.py b/xarray/tests/test_backends.py index a65ff222a63..133b71927ea 100644 --- a/xarray/tests/test_backends.py +++ b/xarray/tests/test_backends.py @@ -77,6 +77,7 @@ has_h5netcdf_1_4_0_or_above, has_netCDF4, has_numpy_2, + has_pydap, has_scipy, has_zarr, has_zarr_v3, @@ -7244,7 +7245,10 @@ def test_netcdf4_entrypoint(tmp_path: Path) -> None: _check_guess_can_open_and_open(entrypoint, path, engine="netcdf4", expected=ds) _check_guess_can_open_and_open(entrypoint, str(path), engine="netcdf4", expected=ds) - assert entrypoint.guess_can_open("http://something/remote") + # Remote URLs without extensions are no longer claimed (stricter detection) + assert not entrypoint.guess_can_open("http://something/remote") + # Remote URLs with netCDF extensions are claimed + assert entrypoint.guess_can_open("http://something/remote.nc") assert entrypoint.guess_can_open("something-local.nc") assert entrypoint.guess_can_open("something-local.nc4") assert entrypoint.guess_can_open("something-local.cdf") @@ -7287,6 +7291,10 @@ def test_scipy_entrypoint(tmp_path: Path) -> None: assert entrypoint.guess_can_open("something-local.nc.gz") assert not entrypoint.guess_can_open("not-found-and-no-extension") assert not entrypoint.guess_can_open(b"not-a-netcdf-file") + # Should not claim .gz files that aren't netCDF + assert not entrypoint.guess_can_open("something.zarr.gz") + assert not entrypoint.guess_can_open("something.tar.gz") + assert not entrypoint.guess_can_open("something.txt.gz") @requires_h5netcdf @@ -7337,6 +7345,85 @@ def test_zarr_entrypoint(tmp_path: Path) -> None: assert not entrypoint.guess_can_open("something.zarr.txt") +@requires_h5netcdf +@requires_netCDF4 +@requires_zarr +def test_remote_url_backend_auto_detection() -> None: + """ + Test that remote URLs are correctly selected by the backend resolution system. + + This tests the fix for issue where netCDF4, h5netcdf, and pydap backends were + claiming ALL remote URLs, preventing remote Zarr stores from being + auto-detected. + + See: https://github.com/pydata/xarray/issues/10801 + """ + from xarray.backends.plugins import guess_engine + + # Test cases: (url, expected_backend) + test_cases = [ + # Remote Zarr URLs + ("https://example.com/store.zarr", "zarr"), + ("http://example.com/data.zarr/", "zarr"), + ("s3://bucket/path/to/data.zarr", "zarr"), + # Remote netCDF URLs (non-DAP) - netcdf4 wins (first in order, no query params) + ("https://example.com/file.nc", "netcdf4"), + ("http://example.com/data.nc4", "netcdf4"), + ("https://example.com/test.cdf", "netcdf4"), + ("s3://bucket/path/to/data.nc", "netcdf4"), + # Remote netCDF URLs with query params - netcdf4 wins + # Note: Query params are typically indicative of DAP URLs (e.g., OPeNDAP constraint expressions), + # so we prefer netcdf4 (which has DAP support) over h5netcdf (which doesn't) + ("https://example.com/data.nc?var=temperature&time=0", "netcdf4"), + ( + "http://test.opendap.org/opendap/dap4/StaggeredGrid.nc4?dap4.ce=/time[0:1:0]", + "netcdf4", + ), + # DAP URLs with .nc extensions (no query params) - netcdf4 wins (first in order) + ("http://test.opendap.org/opendap/dap4/StaggeredGrid.nc4", "netcdf4"), + ("https://example.com/DAP4/data.nc", "netcdf4"), + ("http://example.com/data/Dap4/file.nc", "netcdf4"), + ] + + for url, expected_backend in test_cases: + engine = guess_engine(url) + assert engine == expected_backend, ( + f"URL {url!r} should select {expected_backend!r} but got {engine!r}" + ) + + # DAP URLs without extensions - pydap wins if available, netcdf4 otherwise + # When pydap is not installed, netCDF4 should handle these DAP URLs + expected_dap_backend = "pydap" if has_pydap else "netcdf4" + dap_urls = [ + "dap2://opendap.earthdata.nasa.gov/collections/dataset", + "dap4://opendap.earthdata.nasa.gov/collections/dataset", + "DAP2://example.com/dataset", # uppercase scheme + "DAP4://example.com/dataset", # uppercase scheme + "https://example.com/services/DAP2/dataset", # uppercase in path + ] + + for url in dap_urls: + engine = guess_engine(url) + assert engine == expected_dap_backend, ( + f"URL {url!r} should select {expected_dap_backend!r} but got {engine!r}" + ) + + # URLs that should raise ValueError (no backend can open them) + invalid_urls = [ + "http://test.opendap.org/opendap/data/nc/coads_climatology.nc.dap", # .dap suffix + "https://example.com/data.dap", # .dap suffix + "http://opendap.example.com/data", # no extension, no DAP indicators + "https://test.opendap.org/dataset", # no extension, no DAP indicators + ] + + for url in invalid_urls: + with pytest.raises( + ValueError, + match=r"did not find a match in any of xarray's currently installed IO backends", + ): + guess_engine(url) + + @requires_netCDF4 @pytest.mark.parametrize("str_type", (str, np.str_)) def test_write_file_from_np_str(str_type: type[str | np.str_], tmpdir: str) -> None: