Skip to content

Conversation

ianhi
Copy link
Contributor

@ianhi ianhi commented Sep 30, 2025

Previously pydap or netcdf if installed would grab any remote URL according the order of backend resolution.

@ianhi ianhi changed the title fix: be more more caution when claiming a backend can open a URL fix: be more cautious when claiming a backend can open a URL Sep 30, 2025
Copy link
Member

@shoyer shoyer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great, thanks @ianhi !

Comment on lines 213 to 220
if not (isinstance(filename_or_obj, str) and is_remote_uri(filename_or_obj)):
return False

# Check file extension to avoid claiming non-OPeNDAP URLs (e.g., remote Zarr stores)
_, ext = os.path.splitext(filename_or_obj.rstrip("/"))
# Pydap handles OPeNDAP endpoints, which typically have no extension or .nc/.nc4
# Reject URLs with non-OPeNDAP extensions like .zarr
return ext not in {".zarr", ".zip", ".tar", ".gz"}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not 100% sure on this. We could go further and require "dap" to be in the URL

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's a standard extension for OpenDAP URLs. @Mikejmnez do you know?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked with a co-worker on slack. He said:

There's no standard extension for DAP URLs. Explicitly excluding .zarr seems good enough for this disambiguation.

Copy link
Contributor

@Mikejmnez Mikejmnez Oct 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. Yes, there is no standard extension for opendap urls. OPeNDAP servers produce urls with the filename at the end, but for example NASA does something completely different. Excluding .zarr should be good.

What I am trying to push for this, is an opendap protocol-ization via the URL scheme. This is "dap2://<file_url>" vs "dap4://<file_url>". I already added it to the documentation back then dap2vdap4 Right now, if an opendap begins with http, then it is assumed to be dap2. This is completely on the client side and not a server thing. But pydap and python-netcdf4 support this, some NASA subsetting tools do this. Perhaps this may help separating opendap urls from non-opendap urls

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, actually, Thredds (TDS) does have this "standard" way to specify the protocol that may help to discern between opendap url vs non-opendap url: a TDS dap2 url will have a dodsC in its urls. A TDS dap4 url will have a dap4 in its url. (see here). However, an organization running an opendap server may decide how their own urls are exposed.

@shoyer
Copy link
Member

shoyer commented Oct 1, 2025

Conceptually, I think there is an ambiguity about guess_can_open. Does it means that a backend possibly or definitively can open a dataset?

Based on how it's used in open_dataset, I think we should pick defaults closer to "definitely" (which is what you do in this PR). It's a better user experience to require an explicit engine than to guess wrong and raise a less informative downstream error.

@ianhi
Copy link
Contributor Author

ianhi commented Oct 1, 2025

I think there is an ambiguity about guess_can_open. Does it means that a backend possibly or definitively can open a dataset?

Based on how it's used in open_dataset, I think we should pick defaults closer to "definitely" (which is what you do in this PR). It's a better user experience to require an explicit engine than to guess wrong and raise a less informative downstream error.

I agree. I can make a follow up PR to the backends page to try make this more explicit.

For the case of the dap protocol I skimmed the specification https://opendap.github.io/dap4-specification/DAP4.html and didn't see anything in particular that made me feel confident about searching the URL for dap or opendap even if those are likely common, i think they are too restrictive on a URL. But this would also be a great question for @dcherian when he is back from vacation.

But for this PR I feel pretty happy with where it is and would defer further dap improvements to later, modulo a small fix to lower case the extension to be slightly more robust that I'll push shortly

@ianhi
Copy link
Contributor Author

ianhi commented Oct 1, 2025

modulo a small fix to lower case the extension to be slightly more robust that I'll push shortly

Actually im not sure that's reasonable. Other backends do a similar check without thinking about case:

if isinstance(filename_or_obj, str | os.PathLike):
_, ext = os.path.splitext(filename_or_obj)
return ext in {".nc", ".nc4", ".cdf", ".gz"}

is that more correct?

@shoyer
Copy link
Member

shoyer commented Oct 1, 2025

modulo a small fix to lower case the extension to be slightly more robust that I'll push shortly

Actually im not sure that's reasonable. Other backends do a similar check without thinking about case:

if isinstance(filename_or_obj, str | os.PathLike):
_, ext = os.path.splitext(filename_or_obj)
return ext in {".nc", ".nc4", ".cdf", ".gz"}

is that more correct?

Really hard to say.

Honestly, I think even this list of extensions is pretty generous:

  • Scipy doesn't read netCDF v4 files, so .nc4 seems unlikely
  • Scipy can only read .gz files if they contain a netCDF v3 file. So maybe .nc.gz would be more appropriate than allowing any .gz file.

@ianhi ianhi force-pushed the fix-netcdf4-remote-zarr-detection branch from b06a0a5 to 7ed1f0a Compare October 1, 2025 15:21
@Mikejmnez
Copy link
Contributor

Mikejmnez commented Oct 1, 2025

For the case of the dap protocol I skimmed the specification https://opendap.github.io/dap4-specification/DAP4.html and didn't see anything in particular that made me feel confident about searching the URL for dap or opendap even if those are likely common, i think they are too restrictive on a URL. But this would also be a great question for @dcherian when he is back from vacation.

.das .dds .dods (standard dap2 extensions), and .dmr, .dap (standard dap4 extensions) trigger downloads of either metadata or the entire file's binary data (.dods and .dap). URLS should not have those extensions...
These extensions are added by the backend (pydap or python-netcdf4), to create the python dataset objects. A url exposed by a server like THREDDS or Hyrax is likely not going to have this extension (it is possible only when a user manually creates it, which should not be entirely ruled out).

The only think that I can think of, that would be opendap specific, and part of the spec (dap4), appears when using a constraint expression in the url. These can often be part of the user-provided URL. For example:

url = "http://test.opendap.org/opendap/dap4/StaggeredGrid.nc4" # full data
url_ce2 = "http://test.opendap.org/opendap/dap4/StaggeredGrid.nc4?dap4.ce=/time[0:1:0];/Y[0:1:39];/X[0:1:39];/Eta[0:1:0][0:1:39][0:1:39]"
url_ce2 ="http://test.opendap.org/opendap/dap4/StaggeredGrid.nc4?dap4.ce=/time=[0:1:0];/Y=[0:1:39];/X=[0:1:39];/time;/X;/Y;/Eta"

url_ce1 and url_ce2 make use of the two ways to subset via its url, and as such the dap4.ce= must appear in the query parameter. That is a exclusively opendap that is implemented by any dap4 server.

@ianhi
Copy link
Contributor Author

ianhi commented Oct 2, 2025

dap4.ce= must appear in the query parameter. That is a exclusively opendap that is implemented by any dap4 server.

hese can often be part of the user-provided URL.

Since it is often but not "always" what would you recommend conditioning guessing true on these? It would be a signfiicant increase in strictness over what is there now. The current changes are a medium increase. I'm not an dap user so I don't have a strong opinion.

@shoyer
Copy link
Member

shoyer commented Oct 3, 2025 via email

@Mikejmnez
Copy link
Contributor

Mikejmnez commented Oct 3, 2025

If the url has either dap2 or dap4 as its scheme, it will 100% be an opendap url. For example, dap4://opendap.earthdata.nasa.gov/collections/... and dap2://opendap.earthdata.... Both pydap and netcdf4-python understand this (as so do other non-python tools). In NASA land, there is a push for all tutorials have dap4 urls, for example. (these are not valid http schemes, but a client-side parse-able approach to specify the dap protocol somebody came up with some time ago).

@ianhi
Copy link
Contributor Author

ianhi commented Oct 3, 2025

Both pydap and netcdf4-python understand this

So the NETCDF4 backend should also say yes to urls with dap in them?

I tried our your URL with both pydap and netcdf4 backends, but ran into an error for both, unfortunately

backend=netcdf4

OSError: [Errno -70] NetCDF: DAP server error: 'http://test.opendap.org/opendap/dap4/StaggeredGrid.nc4'

backend=pydap

seems to work but then fails with

RetryError: HTTPConnectionPool(host='test.opendap.org', port=80): Max retries exceeded with url: /opendap/dap4/StaggeredGrid.nc4.dds (Caused by ResponseError('too many 503 error responses'))

So i wasn't able to manually verify

@shoyer
Copy link
Member

shoyer commented Oct 3, 2025

So the NETCDF4 backend should also say yes to urls with dap in them?

I'll defer to @Mikejmnez, but my impression is that we should definitely be preferring pydap to netCDF4 for DAP. I think DAP support is optional in netCDF-C.

So I would lean towards not claiming DAP urls in the netcdf4 backend, or maybe just being sure we try pydap before netcdf4.

@Mikejmnez
Copy link
Contributor

So the NETCDF4 backend should also say yes to urls with dap in them?

I tried our your URL with both pydap and netcdf4 backends, but ran into an error for both, unfortunately

It looks like the test server was down overnight, and got restarted this morning. I tried it and it worked for me.

  • With pydap, all http, dap2, and dap4 work. http | https defaults to dap2.
  • With netcdf4 only http and dap4 work. dap2 does not, but http | https defaults to dap2.

My preference, would be to have any dap2 | dap4 scheme in the url, to be automatically assigned to pydap.

@dopplershift
Copy link
Contributor

dopplershift commented Oct 3, 2025

@shoyer While it's possible to build netcdf-c without DAP support, it's almost always built on and is frequently used. For instance, the netcdf4-python wheels are built with DAP support turned on in netCDF-C, as are the pacakges on conda-forge. So, PLEASE leave netcdf4-python able to read DAP by default.

EDIT: My default environment has neither h5netcdf nor pydap, just netcdf4. I'd really like for that environment to not suddenly start breaking my existing code/examples.

@Mikejmnez
Copy link
Contributor

And I am up for keeping the defaults as is, and definitely NOT break people's workflows. I think the issue at hand is how to identify dap urls.

@ianhi
Copy link
Contributor Author

ianhi commented Oct 3, 2025

So I would lean towards not claiming DAP urls in the netcdf4 backend, or maybe just being sure we try pydap before netcdf4.

I had a thought the other day when working on a custom zarr backend, that it would be nice to have a robust and cross-file-type system for expressing user preference for backend resolution order. It currently seems to be alphabetical, but with a special case of reordering the 3 built in netcdf backend. So the default guessing order will be:

  1. h5netcdf (from netcdf_engine_order[0])
  2. scipy (from netcdf_engine_order[1])
  3. netcdf4 (from netcdf_engine_order[2])
  4. pydap (alphabetically after netcdf4)
  5. store (alphabetically)
  6. zarr (alphabetically)

from the sorting here:

for be_name in OPTIONS["netcdf_engine_order"]:
if backend_entrypoints.get(be_name) is not None:
ordered_backends_entrypoints[be_name] = backend_entrypoints.pop(be_name)
ordered_backends_entrypoints.update(
{name: backend_entrypoints[name] for name in sorted(backend_entrypoints)}
)
return ordered_backends_entrypoints

but if i add a zarr backend for a subtype of zarr it will only come first due to alphbetical order, but if i had a backend named like z-zarr it would not be used.

So, PLEASE leave netcdf4-python able to read DAP by default.

I will make sure to not remove the current situation where netcdf4 can report as being able to a.

This is just about the automatic engine resolution, not changing anything for an explicit engine=... but that definitely could break things

@ianhi
Copy link
Contributor Author

ianhi commented Oct 3, 2025

This is also related to this issue i guess: #10810 (comment)

@shoyer
Copy link
Member

shoyer commented Oct 3, 2025

Given that netcdf4 is certainly the most "flexible" of our backends (also supporting Zarr and DAP), I wonder if it was perhaps a mistake to move it later in precedence, and perhaps would be better to roll-back for now. This definitely seems to be causing some turmoil. Let's talk about that back in #10657.

It seems pretty clear that it's not possible to handle ordering of backends just for netCDF in isolation. Ideally, we might choose the preferred backend on a file-specific basis, but at the least, we should consider making the entire precedence order of backends customizable.

@ianhi
Copy link
Contributor Author

ianhi commented Oct 3, 2025

I used to claude to iterate on a summary table of before and after this PR and came up with this description of before and after (which honestly the after might be nice to put in the docs)

Backend Resolution Order Summary

Before PR

Problems:

  • netcdf4 claimed ALL remote URIs as its first check, preventing zarr and pydap from being auto-selected
  • h5netcdf and scipy claimed .nc/.nc4/.cdf remote URLs but couldn't handle query parameters (OPeNDAP constraint expressions)
  • scipy incorrectly claimed remote .nc/.nc4/.cdf/.gz URLs even though it can only handle local files
  • scipy incorrectly claimed .zarr.gz files (any .gz extension)
  • pydap was never auto-selected because netcdf4 claimed all remote URIs first

Remote URL Resolution Examples (Before)

First available backend from left to right is selected.

URL h5netcdf scipy netcdf4 pydap zarr
https://example.com/store.zarr
https://example.com/data.nc
http://example.com/data.nc?var=temp
http://test.opendap.org/dap4/file.nc4?dap4.ce=/time[0]
dap2://opendap.nasa.gov/dataset
https://example.com/DAP4/data
http://test.opendap.org/dap4/file.nc4
https://example.com/data.dap
https://example.com/dataset

Result:

  • URLs with .nc/.nc4/.cdf extensions: h5netcdf selected (first in order)
  • All other remote URLs: netcdf4 selected (came before pydap/zarr)
  • Remote Zarr stores, DAP URLs, and URLs with query params all incorrectly went to netcdf4

Local File Resolution Examples (Before)

First available backend from left to right is selected.

Files with readable magic numbers

File Path Magic Number h5netcdf scipy netcdf4 zarr
/path/to/file.nc CDF\x01 (netCDF3)
/path/to/file.nc4 \x89HDF\r\n\x1a\n (HDF5/netCDF4)
/path/to/file.nc.gz \x1f\x8b + CDF inside (gzipped netCDF3)

Files without readable magic numbers (extension-based)

File Path h5netcdf scipy netcdf4 zarr
/path/to/file.nc
/path/to/file.nc3
/path/to/file.nc4
/path/to/file.cdf
/path/to/file.zarr.gz
/path/to/store.zarr/

Result: Local file resolution was mostly correct, but scipy incorrectly claimed any .gz file (including .zarr.gz).

After PR

Backends are tried in order: h5netcdf → scipy → netcdf4 → pydap → zarr

The first backend that returns True from guess_can_open() is selected.

Remote URL Resolution Examples

First available backend from left to right is selected.

URL h5netcdf scipy netcdf4 pydap zarr
https://example.com/store.zarr
https://example.com/data.nc
http://example.com/data.nc?var=temp
http://test.opendap.org/dap4/file.nc4?dap4.ce=/time[0]
dap2://opendap.nasa.gov/dataset
https://example.com/DAP4/data
http://test.opendap.org/dap4/file.nc4
https://example.com/data.dap
https://example.com/dataset

Local File Resolution Examples

First available backend from left to right is selected.

Files with readable magic numbers

File Path Magic Number h5netcdf scipy netcdf4 zarr
/path/to/file.nc CDF\x01 (netCDF3)
/path/to/file.nc4 \x89HDF\r\n\x1a\n (HDF5/netCDF4)
/path/to/file.nc.gz \x1f\x8b + CDF inside (gzipped netCDF3)

Files without readable magic numbers (extension-based)

File Path h5netcdf scipy netcdf4 zarr
/path/to/file.nc
/path/to/file.nc3
/path/to/file.nc4
/path/to/file.cdf
/path/to/file.zarr.gz
/path/to/store.zarr/

@ianhi ianhi changed the title fix: be more cautious when claiming a backend can open a URL fix: be more cautious when guessing what a backend can open Oct 3, 2025
return magic_number.startswith(b"\211HDF\r\n\032\n")

if isinstance(filename_or_obj, str | os.PathLike):
_, ext = os.path.splitext(filename_or_obj)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Intentionally not stripping any query params that might be present in dap query so that h5netcdf does not claim to be able to open it, as it's my undersstanding that it cannot

# Note: We intentionally do NOT check for .dap suffix as that would match
# file extensions like .dap which trigger downloads of binary data
url_lower = filename_or_obj.lower()
if url_lower.startswith(("dap2://", "dap4://")):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Mikejmnez is it ok that this will accept both DAP2:// and dap2://?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I tried it and it works with pydap. same for DAP4 v dap4

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

netcdf4 backend claims **all** remote files - preventing reading zarr
4 participants