-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
fix: be more cautious when guessing what a backend can open #10804
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great, thanks @ianhi !
xarray/backends/pydap_.py
Outdated
if not (isinstance(filename_or_obj, str) and is_remote_uri(filename_or_obj)): | ||
return False | ||
|
||
# Check file extension to avoid claiming non-OPeNDAP URLs (e.g., remote Zarr stores) | ||
_, ext = os.path.splitext(filename_or_obj.rstrip("/")) | ||
# Pydap handles OPeNDAP endpoints, which typically have no extension or .nc/.nc4 | ||
# Reject URLs with non-OPeNDAP extensions like .zarr | ||
return ext not in {".zarr", ".zip", ".tar", ".gz"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not 100% sure on this. We could go further and require "dap" to be in the URL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think there's a standard extension for OpenDAP URLs. @Mikejmnez do you know?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked with a co-worker on slack. He said:
There's no standard extension for DAP URLs. Explicitly excluding .zarr seems good enough for this disambiguation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting. Yes, there is no standard extension for opendap urls. OPeNDAP servers produce urls with the filename at the end, but for example NASA does something completely different. Excluding .zarr
should be good.
What I am trying to push for this, is an opendap protocol-ization via the URL scheme. This is "dap2://<file_url>" vs "dap4://<file_url>". I already added it to the documentation back then dap2vdap4 Right now, if an opendap begins with http, then it is assumed to be dap2
. This is completely on the client side and not a server thing. But pydap
and python-netcdf4
support this, some NASA subsetting tools do this. Perhaps this may help separating opendap urls from non-opendap urls
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, actually, Thredds (TDS) does have this "standard" way to specify the protocol that may help to discern between opendap url vs non-opendap url: a TDS dap2 url will have a dodsC
in its urls. A TDS dap4 url will have a dap4
in its url. (see here). However, an organization running an opendap server may decide how their own urls are exposed.
Conceptually, I think there is an ambiguity about Based on how it's used in open_dataset, I think we should pick defaults closer to "definitely" (which is what you do in this PR). It's a better user experience to require an explicit |
I agree. I can make a follow up PR to the backends page to try make this more explicit. For the case of the dap protocol I skimmed the specification https://opendap.github.io/dap4-specification/DAP4.html and didn't see anything in particular that made me feel confident about searching the URL for But for this PR I feel pretty happy with where it is and would defer further dap improvements to later, modulo a small fix to lower case the extension to be slightly more robust that I'll push shortly |
Actually im not sure that's reasonable. Other backends do a similar check without thinking about case: xarray/xarray/backends/scipy_.py Lines 365 to 367 in eed12c4
is that more correct? |
Really hard to say. Honestly, I think even this list of extensions is pretty generous:
|
b06a0a5
to
7ed1f0a
Compare
The only think that I can think of, that would be opendap specific, and part of the spec (dap4), appears when using a constraint expression in the url. These can often be part of the user-provided URL. For example: url = "http://test.opendap.org/opendap/dap4/StaggeredGrid.nc4" # full data
url_ce2 = "http://test.opendap.org/opendap/dap4/StaggeredGrid.nc4?dap4.ce=/time[0:1:0];/Y[0:1:39];/X[0:1:39];/Eta[0:1:0][0:1:39][0:1:39]"
url_ce2 ="http://test.opendap.org/opendap/dap4/StaggeredGrid.nc4?dap4.ce=/time=[0:1:0];/Y=[0:1:39];/X=[0:1:39];/time;/X;/Y;/Eta"
|
Since it is often but not "always" what would you recommend conditioning guessing true on these? It would be a signfiicant increase in strictness over what is there now. The current changes are a medium increase. I'm not an dap user so I don't have a strong opinion. |
guess_can_open should be pretty conservative, something like 95% confident
that the given backend can open the given file. It is a way better user
experience to ask them to explicitly supply an engine than to guess wrong
and get a mysterious error message. So pydap should require something
explict to identify DAP, rather than guessing it can open any HTTP url.
…On Thu, Oct 2, 2025 at 11:13 AM Ian Hunt-Isaak ***@***.***> wrote:
*ianhi* left a comment (pydata/xarray#10804)
<#10804 (comment)>
dap4.ce= must appear in the query parameter. That is a exclusively opendap
that is implemented by any dap4 server.
hese can often be part of the user-provided URL.
Since it is often but not "always" what would you recommend conditioning
guessing true on these? It would be a signfiicant increase in strictness
over what is there now. The current changes are a medium increase. I'm not
an dap user so I don't have a strong opinion.
—
Reply to this email directly, view it on GitHub
<#10804 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAJJFVRUMOHILDO2VR5JWMD3VVTMXAVCNFSM6AAAAACH5N2T7SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTGNRSGQYTOMJVGY>
.
You are receiving this because you commented.Message ID:
***@***.***>
|
If the url has either |
So the NETCDF4 backend should also say yes to urls with dap in them? I tried our your URL with both pydap and netcdf4 backends, but ran into an error for both, unfortunately
seems to work but then fails with
So i wasn't able to manually verify |
I'll defer to @Mikejmnez, but my impression is that we should definitely be preferring pydap to netCDF4 for DAP. I think DAP support is optional in netCDF-C. So I would lean towards not claiming DAP urls in the netcdf4 backend, or maybe just being sure we try pydap before netcdf4. |
It looks like the test server was down overnight, and got restarted this morning. I tried it and it worked for me.
My preference, would be to have any |
@shoyer While it's possible to build netcdf-c without DAP support, it's almost always built on and is frequently used. For instance, the netcdf4-python wheels are built with DAP support turned on in netCDF-C, as are the pacakges on conda-forge. So, PLEASE leave netcdf4-python able to read DAP by default. EDIT: My default environment has neither h5netcdf nor pydap, just netcdf4. I'd really like for that environment to not suddenly start breaking my existing code/examples. |
And I am up for keeping the defaults as is, and definitely NOT break people's workflows. I think the issue at hand is how to identify dap urls. |
I had a thought the other day when working on a custom zarr backend, that it would be nice to have a robust and cross-file-type system for expressing user preference for backend resolution order. It currently seems to be alphabetical, but with a special case of reordering the 3 built in netcdf backend. So the default guessing order will be:
from the sorting here: xarray/xarray/backends/plugins.py Lines 94 to 100 in eed12c4
but if i add a zarr backend for a subtype of zarr it will only come first due to alphbetical order, but if i had a backend named like
I will make sure to not remove the current situation where netcdf4 can report as being able to a. This is just about the automatic engine resolution, not changing anything for an explicit |
This is also related to this issue i guess: #10810 (comment) |
Given that netcdf4 is certainly the most "flexible" of our backends (also supporting Zarr and DAP), I wonder if it was perhaps a mistake to move it later in precedence, and perhaps would be better to roll-back for now. This definitely seems to be causing some turmoil. Let's talk about that back in #10657. It seems pretty clear that it's not possible to handle ordering of backends just for netCDF in isolation. Ideally, we might choose the preferred backend on a file-specific basis, but at the least, we should consider making the entire precedence order of backends customizable. |
I used to claude to iterate on a summary table of before and after this PR and came up with this description of before and after (which honestly the after might be nice to put in the docs) Backend Resolution Order SummaryBefore PRProblems:
Remote URL Resolution Examples (Before)First available backend from left to right is selected.
Result:
Local File Resolution Examples (Before)First available backend from left to right is selected. Files with readable magic numbers
Files without readable magic numbers (extension-based)
Result: Local file resolution was mostly correct, but scipy incorrectly claimed any After PRBackends are tried in order: h5netcdf → scipy → netcdf4 → pydap → zarr The first backend that returns Remote URL Resolution ExamplesFirst available backend from left to right is selected.
Local File Resolution ExamplesFirst available backend from left to right is selected. Files with readable magic numbers
Files without readable magic numbers (extension-based)
|
return magic_number.startswith(b"\211HDF\r\n\032\n") | ||
|
||
if isinstance(filename_or_obj, str | os.PathLike): | ||
_, ext = os.path.splitext(filename_or_obj) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Intentionally not stripping any query params that might be present in dap query so that h5netcdf does not claim to be able to open it, as it's my undersstanding that it cannot
# Note: We intentionally do NOT check for .dap suffix as that would match | ||
# file extensions like .dap which trigger downloads of binary data | ||
url_lower = filename_or_obj.lower() | ||
if url_lower.startswith(("dap2://", "dap4://")): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Mikejmnez is it ok that this will accept both DAP2://
and dap2://
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I tried it and it works with pydap. same for DAP4 v dap4
Previously pydap or netcdf if installed would grab any remote URL according the order of backend resolution.
whats-new.rst
api.rst