Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow for #mode=bytes in file URLs to support NetCDF byte ranges #1349

Closed
wants to merge 1 commit into from

Conversation

djhoese
Copy link
Member

@djhoese djhoese commented Sep 7, 2020

This takes advantage of recent updates in the NetCDF C library which allows for a special #mode=bytes suffix to be added to HTTP/HTTPS to anonymously access remote data stores. This includes S3 buckets that allow for anonymous access and provide a public HTTP endpoint. This work on the C library makes it available from the NetCDF4 python library and therefore xarray (since netcdf4-python is the default open_dataset engine). For example the NOAA GOES-16 ABI data:

url = "https://noaa-goes16.s3.amazonaws.com/ABI-L1b-RadC/2019/001/00/OR_ABI-L1b-RadC-M3C14_G16_s20190010002187_e20190010004560_c20190010005009.nc#mode=bytes"
scn = Scene(reader='abi_l1b', filenames=[url])
scn.load(['C14'])
scn['C14']
  • Tests added
  • Passes flake8 satpy
  • Fully documented

I'm not sure where to document something like this to advertise it more. The quick start page? Maybe add a remote data access section? Or maybe we should have a whole separate page for remote data access at this point?

CC @raybellwaves and @carloshorn who have been working with similar features or features adjacent to this. Also @gerritholl who originally added the file system support to find_files_and_readers.

@djhoese djhoese self-assigned this Sep 7, 2020
@djhoese djhoese added component:readers enhancement code enhancements, features, improvements labels Sep 7, 2020
@coveralls
Copy link

Coverage Status

Coverage increased (+0.002%) to 90.27% when pulling ea5697c on djhoese:feature-allow-byte-mode-urls into 477136b on pytroll:master.

@codecov
Copy link

codecov bot commented Sep 7, 2020

Codecov Report

Merging #1349 into master will increase coverage by 0.00%.
The diff coverage is 100.00%.

Impacted file tree graph

@@           Coverage Diff           @@
##           master    #1349   +/-   ##
=======================================
  Coverage   90.26%   90.27%           
=======================================
  Files         226      226           
  Lines       32718    32724    +6     
=======================================
+ Hits        29534    29540    +6     
  Misses       3184     3184           
Impacted Files Coverage Δ
satpy/readers/yaml_reader.py 95.08% <100.00%> (+<0.01%) ⬆️
satpy/tests/test_yaml_reader.py 99.83% <100.00%> (+<0.01%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 477136b...ea5697c. Read the comment docs.

@@ -69,6 +69,9 @@ def _get_filebase(path, pattern):
"""Get the end of *path* of same length as *pattern*."""
# convert any `/` on Windows to `\\`
path = os.path.normpath(path)
# remove possible #mode=bytes URL suffix to support HTTP byte range
# requests for NetCDF
path = path.split('#')[0]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of splitting the string, I would consider using urllib (as someone else also mentioned in slack)

import urllib
url = "https://mydomain.com/path/to/file.nc#mode=bytes"
url_components = urllib.parse.urlparse(url)
url_components.path  # == "/path/to/file.nc"

because you never know if the URL also contains other components like a query which will add a question mark or any other signs.
However, I am still not convinced that the yaml reader should have the responsibility of extracting the filename from the file path. Maybe the yaml reader should expect objects with a .name attribute so all the logic of extracting the filename is the responsibility of this object, which, if just looking for a .name attribute, could be any duck type of a fileobj or a pathlib.Path, or a yarl.URL, or ...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, just using split is a bit risky here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of depending on .path but the main concern with this was speed. This method is called a lot and when given a lot of filenames (like using the MultiScene) it can add seconds to processing just to sort through the files. @gerritholl originally noticed this and a small change of where filename parsing happened cut off a lot of time iirc.

I'll try to look at it later.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the function I modified is indeed called a lot (inside a tight for loop), however, this operation to get .path would only have to be done once per filename. If force filenames to be convert to Path objects or some other "pre-processing" then I could see this .path stuff working out pretty well.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@carloshorn I was playing around with this idea and realized one issue with the Path objects that I couldn't figure out. For some of our file patterns we need more than just the filename. Some readers encode the file type structure into directories so we really need a majority of the path. So for pathlib.Path the .name won't work. I would have all filenames parsed with urllib but it might be nice to duck type as you mentioned. Besides throwing in some if statements, any idea if there is a pathlib feature I'm missing here that could work for this? I suppose:

path = path_obj.path if hasattr(path_obj, 'path') else str(path_obj)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be clear, I'm not against trying something not filename based, but I have tried a lot of things in the past with my own projects (the ones that lead to Satpy). Everything other than filename based stuff seemed really hacky or didn't provide all of the features we currently have in satpy. If we take away the functionality that lets Satpy take any file and find a reader for it, then we have some more options. But this is also useful functionality to some people.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@djhoese, please do not get me wrong, I am not against using the patterns, I just do not like that it constraints the data storage architecture. The current implementation only knows file systems, but imagine a situation where you are dealing with an object storage, which is designed to quickly provide the relevant metadata and a way to access the data itself. And even if you store the files on a file system, you might have a database containing the relevant metadata. In this case the pattern matching might be overhead, because it is certainly faster to query the database to get the metadata instead of decomposing a filename. For such use cases, it might be great to have an alternative way.
For a comprehensive implementation, it would be good to separate the extraction of required metadata by pattern matching from the subsequent sorting, grouping, ... operations, which I hope and guess that it can be done without adding too much overhead. In case the metadata is already present, it should be fine to skip this step.

Having the user provide the information while possible, is not something non-programmer users can do.

I know that satpy aims at being a convenience product and I personally also appreciate it, but adding such an alternative metadata interface would not break any existing workflow. It only provides an interface for advanced use cases.

I hope to find some time these days to present an implementation proposal.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea (or the goal) of separating out the data source and getting the metadata from that data source. Your example of object storage is a little difficult because xarray and zarr are already able to read NetCDF-like data structures from remote object stores like S3 and they treat them like filenames. They load the initial metadata they need then when you request actual chunks of data additional requests are made to the remote storage. Something like a URL API could be a similar case though to what you are talking about where we have a URL with query parameters where data/metadata can be retrieved/accessed. There are more formal standards that the Pangeo group are dealing with more and more like STAC (https://stacspec.org/) that we could look in to leveraging, but at that point we should be using the Intake library (a pangeo project working towards better STAC support).

Maybe the main question is where do these responsibilities land? Who/what is doing these metadata extractions and data retrievals? How does the user interact with Satpy in these more complex cases? Does the user provide a file like object that provides properties to .path for sorting but also other base metadata (.start_time, etc.)? Does the user provide a "key" that Satpy readers should access from another provided "file system" (or "data store" or something)? Should readers (specifically the current "file handler" concept) be agnostic to where the data came from (general, yes, but some may be difficult)? Is the current file handler structure even flexible enough to handle this or should it be generalized to something like a "data repository" that provides datasets and satpy includes "dataset readers" instead of file handlers. This kind of goes along with xarray's idea of .open_dataset("one_netcdf.nc") returning an xarray.Dataset object.

🤯 I should maybe wait until later on a Monday morning before thinking about these things.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we compared the performance of split vs. urllib?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, but I think given the results that @ghiggi found in https://github.com/ghiggi/goes_benchmarks, there isn't much of a point in supporting this. Plus, any support for this should maybe be handled by a FSFile-style usage. Maybe I'll just close this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component:readers enhancement code enhancements, features, improvements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants