Allow for #mode=bytes in file URLs to support NetCDF byte ranges #1349

djhoese · 2020-09-07T14:09:32Z

This takes advantage of recent updates in the NetCDF C library which allows for a special #mode=bytes suffix to be added to HTTP/HTTPS to anonymously access remote data stores. This includes S3 buckets that allow for anonymous access and provide a public HTTP endpoint. This work on the C library makes it available from the NetCDF4 python library and therefore xarray (since netcdf4-python is the default open_dataset engine). For example the NOAA GOES-16 ABI data:

url = "https://noaa-goes16.s3.amazonaws.com/ABI-L1b-RadC/2019/001/00/OR_ABI-L1b-RadC-M3C14_G16_s20190010002187_e20190010004560_c20190010005009.nc#mode=bytes"
scn = Scene(reader='abi_l1b', filenames=[url])
scn.load(['C14'])
scn['C14']

Tests added
Passes flake8 satpy
Fully documented

I'm not sure where to document something like this to advertise it more. The quick start page? Maybe add a remote data access section? Or maybe we should have a whole separate page for remote data access at this point?

CC @raybellwaves and @carloshorn who have been working with similar features or features adjacent to this. Also @gerritholl who originally added the file system support to find_files_and_readers.

coveralls · 2020-09-07T14:30:47Z

Coverage increased (+0.002%) to 90.27% when pulling ea5697c on djhoese:feature-allow-byte-mode-urls into 477136b on pytroll:master.

codecov · 2020-09-07T14:39:58Z

Codecov Report

Merging #1349 into master will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master    #1349   +/-   ##
=======================================
  Coverage   90.26%   90.27%           
=======================================
  Files         226      226           
  Lines       32718    32724    +6     
=======================================
+ Hits        29534    29540    +6     
  Misses       3184     3184

Impacted Files	Coverage Δ
satpy/readers/yaml_reader.py	`95.08% <100.00%> (+<0.01%)`	⬆️
satpy/tests/test_yaml_reader.py	`99.83% <100.00%> (+<0.01%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 477136b...ea5697c. Read the comment docs.

carloshorn · 2020-09-09T07:48:46Z

satpy/readers/yaml_reader.py

@@ -69,6 +69,9 @@ def _get_filebase(path, pattern):
    """Get the end of *path* of same length as *pattern*."""
    # convert any `/` on Windows to `\\`
    path = os.path.normpath(path)
+    # remove possible #mode=bytes URL suffix to support HTTP byte range
+    # requests for NetCDF
+    path = path.split('#')[0]


Instead of splitting the string, I would consider using urllib (as someone else also mentioned in slack)

import urllib url = "https://mydomain.com/path/to/file.nc#mode=bytes" url_components = urllib.parse.urlparse(url) url_components.path # == "/path/to/file.nc"

because you never know if the URL also contains other components like a query which will add a question mark or any other signs.
However, I am still not convinced that the yaml reader should have the responsibility of extracting the filename from the file path. Maybe the yaml reader should expect objects with a .name attribute so all the logic of extracting the filename is the responsibility of this object, which, if just looking for a .name attribute, could be any duck type of a fileobj or a pathlib.Path, or a yarl.URL, or ...

I agree, just using split is a bit risky here.

I like the idea of depending on .path but the main concern with this was speed. This method is called a lot and when given a lot of filenames (like using the MultiScene) it can add seconds to processing just to sort through the files. @gerritholl originally noticed this and a small change of where filename parsing happened cut off a lot of time iirc.

I'll try to look at it later.

So the function I modified is indeed called a lot (inside a tight for loop), however, this operation to get .path would only have to be done once per filename. If force filenames to be convert to Path objects or some other "pre-processing" then I could see this .path stuff working out pretty well.

@carloshorn I was playing around with this idea and realized one issue with the Path objects that I couldn't figure out. For some of our file patterns we need more than just the filename. Some readers encode the file type structure into directories so we really need a majority of the path. So for pathlib.Path the .name won't work. I would have all filenames parsed with urllib but it might be nice to duck type as you mentioned. Besides throwing in some if statements, any idea if there is a pathlib feature I'm missing here that could work for this? I suppose:

path = path_obj.path if hasattr(path_obj, 'path') else str(path_obj)

To be clear, I'm not against trying something not filename based, but I have tried a lot of things in the past with my own projects (the ones that lead to Satpy). Everything other than filename based stuff seemed really hacky or didn't provide all of the features we currently have in satpy. If we take away the functionality that lets Satpy take any file and find a reader for it, then we have some more options. But this is also useful functionality to some people.

@djhoese, please do not get me wrong, I am not against using the patterns, I just do not like that it constraints the data storage architecture. The current implementation only knows file systems, but imagine a situation where you are dealing with an object storage, which is designed to quickly provide the relevant metadata and a way to access the data itself. And even if you store the files on a file system, you might have a database containing the relevant metadata. In this case the pattern matching might be overhead, because it is certainly faster to query the database to get the metadata instead of decomposing a filename. For such use cases, it might be great to have an alternative way.
For a comprehensive implementation, it would be good to separate the extraction of required metadata by pattern matching from the subsequent sorting, grouping, ... operations, which I hope and guess that it can be done without adding too much overhead. In case the metadata is already present, it should be fine to skip this step.

Having the user provide the information while possible, is not something non-programmer users can do.

I know that satpy aims at being a convenience product and I personally also appreciate it, but adding such an alternative metadata interface would not break any existing workflow. It only provides an interface for advanced use cases.

I hope to find some time these days to present an implementation proposal.

I like the idea (or the goal) of separating out the data source and getting the metadata from that data source. Your example of object storage is a little difficult because xarray and zarr are already able to read NetCDF-like data structures from remote object stores like S3 and they treat them like filenames. They load the initial metadata they need then when you request actual chunks of data additional requests are made to the remote storage. Something like a URL API could be a similar case though to what you are talking about where we have a URL with query parameters where data/metadata can be retrieved/accessed. There are more formal standards that the Pangeo group are dealing with more and more like STAC (https://stacspec.org/) that we could look in to leveraging, but at that point we should be using the Intake library (a pangeo project working towards better STAC support).

Maybe the main question is where do these responsibilities land? Who/what is doing these metadata extractions and data retrievals? How does the user interact with Satpy in these more complex cases? Does the user provide a file like object that provides properties to .path for sorting but also other base metadata (.start_time, etc.)? Does the user provide a "key" that Satpy readers should access from another provided "file system" (or "data store" or something)? Should readers (specifically the current "file handler" concept) be agnostic to where the data came from (general, yes, but some may be difficult)? Is the current file handler structure even flexible enough to handle this or should it be generalized to something like a "data repository" that provides datasets and satpy includes "dataset readers" instead of file handlers. This kind of goes along with xarray's idea of .open_dataset("one_netcdf.nc") returning an xarray.Dataset object.

🤯 I should maybe wait until later on a Monday morning before thinking about these things.

Have we compared the performance of split vs. urllib?

No, but I think given the results that @ghiggi found in https://github.com/ghiggi/goes_benchmarks, there isn't much of a point in supporting this. Plus, any support for this should maybe be handled by a FSFile-style usage. Maybe I'll just close this PR.

Allow for #mode=bytes in file URLs to support NetCDF byte ranges

ea5697c

djhoese requested review from mraspaud, gerritholl and pnuu September 7, 2020 14:09

djhoese self-assigned this Sep 7, 2020

djhoese added component:readers enhancement code enhancements, features, improvements labels Sep 7, 2020

carloshorn reviewed Sep 9, 2020

View reviewed changes

djhoese mentioned this pull request Sep 23, 2020

Add arbitrary filename suffix to ABI L1B reader #1380

Merged

3 tasks

gerritholl mentioned this pull request Sep 30, 2020

fsspec FileSystem support #1321

Closed

6 tasks

djhoese mentioned this pull request Nov 5, 2020

read goes-r data from https aws, gcp or azure (it uses netcdf4) #1423

Open

5 tasks

djhoese closed this Jul 29, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow for #mode=bytes in file URLs to support NetCDF byte ranges #1349

Allow for #mode=bytes in file URLs to support NetCDF byte ranges #1349

djhoese commented Sep 7, 2020

coveralls commented Sep 7, 2020

codecov bot commented Sep 7, 2020 •

edited

carloshorn Sep 9, 2020

mraspaud Sep 9, 2020

djhoese Sep 9, 2020

djhoese Sep 10, 2020

djhoese Sep 15, 2020

djhoese Sep 17, 2020

carloshorn Sep 17, 2020

djhoese Sep 21, 2020

gerritholl Jul 29, 2022

djhoese Jul 29, 2022

Allow for #mode=bytes in file URLs to support NetCDF byte ranges #1349

Allow for #mode=bytes in file URLs to support NetCDF byte ranges #1349

Conversation

djhoese commented Sep 7, 2020

coveralls commented Sep 7, 2020

codecov bot commented Sep 7, 2020 • edited

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Sep 7, 2020 •

edited