enable loading remote hdf5 files #2782

scottyhq · 2019-02-20T21:51:02Z

Enable loading remote hdf5 files. Will require h5py>2.9.0 and some changes to https://github.com/shoyer/h5netcdf. I've current just made a quick hack change to backends/api.py, so further tests are needed. Pinging @jhamman, @mrocklin, and @rabernat for thoughts on this.

Here is a short notebook demonstrating how this works:
https://gist.github.com/scottyhq/790bf19c7811b5c6243ce37aae252ca1

Closes enable reading of file-like HDF5 objects #2781
Tests added
Fully documented, including whats-new.rst for all changes and api.rst for new API

mrocklin · 2019-02-20T21:54:15Z

I'm glad to see this. I'll also be curious to see what the performance will look like.

cc @llllllllll

shoyer · 2019-02-22T03:27:56Z

This looks great!

I'll note one minor extension: you could look at the first few bytes of the file (the "magic number") to determine if it's a netCDF3 or netCDF4 file, and hence whether it can be opened with scipy or h5netcdf:

CDF\001 or CDF\002 would indicate netCDF3 (use scipy)
\211HDF\r\n\032\n would indicate netCDF4/HDF5 (use h5netcdf)

scottyhq · 2019-03-05T06:27:51Z

@shoyer , it would be great to have your feedback on these recent changes now that h5netcdf 0.7 is out. There's a bit more logic required in api.py now that scipy isn't the only backend that is able to read file-like objects (and people may not specify engine= when opening datasets)

test_backends.py passes locally for me except for TestValidateAttrs.test_validating_attrs... not sure why.

Also, per your comment here: h5netcdf/h5netcdf#51 (comment), I think it would be great to get a few small netcdf4/hdf test files in https://github.com/pydata/xarray-data.

shoyer

I have some very minor suggestions but generally this looks good to me.

shoyer · 2019-03-05T06:53:55Z

xarray/backends/api.py

+                                               lock=lock, **backend_kwargs)
+            else:
+                raise ValueError("byte header doesn't match netCDF3 or "
+                                 "netCDF4/HDF5: {}".format(magic_number))


I suspect this is one of those rare cases where it's best not to report all the details -- most users probably don't know about magic numbers. Maybe something like:

"file-like object is not a netCDF file: {}".format(filename_or_obj)`, or

"bytes do not represent in-memory netCDF file: {}. (Pass a string or pathlib.Path object to read a filename from disk.)".format(filename_or_obj[:80] + b'...' if len(filename_or_obj) > 80 else b'')

went with the first more-concise option

shoyer · 2019-03-05T06:56:05Z

xarray/tests/test_backends.py

@@ -1955,6 +1955,38 @@ def test_dump_encodings_h5py(self):
            assert actual.x.encoding['compression_opts'] is None


+# Requires h5py>2.9.0


Can you add a pytest.mark.skipif based on the version number? (The test on Travis-CI is failing on Python 3.5 because it has an old version of h5py installed)

i think i did this correctly (added some lines to tests/__init__.py)

shoyer · 2019-03-05T06:57:48Z

xarray/tests/test_backends.py

+            ds['scalar'] = v
+        bio.seek(0)
+        with xr.open_dataset(bio) as ds:
+            v = ds['scalar']


prefer using assert_identical and comparing to another expected dataset object.

i've changed that test to use assert_identical, and am using with raises_regex() to make sure the new error exceptions are hit

shoyer · 2019-03-05T06:58:53Z

xarray/tests/test_backends.py

+    def test_h5bytes(self):
+        import h5py
+        bio = BytesIO()
+        with h5py.File(bio) as ds:


Wouldn't it be nice if we supported writing to file-like objects, too? :)

(But don't do that now, this PR is a nice logical size already.)

agreed. hopefully someone else could pick that up!

pep8speaks · 2019-03-06T01:46:54Z

Hello @scottyhq! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2019-03-15 23:35:28 UTC

xarray/tests/test_backends.py

xarray/backends/api.py

shoyer · 2019-03-06T02:17:59Z

xarray/backends/api.py

+            else:
+                print(magic_number)
+                raise ValueError("file-like object is not a netCDF file: {}"
+                                 .format(filename_or_obj))


I'm a little concerned about giving users an error message about file-like objects if they passed in a bytes object, e.g., consider xarray.open_dataset(b'garbage').

Ideally this should give a useful error message, something like: ValueError: b'garbage' is not a valid netCDF file, did you mean to pass a string for a path instead?, not ValueError: file-like object is not a netCDF file: <_io.BytesIO at 0x105663888>.

xarray/backends/api.py

shoyer · 2019-03-06T02:27:49Z

I don't think it's essential to have an integration test doing real network access in xarray, so I would consider just dropping that part instead.

…

On Tue, Mar 5, 2019 at 6:10 PM Ryan Abernathey ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In xarray/tests/test_backends.py <#2782 (comment)>: > @@ -1955,6 +1955,39 @@ def test_dump_encodings_h5py(self): assert actual.x.encoding['compression_opts'] is None ***@***.***_h5fileobj +class TestH5NetCDFFileObject(TestH5NetCDFData): + h5py = pytest.importorskip('h5py', minversion='2.9.0') + engine = 'h5netcdf' + + @network + def test_h5remote(self): + # alternative: http://era5-pds.s3.amazonaws.com/2008/01/main.nc + import requests + url = ('https://www.unidata.ucar.edu/' + 'software/netcdf/examples/test_hgroups.nc') Rather than going over the network, it might be quite easy to fire up a http.server.SimpleHTTPRequestHandler <https://docs.python.org/3/library/http.server.html#http.server.SimpleHTTPRequestHandler> as part of a fixture. This would allow us to test the remote capability without internet (and without depending on a third party to host a file.) — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#2782 (review)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABKS1puGwIDrvG3NGvfq5oXqwdtbHSIDks5vTyOkgaJpZM4bGMFq> .

scottyhq · 2019-03-08T01:13:16Z

thanks for the input @shoyer, I attempted to tidy up a bit and in the process re-ordered some things such as adding an 'engine' check at the top of open_dataset(). backend tests are passing locally on my machine. hopefully i didn't add too much here or overstep!

jhamman · 2019-03-08T06:58:20Z

@scottyhq - can you add note to the what's new page?

From what I can tell, I don't think the failing tests are related to this PR.

shoyer

I like the look of this, thanks for refactoring this logic!

shoyer · 2019-03-14T05:46:50Z

xarray/backends/api.py

+    elif magic_number.startswith(b'\211HDF\r\n\032\n'):
+        engine = 'h5netcdf'
+        if isinstance(filename_or_obj, bytes):
+            raise ValueError("can't open netCDF4/HDF5 as bytes "


Just a note: we could support this in the future, by wrapping bytes in a io.BytesIO object (like we do for the scipy backend). But no need to add it now -- I like explicitly providing file objects.

shoyer · 2019-03-15T01:33:02Z

xarray/tests/test_backends.py

+        with raises_regex(ValueError, 'read/write pointer not at zero'):
+            with create_tmp_file() as tmp_file:
+                expected.to_netcdf(tmp_file, engine='h5netcdf')
+                f = open(tmp_file, 'rb')


There is a real test failure on Window (see the Appveyor CI results), likely because this file never get closed. You should use a context manager here instead.

that test was for the case where the file isn't closed before reopening, but it looks like on windows the error is different compared to linux ( PermissionError versus ValueError), so I added a check for the windows error: PermissionError: [WinError 32] The process cannot access the file because it is being used by another process:

…eobj

jhamman · 2019-03-16T00:13:18Z

I think we're good here. I made one minor tweak to the windows fix @scottyhq implemented. I plan to merge this on Monday if I don't hear any objections.

shoyer · 2019-03-16T00:36:12Z

thanks @scottyhq !

* upstream/master: Rework whats-new for 0.12 Add whats-new for 0.12.1 Release 0.12.0 enable loading remote hdf5 files (pydata#2782) Push back finalizing deprecations for 0.12 (pydata#2809) Drop failing tests writing multi-dimensional arrays as attributes (pydata#2810) some docs updates (pydata#2746) Add support for cftime.datetime coordinates with coarsen (pydata#2778) Don't use deprecated np.asscalar() (pydata#2800) Improve name concat (pydata#2792) Add `Dataset.drop_dims` (pydata#2767) Quarter offset implemented (base is now latest pydata-master). (pydata#2721) Add use_cftime option to open_dataset (pydata#2759) Bugfix/reduce no axis (pydata#2769) 'standard' now refers to 'gregorian' in cftime_range (pydata#2771)

* attempt at loading remote hdf5 * added a couple tests * rewind bytes after reading header * addressed comments for tests and error message * fixed pep8 formatting * created _get_engine_from_magic_number function, new tests * added description in whats-new * fixed test failure on windows * same error on windows and nix

…ns with size>1 (#2757) * Quarter offset implemented (base is now latest pydata-master). (#2721) * Quarter offset implemented (base is now latest pydata-master). * Fixed issues raised in review (#2721 (review)) * Updated whats-new.rst with info on quarter offset support. * Updated whats-new.rst with info on quarter offset support. * Update doc/whats-new.rst Co-Authored-By: jwenfai <jwenfai@gmail.com> * Added support for quarter frequencies when resampling CFTimeIndex. Less redundancy in CFTimeIndex resampling tests. * Removed normalization code (unnecessary for cftime_range) in cftime_offsets.py. Removed redundant lines in whats-new.rst. * Removed invalid option from _get_day_of_month docstring. Added tests back in that raises ValueError when resampling (base=24 when resampling to daily freq, e.g., '8D'). * Minor edits to docstrings/comments * lint * Add `Dataset.drop_dims` (#2767) * ENH: Add Dataset.drop_dims() * Drops full dimensions and any corresponding variables in a Dataset * Fixes GH1949 * DOC: Add Dataset.drop_dims() documentation * Improve name concat (#2792) * Added tests of desired name inferring behaviour * Infers names * updated what's new * Don't use deprecated np.asscalar() (#2800) It got deprecated in numpy 1.16 and throws a ton of warnings due to that. All the function does is returning .item() anyway, which is why it got deprecated. * Add support for cftime.datetime coordinates with coarsen (#2778) * some docs updates (#2746) * Friendlier io title. * Fix lists. * Fix *args, **kwargs "inline emphasis..." * misc * Reference xarray_extras for csv writing. Closes #2289 * Add metpy accessor. Closes #461 * fix transpose docstring. Closes #2576 * Revert "Fix lists." This reverts commit 39983a5. * Revert "Fix *args, **kwargs" This reverts commit 1b9da35. * Add MetPy to related projects. * Add Weather and Climate specific page. * Add hvplot. * Note open_dataset, mfdataset open files as read-only (closes #2345). * Update metpy 1 Co-Authored-By: dcherian <dcherian@users.noreply.github.com> * Update doc/weather-climate.rst Co-Authored-By: dcherian <dcherian@users.noreply.github.com> * Drop failing tests writing multi-dimensional arrays as attributes (#2810) These aren't valid for netCDF files. Fixes GH2803 * Push back finalizing deprecations for 0.12 (#2809) 0.12 will already have a big change in dropping Python 2.7 support. I'd rather wait a bit longer to finalize these deprecations to minimize the impact on users. * enable loading remote hdf5 files (#2782) * attempt at loading remote hdf5 * added a couple tests * rewind bytes after reading header * addressed comments for tests and error message * fixed pep8 formatting * created _get_engine_from_magic_number function, new tests * added description in whats-new * fixed test failure on windows * same error on windows and nix * Release 0.12.0 * Add whats-new for 0.12.1 * Rework whats-new for 0.12 * DOC: Update donation links * DOC: remove outdated warning (#2818) * Allow expand_dims() method to support inserting/broadcasting dimensions with size>1 (#2757) * Make using dim_kwargs for python 3.5 illegal -- a ValueError is thrown * dataset.expand_dims() method take dict like object where values represent length of dimensions or coordinates of dimesnsions * dataarray.expand_dims() method take dict like object where values represent length of dimensions or coordinates of dimesnsions * Add alternative option to passing a dict to the dim argument, which is now an optional kwarg, passing in each new dimension as its own kwarg * Add expand_dims enhancement from issue 2710 to whats-new.rst * Fix test_dataarray.TestDataArray.test_expand_dims_with_greater_dim_size tests to pass in python 3.5 using ordered dicts instead of regular dicts. This was needed because python 3.5 and earlier did not maintain insertion order for dicts * Restrict core logic to use 'dim' as a dict--it will be converted into a dict on entry if it is a str or a sequence of str * Don't cast dim values (coords) as a list since IndexVariable/Variable will internally convert it into a numpy.ndarray. So just use IndexVariable((k,), v) * TypeErrors should be raised for invalid input types, rather than ValueErrors. * Force 'dim' to be OrderedDict for python 3.5 * Allow expand_dims() method to support inserting/broadcasting dimensions with size>1 (#2757) * use .size attribute to determine the size of a dimension, rather than converting to a list, which can be slow for large iterables * Make using dim_kwargs for python 3.5 illegal -- a ValueError is thrown * dataset.expand_dims() method take dict like object where values represent length of dimensions or coordinates of dimesnsions * dataarray.expand_dims() method take dict like object where values represent length of dimensions or coordinates of dimesnsions * Add alternative option to passing a dict to the dim argument, which is now an optional kwarg, passing in each new dimension as its own kwarg * Add expand_dims enhancement from issue 2710 to whats-new.rst * Fix test_dataarray.TestDataArray.test_expand_dims_with_greater_dim_size tests to pass in python 3.5 using ordered dicts instead of regular dicts. This was needed because python 3.5 and earlier did not maintain insertion order for dicts * Restrict core logic to use 'dim' as a dict--it will be converted into a dict on entry if it is a str or a sequence of str * Don't cast dim values (coords) as a list since IndexVariable/Variable will internally convert it into a numpy.ndarray. So just use IndexVariable((k,), v) * TypeErrors should be raised for invalid input types, rather than ValueErrors. * Force 'dim' to be OrderedDict for python 3.5 * Allow expand_dims() method to support inserting/broadcasting dimensions with size>1 (#2757) * Move enhancement description up to 0.12.1 * use .size attribute to determine the size of a dimension, rather than converting to a list, which can be slow for large iterables * Make using dim_kwargs for python 3.5 illegal -- a ValueError is thrown * dataset.expand_dims() method take dict like object where values represent length of dimensions or coordinates of dimesnsions * dataarray.expand_dims() method take dict like object where values represent length of dimensions or coordinates of dimesnsions * Add alternative option to passing a dict to the dim argument, which is now an optional kwarg, passing in each new dimension as its own kwarg * Add expand_dims enhancement from issue 2710 to whats-new.rst * Fix test_dataarray.TestDataArray.test_expand_dims_with_greater_dim_size tests to pass in python 3.5 using ordered dicts instead of regular dicts. This was needed because python 3.5 and earlier did not maintain insertion order for dicts * Restrict core logic to use 'dim' as a dict--it will be converted into a dict on entry if it is a str or a sequence of str * Don't cast dim values (coords) as a list since IndexVariable/Variable will internally convert it into a numpy.ndarray. So just use IndexVariable((k,), v) * TypeErrors should be raised for invalid input types, rather than ValueErrors. * Force 'dim' to be OrderedDict for python 3.5

attempt at loading remote hdf5

08aba0b

scottyhq mentioned this pull request Feb 20, 2019

enable loading h5py file like objects h5netcdf/h5netcdf#51

Merged

added a couple tests

8ec34a6

scottyhq mentioned this pull request Feb 27, 2019

trouble loading netcdf4 files with xarray on s3 fsspec/s3fs#168

Closed

scottyhq added 2 commits March 4, 2019 21:02

Merge remote-tracking branch 'upstream/master' into fileobj

b88b06e

rewind bytes after reading header

48b23b6

shoyer reviewed Mar 5, 2019

View reviewed changes

addressed comments for tests and error message

4a7e560

fixed pep8 formatting

2aa7349

rabernat reviewed Mar 6, 2019

View reviewed changes

xarray/tests/test_backends.py Outdated Show resolved Hide resolved

shoyer reviewed Mar 6, 2019

View reviewed changes

xarray/backends/api.py Outdated Show resolved Hide resolved

created _get_engine_from_magic_number function, new tests

1a4c4f3

added description in whats-new

94a3afe

shoyer approved these changes Mar 14, 2019

View reviewed changes

Merge branch 'master' into fileobj

7e82959

shoyer reviewed Mar 15, 2019

View reviewed changes

scottyhq and others added 3 commits March 15, 2019 12:41

fixed test failure on windows

c067fa0

Merge branch 'fileobj' of https://github.com/scottyhq/xarray into fil…

c99e8a6

…eobj

same error on windows and nix

73c022e

shoyer merged commit 225868d into pydata:master Mar 16, 2019

spencerahill mentioned this pull request Mar 30, 2019

Failing tests in CI, but for some builds still come back as green spencerahill/aospy#319

Closed

scottyhq mentioned this pull request Apr 4, 2019

tiledb pangeo-data/pangeo#120

Closed

zbruick mentioned this pull request Oct 15, 2019

Issues with xarray v0.14 Unidata/MetPy#1203

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

enable loading remote hdf5 files #2782

enable loading remote hdf5 files #2782

scottyhq commented Feb 20, 2019 •

edited

Loading

mrocklin commented Feb 20, 2019

shoyer commented Feb 22, 2019

scottyhq commented Mar 5, 2019

shoyer left a comment

shoyer Mar 5, 2019

scottyhq Mar 6, 2019

shoyer Mar 5, 2019

scottyhq Mar 6, 2019

shoyer Mar 5, 2019

scottyhq Mar 6, 2019 •

edited

Loading

shoyer Mar 5, 2019

scottyhq Mar 6, 2019

pep8speaks commented Mar 6, 2019 •

edited

Loading

shoyer Mar 6, 2019

shoyer commented Mar 6, 2019 via email

scottyhq commented Mar 8, 2019 •

edited

Loading

jhamman commented Mar 8, 2019

shoyer left a comment

shoyer Mar 14, 2019

shoyer Mar 15, 2019

scottyhq Mar 15, 2019

jhamman commented Mar 16, 2019

shoyer commented Mar 16, 2019

		@@ -1955,6 +1955,38 @@ def test_dump_encodings_h5py(self):
		assert actual.x.encoding['compression_opts'] is None


		# Requires h5py>2.9.0

enable loading remote hdf5 files #2782

enable loading remote hdf5 files #2782

Conversation

scottyhq commented Feb 20, 2019 • edited Loading

mrocklin commented Feb 20, 2019

shoyer commented Feb 22, 2019

scottyhq commented Mar 5, 2019

shoyer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

scottyhq Mar 6, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pep8speaks commented Mar 6, 2019 • edited Loading

Comment last updated at 2019-03-15 23:35:28 UTC

Choose a reason for hiding this comment

shoyer commented Mar 6, 2019 via email

scottyhq commented Mar 8, 2019 • edited Loading

jhamman commented Mar 8, 2019

shoyer left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jhamman commented Mar 16, 2019

shoyer commented Mar 16, 2019

scottyhq commented Feb 20, 2019 •

edited

Loading

scottyhq Mar 6, 2019 •

edited

Loading

pep8speaks commented Mar 6, 2019 •

edited

Loading

scottyhq commented Mar 8, 2019 •

edited

Loading