Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: correctly pad netCDF files with null bytes #8170

Merged
merged 3 commits into from
Dec 11, 2017

Conversation

shoyer
Copy link
Contributor

@shoyer shoyer commented Nov 18, 2017

Per the netCDF spec, variable names and attributes should be padded with null
padding to the nearest 4-byte boundary, but scipy has been padding with b'0'
instead:
http://www.unidata.ucar.edu/software/netcdf/docs/file_format_specifications.html

This surfaced recently after the netCDF-C library added more stringent checks
in version 4.5.0 which rendered it unable to read files written with SciPy:
Unidata/netcdf-c#657

This change to netCDF-C is likely going to be rolled back, but it's still a
good idea for us to write compliant files. Apparently there are other netCDF
implementations that also insist on padding with nulls.

Per the netCDF spec, variable names and attributes should be padded with null
padding to the nearest 4-byte boundary, but scipy has been padding with b'0'
instead:
http://www.unidata.ucar.edu/software/netcdf/docs/file_format_specifications.html

This surfaced recently after the netCDF-C library added more stringent checks
in version 4.5.0 which rendered it unable to read files written with SciPy:
Unidata/netcdf-c#657

This change to netCDF-C is likely going to be rolled back, but it's still a
good idea for us to write compliant files. Apparently there are other netCDF
implementations that also insist on padding with nulls.
@shoyer
Copy link
Contributor Author

shoyer commented Nov 18, 2017

Note: in general I would always add a unit test for bug fixes, but I'm not really sure how to do that in this case short of manually verifying the bytes for a full serialized file (which seems excessive).

@ghost
Copy link

ghost commented Dec 4, 2017

@shoyer Why not install netcdf from pypi and then use that to read the files for the test?

@shoyer
Copy link
Contributor Author

shoyer commented Dec 4, 2017

Why not install netcdf from pypi and then use that to read the files for the test?

This seems rather heavy-weight -- libnetcdf is a big binary dependency that can be tricky to install. Though I suppose if we make the test dependency optional it could still be a good way to pick-up incompatibilities.

I think a better testing approach would be to hand-write a minimal netCDF file according to spec, and verify that scipy produces a bit-wise identical result. This shouldn't be too onerous: a minimal example with xarray is only 108 bytes: len(xarray.Dataset(coords={'x': [-9999]}, attrs={'a': 'b'}).to_netcdf())

@ghost
Copy link

ghost commented Dec 4, 2017

This seems rather heavy-weight -- libnetcdf is a big binary dependency that can be tricky to install.

Not tricky to install at all:

pip install netcdf4

Also there are no issues with including additional test dependencies, because the test will be skipped if the module cannot be imported.

@shoyer
Copy link
Contributor Author

shoyer commented Dec 4, 2017

Okay, I suppose we do have binary wheels these days :).

This still isn't the best test for this issue -- netCDF4 will read old files created by SciPy just fine, unless you are using libnetcdf=4.5.0.

@shoyer shoyer force-pushed the netcdf-nullpadding branch 4 times, most recently from 4faafd1 to 69b9e38 Compare December 5, 2017 01:59
@rgommers rgommers added maintenance Items related to regular maintenance tasks scipy.io labels Dec 9, 2017
@rgommers rgommers added this to the 1.1.0 milestone Dec 9, 2017
@rgommers rgommers added the backport-candidate This fix should be ported by a maintainer to previous SciPy versions. label Dec 9, 2017
@rgommers
Copy link
Member

rgommers commented Dec 9, 2017

Thanks @shoyer. Test that was added LGTM. We have _FillValue support (since a release or two IIRC), but I agree it can be left to another PR.

Might warrant backporting if that change to netCDF-C is not rolled back.

Will merge in a day or so unless there are further comments.

@shoyer
Copy link
Contributor Author

shoyer commented Dec 9, 2017

We have _FillValue support (since a release or two IIRC), but I agree it can be left to another PR.

_FillValue support is still not quite right. Per spec, we probably should add an optional fill_value keyword argument to createVariable, and if provided use it to pre-fill a variable's data (and set the _FillValue attribute) rather than using np.empty(). Currently, _FillValue is only used for masking (if scaleandmask=True).

I might add this in another PR, but I'm not sure it's really worth the trouble. As far as we know, there are no netCDF applications that check trailing bytes on variables, and pre-filling variable data would entail a slight performance hit over just using np.empty().

Might warrant backporting if that change to netCDF-C is not rolled back.

We are good here, the netCDF-C change was indeed rolled back.

@rgommers rgommers removed the backport-candidate This fix should be ported by a maintainer to previous SciPy versions. label Dec 9, 2017
@shoyer shoyer force-pushed the netcdf-nullpadding branch 3 times, most recently from 1eb975a to b9ba9c3 Compare December 10, 2017 00:31
@rgommers
Copy link
Member

Test failures are real.

@shoyer
Copy link
Contributor Author

shoyer commented Dec 10, 2017

Tests are passing now. As you can probably tell, I have not yet gone to the trouble of setting up a local development environment :).

I realized that I could indeed check the _FillValue attribute when writing data, which is exactly what we do now. This should make netCDF files produced by scipy fully compatible with the spec.

For invalid fill values, I decided to use the default fill value for the data type instead. In principle, we might raise an error for this instead, but this should be the least disruptive change for scipy users.

@rgommers rgommers merged commit 306a2f0 into scipy:master Dec 11, 2017
@rgommers
Copy link
Member

LGTM, thanks Stephan.

@shoyer shoyer deleted the netcdf-nullpadding branch December 11, 2017 16:52
@bderembl
Copy link

I wonder if it is related to this issue but ncview will no longer open netcdf files created with scipy.io
"ncview: can't recognize format of input file avges.nc"
it may be an issue with ncview though.. (but it used to work well a few months ago)

@shoyer
Copy link
Contributor Author

shoyer commented Feb 12, 2018

@bderembl I think this should only be a problem with netCDF-C version 4.5.0. If you're using either an older or newer release netCDF should be able to read these files.

@bderembl
Copy link

indeed.
thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
maintenance Items related to regular maintenance tasks scipy.io
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants