Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

“ValueError: chunksize cannot exceed dimension size” when trying to write xarray to netcdf #1225

Closed
shoyer opened this issue Jan 24, 2017 · 11 comments

Comments

@shoyer
Copy link
Member

shoyer commented Jan 24, 2017

Reported on StackOverflow: http://stackoverflow.com/questions/39900011/valueerror-chunksize-cannot-exceed-dimension-size-when-trying-to-write-xarray

Unfortunately, the given example is not self-contained:

import xarray as xr  
ds=xr.open_dataset("somefile.nc",chunks={'lat':72,'lon':144}  
myds=ds.copy()
#ds is 335 (time) on 720 on 1440 and has variable var  
def some_function(x):
  return x*2
myds['newvar']=xr.DataArray(np.apply_along_axis(some_function,0,ds['var']))  
myds.drop('var')  
myds.to_netcdf("somenewfile.nc")

Apparently this works if engine='scipy' in to_netcdf!

Something strange is definitely going on, I suspect a bug.

@jgerardsimcock
Copy link

I've also just encountered this. Will try to to reproduce a self-contained example.

@tbohn
Copy link

tbohn commented Jun 9, 2017

I've been encountering this as well, and I don't want to use the scipy engine workaround. If you can tell me what a "self-contained" example means, I can also try to provide one.

@shoyer
Copy link
Member Author

shoyer commented Jun 9, 2017

@tbohn "self-contained" just means something that I can run on my machine. For example, the code above plus the "somefile.nc" netCDF file that I can load to reproduce this example.

Thinking about this a little more, I think the issue is somehow related to the encoding['chunksizes'] property on the Dataset variables loaded from the original netCDF file. Something like this should work as a work-around:

del myds.var.encoding['chunksizes']

The bug is somewhere in our handling of chunksize encoding for netCDF4, but it is difficult to fix it without being able to run code that reproduces it.

@tbohn
Copy link

tbohn commented Jun 9, 2017

OK, here's my code and the file that it works (fails) on.

Code:

import os.path
import numpy as np
import xarray as xr
ds = xr.open_dataset('veg_hist.0_10n.90_80w.2000_2016.mode_PFT.5dates.nc')
ds_out = ds.isel(lat=slice(0,16),lon=slice(0,16))
#ds_out.encoding['unlimited_dims'] = 'time'
ds_out.to_netcdf('test.out.nc')

Note that I commented out the attempt to make 'time' unlimited - if I attempt it, I get a slightly different chunk size error ('NetCDF: Bad chunk sizes').

I realize that for now I can use 'ncks' as a workaround, but seems to me that xarray should be able to do this too.

File (attached)
veg_hist.0_10n.90_80w.2000_2016.mode_PFT.5dates.nc.zip

@tbohn
Copy link

tbohn commented Jun 9, 2017

(note also that for the example nc file I provided, the slice that my example code makes contains nothing but null values - but that's irrelevant - the error happens for other slices that do contain non-null values.)

@jhamman
Copy link
Member

jhamman commented Aug 30, 2017

@tbohn - What is happening here is that xarray is storing the netCDF4 chunk size from the input file. For the LAI variable in your example, that isLAI:_ChunkSizes = 19, 1, 160, 160 ; (you can see this with ncdump -h -s filename.nc).

$ ncdump -s -h veg_hist.0_10n.90_80w.2000_2016.mode_PFT.5dates.nc
netcdf veg_hist.0_10n.90_80w.2000_2016.mode_PFT.5dates {
dimensions:
	veg_class = 19 ;
	lat = 160 ;
	lon = 160 ;
	time = UNLIMITED ; // (5 currently)
variables:
	float Cv(veg_class, lat, lon) ;
		Cv:_FillValue = -1.f ;
		Cv:units = "-" ;
		Cv:longname = "Area Fraction" ;
		Cv:missing_value = -1.f ;
		Cv:_Storage = "contiguous" ;
		Cv:_Endianness = "little" ;
	float LAI(veg_class, time, lat, lon) ;
		LAI:_FillValue = -1.f ;
		LAI:units = "m2/m2" ;
		LAI:longname = "Leaf Area Index" ;
		LAI:missing_value = -1.f ;
		LAI:_Storage = "chunked" ;
		LAI:_ChunkSizes = 19, 1, 160, 160 ;
		LAI:_Endianness = "little" ;
...

Those integers correspond to the dimensions from LAI. When you slice your dataset, you end up with lat/lon dimensions that are now smaller than the _ChunkSizes. When writing this back to netCDF, xarray is still trying to use the original encoding attribute.

The logical fix is to validate this encoding attribute and either 1) throw an informative error if something isn't going to work, or 2) change the ChunkSizes.

@tbohn
Copy link

tbohn commented Aug 30, 2017 via email

@cwerner
Copy link

cwerner commented Nov 9, 2017

Is there any news on this? Have the same problem. A reset_chunksizes() method would be very helpful. Also, what is the cleanest way to remove all chunk size info? I have a very long computation and it fails at the very end with the mentioned error message. My file is patched together from many sources...

cheers

@shoyer
Copy link
Member Author

shoyer commented Nov 10, 2017

@ChrWerner Sorry to hear about your trouble, I will take another look at this.

Right now, your best bet is probably something like:

def clean_dataset(ds):
    for var in ds.variables.values():
        if 'chunksizes' in var.encoding:
            del var.encoding['chunksizes']

@cwerner
Copy link

cwerner commented Nov 10, 2017

Thanks for that Stephan.

The workaround looks good for the moment ;-)...
Detecting a mismatch (and maybe even correcting it) automatically would be very useful

cheers,
C

@shoyer
Copy link
Member Author

shoyer commented Nov 10, 2017

Doing some digging, it turns out this turned up quite a while ago back in #156 where we added some code to fix this.

Looking at @tbohn's dataset, the problem variable is actually the coordinate variable 'time' corresponding to the unlimited dimension:

In [7]: ds.variables['time']
Out[7]:
<class 'netCDF4._netCDF4.Variable'>
int32 time(time)
    units: days since 2000-01-01 00:00:00.0
unlimited dimensions: time
current shape = (5,)
filling on, default _FillValue of -2147483647 used

In [8]: ds.variables['time'].chunking()
Out[8]: [1048576]

In [9]: 2 ** 20
Out[9]: 1048576

In [10]: ds.dimensions
Out[10]:
OrderedDict([('veg_class',
              <class 'netCDF4._netCDF4.Dimension'>: name = 'veg_class', size = 19),
             ('lat',
              <class 'netCDF4._netCDF4.Dimension'>: name = 'lat', size = 160),
             ('lon',
              <class 'netCDF4._netCDF4.Dimension'>: name = 'lon', size = 160),
             ('time',
              <class 'netCDF4._netCDF4.Dimension'> (unlimited): name = 'time', size = 5)])

For some reason netCDF4 gives it a chunking of 2 ** 20, even though it only has length 5. This leads to an error when we write a file back with the original chunking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants