“ValueError: chunksize cannot exceed dimension size” when trying to write xarray to netcdf #1225

shoyer · 2017-01-24T22:52:36Z

Reported on StackOverflow: http://stackoverflow.com/questions/39900011/valueerror-chunksize-cannot-exceed-dimension-size-when-trying-to-write-xarray

Unfortunately, the given example is not self-contained:

import xarray as xr  
ds=xr.open_dataset("somefile.nc",chunks={'lat':72,'lon':144}  
myds=ds.copy()
#ds is 335 (time) on 720 on 1440 and has variable var  
def some_function(x):
  return x*2
myds['newvar']=xr.DataArray(np.apply_along_axis(some_function,0,ds['var']))  
myds.drop('var')  
myds.to_netcdf("somenewfile.nc")

Apparently this works if engine='scipy' in to_netcdf!

Something strange is definitely going on, I suspect a bug.

The text was updated successfully, but these errors were encountered:

jgerardsimcock · 2017-06-06T21:19:21Z

I've also just encountered this. Will try to to reproduce a self-contained example.

tbohn · 2017-06-09T22:55:20Z

I've been encountering this as well, and I don't want to use the scipy engine workaround. If you can tell me what a "self-contained" example means, I can also try to provide one.

shoyer · 2017-06-09T23:02:20Z

@tbohn "self-contained" just means something that I can run on my machine. For example, the code above plus the "somefile.nc" netCDF file that I can load to reproduce this example.

Thinking about this a little more, I think the issue is somehow related to the encoding['chunksizes'] property on the Dataset variables loaded from the original netCDF file. Something like this should work as a work-around:

del myds.var.encoding['chunksizes']

The bug is somewhere in our handling of chunksize encoding for netCDF4, but it is difficult to fix it without being able to run code that reproduces it.

tbohn · 2017-06-09T23:32:38Z

OK, here's my code and the file that it works (fails) on.

Code:

import os.path
import numpy as np
import xarray as xr
ds = xr.open_dataset('veg_hist.0_10n.90_80w.2000_2016.mode_PFT.5dates.nc')
ds_out = ds.isel(lat=slice(0,16),lon=slice(0,16))
#ds_out.encoding['unlimited_dims'] = 'time'
ds_out.to_netcdf('test.out.nc')

Note that I commented out the attempt to make 'time' unlimited - if I attempt it, I get a slightly different chunk size error ('NetCDF: Bad chunk sizes').

I realize that for now I can use 'ncks' as a workaround, but seems to me that xarray should be able to do this too.

File (attached)
veg_hist.0_10n.90_80w.2000_2016.mode_PFT.5dates.nc.zip

tbohn · 2017-06-09T23:34:44Z

(note also that for the example nc file I provided, the slice that my example code makes contains nothing but null values - but that's irrelevant - the error happens for other slices that do contain non-null values.)

jhamman · 2017-08-30T22:36:14Z

@tbohn - What is happening here is that xarray is storing the netCDF4 chunk size from the input file. For the LAI variable in your example, that isLAI:_ChunkSizes = 19, 1, 160, 160 ; (you can see this with ncdump -h -s filename.nc).

$ ncdump -s -h veg_hist.0_10n.90_80w.2000_2016.mode_PFT.5dates.nc
netcdf veg_hist.0_10n.90_80w.2000_2016.mode_PFT.5dates {
dimensions:
	veg_class = 19 ;
	lat = 160 ;
	lon = 160 ;
	time = UNLIMITED ; // (5 currently)
variables:
	float Cv(veg_class, lat, lon) ;
		Cv:_FillValue = -1.f ;
		Cv:units = "-" ;
		Cv:longname = "Area Fraction" ;
		Cv:missing_value = -1.f ;
		Cv:_Storage = "contiguous" ;
		Cv:_Endianness = "little" ;
	float LAI(veg_class, time, lat, lon) ;
		LAI:_FillValue = -1.f ;
		LAI:units = "m2/m2" ;
		LAI:longname = "Leaf Area Index" ;
		LAI:missing_value = -1.f ;
		LAI:_Storage = "chunked" ;
		LAI:_ChunkSizes = 19, 1, 160, 160 ;
		LAI:_Endianness = "little" ;
...

Those integers correspond to the dimensions from LAI. When you slice your dataset, you end up with lat/lon dimensions that are now smaller than the _ChunkSizes. When writing this back to netCDF, xarray is still trying to use the original encoding attribute.

The logical fix is to validate this encoding attribute and either 1) throw an informative error if something isn't going to work, or 2) change the ChunkSizes.

tbohn · 2017-08-30T23:23:16Z

OK, thanks Joe and Stephan.

…

On Wed, Aug 30, 2017 at 3:36 PM, Joe Hamman ***@***.***> wrote: @tbohn <https://github.com/tbohn> - What is happening here is that xarray is storing the netCDF4 chunk size from the input file. For the LAI variable in your example, that isLAI:_ChunkSizes = 19, 1, 160, 160 ; (you can see this with ncdump -h -s filename.nc). $ ncdump -s -h veg_hist.0_10n.90_80w.2000_2016.mode_PFT.5dates.nc netcdf veg_hist.0_10n.90_80w.2000_2016.mode_PFT.5dates { dimensions: veg_class = 19 ; lat = 160 ; lon = 160 ; time = UNLIMITED ; // (5 currently) variables: float Cv(veg_class, lat, lon) ; Cv:_FillValue = -1.f ; Cv:units = "-" ; Cv:longname = "Area Fraction" ; Cv:missing_value = -1.f ; Cv:_Storage = "contiguous" ; Cv:_Endianness = "little" ; float LAI(veg_class, time, lat, lon) ; LAI:_FillValue = -1.f ; LAI:units = "m2/m2" ; LAI:longname = "Leaf Area Index" ; LAI:missing_value = -1.f ; LAI:_Storage = "chunked" ; LAI:_ChunkSizes = 19, 1, 160, 160 ; LAI:_Endianness = "little" ; ... Those integers correspond to the dimensions from LAI. When you slice your dataset, you end up with lat/lon dimensions that are now smaller than the _ChunkSizes. When writing this back to netCDF, xarray is still trying to use the original encoding attribute. The logical fix is to validate this encoding attribute and either 1) throw an informative error if something isn't going to work, or 2) change the ChunkSizes. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#1225 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ADVZeo0qPYlMc_a8UeGDNp04jtFXqkgOks5sdePhgaJpZM4Ls47i> .

cwerner · 2017-11-09T23:28:28Z

Is there any news on this? Have the same problem. A reset_chunksizes() method would be very helpful. Also, what is the cleanest way to remove all chunk size info? I have a very long computation and it fails at the very end with the mentioned error message. My file is patched together from many sources...

cheers

shoyer · 2017-11-10T00:02:07Z

@ChrWerner Sorry to hear about your trouble, I will take another look at this.

Right now, your best bet is probably something like:

def clean_dataset(ds):
    for var in ds.variables.values():
        if 'chunksizes' in var.encoding:
            del var.encoding['chunksizes']

cwerner · 2017-11-10T00:07:24Z

Thanks for that Stephan.

The workaround looks good for the moment ;-)...
Detecting a mismatch (and maybe even correcting it) automatically would be very useful

cheers,
C

shoyer · 2017-11-10T00:23:32Z

Doing some digging, it turns out this turned up quite a while ago back in #156 where we added some code to fix this.

Looking at @tbohn's dataset, the problem variable is actually the coordinate variable 'time' corresponding to the unlimited dimension:

In [7]: ds.variables['time']
Out[7]:
<class 'netCDF4._netCDF4.Variable'>
int32 time(time)
    units: days since 2000-01-01 00:00:00.0
unlimited dimensions: time
current shape = (5,)
filling on, default _FillValue of -2147483647 used

In [8]: ds.variables['time'].chunking()
Out[8]: [1048576]

In [9]: 2 ** 20
Out[9]: 1048576

In [10]: ds.dimensions
Out[10]:
OrderedDict([('veg_class',
              <class 'netCDF4._netCDF4.Dimension'>: name = 'veg_class', size = 19),
             ('lat',
              <class 'netCDF4._netCDF4.Dimension'>: name = 'lat', size = 160),
             ('lon',
              <class 'netCDF4._netCDF4.Dimension'>: name = 'lon', size = 160),
             ('time',
              <class 'netCDF4._netCDF4.Dimension'> (unlimited): name = 'time', size = 5)])

For some reason netCDF4 gives it a chunking of 2 ** 20, even though it only has length 5. This leads to an error when we write a file back with the original chunking.

shoyer mentioned this issue Jun 17, 2017

writing datasets derived from netCDF4 with compression fails #1458

Closed

shoyer added the bug label Jun 17, 2017

jhamman added the topic-backends label Aug 30, 2017

This was referenced Nov 10, 2017

Variable.chunking() is not always a valid argument to chunksizes Unidata/netcdf4-python#740

Closed

Fix "Chunksize cannot exceed dimension size" #1707

Merged

shoyer closed this as completed in #1707 Nov 13, 2017

Karel-van-de-Plassche mentioned this issue May 30, 2018

DataArray.encoding['chunksizes'] not respected in to_netcdf #2198

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

“ValueError: chunksize cannot exceed dimension size” when trying to write xarray to netcdf #1225

“ValueError: chunksize cannot exceed dimension size” when trying to write xarray to netcdf #1225

shoyer commented Jan 24, 2017 •

edited

Loading

jgerardsimcock commented Jun 6, 2017

tbohn commented Jun 9, 2017

shoyer commented Jun 9, 2017

tbohn commented Jun 9, 2017 •

edited by jhamman

Loading

tbohn commented Jun 9, 2017

jhamman commented Aug 30, 2017

tbohn commented Aug 30, 2017 via email

cwerner commented Nov 9, 2017

shoyer commented Nov 10, 2017

cwerner commented Nov 10, 2017

shoyer commented Nov 10, 2017

“ValueError: chunksize cannot exceed dimension size” when trying to write xarray to netcdf #1225

“ValueError: chunksize cannot exceed dimension size” when trying to write xarray to netcdf #1225

Comments

shoyer commented Jan 24, 2017 • edited Loading

jgerardsimcock commented Jun 6, 2017

tbohn commented Jun 9, 2017

shoyer commented Jun 9, 2017

tbohn commented Jun 9, 2017 • edited by jhamman Loading

tbohn commented Jun 9, 2017

jhamman commented Aug 30, 2017

tbohn commented Aug 30, 2017 via email

cwerner commented Nov 9, 2017

shoyer commented Nov 10, 2017

cwerner commented Nov 10, 2017

shoyer commented Nov 10, 2017

shoyer commented Jan 24, 2017 •

edited

Loading

tbohn commented Jun 9, 2017 •

edited by jhamman

Loading