Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rolling mean of dask array conflicting sizes for data and coordinate in rolling operation #2113

Closed
raybellwaves opened this issue May 10, 2018 · 4 comments · Fixed by #2122
Closed
Labels

Comments

@raybellwaves
Copy link
Contributor

Code Sample, a copy-pastable example if possible

import xarray as xr
remote_data = xr.open_dataarray('http://iridl.ldeo.columbia.edu/SOURCES/.Models'\
                                '/.SubX/.RSMAS/.CCSM4/.hindcast/.zg/dods',
                                chunks={'L': 1, 'S': 1})
da = remote_data.isel(P=0,L=0,M=0,X=0,Y=0)
da_day_clim = da.groupby('S.dayofyear').mean('S')
da_day_clim2 = da_day_clim.chunk({'dayofyear': 366})
da_day_clim_smooth = da_day_clim2.rolling(dayofyear=31, center=True).mean()

Problem description

Initially discussed on SO: https://stackoverflow.com/questions/50265586/xarray-rolling-mean-of-dask-array-conflicting-sizes-for-data-and-coordinate-in

The rolling operation gives a ValueError: conflicting sizes for dimension 'dayofyear': length 351 on the data but length 366 on coordinate 'dayofyear' The length of 351 in the data is created in the rolling operation.

Here's the full traceback:

ValueError                                Traceback (most recent call last)
<ipython-input-57-6acf382cdd3d> in <module>()
      4 da_day_clim = da.groupby('S.dayofyear').mean('S')
      5 da_day_clim2 = da_day_clim.chunk({'dayofyear': 366})
----> 6 da_day_clim_smooth = da_day_clim2.rolling(dayofyear=31, center=True).mean()

~/anaconda/envs/SubXNAO/lib/python3.6/site-packages/xarray/core/rolling.py in wrapped_func(self, **kwargs)
    307             if self.center:
    308                 values = values[valid]
--> 309             result = DataArray(values, self.obj.coords)
    310 
    311             return result

~/anaconda/envs/SubXNAO/lib/python3.6/site-packages/xarray/core/dataarray.py in __init__(self, data, coords, dims, name, attrs, encoding, fastpath)
    224 
    225             data = as_compatible_data(data)
--> 226             coords, dims = _infer_coords_and_dims(data.shape, coords, dims)
    227             variable = Variable(dims, data, attrs, encoding, fastpath=True)
    228 

~/anaconda/envs/SubXNAO/lib/python3.6/site-packages/xarray/core/dataarray.py in _infer_coords_and_dims(shape, coords, dims)
     79                 raise ValueError('conflicting sizes for dimension %r: '
     80                                  'length %s on the data but length %s on '
---> 81                                  'coordinate %r' % (d, sizes[d], s, k))
     82 
     83         if k in sizes and v.shape != (sizes[k],):

ValueError: conflicting sizes for dimension 'dayofyear': length 351 on the data but length 366 on coordinate 'dayofyear'

Expected Output

The rolling operation would work on the dask array as it would on the dataarray e.g.

import pandas as pd
import xarray as xr
import numpy as np

dates = pd.date_range('1/1/1980', '31/12/2000', freq='D')
data = np.linspace(1, len(dates), num=len(dates), dtype=np.float)
da = xr.DataArray(data, coords=[dates], dims='time')
da_day_clim = da.groupby('time.dayofyear').mean('time')
da_day_clim_smooth = da_day_clim.rolling(dayofyear=31, center=True).mean()

Output of xr.show_versions()

/Users/Ray/anaconda/envs/SubXNAO/lib/python3.6/site-packages/h5py/__init__.py:36: FutureWarning: Conversion of the second argument of issubdtype from `float` to `np.floating` is deprecated. In future, it will be treated as `np.float64 == np.dtype(float).type`. from ._conv import register_converters as _register_converters

INSTALLED VERSIONS

commit: None
python: 3.6.5.final.0
python-bits: 64
OS: Darwin
OS-release: 17.5.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

xarray: 0.10.3
pandas: 0.22.0
numpy: 1.14.2
scipy: 1.0.1
netCDF4: 1.3.1
h5netcdf: 0.5.1
h5py: 2.7.1
Nio: None
zarr: None
bottleneck: 1.2.1
cyordereddict: None
dask: 0.17.4
distributed: 1.21.8
matplotlib: 2.2.2
cartopy: 0.16.0
seaborn: None
setuptools: 39.1.0
pip: 9.0.3
conda: None
pytest: None
IPython: 6.3.1
sphinx: None

@raybellwaves
Copy link
Contributor Author

Probably isn't a good first issue but I wouldn't like to spend some time on this. Welcome to suggest places to look and things to try.

@shoyer
Copy link
Member

shoyer commented May 10, 2018

@raybellwaves thanks for the bug report!

Here's a slightly simplified version that shows the issue for dask:

import pandas as pd
import xarray as xr
import numpy as np

da_day_clim = xr.DataArray(np.arange(1, 367), coords=[np.arange(1, 367)], dims='dayofyear')
print(da_day_clim.rolling(dayofyear=31, center=True).mean())  # works
print(da_day_clim.chunk().rolling(dayofyear=31, center=True).mean())  # raises ValueError

The traceback is probably the place to start looking into this. I like to drop into a Python debugger (type %debug after the line that raised the error), navigate to different levels in the traceback with u/d, and then print out variables until I can identify which specific function/block of code seems to be doing the wrong thing.

@shoyer shoyer added the bug label May 10, 2018
@raybellwaves
Copy link
Contributor Author

raybellwaves commented May 11, 2018

I realized there with an issue before that in without center=True it doesn't raise an issue but it returns rubbish:

import xarray as xr
import numpy as np

a = xr.DataArray(np.arange(1,4), coords=[np.arange(1,4)], dims='x')
print(a.rolling(x=3, center=True).mean())
Out[2]: 
<xarray.DataArray (x: 3)>
array([ nan,   2.,  nan])
Coordinates:
  * x        (x) int64 1 2 3
print(a.chunk().rolling(x=3).mean().values)
Out[3]: array([ -6.14891469e+18,  -9.22337204e+18,  -1.22978294e+19])

The culprit lies in a dask function

out = ag.map_blocks(moving_func, window, min_count=min_count,

Not sure if this is an issue with the function or the way the data is going into the function.

For the center=True issue:

values = values[valid]

is slicing the data.

@fujiisoup
Copy link
Member

fujiisoup commented May 11, 2018

I noticed that this bug arises when bottleneck is installed.

EDIT
The test suite was not checking center=True. I will look inside the code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants