Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Relax logic for comparing grid attributes between Run and Model objs #199

Conversation

spencerkclark
Copy link
Collaborator

Fixes #187

The way I tested this was by adding a grid attribute ('zsurf') of all zeros to the files used in some test Run objects:

  • 00040101.precip_monthly.nc
  • 00050101.precip_monthly.nc
  • 00060101.precip_monthly.nc

but leaving it off the files used in constructing their Model object:

  • 00060101.sphum_monthly.nc

Tests do not pass without the one line change I add in calc.py.

Fixes spencerahill#187

Previously an error would be raised if a grid attribute found in a Run
object did not exist in the Run's corresponding Model.  This allows
aospy to use grid attributes found only in Run objects.
@spencerahill
Copy link
Owner

Gotta love a one-line fix! Any idea what's going on in the tests with the Notebook? Otherwise this looks good to me.

@spencerkclark
Copy link
Collaborator Author

@spencerahill I need to run now, but it seems that the notebook exposed a valid use-case that was problematic. My commit fixes things, but I don't think it is the ideal solution. I'll try and lay out the problem in more detail tomorrow so we can discuss the options.

@spencerahill
Copy link
Owner

Sounds good. No rush.

@spencerkclark
Copy link
Collaborator Author

spencerkclark commented Sep 2, 2017

The notebook sets up some aospy objects in the following manner:

...
from aospy.data_loader import DictDataLoader
file_map = {'monthly': os.path.join(rootdir, '000[4-6]0101.precip_monthly.nc')}
data_loader = DictDataLoader(file_map)

from aospy import Run
example_run = Run(
    name='example_run',
    description='Control simulation of the idealized moist model',
    data_loader=data_loader
)

from aospy import Model
example_model = Model(
    name='example_model',
    grid_file_paths=(
        os.path.join(rootdir, '00040101.precip_monthly.nc'),
        os.path.join(rootdir, 'im.landmask.nc')
    ),
    runs=[example_run]  # only one Run in our case, but could be more
)
...

As of this PR each *precip_monthly.nc file now contains a 2D zsurf variable:

In [1]: import xarray as xr

In [2]: single_file = xr.open_dataset('00040101.precip_monthly.nc', decode_cf=Fa
   ...: lse)

In [3]: single_file.zsurf
Out[3]:
<xarray.DataArray 'zsurf' (lat: 64, lon: 128)>
array([[ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       ...,
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.]], dtype=float32)
Coordinates:
  * lon      (lon) float64 0.0 2.812 5.625 8.438 11.25 14.06 16.88 19.69 ...
  * lat      (lat) float64 -87.86 -85.1 -82.31 -79.53 -76.74 -73.95 -71.16 ...
Attributes:
    _FillValue:     1e+20
    units:          m
    long_name:      Surface height
    time_avg_info:  average_T1,average_T2,average_DT
    cell_methods:   time: mean

The issue stems from how concatenation proceeds using xr.open_mfdataset in data_loader.py:

aospy/aospy/data_loader.py

Lines 185 to 190 in ce7c784

apply_preload_user_commands(file_set)
func = _preprocess_and_rename_grid_attrs(preprocess_func, **kwargs)
return xr.open_mfdataset(file_set, preprocess=func,
concat_dim=internal_names.TIME_STR,
decode_times=False, decode_coords=False,
mask_and_scale=True)

When the three files are concatenated together using xr.open_mfdataset, a time dimension is added to zsurf:

In [4]: ds = xr.open_mfdataset('000[4-6]0101.precip_monthly.nc', decode_cf=False,
   ...: concat_dim='time').load()

In [5]: ds.zsurf
Out[5]:
<xarray.DataArray 'zsurf' (time: 36, lat: 64, lon: 128)>
array([[[ 0.,  0., ...,  0.,  0.],
        [ 0.,  0., ...,  0.,  0.],
        ...,
        [ 0.,  0., ...,  0.,  0.],
        [ 0.,  0., ...,  0.,  0.]],

       [[ 0.,  0., ...,  0.,  0.],
        [ 0.,  0., ...,  0.,  0.],
        ...,
        [ 0.,  0., ...,  0.,  0.],
        [ 0.,  0., ...,  0.,  0.]],

       ...,
       [[ 0.,  0., ...,  0.,  0.],
        [ 0.,  0., ...,  0.,  0.],
        ...,
        [ 0.,  0., ...,  0.,  0.],
        [ 0.,  0., ...,  0.,  0.]],

       [[ 0.,  0., ...,  0.,  0.],
        [ 0.,  0., ...,  0.,  0.],
        ...,
        [ 0.,  0., ...,  0.,  0.],
        [ 0.,  0., ...,  0.,  0.]]], dtype=float32)
Coordinates:
  * lon      (lon) float64 0.0 2.812 5.625 8.438 11.25 14.06 16.88 19.69 ...
  * lat      (lat) float64 -87.86 -85.1 -82.31 -79.53 -76.74 -73.95 -71.16 ...
  * time     (time) float64 1.111e+03 1.139e+03 1.17e+03 1.2e+03 1.231e+03 ...
Attributes:
    _FillValue:     1e+20
    units:          m
    long_name:      Surface height
    time_avg_info:  average_T1,average_T2,average_DT
    cell_methods:   time: mean

Note that three files are specified in the Run object created in the notebook, and one *precip_monthly.nc file is specified in the Model object. So the zsurf grid attribute in the Run object has a time dimension, but the zsurf grid attribute in the Model object does not. When these reach the Calc._add_grid_attributes step, they are compared (the key line is 283):

aospy/aospy/calc.py

Lines 282 to 295 in ce7c784

if not np.array_equal(ds[name_int], model_attr):
if np.allclose(ds[name_int], model_attr):
msg = ("Values for '{0}' are nearly (but not exactly) "
"the same in the Run {1} and the Model {2}. "
"Therefore replacing Run's values with the "
"model's.".format(name_int, self.run,
self.model))
logging.info(msg)
ds[name_int].values = model_attr.values
else:
msg = ("Model coordinates for '{0}' do not match those"
" in Run: {1} vs. {2}"
"".format(name_int, ds[name_int], model_attr))
logging.info(msg)

For some reason, np.allclose returns True in this case, even though the arrays have differing shapes:

In [6]: import numpy as np

In [7]: np.allclose(single_file.zsurf, ds.zsurf)
Out[7]: True

So when assignment is attempted in line 290, it fails, because the arrays have different shapes. The workaround I've used now is to replace np.allclose with xarray's version, which does check the dimensions and shape of the arrays, but works as an assertion statement (rather than returning True or False). Then assignment is no longer attempted (and everything works).

This workaround is OK, but it does not address the underlying issue, which is that zsurf should not acquire a time dimension in the first place. I'm not aware of a way to prevent this with xr.open_mfdataset; with xr.concat this can be done by using the data_vars='minimal' keyword argument:

In [8]: xr.concat([xr.open_dataset('000{}0101.precip_monthly.nc'.format(year)) f
   ...: or year in [4, 5, 6]], dim='time', data_vars='minimal').zsurf
/nbhome/skc/miniconda3/envs/research/lib/python3.6/site-packages/xarray/conventions.py:393: RuntimeWarning: Unable to decode time axis into full numpy.datetime64 objects, continuing using dummy netCDF4.datetime objects instead, reason: dates out of range
  result = decode_cf_datetime(example_value, units, calendar)
/nbhome/skc/miniconda3/envs/research/lib/python3.6/site-packages/xarray/conventions.py:412: RuntimeWarning: Unable to decode time axis into full numpy.datetime64 objects, continuing using dummy netCDF4.datetime objects instead, reason: dates out of range
  calendar=self.calendar)
Out[8]:
<xarray.DataArray 'zsurf' (lat: 64, lon: 128)>
array([[ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       ...,
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.]])
Coordinates:
  * lon      (lon) float64 0.0 2.812 5.625 8.438 11.25 14.06 16.88 19.69 ...
  * lat      (lat) float64 -87.86 -85.1 -82.31 -79.53 -76.74 -73.95 -71.16 ...
Attributes:
    units:          m
    long_name:      Surface height
    time_avg_info:  average_T1,average_T2,average_DT
    cell_methods:   time: mean

My impression is that the ideal solution would be if we could provide an argument like data_vars='minimal' to xr.open_mfdataset, but that's not an available option. Does that make sense? Am I missing a way of doing this that already exists?

@spencerahill
Copy link
Owner

Thanks for digging into this.

For some reason, np.allclose returns True in this case, even though the arrays have differing shapes

I stumbled upon this myself in #181 (although I'm not sure I made a written note of it). This sure seems like a bug to me: the allclose documentation only gives examples of same-shaped arrays, and there's an open issue at Numpy about basically the same thing. But it's >2 years old with no movement, so we might consider giving it a ping.

Here's the MWE I came up with; including version to show it still exists on latest numpy release:

In [2]: np.version.version
Out[2]: '1.13.1'

In [5]: np.allclose(np.ones([5, 1]), np.ones([1]))
Out[5]: True

And there's some further odd behavior:

In [7]: np.allclose(np.ones([5, 1]), np.ones([5, 2]))
Out[7]: True

In [8]: np.allclose(np.ones([5, 1]), np.ones([2]))
Out[8]: True

In [9]: np.allclose(np.ones([5, 1]), np.ones([]))
Out[9]: True

I would expect all of these to raise a ValueError, just like the following does:

In [12]: np.allclose(np.ones([5, 1]), np.ones([12, 34]))
...
ValueError: operands could not be broadcast together with shapes (5,1) (12,34)

The workaround I've used now is to replace np.allclose with xarray's version, which does check the dimensions and shape of the arrays.

I think that's a good decision even if there weren't this bug; we should strive to use xarray versions whenever possible (which we mostly already do anyways).

My impression is that the ideal solution would be if we could provide an argument like data_vars='minimal' to xr.open_mfdataset, but that's not an available option. Does that make sense? Am I missing a way of doing this that already exists?

Is there a way to use the preprocess kwarg to accomplish this? Feels a bit hackish though even if it could work.

@spencerkclark
Copy link
Collaborator Author

spencerkclark commented Sep 6, 2017

Is there a way to use the preprocess kwarg to accomplish this? Feels a bit hackish though even if it could work.

Since preprocess acts on individual datasets before they are combined, this might be tricky. In looking over the code for xr.open_mfdataset, I might be inclined to write our own custom version (leaving most of the logic the same), using something involving xr.concat / xr.merge in place of xr.auto_combine .

@spencerahill
Copy link
Owner

I might be inclined to write our own custom version (leaving most of the logic the same)

Probably worth at least asking at the xarray repo or mailing list before rolling our own. I suspect others have/will come across the same problem.

It turns out that if variables are set as coordinates before concatenation
in xr.open_mfdataset then they are not broadcast along the concat_dim.
This solves the underlying issue (so we can optionally revert to the orginal
logic in calc.py)
@spencerkclark
Copy link
Collaborator Author

@spencerahill I did a little more investigating -- it turns out we can obtain the desired behavior if we set grid attributes as coordinates before concatenating in xr.open_mfdataset. Compare the following minimal examples:

Concatenation with two variables (original behavior):

for i in range(3):
    da = xr.DataArray(np.ones((2, 3)),
                      dims=['x', 't'],
                      coords=[np.arange(2), np.arange(3 * i, 3 * (i + 1))],
                      name='a')
    coord = xr.DataArray(2. * np.ones(2), dims=['x'], coords=[np.arange(2)], name='area')
    ds = da.to_dataset()
    ds['area'] = coord
    ds.to_netcdf('example-case-{:d}.nc'.format(i))

xr.open_mfdataset('example-case-*.nc') returns:

<xarray.Dataset>
Dimensions:  (t: 9, x: 2)
Coordinates:
  * x        (x) int64 0 1
  * t        (t) int64 0 1 2 3 4 5 6 7 8
Data variables:
    a        (x, t) float64 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 ...
    area     (t, x) float64 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 ...

Concatenation with a non-index coordinate (desired / current behavior):

for i in range(3):
    da = xr.DataArray(np.ones((2, 3)),
                      dims=['x', 't'],
                      coords=[np.arange(2), np.arange(3 * i, 3 * (i + 1))],
                      name='a')
    coord = xr.DataArray(2. * np.ones(2), dims=['x'], coords=[np.arange(2)], name='area')
    da['area'] = coord
    da.to_netcdf('example-case-{:d}.nc'.format(i))

xr.open_mfdataset('example-case-*.nc') returns:

<xarray.Dataset>
Dimensions:  (t: 9, x: 2)
Coordinates:
  * x        (x) int64 0 1
    area     (x) float64 2.0 2.0
  * t        (t) int64 0 1 2 3 4 5 6 7 8
Data variables:
    a        (x, t) float64 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 ...

Setting the time attributes as coordinates before calling _prep_time_data causes problems, so I hold off on setting those attributes until after calling data_loader._prep_time_data; the rest I set in our preprocessing function we pass to xr.open_mfdataset. This solution could still probably use some cleanup, but I think it's a better approach to solving the underlying issue than writing our own open_mfdataset function.

@@ -91,7 +92,11 @@ def set_grid_attrs_as_coords(ds):
Dataset
Dataset with grid attributes set as coordinates
"""
int_names = GRID_ATTRS.keys()
if not set_time_vars:
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slight preference to use the affirmative, i.e. if set_time_vars: as the main if statement, with the else corresponding to not set_time_Vars

@spencerahill
Copy link
Owner

Thanks @spencerkclark, nice find of a pretty simple workaround.

Just confirming, the only difference between the two examples is:

ds = da.to_dataset()
ds['area'] = coord
ds.to_netcdf('example-case-{:d}.nc'.format(i))

versus

da['area'] = coord
da.to_netcdf('example-case-{:d}.nc'.format(i))

I don't fully understand why this results in the behavior that it does...is this intuitive to you? (I.e. should xarray consider tweaking or clarifying?)

This solution could still probably use some cleanup, but I think it's a better approach to solving the underlying issue than writing our own open_mfdataset function.

It actually looks pretty tight already to me, although as I mentioned I'm not seeing it 100% clearly. Need some tests of course; those might actually help me understand also.

@spencerkclark
Copy link
Collaborator Author

@spencerahill yes, that is exactly the difference; it seems that xr.auto_combine treats data variables and coordinates differently.

I don't fully understand why this results in the behavior that it does...is this intuitive to you? (I.e. should xarray consider tweaking or clarifying?)

No, this isn't super intuitive to me either; however, it is consistent with the default behavior of xr.concat, which is to have data_vars='all' and coords='different' (and xr.auto_combine calls xr.concat underneath, with only default values for these arguments). Perhaps this could be addressed by passing those arguments down through one's call to xr.open_mfdataset (so that one could have control over them)?

@spencerahill
Copy link
Owner

Also FYI it turns out that the np.allclose behavior isn't a bug

@spencerahill
Copy link
Owner

Perhaps this could be addressed by passing those arguments down through one's call to xr.open_mfdataset (so that one could have control over them)?

Yes, that sounds like a good approach.

@spencerkclark
Copy link
Collaborator Author

Also FYI it turns out that the np.allclose behavior isn't a bug

That thread makes sense; thanks for double checking with the numpy folks on that.

Yes, that sounds like a good approach.

Remarkably it seems someone else is keen on implementing the same thing: pydata/xarray#1580.

@spencerahill
Copy link
Owner

spencerahill commented Sep 19, 2017

Remarkably it seems someone else is keen on implementing the same thing

Haha go figure!

If that makes it into xarray 0.10, we have the option of waiting for that and then bumping our required xarray version to 0.10. Conversely, we can continue with this PR using your workaround. What do you think?

@spencerkclark
Copy link
Collaborator Author

If that makes it into xarray 0.10, we have the option of waiting for that and then bumping our required xarray version to 0.10. Conversely, we can continue with this PR using your workaround. What do you think?

I'll go ahead and continue with this PR. If the change does make into xarray 0.10 then it's a simple change. (This way we'll be prepared even if it doesn't).

@spencerahill
Copy link
Owner

Great, in that case just ping me when you're ready for another review

@spencerkclark
Copy link
Collaborator Author

@spencerahill great, this should be ready for another review when you get a chance.

@spencerahill
Copy link
Owner

Looks good! Thanks @spencerkclark

@spencerahill spencerahill merged commit b5f5048 into spencerahill:develop Sep 22, 2017
@spencerkclark spencerkclark deleted the only-compare-coord-if-it-exists branch September 22, 2017 23:51
@spencerahill
Copy link
Owner

pydata/xarray#1580 has now been merged, although xarray v0.10 in which it will appear has not yet been released.

So if/when we come across something like this again, we should use the new xarray functionality and go ahead and bump the minimum requirement of xarray to >=0.10.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants