Relax logic for comparing grid attributes between Run and Model objs #199

spencerkclark · 2017-09-01T16:07:53Z

Fixes #187

The way I tested this was by adding a grid attribute ('zsurf') of all zeros to the files used in some test Run objects:

00040101.precip_monthly.nc
00050101.precip_monthly.nc
00060101.precip_monthly.nc

but leaving it off the files used in constructing their Model object:

00060101.sphum_monthly.nc

Tests do not pass without the one line change I add in calc.py.

Fixes spencerahill#187 Previously an error would be raised if a grid attribute found in a Run object did not exist in the Run's corresponding Model. This allows aospy to use grid attributes found only in Run objects.

spencerahill · 2017-09-01T16:47:21Z

Gotta love a one-line fix! Any idea what's going on in the tests with the Notebook? Otherwise this looks good to me.

spencerkclark · 2017-09-02T01:55:12Z

@spencerahill I need to run now, but it seems that the notebook exposed a valid use-case that was problematic. My commit fixes things, but I don't think it is the ideal solution. I'll try and lay out the problem in more detail tomorrow so we can discuss the options.

spencerahill · 2017-09-02T02:35:25Z

Sounds good. No rush.

spencerkclark · 2017-09-02T16:01:54Z

The notebook sets up some aospy objects in the following manner:

...
from aospy.data_loader import DictDataLoader
file_map = {'monthly': os.path.join(rootdir, '000[4-6]0101.precip_monthly.nc')}
data_loader = DictDataLoader(file_map)

from aospy import Run
example_run = Run(
    name='example_run',
    description='Control simulation of the idealized moist model',
    data_loader=data_loader
)

from aospy import Model
example_model = Model(
    name='example_model',
    grid_file_paths=(
        os.path.join(rootdir, '00040101.precip_monthly.nc'),
        os.path.join(rootdir, 'im.landmask.nc')
    ),
    runs=[example_run]  # only one Run in our case, but could be more
)
...

As of this PR each *precip_monthly.nc file now contains a 2D zsurf variable:

In [1]: import xarray as xr

In [2]: single_file = xr.open_dataset('00040101.precip_monthly.nc', decode_cf=Fa
   ...: lse)

In [3]: single_file.zsurf
Out[3]:
<xarray.DataArray 'zsurf' (lat: 64, lon: 128)>
array([[ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       ...,
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.]], dtype=float32)
Coordinates:
  * lon      (lon) float64 0.0 2.812 5.625 8.438 11.25 14.06 16.88 19.69 ...
  * lat      (lat) float64 -87.86 -85.1 -82.31 -79.53 -76.74 -73.95 -71.16 ...
Attributes:
    _FillValue:     1e+20
    units:          m
    long_name:      Surface height
    time_avg_info:  average_T1,average_T2,average_DT
    cell_methods:   time: mean

The issue stems from how concatenation proceeds using xr.open_mfdataset in data_loader.py:

aospy/aospy/data_loader.py

Lines 185 to 190 in ce7c784

    
           apply_preload_user_commands(file_set) 
        
           func = _preprocess_and_rename_grid_attrs(preprocess_func, **kwargs) 
        
           return xr.open_mfdataset(file_set, preprocess=func, 
        
                                    concat_dim=internal_names.TIME_STR, 
        
                                    decode_times=False, decode_coords=False, 
        
                                    mask_and_scale=True)

When the three files are concatenated together using xr.open_mfdataset, a time dimension is added to zsurf:

In [4]: ds = xr.open_mfdataset('000[4-6]0101.precip_monthly.nc', decode_cf=False,
   ...: concat_dim='time').load()

In [5]: ds.zsurf
Out[5]:
<xarray.DataArray 'zsurf' (time: 36, lat: 64, lon: 128)>
array([[[ 0.,  0., ...,  0.,  0.],
        [ 0.,  0., ...,  0.,  0.],
        ...,
        [ 0.,  0., ...,  0.,  0.],
        [ 0.,  0., ...,  0.,  0.]],

       [[ 0.,  0., ...,  0.,  0.],
        [ 0.,  0., ...,  0.,  0.],
        ...,
        [ 0.,  0., ...,  0.,  0.],
        [ 0.,  0., ...,  0.,  0.]],

       ...,
       [[ 0.,  0., ...,  0.,  0.],
        [ 0.,  0., ...,  0.,  0.],
        ...,
        [ 0.,  0., ...,  0.,  0.],
        [ 0.,  0., ...,  0.,  0.]],

       [[ 0.,  0., ...,  0.,  0.],
        [ 0.,  0., ...,  0.,  0.],
        ...,
        [ 0.,  0., ...,  0.,  0.],
        [ 0.,  0., ...,  0.,  0.]]], dtype=float32)
Coordinates:
  * lon      (lon) float64 0.0 2.812 5.625 8.438 11.25 14.06 16.88 19.69 ...
  * lat      (lat) float64 -87.86 -85.1 -82.31 -79.53 -76.74 -73.95 -71.16 ...
  * time     (time) float64 1.111e+03 1.139e+03 1.17e+03 1.2e+03 1.231e+03 ...
Attributes:
    _FillValue:     1e+20
    units:          m
    long_name:      Surface height
    time_avg_info:  average_T1,average_T2,average_DT
    cell_methods:   time: mean

Note that three files are specified in the Run object created in the notebook, and one *precip_monthly.nc file is specified in the Model object. So the zsurf grid attribute in the Run object has a time dimension, but the zsurf grid attribute in the Model object does not. When these reach the Calc._add_grid_attributes step, they are compared (the key line is 283):

aospy/aospy/calc.py

Lines 282 to 295 in ce7c784

    
           if not np.array_equal(ds[name_int], model_attr): 
        
               if np.allclose(ds[name_int], model_attr): 
        
                   msg = ("Values for '{0}' are nearly (but not exactly) " 
        
                          "the same in the Run {1} and the Model {2}.  " 
        
                          "Therefore replacing Run's values with the " 
        
                          "model's.".format(name_int, self.run, 
        
                                            self.model)) 
        
                   logging.info(msg) 
        
                   ds[name_int].values = model_attr.values 
        
               else: 
        
                   msg = ("Model coordinates for '{0}' do not match those" 
        
                          " in Run: {1} vs. {2}" 
        
                          "".format(name_int, ds[name_int], model_attr)) 
        
                   logging.info(msg)

For some reason, np.allclose returns True in this case, even though the arrays have differing shapes:

In [6]: import numpy as np

In [7]: np.allclose(single_file.zsurf, ds.zsurf)
Out[7]: True

So when assignment is attempted in line 290, it fails, because the arrays have different shapes. The workaround I've used now is to replace np.allclose with xarray's version, which does check the dimensions and shape of the arrays, but works as an assertion statement (rather than returning True or False). Then assignment is no longer attempted (and everything works).

This workaround is OK, but it does not address the underlying issue, which is that zsurf should not acquire a time dimension in the first place. I'm not aware of a way to prevent this with xr.open_mfdataset; with xr.concat this can be done by using the data_vars='minimal' keyword argument:

In [8]: xr.concat([xr.open_dataset('000{}0101.precip_monthly.nc'.format(year)) f
   ...: or year in [4, 5, 6]], dim='time', data_vars='minimal').zsurf
/nbhome/skc/miniconda3/envs/research/lib/python3.6/site-packages/xarray/conventions.py:393: RuntimeWarning: Unable to decode time axis into full numpy.datetime64 objects, continuing using dummy netCDF4.datetime objects instead, reason: dates out of range
  result = decode_cf_datetime(example_value, units, calendar)
/nbhome/skc/miniconda3/envs/research/lib/python3.6/site-packages/xarray/conventions.py:412: RuntimeWarning: Unable to decode time axis into full numpy.datetime64 objects, continuing using dummy netCDF4.datetime objects instead, reason: dates out of range
  calendar=self.calendar)
Out[8]:
<xarray.DataArray 'zsurf' (lat: 64, lon: 128)>
array([[ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       ...,
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ...,  0.,  0.,  0.]])
Coordinates:
  * lon      (lon) float64 0.0 2.812 5.625 8.438 11.25 14.06 16.88 19.69 ...
  * lat      (lat) float64 -87.86 -85.1 -82.31 -79.53 -76.74 -73.95 -71.16 ...
Attributes:
    units:          m
    long_name:      Surface height
    time_avg_info:  average_T1,average_T2,average_DT
    cell_methods:   time: mean

My impression is that the ideal solution would be if we could provide an argument like data_vars='minimal' to xr.open_mfdataset, but that's not an available option. Does that make sense? Am I missing a way of doing this that already exists?

spencerahill · 2017-09-06T04:11:24Z

Thanks for digging into this.

For some reason, np.allclose returns True in this case, even though the arrays have differing shapes

I stumbled upon this myself in #181 (although I'm not sure I made a written note of it). This sure seems like a bug to me: the allclose documentation only gives examples of same-shaped arrays, and there's an open issue at Numpy about basically the same thing. But it's >2 years old with no movement, so we might consider giving it a ping.

Here's the MWE I came up with; including version to show it still exists on latest numpy release:

In [2]: np.version.version
Out[2]: '1.13.1'

In [5]: np.allclose(np.ones([5, 1]), np.ones([1]))
Out[5]: True

And there's some further odd behavior:

In [7]: np.allclose(np.ones([5, 1]), np.ones([5, 2]))
Out[7]: True

In [8]: np.allclose(np.ones([5, 1]), np.ones([2]))
Out[8]: True

In [9]: np.allclose(np.ones([5, 1]), np.ones([]))
Out[9]: True

I would expect all of these to raise a ValueError, just like the following does:

In [12]: np.allclose(np.ones([5, 1]), np.ones([12, 34]))
...
ValueError: operands could not be broadcast together with shapes (5,1) (12,34)

The workaround I've used now is to replace np.allclose with xarray's version, which does check the dimensions and shape of the arrays.

I think that's a good decision even if there weren't this bug; we should strive to use xarray versions whenever possible (which we mostly already do anyways).

My impression is that the ideal solution would be if we could provide an argument like data_vars='minimal' to xr.open_mfdataset, but that's not an available option. Does that make sense? Am I missing a way of doing this that already exists?

Is there a way to use the preprocess kwarg to accomplish this? Feels a bit hackish though even if it could work.

spencerkclark · 2017-09-06T13:42:44Z

Is there a way to use the preprocess kwarg to accomplish this? Feels a bit hackish though even if it could work.

Since preprocess acts on individual datasets before they are combined, this might be tricky. In looking over the code for xr.open_mfdataset, I might be inclined to write our own custom version (leaving most of the logic the same), using something involving xr.concat / xr.merge in place of xr.auto_combine .

spencerahill · 2017-09-06T14:48:08Z

I might be inclined to write our own custom version (leaving most of the logic the same)

Probably worth at least asking at the xarray repo or mailing list before rolling our own. I suspect others have/will come across the same problem.

It turns out that if variables are set as coordinates before concatenation in xr.open_mfdataset then they are not broadcast along the concat_dim. This solves the underlying issue (so we can optionally revert to the orginal logic in calc.py)

spencerkclark · 2017-09-16T13:43:16Z

@spencerahill I did a little more investigating -- it turns out we can obtain the desired behavior if we set grid attributes as coordinates before concatenating in xr.open_mfdataset. Compare the following minimal examples:

Concatenation with two variables (original behavior):

for i in range(3):
    da = xr.DataArray(np.ones((2, 3)),
                      dims=['x', 't'],
                      coords=[np.arange(2), np.arange(3 * i, 3 * (i + 1))],
                      name='a')
    coord = xr.DataArray(2. * np.ones(2), dims=['x'], coords=[np.arange(2)], name='area')
    ds = da.to_dataset()
    ds['area'] = coord
    ds.to_netcdf('example-case-{:d}.nc'.format(i))

xr.open_mfdataset('example-case-*.nc') returns:

<xarray.Dataset>
Dimensions:  (t: 9, x: 2)
Coordinates:
  * x        (x) int64 0 1
  * t        (t) int64 0 1 2 3 4 5 6 7 8
Data variables:
    a        (x, t) float64 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 ...
    area     (t, x) float64 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 2.0 ...

Concatenation with a non-index coordinate (desired / current behavior):

for i in range(3):
    da = xr.DataArray(np.ones((2, 3)),
                      dims=['x', 't'],
                      coords=[np.arange(2), np.arange(3 * i, 3 * (i + 1))],
                      name='a')
    coord = xr.DataArray(2. * np.ones(2), dims=['x'], coords=[np.arange(2)], name='area')
    da['area'] = coord
    da.to_netcdf('example-case-{:d}.nc'.format(i))

xr.open_mfdataset('example-case-*.nc') returns:

<xarray.Dataset>
Dimensions:  (t: 9, x: 2)
Coordinates:
  * x        (x) int64 0 1
    area     (x) float64 2.0 2.0
  * t        (t) int64 0 1 2 3 4 5 6 7 8
Data variables:
    a        (x, t) float64 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 ...

Setting the time attributes as coordinates before calling _prep_time_data causes problems, so I hold off on setting those attributes until after calling data_loader._prep_time_data; the rest I set in our preprocessing function we pass to xr.open_mfdataset. This solution could still probably use some cleanup, but I think it's a better approach to solving the underlying issue than writing our own open_mfdataset function.

spencerahill · 2017-09-18T16:14:21Z

aospy/data_loader.py

@@ -91,7 +92,11 @@ def set_grid_attrs_as_coords(ds):
    Dataset
        Dataset with grid attributes set as coordinates
    """
-    int_names = GRID_ATTRS.keys()
+    if not set_time_vars:


Slight preference to use the affirmative, i.e. if set_time_vars: as the main if statement, with the else corresponding to not set_time_Vars

spencerahill · 2017-09-18T16:36:24Z

Thanks @spencerkclark, nice find of a pretty simple workaround.

Just confirming, the only difference between the two examples is:

ds = da.to_dataset()
ds['area'] = coord
ds.to_netcdf('example-case-{:d}.nc'.format(i))

versus

da['area'] = coord
da.to_netcdf('example-case-{:d}.nc'.format(i))

I don't fully understand why this results in the behavior that it does...is this intuitive to you? (I.e. should xarray consider tweaking or clarifying?)

This solution could still probably use some cleanup, but I think it's a better approach to solving the underlying issue than writing our own open_mfdataset function.

It actually looks pretty tight already to me, although as I mentioned I'm not seeing it 100% clearly. Need some tests of course; those might actually help me understand also.

spencerkclark · 2017-09-18T18:37:44Z

@spencerahill yes, that is exactly the difference; it seems that xr.auto_combine treats data variables and coordinates differently.

I don't fully understand why this results in the behavior that it does...is this intuitive to you? (I.e. should xarray consider tweaking or clarifying?)

No, this isn't super intuitive to me either; however, it is consistent with the default behavior of xr.concat, which is to have data_vars='all' and coords='different' (and xr.auto_combine calls xr.concat underneath, with only default values for these arguments). Perhaps this could be addressed by passing those arguments down through one's call to xr.open_mfdataset (so that one could have control over them)?

spencerahill · 2017-09-18T19:24:34Z

Also FYI it turns out that the np.allclose behavior isn't a bug

spencerahill · 2017-09-18T19:25:12Z

Perhaps this could be addressed by passing those arguments down through one's call to xr.open_mfdataset (so that one could have control over them)?

Yes, that sounds like a good approach.

spencerkclark · 2017-09-19T18:10:04Z

Also FYI it turns out that the np.allclose behavior isn't a bug

That thread makes sense; thanks for double checking with the numpy folks on that.

Yes, that sounds like a good approach.

Remarkably it seems someone else is keen on implementing the same thing: pydata/xarray#1580.

spencerahill · 2017-09-19T19:17:06Z

Remarkably it seems someone else is keen on implementing the same thing

Haha go figure!

If that makes it into xarray 0.10, we have the option of waiting for that and then bumping our required xarray version to 0.10. Conversely, we can continue with this PR using your workaround. What do you think?

spencerkclark · 2017-09-21T20:32:59Z

If that makes it into xarray 0.10, we have the option of waiting for that and then bumping our required xarray version to 0.10. Conversely, we can continue with this PR using your workaround. What do you think?

I'll go ahead and continue with this PR. If the change does make into xarray 0.10 then it's a simple change. (This way we'll be prepared even if it doesn't).

spencerahill · 2017-09-22T02:40:04Z

Great, in that case just ping me when you're ready for another review

spencerkclark · 2017-09-22T18:42:29Z

@spencerahill great, this should be ready for another review when you get a chance.

spencerahill · 2017-09-22T23:39:06Z

Looks good! Thanks @spencerkclark

spencerahill · 2017-10-10T21:40:51Z

pydata/xarray#1580 has now been merged, although xarray v0.10 in which it will appear has not yet been released.

So if/when we come across something like this again, we should use the new xarray functionality and go ahead and bump the minimum requirement of xarray to >=0.10.

Relax logic for comparing grid attributes between Run and Model objs

78d91c9

Fixes spencerahill#187 Previously an error would be raised if a grid attribute found in a Run object did not exist in the Run's corresponding Model. This allows aospy to use grid attributes found only in Run objects.

Fix notebook

c3cb4bc

spencerkclark added 2 commits September 16, 2017 09:25

Alternative option

94fd0c4

It turns out that if variables are set as coordinates before concatenation in xr.open_mfdataset then they are not broadcast along the concat_dim. This solves the underlying issue (so we can optionally revert to the orginal logic in calc.py)

Merge branch 'develop' into only-compare-coord-if-it-exists

8a635ce

spencerahill reviewed Sep 18, 2017

View reviewed changes

spencerahill mentioned this pull request Sep 19, 2017

data_vars option added to open_mfdataset pydata/xarray#1580

Merged

4 tasks

Add test; address initial review comment

4b830c6

spencerahill merged commit b5f5048 into spencerahill:develop Sep 22, 2017

spencerkclark deleted the only-compare-coord-if-it-exists branch September 22, 2017 23:51

spencerkclark mentioned this pull request Nov 23, 2017

Expose data_vars and coords arguments to xr.open_mfdataset in DataLoaders #240

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Relax logic for comparing grid attributes between Run and Model objs #199

Relax logic for comparing grid attributes between Run and Model objs #199

spencerkclark commented Sep 1, 2017

spencerahill commented Sep 1, 2017

spencerkclark commented Sep 2, 2017

spencerahill commented Sep 2, 2017

spencerkclark commented Sep 2, 2017 •

edited

Loading

spencerahill commented Sep 6, 2017

spencerkclark commented Sep 6, 2017 •

edited

Loading

spencerahill commented Sep 6, 2017

spencerkclark commented Sep 16, 2017

spencerahill Sep 18, 2017

spencerahill commented Sep 18, 2017

spencerkclark commented Sep 18, 2017

spencerahill commented Sep 18, 2017

spencerahill commented Sep 18, 2017

spencerkclark commented Sep 19, 2017

spencerahill commented Sep 19, 2017 •

edited

Loading

spencerkclark commented Sep 21, 2017

spencerahill commented Sep 22, 2017

spencerkclark commented Sep 22, 2017

spencerahill commented Sep 22, 2017

spencerahill commented Oct 10, 2017

Relax logic for comparing grid attributes between Run and Model objs #199

Relax logic for comparing grid attributes between Run and Model objs #199

Conversation

spencerkclark commented Sep 1, 2017

spencerahill commented Sep 1, 2017

spencerkclark commented Sep 2, 2017

spencerahill commented Sep 2, 2017

spencerkclark commented Sep 2, 2017 • edited Loading

spencerahill commented Sep 6, 2017

spencerkclark commented Sep 6, 2017 • edited Loading

spencerahill commented Sep 6, 2017

spencerkclark commented Sep 16, 2017

Concatenation with two variables (original behavior):

Concatenation with a non-index coordinate (desired / current behavior):

spencerahill Sep 18, 2017

Choose a reason for hiding this comment

spencerahill commented Sep 18, 2017

spencerkclark commented Sep 18, 2017

spencerahill commented Sep 18, 2017

spencerahill commented Sep 18, 2017

spencerkclark commented Sep 19, 2017

spencerahill commented Sep 19, 2017 • edited Loading

spencerkclark commented Sep 21, 2017

spencerahill commented Sep 22, 2017

spencerkclark commented Sep 22, 2017

spencerahill commented Sep 22, 2017

spencerahill commented Oct 10, 2017

spencerkclark commented Sep 2, 2017 •

edited

Loading

spencerkclark commented Sep 6, 2017 •

edited

Loading

spencerahill commented Sep 19, 2017 •

edited

Loading