Overwrite time array as average of time bounds if time bounds present #200

spencerahill · 2017-09-01T16:44:02Z

Closes #185.

Maybe _average_time_bounds should go in utils.times?

Also still need:
~~- [ ] More informative message for the NotImplementedError~~

tests
what's new

spencerkclark

Thanks for putting this together. I agree; I do think _average_time_bounds should go in utils.times.

If I understand things correctly, I think there might be an issue in the logic for the case when time bounds is a 1D array (see my suggestion below). Other than that, there are just a few minor things.

spencerkclark · 2017-09-02T12:26:41Z

aospy/data_loader.py

+    bounds = ds[TIME_BOUNDS_STR]
+    if bounds.ndim == 1:
+        assert len(bounds) == len(ds[TIME_STR]) + 1
+        avg = 0.5*(bounds[:-1] + bounds[1:])


Would this work? I think since bounds is indexed by a 'time_bounds' coordinate, xarray will try to align bounds[:-1] with bounds[1:] so in this case avg would end up being the equivalent of bounds.isel(time_bounds=slice(1, -1)) (which is not the desired result).

I think something like the following is more what we're after?

avg = (0.5 * (bounds.shift(**{TIME_BOUNDS_STR: -1}) + bounds)).dropna(TIME_BOUNDS_STR)

E.g.

In [1]: import numpy as np In [2]: import xarray as xr In [3]: example = xr.DataArray(np.arange(5), coords=[np.arange(5)], dims=['time_bounds']) In [4]: example Out[4]: <xarray.DataArray (time_bounds: 5)> array([0, 1, 2, 3, 4]) Coordinates: * time_bounds (time_bounds) int64 0 1 2 3 4 In [5]: 0.5 * (example[1:] + example[:-1]) Out[5]: <xarray.DataArray (time_bounds: 3)> array([ 1., 2., 3.]) Coordinates: * time_bounds (time_bounds) int64 1 2 3 In [6]: example.isel(time_bounds=slice(1, -1)) Out[6]: <xarray.DataArray (time_bounds: 3)> array([1, 2, 3]) Coordinates: * time_bounds (time_bounds) int64 1 2 3 In [7]: (0.5 * (example.shift(time_bounds=-1) + example)).dropna('time_bounds') Out[7]: <xarray.DataArray (time_bounds: 4)> array([ 0.5, 1.5, 2.5, 3.5]) Coordinates: * time_bounds (time_bounds) int64 0 1 2 3

You're right; for some reason I was thinking about these as if they were just numpy arrays. Good catch!

spencerkclark · 2017-09-02T12:41:26Z

aospy/data_loader.py

+        assert len(bounds) == len(ds[TIME_STR]) + 1
+        avg = 0.5*(bounds[:-1] + bounds[1:])
+    elif bounds.ndim == 2:
+        assert set([2, len(ds[TIME_STR])]) == set(bounds.shape)


I think

assert bounds.sizes == {TIME_STR: ds.sizes[TIME_STR], BOUNDS_STR: 2}

might be slightly more idiomatic xarray (because it takes into account the dimension label-size correspondence).

Actually this should be:

assert dict(bounds.sizes) == {TIME_STR: ds.sizes[TIME_STR], BOUNDS_STR: 2}

because bounds.sizes is an OrderedDict (so we should cast it as an ordinary dict first).

@spencerkclark thanks for the review. All of your suggestions are good; will implement.

spencerkclark · 2017-09-02T12:44:13Z

aospy/data_loader.py

+    """
+    bounds = ds[TIME_BOUNDS_STR]
+    if bounds.ndim == 1:
+        assert len(bounds) == len(ds[TIME_STR]) + 1


If this (or the other assert statement below) raises an AssertionError, should we provide a more informative error description?

spencerahill · 2017-09-06T04:19:46Z

I now realize that the 1-D time_bounds use case actually has never arisen for me. I was conflating it with the same behavior in AM2.1 but for lat and lon bounds:

/archive/s1h/am2/am2clim_reyoi/gfdl.ncrc3-intel-prod/pp/atmos/ts/monthly/30yr/$ ncdump -h atmos.198301-201212.t_surf.nc
netcdf atmos.198301-201212.t_surf {
dimensions:
	time = UNLIMITED ; // (360 currently)
	lat = 90 ;
	latb = 91 ;
	lon = 144 ;
	lonb = 145 ;
	nv = 2 ;
variables:
	double average_DT(time) ;
		average_DT:long_name = "Length of average period" ;
		average_DT:units = "days" ;
		average_DT:missing_value = 1.e+20 ;
		average_DT:_FillValue = 1.e+20 ;
	double average_T1(time) ;
		average_T1:long_name = "Start time for average period" ;
		average_T1:units = "days since 1982-01-01 00:00:00" ;
		average_T1:missing_value = 1.e+20 ;
		average_T1:_FillValue = 1.e+20 ;
	double average_T2(time) ;
		average_T2:long_name = "End time for average period" ;
		average_T2:units = "days since 1982-01-01 00:00:00" ;
		average_T2:missing_value = 1.e+20 ;
		average_T2:_FillValue = 1.e+20 ;
	double lat(lat) ;
		lat:long_name = "latitude" ;
		lat:units = "degrees_N" ;
		lat:cartesian_axis = "Y" ;
		lat:edges = "latb" ;
	double latb(latb) ;
		latb:long_name = "latitude edges" ;
		latb:units = "degrees_N" ;
		latb:cartesian_axis = "Y" ;
	double lon(lon) ;
		lon:long_name = "longitude" ;
		lon:units = "degrees_E" ;
		lon:cartesian_axis = "X" ;
		lon:edges = "lonb" ;
	double lonb(lonb) ;
		lonb:long_name = "longitude edges" ;
		lonb:units = "degrees_E" ;
		lonb:cartesian_axis = "X" ;
	double nv(nv) ;
		nv:long_name = "vertex number" ;
		nv:units = "none" ;
		nv:cartesian_axis = "N" ;
	float t_surf(time, lat, lon) ;
		t_surf:long_name = "surface temperature" ;
		t_surf:units = "deg_k" ;
		t_surf:valid_range = 100.f, 400.f ;
		t_surf:missing_value = 1.e+20f ;
		t_surf:_FillValue = 1.e+20f ;
		t_surf:cell_methods = "time: mean" ;
		t_surf:time_avg_info = "average_T1,average_T2,average_DT" ;
	double time(time) ;
		time:long_name = "time" ;
		time:units = "days since 1982-01-01 00:00:00" ;
		time:cartesian_axis = "T" ;
		time:calendar_type = "NOLEAP" ;
		time:calendar = "NOLEAP" ;
		time:bounds = "time_bounds" ;
	double time_bounds(time, nv) ;
		time_bounds:long_name = "time axis boundaries" ;
		time_bounds:units = "days" ;
		time_bounds:missing_value = 1.e+20 ;
		time_bounds:_FillValue = 1.e+20 ;

So, unless somebody else has such a use case, I'm inclined to remove this logic, and only accept bounds arrays with shape (len(times), 2). That makes the function simpler and the testing much easier. @spencerkclark what do you think?

spencerkclark · 2017-09-06T13:30:18Z

So, unless somebody else has such a use case, I'm inclined to remove this logic, and only accept bounds arrays with shape (len(times), 2). That makes the function simpler and the testing much easier.

I agree; I also have not encountered 1D time bounds, so this sounds like a good idea.

spencerahill · 2017-09-06T14:45:47Z

Great. Also FWIW, CF conventions only permit the 2D format:

To represent cells we add the attribute bounds to the appropriate coordinate variable(s). The value of bounds is the name of the variable that contains the vertices of the cell boundaries. We refer to this type of variable as a "boundary variable." A boundary variable will have one more dimension than its associated coordinate or auxiliary coordinate variable. The additional dimension should be the most rapidly varying one, and its size is the maximum number of cell vertices.

Also snuck in re-formatting of internal_names in utils.times.py

spencerahill · 2017-09-07T19:16:07Z

CF conventions only permit the 2D format

Given this, I'm wondering if we even need this check at all. It's hard to imagine an input dataset wherein the time bounds and time arrays are different lengths. So I'm inclined to just remove the check entirely.

(Motivated by coveralls tipping me off that my test doesn't cover this mis-shaped case.)

spencerkclark

@spencerahill thanks for taking the time to switch test_utils_times.py to use pytest! I have one optional suggestion that could simplify things a bit more, but it's totally up to you.

Regarding checking the shape of time_bounds; I agree it is probably not necessary.

spencerkclark · 2017-09-08T17:34:49Z

aospy/utils/times.py

+                                                      len(ds[TIME_STR]))
+        raise ValueError(msg)
+
+    start_bounds = bounds.isel(**{BOUNDS_STR: 0})


The function you've written here is nice, because it can be applied to both dates and floats. That said, we only use this function on floats (i.e. before we decode times). Therefore optionally I think it would be possible to simplify the logic (and tests) slightly:

def average_time_bounds(ds): bounds = ds[TIME_BOUNDS_STR] new_times = bounds.mean(dim=BOUNDS_STR, keep_attrs=True).drop('time').rename(TIME_STR) new_times[TIME_STR] = new_times return new_times

Note the slight edit above (the addition of .drop)

spencerkclark · 2017-09-08T17:38:15Z

aospy/data_loader.py

        ds = times.ensure_time_avg_has_cf_metadata(ds)
+        ds[TIME_STR].values = times.average_time_bounds(ds).values


If you went with my suggestion for the simpler method below, I think you could simplify this to:

ds[TIME_STR] = times.average_time_bounds(ds)

spencerkclark · 2017-09-08T17:40:53Z

aospy/test/test_utils_times.py

@@ -530,5 +534,29 @@ def test_yearly_average_masked_data():
    xr.testing.assert_allclose(actual, desired)


-if __name__ == '__main__':
-    sys.exit(unittest.main())
+def test_average_time_bounds():


Regardless of whether or not you switch to a float-only method, could you add a test for the use case where floats are used rather than dates, since that is the use case in data_loader.py?

spencerahill · 2017-09-08T18:31:33Z

we only use this function on floats (i.e. before we decode times)

Wow, you're absolutely right. Total mistake on my part. Lesson: it pays to stop and think before you start hacking away.

Thanks for catching that. I implemented your suggested replacement. and it seems to work; will push when tests are good.

spencerahill · 2017-09-08T18:52:25Z

I took the liberty of creating a pytest.fixture for the new test and rewriting one existing test, test_ensure_time_avg_has_cf_metadata, to use it. There's room for further improvement using fixtures in this module, if/when somebody has time/inclination.

spencerahill · 2017-09-08T19:29:51Z

We're green. @spencerkclark I'll wait for your final approval, given how many careless mistakes I've made so far in this PR! 😝

spencerkclark · 2017-09-08T19:41:32Z

Thanks, this looks good to me!

Spencer Hill added 2 commits September 1, 2017 09:40

Create _average_time_bounds and call it in _prep_time_data

3d502d3

Add what's new

6020515

spencerkclark reviewed Sep 2, 2017

View reviewed changes

Spencer Hill added 4 commits September 7, 2017 11:44

Move average_time_bounds to utils.times; simplify its logic

94d097b

Also snuck in re-formatting of internal_names in utils.times.py

Convert test_utils_times from unittest to pytest

e524936

Test for average_time_bounds

144d518

Remove un-used import

6226a88

spencerkclark reviewed Sep 8, 2017

View reviewed changes

Switch function and tests to non-CF-decoded time arrays

ebbfda9

spencerahill merged commit 94b6b9c into develop Sep 8, 2017

spencerahill deleted the overwrite-timestamps branch September 8, 2017 20:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overwrite time array as average of time bounds if time bounds present #200

Overwrite time array as average of time bounds if time bounds present #200

spencerahill commented Sep 1, 2017 •

edited

Loading

spencerkclark left a comment

spencerkclark Sep 2, 2017

spencerahill Sep 2, 2017

spencerkclark Sep 2, 2017

spencerkclark Sep 2, 2017

spencerahill Sep 2, 2017

spencerkclark Sep 2, 2017

spencerahill commented Sep 6, 2017

spencerkclark commented Sep 6, 2017

spencerahill commented Sep 6, 2017

spencerahill commented Sep 7, 2017

spencerkclark left a comment

spencerkclark Sep 8, 2017 •

edited

Loading

spencerkclark Sep 8, 2017

spencerkclark Sep 8, 2017

spencerkclark Sep 8, 2017

spencerahill commented Sep 8, 2017

spencerahill commented Sep 8, 2017

spencerahill commented Sep 8, 2017

spencerkclark commented Sep 8, 2017

		ds = times.ensure_time_avg_has_cf_metadata(ds)
		ds[TIME_STR].values = times.average_time_bounds(ds).values

Overwrite time array as average of time bounds if time bounds present #200

Overwrite time array as average of time bounds if time bounds present #200

Conversation

spencerahill commented Sep 1, 2017 • edited Loading

spencerkclark left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

spencerahill commented Sep 6, 2017

spencerkclark commented Sep 6, 2017

spencerahill commented Sep 6, 2017

spencerahill commented Sep 7, 2017

spencerkclark left a comment

Choose a reason for hiding this comment

spencerkclark Sep 8, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

spencerahill commented Sep 8, 2017

spencerahill commented Sep 8, 2017

spencerahill commented Sep 8, 2017

spencerkclark commented Sep 8, 2017

spencerahill commented Sep 1, 2017 •

edited

Loading

spencerkclark Sep 8, 2017 •

edited

Loading