Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overwrite time array as average of time bounds if time bounds present #200

Merged
merged 7 commits into from
Sep 8, 2017

Conversation

spencerahill
Copy link
Owner

@spencerahill spencerahill commented Sep 1, 2017

Closes #185.

Maybe _average_time_bounds should go in utils.times?

Also still need:
- [ ] More informative message for the NotImplementedError

  • tests
  • what's new

Copy link
Collaborator

@spencerkclark spencerkclark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting this together. I agree; I do think _average_time_bounds should go in utils.times.

If I understand things correctly, I think there might be an issue in the logic for the case when time bounds is a 1D array (see my suggestion below). Other than that, there are just a few minor things.

bounds = ds[TIME_BOUNDS_STR]
if bounds.ndim == 1:
assert len(bounds) == len(ds[TIME_STR]) + 1
avg = 0.5*(bounds[:-1] + bounds[1:])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this work? I think since bounds is indexed by a 'time_bounds' coordinate, xarray will try to align bounds[:-1] with bounds[1:] so in this case avg would end up being the equivalent of bounds.isel(time_bounds=slice(1, -1)) (which is not the desired result).

I think something like the following is more what we're after?

avg = (0.5 * (bounds.shift(**{TIME_BOUNDS_STR: -1}) + bounds)).dropna(TIME_BOUNDS_STR)

E.g.

In [1]: import numpy as np

In [2]: import xarray as xr

In [3]: example = xr.DataArray(np.arange(5), coords=[np.arange(5)], dims=['time_bounds'])

In [4]: example
Out[4]:
<xarray.DataArray (time_bounds: 5)>
array([0, 1, 2, 3, 4])
Coordinates:
  * time_bounds  (time_bounds) int64 0 1 2 3 4

In [5]: 0.5 * (example[1:] + example[:-1])
Out[5]:
<xarray.DataArray (time_bounds: 3)>
array([ 1.,  2.,  3.])
Coordinates:
  * time_bounds  (time_bounds) int64 1 2 3

In [6]: example.isel(time_bounds=slice(1, -1))
Out[6]:
<xarray.DataArray (time_bounds: 3)>
array([1, 2, 3])
Coordinates:
  * time_bounds  (time_bounds) int64 1 2 3

In [7]: (0.5 * (example.shift(time_bounds=-1) + example)).dropna('time_bounds')
Out[7]:
<xarray.DataArray (time_bounds: 4)>
array([ 0.5,  1.5,  2.5,  3.5])
Coordinates:
  * time_bounds  (time_bounds) int64 0 1 2 3

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right; for some reason I was thinking about these as if they were just numpy arrays. Good catch!

assert len(bounds) == len(ds[TIME_STR]) + 1
avg = 0.5*(bounds[:-1] + bounds[1:])
elif bounds.ndim == 2:
assert set([2, len(ds[TIME_STR])]) == set(bounds.shape)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think

assert bounds.sizes == {TIME_STR: ds.sizes[TIME_STR], BOUNDS_STR: 2}

might be slightly more idiomatic xarray (because it takes into account the dimension label-size correspondence).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually this should be:

assert dict(bounds.sizes) == {TIME_STR: ds.sizes[TIME_STR], BOUNDS_STR: 2}

because bounds.sizes is an OrderedDict (so we should cast it as an ordinary dict first).

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@spencerkclark thanks for the review. All of your suggestions are good; will implement.

"""
bounds = ds[TIME_BOUNDS_STR]
if bounds.ndim == 1:
assert len(bounds) == len(ds[TIME_STR]) + 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this (or the other assert statement below) raises an AssertionError, should we provide a more informative error description?

@spencerahill
Copy link
Owner Author

I now realize that the 1-D time_bounds use case actually has never arisen for me. I was conflating it with the same behavior in AM2.1 but for lat and lon bounds:

/archive/s1h/am2/am2clim_reyoi/gfdl.ncrc3-intel-prod/pp/atmos/ts/monthly/30yr/$ ncdump -h atmos.198301-201212.t_surf.nc
netcdf atmos.198301-201212.t_surf {
dimensions:
	time = UNLIMITED ; // (360 currently)
	lat = 90 ;
	latb = 91 ;
	lon = 144 ;
	lonb = 145 ;
	nv = 2 ;
variables:
	double average_DT(time) ;
		average_DT:long_name = "Length of average period" ;
		average_DT:units = "days" ;
		average_DT:missing_value = 1.e+20 ;
		average_DT:_FillValue = 1.e+20 ;
	double average_T1(time) ;
		average_T1:long_name = "Start time for average period" ;
		average_T1:units = "days since 1982-01-01 00:00:00" ;
		average_T1:missing_value = 1.e+20 ;
		average_T1:_FillValue = 1.e+20 ;
	double average_T2(time) ;
		average_T2:long_name = "End time for average period" ;
		average_T2:units = "days since 1982-01-01 00:00:00" ;
		average_T2:missing_value = 1.e+20 ;
		average_T2:_FillValue = 1.e+20 ;
	double lat(lat) ;
		lat:long_name = "latitude" ;
		lat:units = "degrees_N" ;
		lat:cartesian_axis = "Y" ;
		lat:edges = "latb" ;
	double latb(latb) ;
		latb:long_name = "latitude edges" ;
		latb:units = "degrees_N" ;
		latb:cartesian_axis = "Y" ;
	double lon(lon) ;
		lon:long_name = "longitude" ;
		lon:units = "degrees_E" ;
		lon:cartesian_axis = "X" ;
		lon:edges = "lonb" ;
	double lonb(lonb) ;
		lonb:long_name = "longitude edges" ;
		lonb:units = "degrees_E" ;
		lonb:cartesian_axis = "X" ;
	double nv(nv) ;
		nv:long_name = "vertex number" ;
		nv:units = "none" ;
		nv:cartesian_axis = "N" ;
	float t_surf(time, lat, lon) ;
		t_surf:long_name = "surface temperature" ;
		t_surf:units = "deg_k" ;
		t_surf:valid_range = 100.f, 400.f ;
		t_surf:missing_value = 1.e+20f ;
		t_surf:_FillValue = 1.e+20f ;
		t_surf:cell_methods = "time: mean" ;
		t_surf:time_avg_info = "average_T1,average_T2,average_DT" ;
	double time(time) ;
		time:long_name = "time" ;
		time:units = "days since 1982-01-01 00:00:00" ;
		time:cartesian_axis = "T" ;
		time:calendar_type = "NOLEAP" ;
		time:calendar = "NOLEAP" ;
		time:bounds = "time_bounds" ;
	double time_bounds(time, nv) ;
		time_bounds:long_name = "time axis boundaries" ;
		time_bounds:units = "days" ;
		time_bounds:missing_value = 1.e+20 ;
		time_bounds:_FillValue = 1.e+20 ;

So, unless somebody else has such a use case, I'm inclined to remove this logic, and only accept bounds arrays with shape (len(times), 2). That makes the function simpler and the testing much easier. @spencerkclark what do you think?

@spencerkclark
Copy link
Collaborator

So, unless somebody else has such a use case, I'm inclined to remove this logic, and only accept bounds arrays with shape (len(times), 2). That makes the function simpler and the testing much easier.

I agree; I also have not encountered 1D time bounds, so this sounds like a good idea.

@spencerahill
Copy link
Owner Author

Great. Also FWIW, CF conventions only permit the 2D format:

To represent cells we add the attribute bounds to the appropriate coordinate variable(s). The value of bounds is the name of the variable that contains the vertices of the cell boundaries. We refer to this type of variable as a "boundary variable." A boundary variable will have one more dimension than its associated coordinate or auxiliary coordinate variable. The additional dimension should be the most rapidly varying one, and its size is the maximum number of cell vertices.

@spencerahill
Copy link
Owner Author

CF conventions only permit the 2D format

Given this, I'm wondering if we even need this check at all. It's hard to imagine an input dataset wherein the time bounds and time arrays are different lengths. So I'm inclined to just remove the check entirely.

(Motivated by coveralls tipping me off that my test doesn't cover this mis-shaped case.)

Copy link
Collaborator

@spencerkclark spencerkclark left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@spencerahill thanks for taking the time to switch test_utils_times.py to use pytest! I have one optional suggestion that could simplify things a bit more, but it's totally up to you.

Regarding checking the shape of time_bounds; I agree it is probably not necessary.

len(ds[TIME_STR]))
raise ValueError(msg)

start_bounds = bounds.isel(**{BOUNDS_STR: 0})
Copy link
Collaborator

@spencerkclark spencerkclark Sep 8, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The function you've written here is nice, because it can be applied to both dates and floats. That said, we only use this function on floats (i.e. before we decode times). Therefore optionally I think it would be possible to simplify the logic (and tests) slightly:

def average_time_bounds(ds):
    bounds = ds[TIME_BOUNDS_STR]
    new_times = bounds.mean(dim=BOUNDS_STR, keep_attrs=True).drop('time').rename(TIME_STR)
    new_times[TIME_STR] = new_times
    return new_times

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note the slight edit above (the addition of .drop)

ds = times.ensure_time_avg_has_cf_metadata(ds)
ds[TIME_STR].values = times.average_time_bounds(ds).values
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you went with my suggestion for the simpler method below, I think you could simplify this to:

ds[TIME_STR] = times.average_time_bounds(ds)

@@ -530,5 +534,29 @@ def test_yearly_average_masked_data():
xr.testing.assert_allclose(actual, desired)


if __name__ == '__main__':
sys.exit(unittest.main())
def test_average_time_bounds():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regardless of whether or not you switch to a float-only method, could you add a test for the use case where floats are used rather than dates, since that is the use case in data_loader.py?

@spencerahill
Copy link
Owner Author

we only use this function on floats (i.e. before we decode times)

Wow, you're absolutely right. Total mistake on my part. Lesson: it pays to stop and think before you start hacking away.

Thanks for catching that. I implemented your suggested replacement. and it seems to work; will push when tests are good.

@spencerahill
Copy link
Owner Author

I took the liberty of creating a pytest.fixture for the new test and rewriting one existing test, test_ensure_time_avg_has_cf_metadata, to use it. There's room for further improvement using fixtures in this module, if/when somebody has time/inclination.

@spencerahill
Copy link
Owner Author

We're green. @spencerkclark I'll wait for your final approval, given how many careless mistakes I've made so far in this PR! 😝

@spencerkclark
Copy link
Collaborator

Thanks, this looks good to me!

@spencerahill spencerahill merged commit 94b6b9c into develop Sep 8, 2017
@spencerahill spencerahill deleted the overwrite-timestamps branch September 8, 2017 20:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants