Fix workaround for years less than 1678 for non-0001 reference dates #189

spencerkclark · 2017-06-08T00:01:03Z

Closes #188

@spencerahill I can confirm the tests added here failed prior to making this change.

spencerahill · 2017-06-08T03:40:33Z

aospy/utils/times.py

@@ -198,13 +198,14 @@ def numpy_datetime_workaround_encode_cf(ds):
    time = ds[internal_names.TIME_STR]
    units = time.attrs['units']
    units_yr = units.split(' since ')[1].split('-')[0]
+    ref_yr = int(units_yr)
    min_yr_decoded = xr.decode_cf(time.to_dataset(name='dummy'))
    min_date = min_yr_decoded[internal_names.TIME_STR].values[0]
    if isinstance(min_date, np.datetime64):
        return ds, pd.Timestamp(min_date).year
    else:
        min_yr = min_date.year


After defining min_yr, let's define offset = int(units_yr) - min_yr + 1 rather than ref_yr and include a comment above it describing it, e.g. # Offset such that decoded dates start in the earliest valid whole year. Then the line below becomes new_units_yr = pd.Timestamp.min.year + offset.

spencerahill · 2017-06-08T03:42:20Z

Thanks for the quick turnaround on this. Can you also add a test with non-January 1st month and day?

Potential edge cases that come to mind: dates in 1678, both in and out of the valid range; same for dates in the last valid year.

spencerkclark · 2017-06-08T23:56:12Z

Potential edge cases that come to mind: dates in 1678, both in and out of the valid range; same for dates in the last valid year.

Yeah a minimum date with the same year as the Timestamp minimum date would cause problems (I think it's actually 1677). In its current form, the logic that we use to transform dates from user input to the workaround versions (numpy_datetime_range_workaround) would also fail for dates larger than the Timestamp maximum. Do you think we should address that or wait until someone has a use case for it?

spencerahill · 2017-06-09T02:10:56Z

aospy/test/test_utils_times.py

@@ -144,57 +144,60 @@ def test_numpy_datetime_range_workaround(self):
        )

    def test_numpy_datetime_workaround_encode_cf(self):
+        def create_test_data(
+                days, ref_units, expected_units, expected_date):


expected_date isn't used.

Nit: non-standard to start the func args on the next line. I would just break after ref_units to stay within 80 chars.

spencerahill · 2017-06-09T02:16:23Z

Do you think we should address that or wait until someone has a use case for it?

Your call based on how easy the fix would be and your time available. Although if we're saying we've identified a bug we should at least warn/document it somehow.

Switch to using the minimum and maximum years in the loaded dataset to determine whether dates should be offset. This is what ultimately determines whether dates are offset in initially in numpy_datetime_workaround_encode_cf, so it is important to be consistent. Originally this method used the user-input subset bounds. This would have the potential of being an issue if the dataset included data from times before 1678, say 1600, but the user only wanted to subset data between say 1679 and 1750. In that case internally, dates would be offset to start in 1678, but the subset dates would not be offset. In addition this method would not catch if dates were outside the Timestamp upper bound.

spencerkclark · 2017-06-10T02:19:45Z

@spencerahill I think I fixed it (see the commit message for a more detailed explanation), but I think we should think carefully about:

When this workaround breaks down (so we can raise an informative exception)
Whether what we have implemented currently is the absolute best we can do (as far as a workaround goes)
Whether this has adequate tests

before merging. Hopefully I'll have some time over the next few days, but I obviously appreciate your feedback as well.

spencerahill · 2017-06-11T21:07:43Z

Originally this method used the user-input subset bounds. This
would have the potential of being an issue if the dataset included
data from times before 1678, say 1600, but the user only wanted to subset
data between say 1679 and 1750. In that case internally, dates
would be offset to start in 1678, but the subset dates would not be
offset.

Good catch!

Whether what we have implemented currently is the absolute best we can do (as far as a workaround goes)

Let's also keep in mind that all of this is temporary; it will all go away once #98 / pydata/xarray#1084 are addressed. As such, I think it's reasonable to fix bugs in actual use cases as they come up, but potentially just warn/raise when we recognize edge cases that aren't actually affecting users so far, e.g. this edge case of subsetting within the valid range from input files extending outside the valid range.

That's my take. What's the latest with #98 and pydata/xarray#1084, and based on that do you think my take seems reasonable?

I'll give a more careful code review once we converge on these bigger picture aspects.

spencerkclark · 2017-06-11T23:21:27Z

Let's also keep in mind that all of this is temporary; it will all go away once #98 / pydata/xarray#1084 are addressed.

Right. pydata/xarray#1252 is the place to look for progress on that. The updates I pushed about a month ago now are a first pass at resetting the logic in xarray to use a NetCDFTimeIndex whenever a non-standard calendar is used or dates with standard calendar are outside the Timestamp-valid range. I think the gist of what we set out to accomplish is there (e.g. partial-string indexing, saving to netCDF, and field accessors), but it needs a good deal of cleaning up (and I could probably use some guidance on how best to do that).

As such, I think it's reasonable to fix bugs in actual use cases as they come up, but potentially just warn/raise when we recognize edge cases that aren't actually affecting users so far, e.g. this edge case of subsetting within the valid range from input files extending outside the valid range.

For this particular edge case it may actually be just as easy to just fix it as it would be to warn/raise upon encountering it :). To warn/raise I think we would still need to check both min_yr and max_yr in numpy_datetime_range_workaround (checks which I've introduced as part of the fix). Bigger picture though, I agree, for future edge cases we may think of (that aren't impacting users) warning/raising is probably plenty.

spencerahill · 2017-06-12T14:55:46Z

it needs a good deal of cleaning up (and I could probably use some guidance on how best to do that)

Indeed. I just pinged that thread to hopefully rekindle the discussion 😄

For this particular edge case it may actually be just as easy to just fix it as it would be to warn/raise upon encountering it :).

That's totally fine...of course all else equal a fix is better than a warning! I'll review your code now.

spencerahill

A few minor things, otherwise this looks good to me. Thanks!

spencerahill · 2017-06-12T15:02:42Z

aospy/utils/times.py

-    if date < pd.Timestamp.min:
+    min_yr_in_range = pd.Timestamp.min.year < min_year < pd.Timestamp.max.year
+    max_yr_in_range = pd.Timestamp.min.year < max_year < pd.Timestamp.max.year
+    if not (min_yr_in_range and max_yr_in_range):


Just confirming I'm understanding the situation correctly: if any part of the loaded data is outside of the valid range, the decoding will be unable to use the standard numpy datetimes. In that case, we apply our offset so that it "starts" near the beginning of the valid date range.

Importantly, the loaded data includes the whole range of data loaded onto disk, even if the date range we're interested in is a subset of that, and even if that subset is itself entirely within the valid range.

If that's correct then I think your logic is correct.

That's exactly right

spencerahill · 2017-06-12T15:06:01Z

aospy/test/test_data_loader.py

@@ -442,6 +443,26 @@ def test_load_variable(self):
        expected = xr.open_dataset(filepath)['condensation_rain']
        np.testing.assert_array_equal(result.values, expected.values)

+    def test_load_variable_non_0001_refdate(self):
+        def preprocess(ds, **kwargs):
+            ds['time'] = ds['time'] - 1095.


Can you add a comment explaining the significance/reasoning for the 1095 value? And/or define it as a constant.

spencerahill · 2017-06-12T15:08:51Z

aospy/test/test_utils_times.py

+            desired[TIME_STR].attrs['calendar'] = 'noleap'
+            return actual, min_yr, max_yr, desired
+
+        # 255169 days from 0001-01-01 corresponds to date 700-02-04.


Something like this line is what I have in mind in the above comment re: the 1095 value

spencerkclark · 2017-06-12T18:49:08Z

@spencerahill thanks for having a look and for pinging the NetCDFTimeIndex thread; I think I took care of what you were looking for here.

Let me know if you have any more comments!

spencerahill · 2017-06-12T19:56:52Z

Awesome, thanks @spencerkclark! Nice work as usual.

spencerkclark added 3 commits June 7, 2017 19:52

Fix workaround for years less than 1678 for non-0001 reference dates

198ee7d

Remove unnecessary changes

e206791

Fix typo in whats new

af806e9

spencerahill reviewed Jun 8, 2017

View reviewed changes

Address review comments

2c66789

spencerahill reviewed Jun 9, 2017

View reviewed changes

spencerahill reviewed Jun 12, 2017

View reviewed changes

spencerkclark added 2 commits June 12, 2017 14:40

Clarify tests

a861660

Update what's new

db1f2d4

Fix typo in what's new

8bfa2f8

spencerahill merged commit 829dd69 into spencerahill:develop Jun 12, 2017

spencerkclark deleted the ref-date-fix branch June 12, 2017 20:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix workaround for years less than 1678 for non-0001 reference dates #189

Fix workaround for years less than 1678 for non-0001 reference dates #189

spencerkclark commented Jun 8, 2017

spencerahill Jun 8, 2017

spencerahill commented Jun 8, 2017

spencerkclark commented Jun 8, 2017

spencerahill Jun 9, 2017

spencerahill commented Jun 9, 2017

spencerkclark commented Jun 10, 2017

spencerahill commented Jun 11, 2017

spencerkclark commented Jun 11, 2017

spencerahill commented Jun 12, 2017

spencerahill left a comment

spencerahill Jun 12, 2017

spencerkclark Jun 12, 2017

spencerahill Jun 12, 2017

spencerahill Jun 12, 2017

spencerkclark commented Jun 12, 2017

spencerahill commented Jun 12, 2017

Fix workaround for years less than 1678 for non-0001 reference dates #189

Fix workaround for years less than 1678 for non-0001 reference dates #189

Conversation

spencerkclark commented Jun 8, 2017

spencerahill Jun 8, 2017

Choose a reason for hiding this comment

spencerahill commented Jun 8, 2017

spencerkclark commented Jun 8, 2017

spencerahill Jun 9, 2017

Choose a reason for hiding this comment

spencerahill commented Jun 9, 2017

spencerkclark commented Jun 10, 2017

spencerahill commented Jun 11, 2017

spencerkclark commented Jun 11, 2017

spencerahill commented Jun 12, 2017

spencerahill left a comment

Choose a reason for hiding this comment

spencerahill Jun 12, 2017

Choose a reason for hiding this comment

spencerkclark Jun 12, 2017

Choose a reason for hiding this comment

spencerahill Jun 12, 2017

Choose a reason for hiding this comment

spencerahill Jun 12, 2017

Choose a reason for hiding this comment

spencerkclark commented Jun 12, 2017

spencerahill commented Jun 12, 2017