-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix workaround for years less than 1678 for non-0001 reference dates #189
Fix workaround for years less than 1678 for non-0001 reference dates #189
Conversation
aospy/utils/times.py
Outdated
@@ -198,13 +198,14 @@ def numpy_datetime_workaround_encode_cf(ds): | |||
time = ds[internal_names.TIME_STR] | |||
units = time.attrs['units'] | |||
units_yr = units.split(' since ')[1].split('-')[0] | |||
ref_yr = int(units_yr) | |||
min_yr_decoded = xr.decode_cf(time.to_dataset(name='dummy')) | |||
min_date = min_yr_decoded[internal_names.TIME_STR].values[0] | |||
if isinstance(min_date, np.datetime64): | |||
return ds, pd.Timestamp(min_date).year | |||
else: | |||
min_yr = min_date.year |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After defining min_yr
, let's define offset = int(units_yr) - min_yr + 1
rather than ref_yr
and include a comment above it describing it, e.g. # Offset such that decoded dates start in the earliest valid whole year
. Then the line below becomes new_units_yr = pd.Timestamp.min.year + offset
.
Thanks for the quick turnaround on this. Can you also add a test with non-January 1st month and day? Potential edge cases that come to mind: dates in 1678, both in and out of the valid range; same for dates in the last valid year. |
Yeah a minimum date with the same year as the Timestamp minimum date would cause problems (I think it's actually 1677). In its current form, the logic that we use to transform dates from user input to the workaround versions ( |
aospy/test/test_utils_times.py
Outdated
@@ -144,57 +144,60 @@ def test_numpy_datetime_range_workaround(self): | |||
) | |||
|
|||
def test_numpy_datetime_workaround_encode_cf(self): | |||
def create_test_data( | |||
days, ref_units, expected_units, expected_date): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
expected_date
isn't used.
Nit: non-standard to start the func args on the next line. I would just break after ref_units
to stay within 80 chars.
Your call based on how easy the fix would be and your time available. Although if we're saying we've identified a bug we should at least warn/document it somehow. |
Switch to using the minimum and maximum years in the loaded dataset to determine whether dates should be offset. This is what ultimately determines whether dates are offset in initially in numpy_datetime_workaround_encode_cf, so it is important to be consistent. Originally this method used the user-input subset bounds. This would have the potential of being an issue if the dataset included data from times before 1678, say 1600, but the user only wanted to subset data between say 1679 and 1750. In that case internally, dates would be offset to start in 1678, but the subset dates would not be offset. In addition this method would not catch if dates were outside the Timestamp upper bound.
@spencerahill I think I fixed it (see the commit message for a more detailed explanation), but I think we should think carefully about:
before merging. Hopefully I'll have some time over the next few days, but I obviously appreciate your feedback as well. |
Good catch!
Let's also keep in mind that all of this is temporary; it will all go away once #98 / pydata/xarray#1084 are addressed. As such, I think it's reasonable to fix bugs in actual use cases as they come up, but potentially just warn/raise when we recognize edge cases that aren't actually affecting users so far, e.g. this edge case of subsetting within the valid range from input files extending outside the valid range. That's my take. What's the latest with #98 and pydata/xarray#1084, and based on that do you think my take seems reasonable? I'll give a more careful code review once we converge on these bigger picture aspects. |
Right. pydata/xarray#1252 is the place to look for progress on that. The updates I pushed about a month ago now are a first pass at resetting the logic in xarray to use a NetCDFTimeIndex whenever a non-standard calendar is used or dates with standard calendar are outside the Timestamp-valid range. I think the gist of what we set out to accomplish is there (e.g. partial-string indexing, saving to netCDF, and field accessors), but it needs a good deal of cleaning up (and I could probably use some guidance on how best to do that).
For this particular edge case it may actually be just as easy to just fix it as it would be to warn/raise upon encountering it :). To warn/raise I think we would still need to check both |
Indeed. I just pinged that thread to hopefully rekindle the discussion 😄
That's totally fine...of course all else equal a fix is better than a warning! I'll review your code now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few minor things, otherwise this looks good to me. Thanks!
if date < pd.Timestamp.min: | ||
min_yr_in_range = pd.Timestamp.min.year < min_year < pd.Timestamp.max.year | ||
max_yr_in_range = pd.Timestamp.min.year < max_year < pd.Timestamp.max.year | ||
if not (min_yr_in_range and max_yr_in_range): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just confirming I'm understanding the situation correctly: if any part of the loaded data is outside of the valid range, the decoding will be unable to use the standard numpy datetimes. In that case, we apply our offset so that it "starts" near the beginning of the valid date range.
Importantly, the loaded data includes the whole range of data loaded onto disk, even if the date range we're interested in is a subset of that, and even if that subset is itself entirely within the valid range.
If that's correct then I think your logic is correct.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's exactly right
aospy/test/test_data_loader.py
Outdated
@@ -442,6 +443,26 @@ def test_load_variable(self): | |||
expected = xr.open_dataset(filepath)['condensation_rain'] | |||
np.testing.assert_array_equal(result.values, expected.values) | |||
|
|||
def test_load_variable_non_0001_refdate(self): | |||
def preprocess(ds, **kwargs): | |||
ds['time'] = ds['time'] - 1095. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a comment explaining the significance/reasoning for the 1095 value? And/or define it as a constant.
desired[TIME_STR].attrs['calendar'] = 'noleap' | ||
return actual, min_yr, max_yr, desired | ||
|
||
# 255169 days from 0001-01-01 corresponds to date 700-02-04. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Something like this line is what I have in mind in the above comment re: the 1095 value
@spencerahill thanks for having a look and for pinging the NetCDFTimeIndex thread; I think I took care of what you were looking for here. Let me know if you have any more comments! |
Awesome, thanks @spencerkclark! Nice work as usual. |
Closes #188
@spencerahill I can confirm the tests added here failed prior to making this change.