Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix workaround for years less than 1678 for non-0001 reference dates #189

Merged
merged 8 commits into from
Jun 12, 2017

Conversation

spencerkclark
Copy link
Collaborator

Closes #188

@spencerahill I can confirm the tests added here failed prior to making this change.

@@ -198,13 +198,14 @@ def numpy_datetime_workaround_encode_cf(ds):
time = ds[internal_names.TIME_STR]
units = time.attrs['units']
units_yr = units.split(' since ')[1].split('-')[0]
ref_yr = int(units_yr)
min_yr_decoded = xr.decode_cf(time.to_dataset(name='dummy'))
min_date = min_yr_decoded[internal_names.TIME_STR].values[0]
if isinstance(min_date, np.datetime64):
return ds, pd.Timestamp(min_date).year
else:
min_yr = min_date.year
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After defining min_yr, let's define offset = int(units_yr) - min_yr + 1 rather than ref_yr and include a comment above it describing it, e.g. # Offset such that decoded dates start in the earliest valid whole year. Then the line below becomes new_units_yr = pd.Timestamp.min.year + offset.

@spencerahill
Copy link
Owner

Thanks for the quick turnaround on this. Can you also add a test with non-January 1st month and day?

Potential edge cases that come to mind: dates in 1678, both in and out of the valid range; same for dates in the last valid year.

@spencerkclark
Copy link
Collaborator Author

Potential edge cases that come to mind: dates in 1678, both in and out of the valid range; same for dates in the last valid year.

Yeah a minimum date with the same year as the Timestamp minimum date would cause problems (I think it's actually 1677). In its current form, the logic that we use to transform dates from user input to the workaround versions (numpy_datetime_range_workaround) would also fail for dates larger than the Timestamp maximum. Do you think we should address that or wait until someone has a use case for it?

@@ -144,57 +144,60 @@ def test_numpy_datetime_range_workaround(self):
)

def test_numpy_datetime_workaround_encode_cf(self):
def create_test_data(
days, ref_units, expected_units, expected_date):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expected_date isn't used.

Nit: non-standard to start the func args on the next line. I would just break after ref_units to stay within 80 chars.

@spencerahill
Copy link
Owner

Do you think we should address that or wait until someone has a use case for it?

Your call based on how easy the fix would be and your time available. Although if we're saying we've identified a bug we should at least warn/document it somehow.

Switch to using the minimum and maximum years in the loaded dataset
to determine whether dates should be offset.  This is what ultimately
determines whether dates are offset in initially in
numpy_datetime_workaround_encode_cf, so it is important to be
consistent.

Originally this method used the user-input subset bounds.  This
would have the potential of being an issue if the dataset included
data from times before 1678, say 1600, but the user only wanted to subset
data between say 1679 and 1750.  In that case internally, dates
would be offset to start in 1678, but the subset dates would not be
offset.  In addition this method would not catch if dates were
outside the Timestamp upper bound.
@spencerkclark
Copy link
Collaborator Author

@spencerahill I think I fixed it (see the commit message for a more detailed explanation), but I think we should think carefully about:

  • When this workaround breaks down (so we can raise an informative exception)
  • Whether what we have implemented currently is the absolute best we can do (as far as a workaround goes)
  • Whether this has adequate tests

before merging. Hopefully I'll have some time over the next few days, but I obviously appreciate your feedback as well.

@spencerahill
Copy link
Owner

Originally this method used the user-input subset bounds. This
would have the potential of being an issue if the dataset included
data from times before 1678, say 1600, but the user only wanted to subset
data between say 1679 and 1750. In that case internally, dates
would be offset to start in 1678, but the subset dates would not be
offset.

Good catch!

Whether what we have implemented currently is the absolute best we can do (as far as a workaround goes)

Let's also keep in mind that all of this is temporary; it will all go away once #98 / pydata/xarray#1084 are addressed. As such, I think it's reasonable to fix bugs in actual use cases as they come up, but potentially just warn/raise when we recognize edge cases that aren't actually affecting users so far, e.g. this edge case of subsetting within the valid range from input files extending outside the valid range.

That's my take. What's the latest with #98 and pydata/xarray#1084, and based on that do you think my take seems reasonable?

I'll give a more careful code review once we converge on these bigger picture aspects.

@spencerkclark
Copy link
Collaborator Author

Let's also keep in mind that all of this is temporary; it will all go away once #98 / pydata/xarray#1084 are addressed.

Right. pydata/xarray#1252 is the place to look for progress on that. The updates I pushed about a month ago now are a first pass at resetting the logic in xarray to use a NetCDFTimeIndex whenever a non-standard calendar is used or dates with standard calendar are outside the Timestamp-valid range. I think the gist of what we set out to accomplish is there (e.g. partial-string indexing, saving to netCDF, and field accessors), but it needs a good deal of cleaning up (and I could probably use some guidance on how best to do that).

As such, I think it's reasonable to fix bugs in actual use cases as they come up, but potentially just warn/raise when we recognize edge cases that aren't actually affecting users so far, e.g. this edge case of subsetting within the valid range from input files extending outside the valid range.

For this particular edge case it may actually be just as easy to just fix it as it would be to warn/raise upon encountering it :). To warn/raise I think we would still need to check both min_yr and max_yr in numpy_datetime_range_workaround (checks which I've introduced as part of the fix). Bigger picture though, I agree, for future edge cases we may think of (that aren't impacting users) warning/raising is probably plenty.

@spencerahill
Copy link
Owner

it needs a good deal of cleaning up (and I could probably use some guidance on how best to do that)

Indeed. I just pinged that thread to hopefully rekindle the discussion 😄

For this particular edge case it may actually be just as easy to just fix it as it would be to warn/raise upon encountering it :).

That's totally fine...of course all else equal a fix is better than a warning! I'll review your code now.

Copy link
Owner

@spencerahill spencerahill left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor things, otherwise this looks good to me. Thanks!

if date < pd.Timestamp.min:
min_yr_in_range = pd.Timestamp.min.year < min_year < pd.Timestamp.max.year
max_yr_in_range = pd.Timestamp.min.year < max_year < pd.Timestamp.max.year
if not (min_yr_in_range and max_yr_in_range):
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just confirming I'm understanding the situation correctly: if any part of the loaded data is outside of the valid range, the decoding will be unable to use the standard numpy datetimes. In that case, we apply our offset so that it "starts" near the beginning of the valid date range.

Importantly, the loaded data includes the whole range of data loaded onto disk, even if the date range we're interested in is a subset of that, and even if that subset is itself entirely within the valid range.

If that's correct then I think your logic is correct.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's exactly right

@@ -442,6 +443,26 @@ def test_load_variable(self):
expected = xr.open_dataset(filepath)['condensation_rain']
np.testing.assert_array_equal(result.values, expected.values)

def test_load_variable_non_0001_refdate(self):
def preprocess(ds, **kwargs):
ds['time'] = ds['time'] - 1095.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment explaining the significance/reasoning for the 1095 value? And/or define it as a constant.

desired[TIME_STR].attrs['calendar'] = 'noleap'
return actual, min_yr, max_yr, desired

# 255169 days from 0001-01-01 corresponds to date 700-02-04.
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something like this line is what I have in mind in the above comment re: the 1095 value

@spencerkclark
Copy link
Collaborator Author

@spencerahill thanks for having a look and for pinging the NetCDFTimeIndex thread; I think I took care of what you were looking for here.

Let me know if you have any more comments!

@spencerahill
Copy link
Owner

Awesome, thanks @spencerkclark! Nice work as usual.

@spencerahill spencerahill merged commit 829dd69 into spencerahill:develop Jun 12, 2017
@spencerkclark spencerkclark deleted the ref-date-fix branch June 12, 2017 20:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants