Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make datetime_to_numeric more robust to overflow errors #3642

Closed
wants to merge 2 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
43 changes: 31 additions & 12 deletions xarray/core/duck_array_ops.py
Original file line number Diff line number Diff line change
Expand Up @@ -372,43 +372,62 @@ def _datetime_nanmin(array):


def datetime_to_numeric(array, offset=None, datetime_unit=None, dtype=float):
"""Convert an array containing datetime-like data to an array of floats.
"""Return an array containing datetime-like data to numerical values.

Convert the datetime array to a timedelta relative to an offset.

Parameters
----------
da : np.array
Input data
offset: Scalar with the same type of array or None
If None, subtract minimum values to reduce round off error
datetime_unit: None or any of {'Y', 'M', 'W', 'D', 'h', 'm', 's', 'ms',
'us', 'ns', 'ps', 'fs', 'as'}
dtype: target dtype
da : array-like
Input data
offset: None, datetime or cftime.datetime
Datetime offset. If None, this is set by default to the array's minimum
value to reduce round off errors.
datetime_unit: {None, Y, M, W, D, h, m, s, ms, us, ns, ps, fs, as}
If not None, convert output to a given datetime unit. Note that some
conversions are not allowed due to non-linear relationships between units.
dtype: dtype
Output dtype.

Returns
-------
array
Numerical representation of datetime object relative to an offset.

Notes
-----
Some datetime unit conversions won't work, for example from days to years, even
though some calendars would allow for them (e.g. no_leap). This is because there
is no `cftime.timedelta` object.
"""
# TODO: make this function dask-compatible?
# Set offset to minimum if not given
if offset is None:
if array.dtype.kind in "Mm":
offset = _datetime_nanmin(array)
else:
offset = min(array)

# Compute timedelta object.
# For np.datetime64, this can silently yield garbage due to overflow.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good catch; this is another bug (we end up with a NaN in this example):

In [1]: import numpy as np; import xarray as xr

In [2]: times = [np.datetime64('1690-01-01', 'ns'), np.datetime64('2200-01-01', 'ns')]

In [3]: da = xr.DataArray(range(2), dims=['time'], coords=[times])

In [4]: da
Out[4]:
<xarray.DataArray (time: 2)>
array([0, 1])
Coordinates:
  * time     (time) datetime64[ns] 1690-01-01 2200-01-01

In [5]: da.interp(time=['1700-01-01'])
Out[5]:
<xarray.DataArray (time: 1)>
array([nan])
Coordinates:
  * time     (time) datetime64[ns] 1700-01-01

This is another potential plus of using a universal offset. In the case of NumPy dates, I think if we always use 1970-01-01T00:00:00 as the offset we are guaranteed this will never return garbage.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We'd also have to enforce a microsecond resolution, which could break some code.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the issue I raised in the comment above, I think switching to 1970-01-01T00:00:00 as the offset would work for NumPy dates at nanosecond resolution (maybe there is something I'm missing?). Let me think a little more about the implications of casting the datetime.timedelta objects to timedelta64[us] objects, though. I think you're right that we should think carefully about that.

array = array - offset

if not hasattr(array, "dtype"): # scalar is converted to 0d-array
# Scalar is converted to 0d-array
if not hasattr(array, "dtype"):
array = np.array(array)

# Convert timedelta objects to timedelta64[ms] dtype.
if array.dtype.kind in "O":
# possibly convert object array containing datetime.timedelta
array = np.asarray(pd.Series(array.ravel())).reshape(array.shape)
array = array.astype("timedelta64[ms]")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we may want to use microsecond units ("timedelta64[us]"); using millisecond resolution may lead to unnecessary truncation of precision.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will do. In get_clean_interp_index, should we also use datetime_to_numeric for pd.DatetimeIndex with the us resolution ?

    if isinstance(index, (CFTimeIndex, pd.DatetimeIndex)):
        offset = type(index[0])(1970, 1, 1)
        index = Variable(
            data=datetime_to_numeric(index, offset=offset, datetime_unit="us"),
            dims=(dim,),
        )

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switching to "us" breaks interpolate_na because max_gap is ns by default.

Throughout xarray, "ns" is the default resolution. I'm a bit worried that switching this for "us" in a couple of places it a recipe for future bugs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It occurs to me that we could use nanosecond units (or units even finer) for cftime dates, so long as we do the unit conversion step after converting to floats. E.g. one could write a function like the following to aid in that:

SUB_MICROSECOND_UNIT_CONVERSION_FACTORS = {
    "ns": 1e3,
    "ps": 1e6,
    "fs": 1e9,
    "as": 1e12
}

def _py_timedelta_to_float(array, datetime_unit="ns"):
    array = array.astype("timedelta64[us]")
    if datetime_unit in SUB_MICROSECOND_UNIT_CONVERSION_FACTORS:
        factor = SUB_MICROSECOND_UNIT_CONVERSION_FACTORS[datetime_unit]
        return factor * array.astype(np.float64)
    else:
        return array / np.timedelta64(1, datetime_unit)

I have not thought carefully yet about how to handle max_gap in interpolate_na in the case of cftime dates, but this seems like a possible compromise to allow us to continue to default to nanosecond resolution internally for these functions. @huard do you think that could help?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here's perhaps a cleaner version:

def _py_timedelta_to_float(array, datetime_unit="ns"):
    array = array.astype("timedelta64[us]").astype(np.float64)
    conversion_factor = np.timedelta64(1, "us") / np.timedelta64(1, datetime_unit)
    return conversion_factor * array

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Keep in mind though the default units for differentiate and integrate have been nanoseconds, and that is user-facing; should we preserve that? Also should we preserve the ability to specify sub-microsecond max_gap's in interpolate_na?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah I see. Thanks @spencerkclark

Keep in mind though the default units for differentiate and integrate have been nanoseconds, and that is user-facing; should we preserve that?

Yes I think we should. Or we have to go through a deprecation cycle.

Also should we preserve the ability to specify sub-microsecond max_gap's in interpolate_na

From a user perspective, it seems like we should allow all units that may be used for the time coordinate. I admit I don't understand the subtleties of precision though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dcherian -- I agree on all points.

So @huard, I think it makes sense to do the following (feel free to merge your PRs or not as you see fit):

  1. Maintain the ability to output floats with nanosecond units (or finer) from datetime_to_numeric by converting timedelta-like objects to floats before doing the unit conversion.
  2. Use a similar unit conversion approach when converting max_gap to a float, so that we again get the best of both worlds. I.e. continue to support nanosecond-resolution max_gap timedeltas, but at the same time allow for > 292-year max_gap timedeltas (specified, e.g., using datetime.timedelta objects or np.timedelta64 objects of microsecond or coarser resolution).

(2) might require some careful logic, but in principle I think it would be doable. Does that make sense? Thanks already for your help in identifying/sorting all these issues out. Sorry for taking a while to think all this through.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've merged my two PRs into #3631 and implemented the py_timedelta_to_float function.
It seems to work except for earlier numpy versions. I've special-cased earlier numpy versions since I don't think I can fix that.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @huard -- I'll have a closer look over the coming days.


# Convert to specified timedelta units.
if datetime_unit:
array = array / np.timedelta64(1, datetime_unit)

# convert np.NaT to np.nan
# Convert np.NaT to np.nan
if array.dtype.kind in "mM":
return np.where(isnull(array), np.nan, array.astype(dtype))

return array.astype(dtype)


Expand Down
22 changes: 21 additions & 1 deletion xarray/tests/test_duck_array_ops.py
Original file line number Diff line number Diff line change
Expand Up @@ -669,10 +669,22 @@ def test_datetime_to_numeric_datetime64():
expected = 24 * np.arange(0, 35, 7).astype(dtype)
np.testing.assert_array_equal(result, expected)

# Test overflow. Need to convert to ms first otherwise we get garbage
# TODO: make datetime_to_numeric more robust to overflow errors.
offset = np.datetime64("0001-01-01")
ms = times.astype("datetime64[ms]")
result = duck_array_ops.datetime_to_numeric(
ms, offset=offset, datetime_unit="D", dtype=int
)
expected = 730119 + np.arange(0, 35, 7)
np.testing.assert_array_equal(result, expected)


@requires_cftime
def test_datetime_to_numeric_cftime():
times = cftime_range("2000", periods=5, freq="7D").values
import cftime

times = cftime_range("2000", periods=5, freq="7D", calendar="standard").values
result = duck_array_ops.datetime_to_numeric(times, datetime_unit="h")
expected = 24 * np.arange(0, 35, 7)
np.testing.assert_array_equal(result, expected)
Expand All @@ -686,3 +698,11 @@ def test_datetime_to_numeric_cftime():
result = duck_array_ops.datetime_to_numeric(times, datetime_unit="h", dtype=dtype)
expected = 24 * np.arange(0, 35, 7).astype(dtype)
np.testing.assert_array_equal(result, expected)

# Test potential overflow (outputs are different from timedelta64. Expected ?)
offset = cftime.DatetimeGregorian(1, 1, 1)
result = duck_array_ops.datetime_to_numeric(
times, offset=offset, datetime_unit="D", dtype=int
)
expected = 730121 + np.arange(0, 35, 7)
np.testing.assert_array_equal(result, expected)
7 changes: 7 additions & 0 deletions xarray/tests/test_interp.py
Original file line number Diff line number Diff line change
Expand Up @@ -662,3 +662,10 @@ def test_datetime_interp_noerror():
coords={"time": pd.date_range("01-01-2001", periods=50, freq="H")},
)
a.interp(x=xi, time=xi.time) # should not raise an error


@requires_cftime
def test_3641():
times = xr.cftime_range("0001", periods=3, freq="500Y")
da = xr.DataArray(range(3), dims=["time"], coords=[times])
da.interp(time=["0002-05-01"])