Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use cftime to avoid out of bounds datetime when decoding non-CF time coordinates #283

Merged
merged 13 commits into from
Aug 2, 2022

Conversation

pochedls
Copy link
Collaborator

@pochedls pochedls commented Jul 28, 2022

Description

This PR is intended to address an issue in which datasets with non-cf-compliant times will not open if they go too far into the future.

Checklist

  • My code follows the style guidelines of this project
  • I have performed a self-review of my own code
  • My changes generate no new warnings
  • Any dependent changes have been merged and published in downstream modules

If applicable:

  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass with my changes (locally and CI/CD build)
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have noted that this is a breaking change for a major release (fix or feature that would cause existing functionality to not work as expected)

xcdat/dataset.py Outdated Show resolved Hide resolved
xcdat/dataset.py Outdated Show resolved Hide resolved
@pochedls
Copy link
Collaborator Author

@tomvothecoder – I added this PR to get things rolling; I think you'll have some insights to improve the code in places.

I got stuck resolving a mypi issue. If you are okay with the direction I am going, I can fix the unit tests.

@tomvothecoder
Copy link
Collaborator

Hi @pochedls, I'm taking a look right now. I'll also rebase this branch on the latest main.

add unit test

add comments for test, cleanup extraneous code

initial work on #282

Bugfix/278 cannot generate bounds (#281)

* initial solution for #278

* add unit test

* add comments for test, cleanup extraneous code
@tomvothecoder tomvothecoder force-pushed the bugfix/282-OutOfBoundsDatetime branch from 26ced7f to 3050a8d Compare July 28, 2022 18:08
Copy link
Collaborator

@tomvothecoder tomvothecoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @pochedls, I reviewed your changes and rebased the branch on the latest main.

There are some outdated review comments because I ended up doing some refactoring since I was debugging directly on your branch anyways.

The next thing to for you to do to review my changes and fix the tests.

xcdat/dataset.py Outdated Show resolved Hide resolved
xcdat/dataset.py Outdated Show resolved Hide resolved
xcdat/dataset.py Outdated Show resolved Hide resolved
xcdat/dataset.py Outdated Show resolved Hide resolved
xcdat/dataset.py Outdated Show resolved Hide resolved
xcdat/dataset.py Outdated Show resolved Hide resolved
.pre-commit-config.yaml Show resolved Hide resolved
xcdat/dataset.py Outdated Show resolved Hide resolved
- Add `types-python-dateutil` to `mypy` dependencies
- Update `ref_date` var to `ref_dt_obj` to avoid mypy error `error: Unsupported operand types for + ("str" and "relativedelta")`
- Use dictionary unpacking for units variable
- Use `datetime.strptime` instead of `pd.datetime()` which runs into the `pd.Timestamp` limitation
- Add logger.warning when non-CF compliant time coords cannot be decoded
@tomvothecoder tomvothecoder force-pushed the bugfix/282-OutOfBoundsDatetime branch from 5a2dcc9 to a681787 Compare July 28, 2022 20:16
@tomvothecoder tomvothecoder added type: bug Inconsistencies or issues which will cause an issue or problem for users or implementors. Priority: High labels Jul 28, 2022
xcdat/dataset.py Outdated Show resolved Hide resolved
@codecov-commenter
Copy link

codecov-commenter commented Jul 28, 2022

Codecov Report

Merging #283 (c0d8703) into main (2169b07) will not change coverage.
The diff coverage is 100.00%.

@@            Coverage Diff            @@
##              main      #283   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files           14        14           
  Lines         1146      1169   +23     
=========================================
+ Hits          1146      1169   +23     
Impacted Files Coverage Δ
xcdat/dataset.py 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2169b07...c0d8703. Read the comment docs.

@tomvothecoder tomvothecoder force-pushed the bugfix/282-OutOfBoundsDatetime branch from 1320b50 to a2868e8 Compare July 28, 2022 22:29
xcdat/dataset.py Outdated Show resolved Hide resolved
@tomvothecoder tomvothecoder force-pushed the bugfix/282-OutOfBoundsDatetime branch from a2868e8 to 4c33bf9 Compare July 28, 2022 22:32
Copy link
Collaborator

@tomvothecoder tomvothecoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My second pass, which includes test fixes and additional refactoring.

tests/test_dataset.py Show resolved Hide resolved
xcdat/dataset.py Outdated Show resolved Hide resolved
xcdat/dataset.py Outdated Show resolved Hide resolved
@pochedls
Copy link
Collaborator Author

I modified decode_non_cf_time to return more specific cftime.Datetime objects (e.g, cftime.Datetime360Day and cftime.DatetimeGregorian), which seems to be what xarray does:

import xarray as xr

fn = '/p/css03/esgf_publish/CMIP6/CMIP/MOHC/UKESM1-1-LL/historical/r1i1p1f2/Amon/tas/gn/v20220512/tas_Amon_UKESM1-1-LL_historical_r1i1p1f2_gn_185001-194912.nc'
print(ds.time)

<xarray.DataArray 'time' (time: 1200)>
array([cftime.Datetime360Day(1850, 1, 16, 0, 0, 0, 0, has_year_zero=True),
cftime.Datetime360Day(1850, 2, 16, 0, 0, 0, 0, has_year_zero=True),
cftime.Datetime360Day(1850, 3, 16, 0, 0, 0, 0, has_year_zero=True), ...,
cftime.Datetime360Day(1949, 10, 16, 0, 0, 0, 0, has_year_zero=True),
cftime.Datetime360Day(1949, 11, 16, 0, 0, 0, 0, has_year_zero=True),
cftime.Datetime360Day(1949, 12, 16, 0, 0, 0, 0, has_year_zero=True)],
dtype=object)
Coordinates:

  • time (time) object 1850-01-16 00:00:00 ... 1949-12-16 00:00:00
    height float64 ...
    Attributes:
    bounds: time_bnds
    axis: T
    long_name: time
    standard_name: time

This also fixes some issues I had in subsetting. I had been using the file /p/user_pub/climate_work/pochedley1/cmip6_msu/ttt_Amon_UKESM1-0-LL_historical_r10i1p1f2_gr_185001-201412.nc. In applying ds.sel(time=('1850-01-01', '2014-12-31'), I got the following error:

TypeError: cannot compare cftime.datetime(1932, 7, 1, 0, 0, 0, 0, calendar='360_day', has_year_zero=True) and cftime.datetime(1850, 1, 1, 0, 0, 0, 0, calendar='standard', has_year_zero=False)

I could fix this by using cftime object in my .sel statement, but this now works with the string/slice arguments (because the time vector has the appropriate cftime types).

- Move try and except statements into `decode_non_cf_time()`
- Extract function `_get_cftime_coords()` to encapsulate related logic from `decode_non_cf_time()`
Copy link
Collaborator

@tomvothecoder tomvothecoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @pochedls, here's another pass of refactoring. I think we're in good shape now.

Along with xarray's get_date_type() that you cited, xarray also has convert_times() that we can use. After refactoring with these methods, the existing tests still passed.

Here's a diff of your last commit and this one:
3e9d09c (#283)

Side-note: I am going to try silence the logger warnings in the test suite.

Comment on lines +707 to +708
date_type = get_date_type(calendar)
coords = convert_times(offsets, date_type=date_type)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of implementing our own logic, we can reuse xarray's get_date_type() (you cited this one) and convert_times() methods.

xarray.coding.cftime_offsets.get_date_type()
xarray.coding.times.convert_times()

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind this. I initially used the xr function, but then I wasn't sure if we wanted/needed their _is_standard_calendar check. I guess this will never run, because we set use_cftime = True.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think convert_times is hitting problems for years far into the future (convert_times is based on pandas):

import xcdat

fn = '/p/user_pub/climate_work/pochedley1/cmip6_msu/spliced/ttt_Amon_MRI-ESM2-0_historical-ssp585_r1i1p1f1_gn_185001-230012.nc'
ds = xcdat.open_dataset(fn)

File ~/code/xc

dat/xcdat/dataset.py:719, in _get_cftime_coords(ref_date, offsets, calendar, units)
716 # Convert the array of datetime objects into cftime objects based on
717 # the calendar type.
718 date_type = get_date_type(calendar)
--> 719 coords = convert_times(offsets, date_type=date_type)
721 return coords

File ~/bin/anaconda3/envs/xcdat_dev/lib/python3.9/site-packages/xarray/coding/times.py:451, in convert_times(times, date_type, raise_on_invalid)
448 return cftime_to_nptime(times, raise_on_invalid=raise_on_invalid)
449 if is_np_datetime_like(times.dtype):
450 # Convert datetime64 objects to Timestamps since those have year, month, day, etc. attributes
--> 451 times = pd.DatetimeIndex(times)
452 new = np.empty(times.shape, dtype="O")
453 for i, t in enumerate(times):

.
.
.

OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 2262-05-01 00:00:00

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refer to my suggestion to update the dtype of the offsets np.ndarray from datetime64 to object.

I validated this fix using your example.

xcdat/dataset.py Show resolved Hide resolved
xcdat/dataset.py Show resolved Hide resolved
xcdat/dataset.py Outdated Show resolved Hide resolved
xcdat/dataset.py Outdated Show resolved Hide resolved
xcdat/dataset.py Outdated
ref_datetime: datetime = parser.parse(ref_date, default=datetime(2000, 1, 1))
offsets = np.array(
[ref_datetime + rd.relativedelta(**{units: offset}) for offset in offsets],
dtype="datetime64",
Copy link
Collaborator

@tomvothecoder tomvothecoder Aug 2, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
dtype="datetime64",
dtype="object",

In the convert_times() method, there is a conditional that checks if the offsets np.ndarray dtype is_np_datetime_like (which then converts to pd.DatetimeIndex, and we don't want that due to the Timestamp limitations).

Source code: https://github.com/pydata/xarray/blob/3c98ec7d96cc4b46664850cc7a40af2bc184fea0/xarray/coding/times.py#L454-L460

Instead, by setting the dtype=object, it returns an np.ndarray of cftime objects.

Source code: https://github.com/pydata/xarray/blob/3c98ec7d96cc4b46664850cc7a40af2bc184fea0/xarray/coding/times.py#L461-L476

Copy link
Collaborator

@tomvothecoder tomvothecoder left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unless you find any more issues, we should be good to merge.

@pochedls pochedls merged commit 87790f8 into main Aug 2, 2022
@pochedls pochedls deleted the bugfix/282-OutOfBoundsDatetime branch August 2, 2022 23:59
@tomvothecoder tomvothecoder changed the title Bugfix/282 out of bounds datetime Fix out of bounds datetime due to pandas Timestamp limitation with cftime Aug 17, 2022
@tomvothecoder tomvothecoder changed the title Fix out of bounds datetime due to pandas Timestamp limitation with cftime Fix out of bounds datetime due to pandas Timestamp limitation by using cftime Aug 17, 2022
@tomvothecoder tomvothecoder changed the title Fix out of bounds datetime due to pandas Timestamp limitation by using cftime Fix out of bounds datetime when opening dataset due to pandas Timestamp limitation by using cftime Aug 17, 2022
@tomvothecoder tomvothecoder changed the title Fix out of bounds datetime when opening dataset due to pandas Timestamp limitation by using cftime Use cftime to avoid out of bounds datetime when decoding non-CF time coordinates Aug 17, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug Inconsistencies or issues which will cause an issue or problem for users or implementors.
Projects
No open projects
Status: Done
Development

Successfully merging this pull request may close these issues.

[Bug]: OutOfBoundsDatetime: Out of bounds nanosecond timestamp: 2262-05-01 00:00:00
3 participants