Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long term solution for limited permissible dates due to np.datetime64 nanosecond resolution #98

Closed
spencerahill opened this issue Oct 24, 2016 · 11 comments

Comments

@spencerahill
Copy link
Owner

E.g. #96. This maybe is a wait-and-see for upstream solutions (either at xarray, pandas, or numpy levels), but wanted to have an Issue open where we can track this/discuss alternatives.

Our current workarounds limit us to <585 years of data, which for many folks in the community would be wholly insufficient.

@spencerkclark
Copy link
Collaborator

spencerkclark commented Oct 25, 2016

It seems that this will likely not change anytime soon in pandas. In the meantime, their suggested workaround in the docs is to use pd.Period for dates outside the valid range. On the topic of alternatives -- here is a summary of the existing options (as far as I can tell, none work for all our requirements).

Xarray's Workaround

As we know, xarray doesn't officially use the pandas suggested workaround; instead it uses netCDF4.datetime objects in this situation. The xarray solution is a compromise, because one cannot do time-related groupby operations when using netCDF4.datetime objects in the time variable. E.g.

In [1]: import xarray as xr

In [2]: da = xr.DataArray([15, 45, 74], coords=[[15, 45, 74]], dims=['time'], name='a')

In [3]: da['time'].attrs['units'] = 'days since 0001-01-01'

In [4]: da = da.to_dataset()

In [5]: da = xr.decode_cf(da)
//anaconda/lib/python2.7/site-packages/xarray/conventions.py:377: RuntimeWarning: Unable to decode time axis into full numpy.datetime64 objects, continuing using dummy netCDF4.datetime objects instead, reason: dates out of range
  result = decode_cf_datetime(example_value, units, calendar)

In [6]: da.groupby('time.month').mean('time')
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-6-b8680c841459> in <module>()
----> 1 da.groupby('time.month').mean('time')

//anaconda/lib/python2.7/site-packages/xarray/core/common.pyc in groupby(self, group, squeeze)
    342         """
    343         if isinstance(group, basestring):
--> 344             group = self[group]
    345         return self.groupby_cls(self, group, squeeze=squeeze)
    346

//anaconda/lib/python2.7/site-packages/xarray/core/dataset.pyc in __getitem__(self, key)
    518
    519         if hashable(key):
--> 520             return self._construct_dataarray(key)
    521         else:
    522             return self._copy_listed(np.asarray(key))

//anaconda/lib/python2.7/site-packages/xarray/core/dataset.pyc in _construct_dataarray(self, name)
    467             variable = self._variables[name]
    468         except KeyError:
--> 469             _, name, variable = _get_virtual_variable(self._variables, name)
    470
    471         coords = OrderedDict()

//anaconda/lib/python2.7/site-packages/xarray/core/dataset.pyc in _get_virtual_variable(variables, key)
     60         data = seasons[(month // 3) % 4]
     61     else:
---> 62         data = getattr(date, var_name)
     63     return ref_name, var_name, Variable(ref_var.dims, data)
     64

AttributeError: 'Index' object has no attribute 'month'

This is critical in our use case, because we would like to be able to compute monthly and seasonal averages. That said, subsetting data using time slices works great under the netCDF4.datetime regime in xarray. E.g.

In [7]: from netcdftime._datetime import datetime

In [8]: da.sel(time=slice(datetime(1, 1, 1), datetime(1, 2, 1)))
Out[8]:
<xarray.Dataset>
Dimensions:  (time: 1)
Coordinates:
  * time     (time) object    1-01-16 00:00:00
Data variables:
    a        (time) int64 15

Serialization fails when a Dataset is indexed by netcdftime._datetime.datetime objects:

In [9]: da.to_netcdf('test_netcdftime.nc')
//anaconda/lib/python2.7/site-packages/xarray/conventions.py:396: RuntimeWarning: Unable to decode time axis into full numpy.datetime64 objects, continuing using dummy netCDF4.datetime objects instead, reason: dates out of range
  calendar=self.calendar)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-9-efe2b453be6e> in <module>()
----> 1 da.to_netcdf('test_netcdftime.nc')

//anaconda/lib/python2.7/site-packages/xarray/core/dataset.pyc in to_netcdf(self, path, mode, format, group, engine, encoding)
    780         from ..backends.api import to_netcdf
    781         return to_netcdf(self, path, mode, format=format, group=group,
--> 782                          engine=engine, encoding=encoding)
    783
    784     dump = utils.function_alias(to_netcdf, 'dump')

//anaconda/lib/python2.7/site-packages/xarray/backends/api.pyc in to_netcdf(dataset, path, mode, format, group, engine, writer, encoding)
    352     store = store_cls(path, mode, format, group, writer)
    353     try:
--> 354         dataset.dump_to_store(store, sync=sync, encoding=encoding)
    355         if isinstance(path, BytesIO):
    356             return path.getvalue()

//anaconda/lib/python2.7/site-packages/xarray/core/dataset.pyc in dump_to_store(self, store, encoder, sync, encoding)
    726             variables, attrs = encoder(variables, attrs)
    727
--> 728         store.store(variables, attrs, check_encoding)
    729         if sync:
    730             store.sync()

//anaconda/lib/python2.7/site-packages/xarray/backends/common.pyc in store(self, variables, attributes, check_encoding_set)
    230         # All NetCDF files get CF encoded by default, without this attempting
    231         # to write times, for example, would fail.
--> 232         cf_variables, cf_attrs = cf_encoder(variables, attributes)
    233         AbstractWritableDataStore.store(self, cf_variables, cf_attrs,
    234                                         check_encoding_set)

//anaconda/lib/python2.7/site-packages/xarray/conventions.pyc in cf_encoder(variables, attributes)
   1058     """
   1059     new_vars = OrderedDict((k, encode_cf_variable(v, name=k))
-> 1060                            for k, v in iteritems(variables))
   1061     return new_vars, attributes

//anaconda/python.app/Contents/lib/python2.7/collections.pyc in __init__(*args, **kwds)
     67             root[:] = [root, root, None]
     68             self.__map = {}
---> 69         self.__update(*args, **kwds)
     70
     71     def __setitem__(self, key, value, dict_setitem=dict.__setitem__):

//anaconda/python.app/Contents/lib/python2.7/_abcoll.pyc in update(*args, **kwds)
    569                     self[key] = other[key]
    570             else:
--> 571                 for key, value in other:
    572                     self[key] = value
    573         for key, value in kwds.items():

//anaconda/lib/python2.7/site-packages/xarray/conventions.pyc in <genexpr>((k, v))
   1058     """
   1059     new_vars = OrderedDict((k, encode_cf_variable(v, name=k))
-> 1060                            for k, v in iteritems(variables))
   1061     return new_vars, attributes

//anaconda/lib/python2.7/site-packages/xarray/conventions.pyc in encode_cf_variable(var, needs_copy, name)
    712     var, needs_copy = maybe_encode_offset_and_scale(var, needs_copy)
    713     var, needs_copy = maybe_encode_fill_value(var, needs_copy)
--> 714     var = maybe_encode_dtype(var, name)
    715     var = maybe_encode_bools(var)
    716     var = ensure_dtype_not_object(var)

//anaconda/lib/python2.7/site-packages/xarray/conventions.pyc in maybe_encode_dtype(var, name)
    621                                   'any _FillValue to use for NaNs' % name,
    622                                   RuntimeWarning, stacklevel=3)
--> 623                 data = ops.around(data)[...]
    624             if dtype == 'S1' and data.dtype != 'S1':
    625                 data = string_to_char(np.asarray(data, 'S'))

//anaconda/lib/python2.7/site-packages/xarray/core/ops.pyc in f(*args, **kwargs)
     62             else:
     63                 module = eager_module
---> 64             return getattr(module, name)(*args, **kwargs)
     65     else:
     66         def f(data, *args, **kwargs):

//anaconda/lib/python2.7/site-packages/numpy/core/fromnumeric.pyc in around(a, decimals, out)
   2766     except AttributeError:
   2767         return _wrapit(a, 'round', decimals, out)
-> 2768     return round(decimals, out)
   2769
   2770

AttributeError: 'netcdftime._datetime.datetime' object has no attribute 'rint'

Using a PeriodIndex in xarray

This raises the question: what happens if we use the pandas recommend workaround in xarray? Does the groupby issue persist? Interestingly, the answer is no -- we can indeed do time-related groupby operations on dates outside the Timestamp-valid range using a PeriodIndex in xarray:

In [1]: import xarray as xr

In [2]: import pandas as pd

In [3]: da = xr.DataArray([1, 3, 2, 5], coords=[[15, 16, 45, 74]], dims=['time'], name='a')

In [4]: da['time'].attrs['units'] = 'days since 0001-01-01'

In [5]: da = da.to_dataset()

In [6]: da = xr.decode_cf(da)
//anaconda/lib/python2.7/site-packages/xarray/conventions.py:377: RuntimeWarning: Unable to decode time axis into full numpy.datetime64 objects, continuing using dummy netCDF4.datetime objects instead, reason: dates out of range
  result = decode_cf_datetime(example_value, units, calendar)

In [7]: periods = []

In [8]: for date in da['time']:
    ...:     raw_date = date.values.item()
    ...:     periods.append(pd.Period(year=raw_date.year, month=raw_date.month, day=raw_date.day, freq='D'))
    ...:

In [9]: da['time'] = pd.PeriodIndex(periods)

In [10]: da.groupby('time.month').mean('time')
Out[10]:
<xarray.Dataset>
Dimensions:  (month: 3)
Coordinates:
  * month    (month) int64 1 2 3
Data variables:
    a        (month) float64 2.0 2.0 5.0

Critically though, subsetting based on time slices no longer works.

In [11]: da.sel(time=slice(pd.Period(year=1, month=1, day=1, freq='D'), pd.Period(year=1, month=2, d
    ...: ay=1, freq='D')))
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
<ipython-input-11-123d7c0d656c> in <module>()
----> 1 da.sel(time=slice(pd.Period(year=1, month=1, day=1, freq='D'), pd.Period(year=1, month=2, day=1, freq='D')))

//anaconda/lib/python2.7/site-packages/xarray/core/dataset.pyc in sel(self, method, tolerance, **indexers)
    967         """
    968         pos_indexers, new_indexes = indexing.remap_label_indexers(
--> 969             self, indexers, method=method, tolerance=tolerance
    970         )
    971         return self.isel(**pos_indexers)._replace_indexes(new_indexes)

//anaconda/lib/python2.7/site-packages/xarray/core/indexing.pyc in remap_label_indexers(data_obj, indexers, method, tolerance)
    227         index = data_obj[dim].to_index()
    228         idxr, new_idx = convert_label_indexer(index, label,
--> 229                                               dim, method, tolerance)
    230         pos_indexers[dim] = idxr
    231         if new_idx is not None:

//anaconda/lib/python2.7/site-packages/xarray/core/indexing.pyc in convert_label_indexer(index, label, index_name, method, tolerance)
    169         indexer = index.slice_indexer(_try_get_item(label.start),
    170                                       _try_get_item(label.stop),
--> 171                                       _try_get_item(label.step))
    172         if not isinstance(indexer, slice):
    173             # unlike pandas, in xarray we never want to silently convert a slice

//anaconda/lib/python2.7/site-packages/pandas/indexes/base.pyc in slice_indexer(self, start, end, step, kind)
   2783         """
   2784         start_slice, end_slice = self.slice_locs(start, end, step=step,
-> 2785                                                  kind=kind)
   2786
   2787         # return a slice

//anaconda/lib/python2.7/site-packages/pandas/indexes/base.pyc in slice_locs(self, start, end, step, kind)
   2962         start_slice = None
   2963         if start is not None:
-> 2964             start_slice = self.get_slice_bound(start, 'left', kind)
   2965         if start_slice is None:
   2966             start_slice = 0

//anaconda/lib/python2.7/site-packages/pandas/indexes/base.pyc in get_slice_bound(self, label, side, kind)
   2901         # For datetime indices label may be a string that has to be converted
   2902         # to datetime boundary according to its resolution.
-> 2903         label = self._maybe_cast_slice_bound(label, side, kind)
   2904
   2905         # we need to look up the label

//anaconda/lib/python2.7/site-packages/pandas/tseries/period.pyc in _maybe_cast_slice_bound(self, label, side, kind)
    724
    725         """
--> 726         assert kind in ['ix', 'loc', 'getitem']
    727
    728         if isinstance(label, datetime):

AssertionError:

Serialization also fails for Datasets indexed by a PeriodIndex:

In [12]: da.to_netcdf('test_periodindex.nc')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-12-318efb0010c7> in <module>()
----> 1 da.to_netcdf('test_periodindex.nc')

//anaconda/lib/python2.7/site-packages/xarray/core/dataset.pyc in to_netcdf(self, path, mode, format, group, engine, encoding)
    780         from ..backends.api import to_netcdf
    781         return to_netcdf(self, path, mode, format=format, group=group,
--> 782                          engine=engine, encoding=encoding)
    783
    784     dump = utils.function_alias(to_netcdf, 'dump')

//anaconda/lib/python2.7/site-packages/xarray/backends/api.pyc in to_netcdf(dataset, path, mode, format, group, engine, writer, encoding)
    352     store = store_cls(path, mode, format, group, writer)
    353     try:
--> 354         dataset.dump_to_store(store, sync=sync, encoding=encoding)
    355         if isinstance(path, BytesIO):
    356             return path.getvalue()

//anaconda/lib/python2.7/site-packages/xarray/core/dataset.pyc in dump_to_store(self, store, encoder, sync, encoding)
    726             variables, attrs = encoder(variables, attrs)
    727
--> 728         store.store(variables, attrs, check_encoding)
    729         if sync:
    730             store.sync()

//anaconda/lib/python2.7/site-packages/xarray/backends/common.pyc in store(self, variables, attributes, check_encoding_set)
    230         # All NetCDF files get CF encoded by default, without this attempting
    231         # to write times, for example, would fail.
--> 232         cf_variables, cf_attrs = cf_encoder(variables, attributes)
    233         AbstractWritableDataStore.store(self, cf_variables, cf_attrs,
    234                                         check_encoding_set)

//anaconda/lib/python2.7/site-packages/xarray/conventions.pyc in cf_encoder(variables, attributes)
   1058     """
   1059     new_vars = OrderedDict((k, encode_cf_variable(v, name=k))
-> 1060                            for k, v in iteritems(variables))
   1061     return new_vars, attributes

//anaconda/python.app/Contents/lib/python2.7/collections.pyc in __init__(*args, **kwds)
     67             root[:] = [root, root, None]
     68             self.__map = {}
---> 69         self.__update(*args, **kwds)
     70
     71     def __setitem__(self, key, value, dict_setitem=dict.__setitem__):

//anaconda/python.app/Contents/lib/python2.7/_abcoll.pyc in update(*args, **kwds)
    569                     self[key] = other[key]
    570             else:
--> 571                 for key, value in other:
    572                     self[key] = value
    573         for key, value in kwds.items():

//anaconda/lib/python2.7/site-packages/xarray/conventions.pyc in <genexpr>((k, v))
   1058     """
   1059     new_vars = OrderedDict((k, encode_cf_variable(v, name=k))
-> 1060                            for k, v in iteritems(variables))
   1061     return new_vars, attributes

//anaconda/lib/python2.7/site-packages/xarray/conventions.pyc in encode_cf_variable(var, needs_copy, name)
    714     var = maybe_encode_dtype(var, name)
    715     var = maybe_encode_bools(var)
--> 716     var = ensure_dtype_not_object(var)
    717     return var
    718

//anaconda/lib/python2.7/site-packages/xarray/conventions.pyc in ensure_dtype_not_object(var)
    683             data[missing] = fill_value
    684         else:
--> 685             data = data.astype(dtype=_infer_dtype(data))
    686         var = Variable(dims, data, attrs, encoding)
    687     return var

//anaconda/lib/python2.7/site-packages/xarray/conventions.pyc in _infer_dtype(array)
    653             dtype = np.dtype(dtype.kind)
    654         elif dtype.kind == 'O':
--> 655             raise ValueError('unable to infer dtype; xarray cannot '
    656                              'serialize arbitrary Python objects')
    657     return dtype

ValueError: unable to infer dtype; xarray cannot serialize arbitrary Python objects

The constructor for pd.Period is also a little flaky for unusual dates (for instance below it should not think it is year 0016; it should be year 0001), which makes me a bit wary:

In [13]: pd.Period('0001-01-16')
Out[13]: Period('0016-01-01', 'D')

One can work around this by being more explicit in the constructor, however (as I've done earlier):

In [14]: pd.Period(year=1, month=1, day=16, freq='D')
Out[14]: Period('0001-01-16', 'D')

Summary

To conclude, I do not believe there is an existing workaround other than what we have currently been doing (i.e. offsetting dates to put things in the pandas Timestamp-valid range) that grants us both time-related groupby operations AND date range subsetting. I think the lowest hanging fruit upstream that would enable a cleaner workaround might be fixing time subsetting with a PeriodIndex.

That being said, on the topic of upstream fixes I would prefer the CustomDatetimeIndex approach suggested in this thread in xarray, where we could potentially build in support for custom calendars (and solve the date range issue all at once). At least in the work I'm currently doing, noleap calendars are ubiquitous, and using datetime objects designed for the Gregorian calendar always makes me a little uncomfortable (even though I think the way things are set up in xarray, the calendar type is taken into account, at least when CF attributes are decoded and datetime objects are created).

@spencerkclark spencerkclark mentioned this issue Oct 25, 2016
3 tasks
@spencerahill
Copy link
Owner Author

@spencerkclark, thank you! That is extremely useful. I agree with you.

My hope is that one outcome of the upcoming "other aospy" workshop will be meaningful progress on this issue at the xarray level. So more wait-and-see.

@spencerkclark
Copy link
Collaborator

An alternative to our existing workaround that comes to mind is that we could combine these two approaches; we could subset in time first (using netcdftime._datetime.datetime objects or datetime.datetime objects), and then if dates were outside the Timestamp-valid range, we could then convert the time index to a PeriodIndex to allow for group-by operations. This would bypass any offsetting.

@spencerahill
Copy link
Owner Author

Very intrigued by this. So then would there be a post-calculation step that converts back from PeriodIndexes to the original type? Some of the functions I use utilize groupby internally, so for them to work the conversion would need to occur before passing the data to the user's functions.

Other things that come to mind: Can we assume that groupby operations will work identically for PeriodIndexes as for the other types? Less importantly (since we can easily modify our own code), is there code of ours within aospy (i.e. not aospy-obj-lib functions) that will break/behave differently with PeriodIndexes?

If this did work, it seems like something that xarray would maybe even consider adopting into their code base.

@spencerkclark
Copy link
Collaborator

So then would there be a post-calculation step that converts back from PeriodIndexes to the original type?

Upon a little more tinkering, it seems that one cannot serialize a Dataset indexed by either netcdftime._datetime.datetime objects or a pd.PeriodIndex (see my updated initial comment), which is a bit inconvenient, but perhaps not insurmountable. It's messy, but it's possible we could just carry around the original undecoded time coordinate (i.e. the sequence of floats or integers representing some units since a certain date, which was in the original netCDF file). This undecoded time coordinate would be indexed by the decoded time coordinate, so when any time-reduction operations occurred, the appropriate undecoded times would remain in the DataArray or Dataset (note there could be issues there that I haven't thought through); then just before we would serialize things to a netCDF file, we would drop the decoded time coordinate (and reindex things so that the undecoded time coordinate became the true time coordinate).

Can we assume that groupby operations will work identically for PeriodIndexes as for the other types?

As far as I can tell, yes (since like DatetimeIndexes, PeriodIndexes have year, month, and day attributes).

Less importantly (since we can easily modify our own code), is there code of ours within aospy (i.e. not aospy-obj-lib functions) that will break/behave differently with PeriodIndexes?

The only thing I'm aware of at the moment is serialization (but that's kind of important), but maybe we can find a workaround.

@spencerahill
Copy link
Owner Author

What are the prospects of upstream fixes on the netCDF4 side, either to (1) the lack of groupby support or (2) the inability to serialize?

@spencerahill
Copy link
Owner Author

So, if I'm understanding right, the procedure would be:

  1. Determine if the date range exceeds the xarray permissible range. If yes, copy (somehow) the original time data and proceed with all subsequent steps. If no, just do steps 3, 5, and 7.
  2. Convert the time array to datetime.datetime or netCDF4._datetime.datetime.
  3. Subset the data array in time to the desired times.
  4. Convert the time array to an array of PeriodIndexes.
  5. Perform the desired computation and any subsequent time reductions.
  6. Convert back to the original time datatype using the copy saved in step 1.
  7. Serialize to netCDF files.

Is that right?

One potential snag is that, conceivably, a user's function could involve slicing on the time array (which, for out-of-range dates, would require datetime-like), groupby operations on time array (requires Periods), or both (no good solution). But I'm not sure how much this matters.

@spencerkclark
Copy link
Collaborator

Contributors to netCDF4 / xarray / and pandas would know better than me, but...

  1. My understanding is that at the moment, groupby support in either pandas or xarray would come from outside either library through an externally defined custom pd.Index subclass that was based on netcdftime._datetime.datetime objects (this is also something that seems outside the scope of the netCDF4 library).
  2. My sense is that serialization should be more tractable -- I think it just has to do with how xarray encodes the netcdftime._datetime.datetime object vector as a vector of floats with a CF-compliant units attribute. It's possible this could be done already with some modifications to xarray, but, I believe that netcdftime._datetime.datetime objects are being improved as we speak (to add timedelta arithmetic support) within netCDF4, which can't hurt either.

@spencerkclark
Copy link
Collaborator

Determine if the date range exceeds the xarray permissible range. If yes, copy (somehow) the original time data and proceed with all subsequent steps. If no, just do steps 3, 5, and 7....

Yes, that's exactly the procedure I had in mind -- and if serialization gets fixed, we won't have to worry about copying / storing the raw time vector.

One potential snag is that, conceivably, a user's function could involve slicing on the time array (which, for out-of-range dates, would require datetime-like), groupby operations on time array (requires Periods), or both (no good solution). But I'm not sure how much this matters.

There is a messy workaround to this (but really not ideal...), which is to carry all three representations of the time coordinate with every DataArray, and use each for their appropriate purpose.

@spencerahill
Copy link
Owner Author

There is a messy workaround to this (but really not ideal...), which is to carry all three representations of the time coordinate with every DataArray, and use each for their appropriate purpose.

Ha that's true! We'll keep this in our backpocket as our "nuclear option"

@spencerahill
Copy link
Owner Author

My understanding is that at the moment, groupby support in either pandas or xarray would come from outside either library through an externally defined custom pd.Index subclass that was based on netcdftime._datetime.datetime objects (this is also something that seems outside the scope of the netCDF4 library).

I guess what I'm wondering is: what is it about the netCDF datetime objects that precludes groupby operations? The workshop this weekend should shed light on this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants