Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

virtual variables not available when using open_dataset #121

Closed
jhamman opened this issue May 11, 2014 · 5 comments
Closed

virtual variables not available when using open_dataset #121

jhamman opened this issue May 11, 2014 · 5 comments

Comments

@jhamman
Copy link
Member

jhamman commented May 11, 2014

The tutorial provides an example of how to use xray's virtual_variables. The same functionality is not availble from a Dataset object created by open_dataset.

Tutorial:

In [135]:
foo_values = np.random.RandomState(0).rand(3, 4)
times = pd.date_range('2000-01-01', periods=3)
ds = xray.Dataset({'time': ('time', times),
                   'foo': (['time', 'space'], foo_values)})
ds['time.dayofyear']

Out[135]:
<xray.DataArray 'time.dayofyear' (time: 3)>
array([1, 2, 3], dtype=int32)
Attributes:
    Empty

however, reading a time coordinate / variable from a netCDF4 file, and applying the same logic raises an error:

In [136]:
ds = xray.open_dataset('sample_for_xray.nc')
ds['time']

Out[136]:
<xray.DataArray 'time' (time: 4)>
array([1979-09-16 12:00:00, 1979-10-17 00:00:00, 1979-11-16 12:00:00,
       1979-12-17 00:00:00], dtype=object)
Attributes:
    dimensions: 1
    long_name: time
    type_preferred: int

In [137]:
ds['time.dayofyear']

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-137-bfe1ae778782> in <module>()
----> 1 ds['time.dayofyear']

/Users/jhamman/anaconda/lib/python2.7/site-packages/xray-0.2.0.dev_cc5e1b2-py2.7.egg/xray/dataset.pyc in __getitem__(self, key)
    408         """Access the given variable name in this dataset as a `DataArray`.
    409         """
--> 410         return data_array.DataArray._constructor(self, key)
    411 
    412     def __setitem__(self, key, value):

/Users/jhamman/anaconda/lib/python2.7/site-packages/xray-0.2.0.dev_cc5e1b2-py2.7.egg/xray/data_array.pyc in _constructor(cls, dataset, name)
     95         if name not in dataset and name not in dataset.virtual_variables:
     96             raise ValueError('name %r must be a variable in dataset %r'
---> 97                              % (name, dataset))
     98         obj._dataset = dataset
     99         obj._name = name

ValueError: name 'time.dayofyear' must be a variable in dataset <xray.Dataset>
Dimensions:     (time: 4, x: 275, y: 205)
Coordinates:
    time            X                   
    x                        X          
    y                                X  
Noncoordinates:
    Wind            0        2       1  
Attributes:

sample data for xray from RASM project

Is there a reason that the virtual time variables are only available if the dataset is created from a pandas date_range? Lastly, this could be related to the #118 .

@shoyer
Copy link
Member

shoyer commented May 12, 2014

Yes, this is certainly related to #118. Virtual variables work by using pandas.DatetimeIndex methods, but if you're not using a standard calendar, you end up with an object array of netCDF4.datetime objects instead of an array of numpy.datetime64 objects (which can be turned into a DatetimeIndex).

Unfortunately, we do need to be able to make a DatetimeIndex to be able to use its (very quick) calculations for properties like year. The alternative is to write our own implementation, which would likely mean far slower pure-python code. We could also write a function to cast an array into a DatetimeIndex from datetime objects, which I'm guessing would be your preferred solution, even though there are issues like the difference between dates, as DatetimeIndex objects and numpy's datetime64 always assume a standard gregorian calendar.

@jhamman
Copy link
Member Author

jhamman commented May 12, 2014

I think the simplest option would be to develop a function to cast the netCDF4.datetime objects into an array of numpy.datetime64 objects.

Does xray do a lot of timedelta operations? If not, I'd say its probably best to live with (for now) the discrepancies between the calendars. Just be sure that you always use the netCDF4.date2num when going back to ordinal times.

I've run into issues like this repeatedly and I think it would be really nice if the datetime module was calendar aware -- probably not going to happen. The closest thing I've found to this is the netCDF4 netcdftime.py which is written in pure python but could be expanded to handle the issues we're talking about.

@shoyer
Copy link
Member

shoyer commented May 12, 2014

Timedelta operations are used in exactly one place in xray: speeding up decoding of dates from netCDF if a standard calendar is being used. Otherwise, that sort of stuff is left up to the user.

If dates with non-standard calendars can generally be most usefully expressed as a pandas.DatetimeIndex, then let's go ahead and default to decoding them into datetime64 arrays. The relevant function to modify is here (see also here) if you'd like to make a pull request!

@jhamman
Copy link
Member Author

jhamman commented May 12, 2014

Ok, I just spent a few minutes working through a possible (although not ideal) solution for this. It works although it is a bit ugly and quite a bit slower than the standard calendar option. This option returns a numpy array with a dtype: datetime64[ns]. The downside is that it uses a datetime object as an intermediary.

In [1]:
import pandas as pd
from netCDF4 import num2date, date2num
import datetime
import numpy as np
from xray.conventions import decode_cf_datetime as decode

units = 'days since 0001-01-01'

# pandas time range
times = pd.date_range('2001-01-01-00', end='2001-06-30-23', freq='H')

# numpy array of numeric dates on noleap calendar
noleap_time = date2num(times.to_pydatetime(), units, calendar='noleap')

# numpy array of numeric dates on standard calendar
std_time = date2num(times.to_pydatetime(), units, calendar='standard')

# decoding function using datetime intermediary
def nctime_to_nptime(times):
    new = np.empty(len(times), dtype='M8[ns]')
    for i, t in enumerate(times):
        new[i] = np.datetime64(datetime.datetime(*t.timetuple()[:6]))
    return new

In [2]:
# decode noleap_time 
%timeit nctime_to_nptime(decode(noleap_time, units, calendar='noleap'))
noleap_datetimes = nctime_to_nptime(num2date(noleap_time, units, calendar='noleap'))
print 'dtype:', noleap_datetimes.dtype
print noleap_datetimes
10 loops, best of 3: 38.8 ms per loop
dtype: datetime64[ns]
['2000-12-31T16:00:00.000000000-0800' '2000-12-31T17:00:00.000000000-0800'
 '2000-12-31T18:00:00.000000000-0800' ...,
 '2001-06-30T14:00:00.000000000-0700' '2001-06-30T15:00:00.000000000-0700'
 '2001-06-30T16:00:00.000000000-0700']

In [3]:
# decode std_time using vectorized converter
%timeit decode(std_time, units, calendar='standard')
standard_datetimes = decode(std_time, units, calendar='standard')
print 'dtype:', standard_datetimes.dtype
print standard_datetimes
1000 loops, best of 3: 243 µs per loop
dtype: datetime64[ns]
['2000-12-31T16:00:00.000000000-0800' '2000-12-31T16:59:59.000000000-0800'
 '2000-12-31T17:59:59.000000000-0800' ...,
 '2001-06-30T13:59:59.000000000-0700' '2001-06-30T14:59:59.000000000-0700'
 '2001-06-30T15:59:59.000000000-0700']

Two three things to notice here:

  • the decode_cf_datetime is having a precision issue on the seconds when using the numpy timedelta approach (used by standard calendars).
  • The new function is substantially slower than the numpy vectorization approach (can't get around that though)
  • the dtype of the returned array matches regardless of the calendar

@shoyer
Copy link
Member

shoyer commented May 12, 2014

Those precision issues are unfortunate but perhaps unavoidable in this case because you are representing dates as floating point numbers -- the units are in "days" but the frequency between time points is measured in "hours".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants