Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cftime.real_datetime not pandas compatible #77

Closed
fmaussion opened this issue Nov 6, 2018 · 20 comments
Closed

cftime.real_datetime not pandas compatible #77

fmaussion opened this issue Nov 6, 2018 · 20 comments

Comments

@fmaussion
Copy link

See pandas-dev/pandas#23419 for context.

The issue is following:

import pandas, numpy, netCDF4
time=netCDF4.num2date([0,1,2,3,4,5,500,1000], 'days since 1801-10-01')
pandas.Series(numpy.arange(len(time)), index=time)

Fails with

AttributeError: 'real_datetime' object has no attribute 'nanosecond'

@spencerkclark says that this is due to the fact that netcdf4 now returns cftime objects per default (always). I guess it is a necessary change (having different time objects to deal with depending on the calendar is a pain), but it makes me wonder about the level of compatibility between cftime objects and the rest of the pydata stack.

I'm no specialist at all here, so I don't know what it would involve, but it would be nice if cftime objects could be handled by pandas.

The easiest fix seems to simply add a nanoseconds attribute to the cftime.real_datetime objects.

@spencerkclark
Copy link
Collaborator

@spencerkclark says that this is due to the fact that netcdf4 now returns cftime objects per default (always)

Thanks @fmaussion. I think it's still important to distinguish between objects that subclass datetime.datetime (e.g. cftime.real_datetime) and objects that subclass cftime.datetime (e.g. cftime.DatetimeGregorian). For instance for xarray's CFTimeIndex we require only cftime.datetime objects be used, and we use the only_use_cftime_datetimes flag in num2date to guarantee this when decoding times (it is still False by default). The issue here stems from #66, where the real_datetime subclass of datetime.datetime was introduced.

I would not expect cftime.datetime objects to be handled by default by pandas (because those could have non-standard calendar types), but perhaps we might try to make cftime.real_datetime objects compatible, because they are at their core ordinary datetime.datetime objects.

@TimoRoth
Copy link
Contributor

TimoRoth commented Nov 6, 2018

It looks like all they are missing is a nanosecond attribute, which can just always be 0 for it to work as expected.

@jswhit
Copy link
Collaborator

jswhit commented Nov 6, 2018

python datetime objects don't have a nanosecond attribute either

@TimoRoth
Copy link
Contributor

TimoRoth commented Nov 6, 2018

They indeed don't, but pandas assumes that whenever an object is "PyDateTime_Check(val) and not PyDateTime_CheckExact(val)", i.e. a subclass of datetime, that it has a nanosecond attribute, which is what causes the error in the first place:
https://github.com/pandas-dev/pandas/blob/7191af9b47f6f57991abdb20625d9563877370a2/pandas/_libs/tslib.pyx#L539

@jswhit
Copy link
Collaborator

jswhit commented Nov 6, 2018

So this is a bug in pandas then?

@TimoRoth
Copy link
Contributor

TimoRoth commented Nov 6, 2018 via email

@spencerkclark
Copy link
Collaborator

spencerkclark commented Nov 6, 2018

Sorry for likely prompting that issue to be closed there...one could try to fix this issue in pandas, but I understand to some extent their pushback. They do not make any guarantees that non-pandas.Timestamp subclasses of datetime.datetime objects will be supported, and the logic there regarding datetimes is already quite complicated.

Barring rolling back the change in #66 to add dayofwk and dayofyr properties to datetime.datetime via a subclass, as unsavory as it might sound, I think the most painless workaround here is in your code to convert the datetimes returned from num2date to numpy.datetime64 before passing them to the Series:

In [1]: import cftime, pandas

In [2]: times = cftime.num2date([0,1,2,3,4,5,500,1000], 'days since 1801-10-01')

In [3]: times
Out[3]:
array([real_datetime(1801, 10, 1, 0, 0), real_datetime(1801, 10, 2, 0, 0),
       real_datetime(1801, 10, 3, 0, 0), real_datetime(1801, 10, 4, 0, 0),
       real_datetime(1801, 10, 5, 0, 0), real_datetime(1801, 10, 6, 0, 0),
       real_datetime(1803, 2, 13, 0, 0), real_datetime(1804, 6, 27, 0, 0)], dtype=object)

In [4]: nptimes = times.astype('datetime64[ns]')

In [5]: nptimes
Out[5]:
array(['1801-10-01T00:00:00.000000000', '1801-10-02T00:00:00.000000000',
       '1801-10-03T00:00:00.000000000', '1801-10-04T00:00:00.000000000',
       '1801-10-05T00:00:00.000000000', '1801-10-06T00:00:00.000000000',
       '1803-02-13T00:00:00.000000000', '1804-06-27T00:00:00.000000000'], dtype='datetime64[ns]')

In [6]: pandas.Series(range(len(nptimes)), index=nptimes)
Out[6]:
1801-10-01    0
1801-10-02    1
1801-10-03    2
1801-10-04    3
1801-10-05    4
1801-10-06    5
1803-02-13    6
1804-06-27    7
dtype: int64

At least for now, NumPy seems to handle subclasses of datetime.datetime objects just fine in this conversion, and pandas is OK with numpy.datetime64.

@fmaussion
Copy link
Author

I think the most painless workaround here is in your code to convert the datetimes

Yes, I don't worry too much about workarounds - currently we are running with the nanosecond attr, but I'll consider yours. It surprises me however that I'm the only one wanting to use pandas with netcdf files output, and I find this recent evolution (and the necessary workaround) quite unfortunate for simple users.

But as I said, I just don't understand dates in python well enough to make any recommendation here. In this perspective I am like most scientists I guess: there are way too many date formats in python. They are all equivalent to me, until I find out they are not ;-)

@spencerkclark
Copy link
Collaborator

spencerkclark commented Nov 8, 2018

It surprises me however that I'm the only one wanting to use pandas with netcdf files output, and I find this recent evolution (and the necessary workaround) quite unfortunate for simple users.

You're definitely not the only one :). I'm sorry if I came across as being in support of doing nothing in response to this issue.

there are way too many date formats in python

I agree; in hindsight #66 effectively created another one, for arguably little gain (and clearly at least one regression).

From my perspective I think cftime should at least consider the option of reverting back to by default using pure datetime.datetime objects for standard calendars (and not a subclass, i.e. real_datetime) before going the route of trying to change the behavior downstream in pandas (which they seem somewhat reluctant to doing) or adding a nanosecond attribute to real_datetime (which seems somewhat fragile).

As I understand it, in addition to creating real_datetime, #66 established working dayofwk and dayofyr properties for all cftime.datetime objects. There are cftime.datetime objects for all types of calendars (including standard ones). If one would always like those properties on their datetimes, they could toggle the only_use_cftime_datetimes option in num2date.

Ultimately, I think the benefits of returning standard Python datetime objects (for interoperability with the rest of the stack) would outweigh the minor downside that by default, not all datetime objects returned by num2date would have dayofwk and dayofyr attributes (because there would still be a way to achieve that behavior; it would just not be there automatically).

That's my two cents, but I am not a core developer of cftime, so take that with a grain of salt.

@TimoRoth
Copy link
Contributor

TimoRoth commented Nov 8, 2018

I do think that the actual issue in this case is on the pandas side. As they are the ones assuming that every subclass of datetime automatically has to have a nanosecond attribute, which there is absolutely no guarantee for.

@spencerkclark
Copy link
Collaborator

@TimoRoth I understand that it would fix things in pandas if that was allowed, but I think it's important to note that pandas doesn't explicitly say they will support subclasses of datetime objects, which is the reason for their reluctance. Therefore I'm not sure I would frame it as a "bug" or "regression" in pandas; it would be more of a request for them to relax their assumptions.

The change that caused this was in cftime (and was made quite recently). For that reason I think we should at least reflect on whether that was a good decision before approaching pandas again.

@spencerkclark
Copy link
Collaborator

For what it's worth though, I will note that others apparently would like something of this sort to be supported in pandas too -- this issue seems related: pandas-dev/pandas#21142.

@fmaussion
Copy link
Author

@TimoRoth how much of an effort would it be to send a patch in Pandas? Jeff Reback seemed opened to it (pandas-dev/pandas#23419 (comment))

Let's just hope it doesn't imply messing around with C code...

@spencerkclark
Copy link
Collaborator

The problem with this, and the reason it was changed in the first place, is that returning a different kind of object depending on the calendar and time period resulting in problems for users. I'm not in favor of reverting to this just as a workaround for how pandas deals with datetime subclasses.

Sorry @jswhit I was a little confused by your comment, though I can't seem to find it on GitHub anymore.

As a subclass of datetime.datetime or as a pure datetime.datetime object, cftime.real_datetime is still not the same type as cftime.datetime, so maybe I am not following, but it is not clear to me how #66 addressed the issue of returning different kinds of objects depending on the calendar and time period. As I noted earlier, #39 is a way to work around that particular issue.

That being said, I really don't feel very strongly about this, so I don't want to belabor the discussion. I merely thought given the recency of the change that it might warrant reconsidering in this case, but I see the other side of the argument too (i.e. that one might want dayofwk and dayofyr attributes on all date types returned by num2date). Whatever you ultimately decide to do is OK with me. Since I'm responsible for bringing discussion here, if the decision is to fix things in cftime, I'm happy to volunteer to do that work, but if the decision is to push the fix downstream to pandas (as you suggested in your comment), I would rather someone else take it on, as I am not particularly invested in this problem.

@jswhit
Copy link
Collaborator

jswhit commented Nov 16, 2018

I deleted that comment after I realized it was completely off base, sorry. You are right that this particular change only enabled the extra two attributes (dayofwk, dayofyr). I would still prefer the issue be addressed in pandas, but I'm not opposed to adding a bogus nanosecond attribute as a workaround.

@spencerkclark
Copy link
Collaborator

I would still prefer the issue be addressed in pandas, but I'm not opposed to adding a bogus nanosecond attribute as a workaround.

Perfect, thanks. Hopefully things work out in pandas, but it's good to have that as a fallback option here.

@jswhit
Copy link
Collaborator

jswhit commented Dec 2, 2018

@fmaussion, can you verify that pull request #95 (branch issue77) fixes the issue?

bjlittle added a commit that referenced this issue Dec 2, 2018
add bogus nanosecond attribute to real_datetime (issue #77)
@spencerkclark
Copy link
Collaborator

I know I'm late on this, but thanks for adding the workaround @jswhit.

@Westbrook077
Copy link

Sorry for likely prompting that issue to be closed there...one could try to fix this issue in pandas, but I understand to some extent their pushback. They do not make any guarantees that non-pandas.Timestamp subclasses of datetime.datetime objects will be supported, and the logic there regarding datetimes is already quite complicated.

Barring rolling back the change in #66 to add dayofwk and dayofyr properties to datetime.datetime via a subclass, as unsavory as it might sound, I think the most painless workaround here is in your code to convert the datetimes returned from num2date to numpy.datetime64 before passing them to the Series:

In [1]: import cftime, pandas

In [2]: times = cftime.num2date([0,1,2,3,4,5,500,1000], 'days since 1801-10-01')

In [3]: times
Out[3]:
array([real_datetime(1801, 10, 1, 0, 0), real_datetime(1801, 10, 2, 0, 0),
       real_datetime(1801, 10, 3, 0, 0), real_datetime(1801, 10, 4, 0, 0),
       real_datetime(1801, 10, 5, 0, 0), real_datetime(1801, 10, 6, 0, 0),
       real_datetime(1803, 2, 13, 0, 0), real_datetime(1804, 6, 27, 0, 0)], dtype=object)

In [4]: nptimes = times.astype('datetime64[ns]')

In [5]: nptimes
Out[5]:
array(['1801-10-01T00:00:00.000000000', '1801-10-02T00:00:00.000000000',
       '1801-10-03T00:00:00.000000000', '1801-10-04T00:00:00.000000000',
       '1801-10-05T00:00:00.000000000', '1801-10-06T00:00:00.000000000',
       '1803-02-13T00:00:00.000000000', '1804-06-27T00:00:00.000000000'], dtype='datetime64[ns]')

In [6]: pandas.Series(range(len(nptimes)), index=nptimes)
Out[6]:
1801-10-01    0
1801-10-02    1
1801-10-03    2
1801-10-04    3
1801-10-05    4
1801-10-06    5
1803-02-13    6
1804-06-27    7
dtype: int64

At least for now, NumPy seems to handle subclasses of datetime.datetime objects just fine in this conversion, and pandas is OK with numpy.datetime64.

Thx for your answer, but I found something wrong:
'''
In [6]: pandas.Series(range(len(nptimes)), index=nptimes)
Out[6]:
1801-10-01 0
1801-10-02 1
1801-10-03 2
1801-10-04 3
1801-10-05 4
1801-10-06 5
1803-02-13 6
1804-06-27 7
dtype: int64
'''

Maybe the right code shoud be below:
'''
pandas.Series(nptimes, index = range(len(nptimes)))
'''

@spencerkclark
Copy link
Collaborator

I think it depends on your purpose. The motivation for using the integer range as the values and the times as the index was based on the usage in the initial comment: #77 (comment).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants