Skip to content

unique() broken for datetime64[ns] columns #2872

Closed
languitar opened this Issue Feb 14, 2013 · 6 comments

4 participants

@languitar

I have a data frame containing a column with timestamps using the numpy datetime data type. The unique() function on that column returns obviously invalid data:

cues[(cues.component == '/usr/bin/naoqi-bin') & (cues.thread_id == 0)].start_time
Out[50]: 
time
2013-02-14 10:11:13.480284   2013-02-14 10:09:34.640000
2013-02-14 10:11:13.728758   2013-02-14 10:09:34.640000
2013-02-14 10:11:13.979455   2013-02-14 10:09:34.640000
2013-02-14 10:11:14.240253   2013-02-14 10:09:34.640000
2013-02-14 10:11:14.484151   2013-02-14 10:09:34.640000
2013-02-14 10:11:14.718189   2013-02-14 10:09:34.640000
2013-02-14 10:11:14.969820   2013-02-14 10:09:34.640000
2013-02-14 10:11:15.237294   2013-02-14 10:09:34.640000
2013-02-14 10:11:15.471394   2013-02-14 10:09:34.640000
2013-02-14 10:11:15.729639   2013-02-14 10:09:34.640000
2013-02-14 10:11:15.980261   2013-02-14 10:09:34.640000
2013-02-14 10:11:16.241234   2013-02-14 10:09:34.640000
2013-02-14 10:11:16.497418   2013-02-14 10:09:34.640000
2013-02-14 10:11:16.738275   2013-02-14 10:09:34.640000
2013-02-14 10:11:16.981937   2013-02-14 10:09:34.640000
...
2013-02-14 10:12:53.737133   2013-02-14 10:09:34.640000
2013-02-14 10:12:53.984177   2013-02-14 10:09:34.640000
2013-02-14 10:12:54.262773   2013-02-14 10:09:34.640000
2013-02-14 10:12:54.505545   2013-02-14 10:09:34.640000
2013-02-14 10:12:54.726044   2013-02-14 10:09:34.640000
2013-02-14 10:12:55.010031   2013-02-14 10:09:34.640000
2013-02-14 10:12:55.245294   2013-02-14 10:09:34.640000
2013-02-14 10:12:55.488452   2013-02-14 10:09:34.640000
2013-02-14 10:12:55.737416   2013-02-14 10:09:34.640000
2013-02-14 10:12:55.980422   2013-02-14 10:09:34.640000
2013-02-14 10:12:56.234256   2013-02-14 10:09:34.640000
2013-02-14 10:12:56.471297   2013-02-14 10:09:34.640000
2013-02-14 10:12:56.721042   2013-02-14 10:09:34.640000
2013-02-14 10:12:56.982471   2013-02-14 10:09:34.640000
2013-02-14 10:12:57.218042   2013-02-14 10:09:34.640000
Name: start_time, Length: 416

cues[(cues.component == '/usr/bin/naoqi-bin') & (cues.thread_id == 0)].start_time.dtype
Out[51]: dtype('datetime64[ns]')

cues[(cues.component == '/usr/bin/naoqi-bin') & (cues.thread_id == 0)].start_time.unique()
Out[52]: array([1970-01-16 90:09:34.640000], dtype=datetime64[ns])

My version of pandas is 0.10.1.

@jreback
jreback commented Feb 14, 2013

unique returns a numpy array, that's how they look (np.datetime64 bug
which pandas works around),
just wrap it back with a Series (as its already the correct dtype)

[2]: pd.__version__
Out[2]: '0.10.1'

[4]: df = pd.DataFrame(dict(A = pd.Timestamp('20010101')),index=range(3))

In [5]: df.ix[2,:] = pd.Timestamp('20010102')

In [6]: df
Out[6]: 
                    A
0 2001-01-01 00:00:00
1 2001-01-01 00:00:00
2 2001-01-02 00:00:00

In [7]: df['A'].unique()
Out[7]: array([1970-01-12 72:00:00, 1970-01-12 96:00:00], dtype=datetime64[ns])

In [8]: pd.Series(df['A'].unique())
Out[8]: 
0   2001-01-01 00:00:00
1   2001-01-02 00:00:00
@languitar

Is there a numpy issue tracked somewhere for that bug?

@jreback
jreback commented Feb 14, 2013

its really just a display issue as to how numpy display native datetime64[ns], see below,
if you always just interact withit via pandas (via the Timestamp) objects you should have no problems
wes did all the hard work

df['A'].values
Out[9]: array([1970-01-12 72:00:00, 1970-01-12 72:00:00, 1970-01-12 96:00:00], dtype=datetime64[ns])

good explanation here
http://pandas.pydata.org/pandas-docs/stable/whatsnew.html#numpy-datetime64-dtype-and-1-6-dependency

@wesm
Python for Data member
wesm commented Feb 15, 2013

I am eventually going to circumvent this stupid stuff in NumPy with a pandas-specific array implementation and data types we have control over but it may take me another year. Wrap datetime64 arrays in Series for now

@wesm wesm closed this Mar 12, 2013
@michaelaye

I would like to point out that there is currently an ongoing numpy discussion about datetime64, after a bug had been found in numpy1.7 concerning the fact that timezone support does not exist for pre-1970 dates.
Maybe it would be wise to address any remaining concerns that you guys have about datetime64, as it seems that they want to fix whatever is still broken with it?
http://permalink.gmane.org/gmane.comp.python.numeric.general/53906

@jreback
jreback commented Apr 19, 2013

thanks for the link.....pandas doesn't use the np.datetime64 at all (except as an input type), wes replaced it with Timestamp, that acts and works correctly. In theory it would be nice to rely on numpy for a type like this, but < 1.7 is quite buggy (1.7 seems good from what I have seen)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Something went wrong with that request. Please try again.