Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indexing into Series of tz-aware datetime64s fails using __getitem__ #12089

Closed
JackKelly opened this issue Jan 19, 2016 · 8 comments
Closed

Indexing into Series of tz-aware datetime64s fails using __getitem__ #12089

JackKelly opened this issue Jan 19, 2016 · 8 comments
Labels
Bug Datetime Datetime data dtype Regression Functionality that used to work in a prior pandas version Timezones Timezone data dtype
Milestone

Comments

@JackKelly
Copy link
Contributor

I'm a huge fan of Pandas. Thanks for all the hard work!

I believe I have stumbled across a small bug in Pandas 0.17.1 which was not present in 0.16.2. Indexing into Series of timezone-aware datetime64s fails using __getitem__ but indexing succeeds if the datetime64s are timezone-naive. Here is a minimal code example and the exception produced by Pandas 0.17.1:

In [37]: dates_with_tz = pd.date_range("2011-01-01", periods=3, tz="US/Eastern")

In [46]: dates_with_tz
Out[46]: 
DatetimeIndex(['2011-01-01 00:00:00-05:00', '2011-01-02 00:00:00-05:00',
               '2011-01-03 00:00:00-05:00'],
              dtype='datetime64[ns, US/Eastern]', freq='D')

In [38]: s_with_tz = pd.Series(dates_with_tz, index=['a', 'b', 'c'])

In [39]: s_with_tz
Out[39]: 
a   2011-01-01 00:00:00-05:00
b   2011-01-02 00:00:00-05:00
c   2011-01-03 00:00:00-05:00
dtype: datetime64[ns, US/Eastern]

In [40]: s_with_tz['a']
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
<ipython-input-40-81d0bf655282> in <module>()
----> 1 s_with_tz['a']

/usr/local/lib/python2.7/dist-packages/pandas/core/series.pyc in __getitem__(self, key)
    555     def __getitem__(self, key):
    556         try:
--> 557             result = self.index.get_value(self, key)
    558 
    559             if not np.isscalar(result):

/usr/local/lib/python2.7/dist-packages/pandas/core/index.pyc in get_value(self, series, key)
   1778         s = getattr(series,'_values',None)
   1779         if isinstance(s, Index) and lib.isscalar(key):
-> 1780             return s[key]
   1781 
   1782         s = _values_from_object(series)

/usr/local/lib/python2.7/dist-packages/pandas/tseries/base.pyc in __getitem__(self, key)
     98         getitem = self._data.__getitem__
     99         if np.isscalar(key):
--> 100             val = getitem(key)
    101             return self._box_func(val)
    102         else:

IndexError: only integers, slices (`:`), ellipsis (`...`), numpy.newaxis (`None`) and integer or boolean arrays are valid indices

If the dates are timezone-aware then we can access them using loc but, as far as I'm aware, we should be able to use __getitem__ in this situation too:

In [41]: s_with_tz.loc['a']
Out[41]: Timestamp('2011-01-01 00:00:00-0500', tz='US/Eastern')

However, if the dates are timezone-naive then indexing using __getitem__ works as expected:

In [32]: dates_naive = pd.date_range("2011-01-01", periods=3)

In [33]: dates_naive
Out[33]: DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03'], dtype='datetime64[ns]', freq='D')

In [34]: s = pd.Series(dates_naive, index=['a', 'b', 'c'])

In [35]: s
Out[35]: 
a   2011-01-01
b   2011-01-02
c   2011-01-03
dtype: datetime64[ns]

In [36]: s['a']
Out[36]: Timestamp('2011-01-01 00:00:00')

So indexing into a Series using __getitem__ works if the data is a list of timezone-naive datetime64s but indexing fails if the datetime64s are timezone-aware.

In [47]: pd.show_versions()

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.10.final.0
python-bits: 64
OS: Linux
OS-release: 4.2.0-23-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_GB.UTF-8

pandas: 0.17.1
nose: 1.3.7
pip: 1.5.6
setuptools: 15.2
Cython: 0.23.1
numpy: 1.10.1
scipy: 0.16.1
statsmodels: 0.6.1
IPython: 4.0.0
sphinx: 1.2.3
patsy: 0.3.0
dateutil: 2.4.2
pytz: 2015.7
blosc: None
bottleneck: None
tables: 3.2.2
numexpr: 2.4.6
matplotlib: 1.4.3
openpyxl: None
xlrd: 0.9.2
xlwt: None
xlsxwriter: None
lxml: None
bs4: 4.3.2
html5lib: 0.999
httplib2: 0.9
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: 2.5.3 (dt dec pq3 ext)
Jinja2: None
@jorisvandenbossche jorisvandenbossche added Bug Datetime Datetime data dtype Regression Functionality that used to work in a prior pandas version Timezones Timezone data dtype labels Jan 19, 2016
@jorisvandenbossche jorisvandenbossche added this to the 0.18.0 milestone Jan 19, 2016
@jorisvandenbossche
Copy link
Member

I can confirm this bug, also with current master.

@jreback
Copy link
Contributor

jreback commented Jan 19, 2016

just need a try: except: around line 1780 catching IndexError and passing if its caught.

@JackKelly want to do a PR?

@jreback
Copy link
Contributor

jreback commented Jan 19, 2016

tests can go in the same place as in #12054

@JackKelly
Copy link
Contributor Author

sure, I'll give it a go now...

JackKelly added a commit to JackKelly/pandas that referenced this issue Jan 19, 2016
Indexing into Series of tz-aware datetime64s fails using __getitem__
@JackKelly
Copy link
Contributor Author

OK, I've attempted the fix. Here's the relevant commit on my fork of Pandas.

However, this hasn't fixed the issue and I'm not sure what's best to do. My 'fix' has revealed a new issue. The problem appears to be that, now, when we do series['a'], we get back a tz-naive Timestamp (even though the series contains a bunch of tz-aware datetime64s):

In [5]: dates = pd.date_range("2011-01-01", periods=3, tz='utc')

In [6]: dates
Out[6]: DatetimeIndex(['2011-01-01', '2011-01-02', '2011-01-03'], dtype='datetime64[ns, UTC]', freq='D')

In [7]: series = pd.Series(dates, index=['a', 'b', 'c'])

# Note the lack of timezone:
In [8]: series['a']
Out[8]: Timestamp('2011-01-01 00:00:00')

# But using `loc` we do get the timezone:
In [9]: series.loc['a']
Out[9]: Timestamp('2011-01-01 00:00:00+0000', tz='UTC')

In [10]: series
Out[10]: 
a   2011-01-01 00:00:00+00:00
b   2011-01-02 00:00:00+00:00
c   2011-01-03 00:00:00+00:00
dtype: datetime64[ns, UTC]

In [11]: series['a'] == series.loc['a']
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-11-0de902e8919c> in <module>()
----> 1 series['a'] == series.loc['a']

/home/jack/workspace/python/pandas/pandas/tslib.pyx in pandas.tslib._Timestamp.__richcmp__ (pandas/tslib.c:19258)()
    971                                 (type(self).__name__, type(other).__name__))
    972 
--> 973         self._assert_tzawareness_compat(other)
    974         return _cmp_scalar(self.value, ots.value, op)
    975 

/home/jack/workspace/python/pandas/pandas/tslib.pyx in pandas.tslib._Timestamp._assert_tzawareness_compat (pandas/tslib.c:19638)()
   1000         if self.tzinfo is None:
   1001             if other.tzinfo is not None:
-> 1002                 raise TypeError('Cannot compare tz-naive and tz-aware '
   1003                                  'timestamps')
   1004         elif other.tzinfo is None:

TypeError: Cannot compare tz-naive and tz-aware timestamps

@jreback
Copy link
Contributor

jreback commented Jan 19, 2016

yeh, prob some issues down the path. lmk if you get stuck.

@JackKelly
Copy link
Contributor Author

Hmm, I think this is way over my head to be honest. I'm really not very familiar with Pandas' internals. I have had a quick shot at getting to the bottom of it. Not sure if I've found any bugs or not. Here are my notes:

Set up a debugging session in IPython like this:

dates_with_tz = pd.date_range("2011-01-01", periods=3, tz="US/Eastern")
s_with_tz = pd.Series(dates_with_tz, index=['a', 'b', 'c'])
%debug s_with_tz['a']

we find that:

In Index.get_value(), the line s = _values_from_object(series) sets s to be:

['2011-01-01T05:00:00.000000000+0000' '2011-01-02T05:00:00.000000000+0000'
 '2011-01-03T05:00:00.000000000+0000']

i.e. timezone is switched from "US/Eastern" to UTC. I've tried stepping into _values_from_object(series) to find where the timezone is switched but I'm not sure I understand what I'm looking at. My only hunch is that the following is broken (because the timezone should still be US/Eastern, surely?):

In [32]: s_with_tz._values._values
Out[32]: 
array(['2011-01-01T05:00:00.000000000+0000',
       '2011-01-02T05:00:00.000000000+0000',
       '2011-01-03T05:00:00.000000000+0000'], dtype='datetime64[ns]')

but I'm really not sure! Is s_with_tz._values._values supposed to return an array where the timezone is set to UTC instead of 'US/Eastern'? Here are the other values (which look correct to me):

In [33]: s_with_tz._values
Out[33]: 
DatetimeIndex(['2011-01-01 00:00:00-05:00', '2011-01-02 00:00:00-05:00',
               '2011-01-03 00:00:00-05:00'],
              dtype='datetime64[ns, US/Eastern]', freq='D')

In [34]: s_with_tz
Out[34]: 
a   2011-01-01 00:00:00-05:00
b   2011-01-02 00:00:00-05:00
c   2011-01-03 00:00:00-05:00
dtype: datetime64[ns, US/Eastern]

@JackKelly
Copy link
Contributor Author

thank you @jreback :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Datetime Datetime data dtype Regression Functionality that used to work in a prior pandas version Timezones Timezone data dtype
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants