Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: select_column not preserving a UTC timezone #7777

Closed
alorenzo175 opened this issue Jul 17, 2014 · 7 comments · Fixed by #7790
Closed

BUG: select_column not preserving a UTC timezone #7777

alorenzo175 opened this issue Jul 17, 2014 · 7 comments · Fixed by #7790
Labels
Bug IO HDF5 read_hdf, HDFStore Timezones Timezone data dtype
Milestone

Comments

@alorenzo175
Copy link
Contributor

I was having issues with lost tz-info when retrieving a DatetimeIndex from an HDF store using store.select_column('data', 'index'). I was able to track down the issue to tseries/index.py in the Index._to_embed method. The issue is

    def _to_embed(self, keep_tz=False):
        """ return an array repr of this object, potentially casting to object """
        if keep_tz and self.tz is not None and str(self.tz) != 'UTC':                                                                                                                                                                       
             return self.asobject.values
        return self.values

It looks like it explicitly rejects UTC timezones. Is there a good reason for this?

The below code reproduces the problem for me.

import pandas as pd

drange = pd.date_range('2014-07-07 00:00:00', '2014-07-07 03:00:00', freq='1h')
drange_utc = drange.tz_localize('UTC')
drange_mst = drange.tz_localize('MST')

print drange._to_embed(keep_tz=True)
print drange_utc._to_embed(keep_tz=True)
print drange_mst._to_embed(keep_tz=True)

I'm using python 2.7.6 with the following packages:

commit: None
python: 2.7.6.final.0
python-bits: 64
OS: Linux
OS-release: 2.6.32-431.17.1.el6.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.utf8

pandas: 0.14.1
nose: 1.3.3
Cython: 0.20.1
numpy: 1.8.1
scipy: 0.14.0
statsmodels: 0.5.0
IPython: 2.0.0
sphinx: 1.2.2
patsy: 0.2.1
scikits.timeseries: None
dateutil: 2.2
pytz: 2014.2
bottleneck: None
tables: 3.1.1
numexpr: 2.4
matplotlib: 1.3.1
openpyxl: 1.8.6
xlrd: None
xlwt: None
xlsxwriter: None
lxml: 3.3.5
bs4: 4.3.2
html5lib: None
httplib2: None
apiclient: None
rpy2: None
sqlalchemy: 0.9.4
pymysql: None
psycopg2: None
@jreback
Copy link
Contributor

jreback commented Jul 17, 2014

this is an internal method

show a complete example of what u r actually doing

@alorenzo175
Copy link
Contributor Author

Making a Series of a DatetimeIndex also illustrates the problem.

import pandas as pd

drange = pd.date_range('2014-07-07 00:00:00', '2014-07-07 03:00:00', freq='1h')
drange_utc = drange.tz_localize('UTC')
drange_mst = drange.tz_localize('MST')

print pd.Series(drange)
print pd.Series(drange_utc)
print pd.Series(drange_mst)

@alorenzo175
Copy link
Contributor Author

And an example of my original problem getting the index from an HDF store

import pandas as pd
import numpy as np

drange = pd.date_range('2014-07-07 00:00:00', '2014-07-07 03:00:00', freq='1h')
drange_utc = drange.tz_localize('UTC')
drange_mst = drange.tz_localize('MST')

data = np.ones((drange.size, 3))
df_utc = pd.DataFrame(data, index=drange_utc)
df_mst = pd.DataFrame(data, index=drange_mst)

store_path = 'timezone_test.h5'
with pd.get_store(store_path) as store:
    store.put('utc', df_utc ,'table')
    store.put('mst', df_mst, 'table')

with pd.get_store(store_path) as store:
    print store.select_column('utc', 'index')
    print store.select_column('mst', 'index')

@jreback
Copy link
Contributor

jreback commented Jul 17, 2014

@alorenzo175 By definition a datetime64[ns] is UTC (though it technically doesn't have a tz attached). I guess technically you want to keep a tz-attached (but with UTC) series. You can do this, but will pretty much have to specify a dtype=object a lot.

Because a UTC series is de-facto equivalent to a plain-old datetime64[ns] series and is indistiguishable. (However, the index itself IS distriguishable).

I guess this could be a bit confusing. A possible work-around is to store the 'UTC' data as 'GMT', which will be treated as a regular timezone.

selecting this as a full table DOES seem to work though (e.g. store.select('utc')), so maybe their is a bug in select_column as its not doing the proper conversion (when its shoved into the Series).

ok, will mark that as a bug.

interested in doing a pull-request to fix (it will be in io/pytables/Table/read_column)?

@jreback jreback added this to the 0.15.0 milestone Jul 17, 2014
@jreback jreback changed the title UTC timezone information lost from DatetimeIndex on _to_embed BUG: select_column not preserving a UTC timezone Jul 17, 2014
@alorenzo175
Copy link
Contributor Author

After messing around a little with the Index._to_embed function, I agree that this should just be a fix for pytables.

I'll try to make a fix but this will be my virgin PR.

@jreback
Copy link
Contributor

jreback commented Jul 18, 2014

@alorenzo175 np, I don't think its that tricky, but have to get to know the code...lmk

@jreback
Copy link
Contributor

jreback commented Jul 18, 2014

https://github.com/pydata/pandas/wiki some useful tips

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO HDF5 read_hdf, HDFStore Timezones Timezone data dtype
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants