Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timestamp.tz_localize() NonExistentTimeError handling #8917

Closed
mskrajnowski opened this issue Nov 28, 2014 · 16 comments · Fixed by #22644
Closed

Timestamp.tz_localize() NonExistentTimeError handling #8917

mskrajnowski opened this issue Nov 28, 2014 · 16 comments · Fixed by #22644
Labels
Enhancement Timezones Timezone data dtype
Milestone

Comments

@mskrajnowski
Copy link

I'm trying to use pandas to speed up timezone conversions on many datetimes and I can't get around NonExistentTimeErrors. Pandas tz_localize() seems to ignore the ambiguous argument for the non existent time case.

Example:

import pytz
import pandas

tz = pytz.timezone('Europe/Warsaw')
non_existent = datetime.datetime(2015, 3, 29, 2, 30)

tz.normalize(tz.localize(non_existent))
#2015-03-29 03:30:00+02:00

tz.normalize(tz.localize(non_existent, is_dst=False))
#2015-03-29 03:30:00+02:00

pandas.Timestamp(non_existent).tz_localize(tz)
# NonExistentTimeError: 2015-03-29 02:30:00

pandas.Timestamp(non_existent).tz_localize(tz, ambiguous=0)
# NonExistentTimeError: 2015-03-29 02:30:00

It would be nice if the ambiguous argument worked the same as is_dst in pytz. As it is now, it's impossible afaik to reliably localize a series of datetimes using pandas, since any of them might cause a NonExistentTimeError and there's no way of guiding pandas what to do with such datetimes.

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.5.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-39-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.15.1
nose: 1.3.4
Cython: None
numpy: 1.9.1
scipy: None
statsmodels: None
IPython: 2.3.1
sphinx: 1.2.2
patsy: None
dateutil: 2.2
pytz: 2013.9
bottleneck: None
tables: None
numexpr: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: None
httplib2: 0.9
apiclient: 1.3.1
rpy2: None
sqlalchemy: None
pymysql: None
psycopg2: 2.5.4 (dt dec pq3 ext)
@jorisvandenbossche
Copy link
Member

cc @rockg @ischwabacher

@rockg
Copy link
Contributor

rockg commented Nov 28, 2014

ambiguous is only structured to differentiate duplicate times in the fall transition. It would be very easy to extend to the spring transition.

@jreback jreback added Enhancement Timezones Timezone data dtype labels Nov 28, 2014
@jreback jreback added this to the 0.16.0 milestone Nov 28, 2014
@rockg
Copy link
Contributor

rockg commented Nov 30, 2014

@mskrajnowski I have been thinking about this more and wonder if your data is really in a standard time zone and you should be localizing to that and then converting. For example, what does the data look like for that Spring DST change day? Are there 24 hours or just 23? Maybe the localizing issue is masking something else.

@ischwabacher
Copy link
Contributor

It seems to me that the only overlap between the sorts of options that make sense for handling ambiguous times and the options that make sense for nonexistent times are NaT and raise. Here are the other possible behaviors I can come up with, though some of them are pretty wacky:

  1. Choose the time at the jump. So Timestamp('2014-03-09 02:30:00').tz_localize('America/Chicago') is Timestamp('2014-03-09 03:00:00-0500', tz='America/Chicago').
  2. Apply the (non-DST) offset before the discontinuity to the given time, then normalize (in the pytz sense). So Timestamp('2014-03-09 02:30:00').tz_localize('America/Chicago') becomes
    Timestamp('2014-03-09 03:30:00-0500', tz='America/Chicago').
  3. Apply the (DST) offset after the discontinuity to the given time, then normalize.
  4. Apply the "before" offset, then don't normalize. This yields a nonexistent time, but would allow date_range to be emulated by repeated subtraction of an offset from a Timestamp. (So that for instance Timestamp('2014-03-10 02:30:00-0500', tz='America/Chicago') - Day() - Day() could equal Timestamp('2014-03-08 02:30:00-0600', tz='America/Chicago'). But this is a lot of crazy complexity for an invariant that I'm starting to think is a bad idea anyway.
  5. Apply the "after" offset, then don't normalize. This is like the previous option but for repeated addition instead of subtraction.

How do these match up against the options for ambiguous time handling? Does the knob for nonexistent time handling need to be separate from the one for ambiguous times?

@mskrajnowski
Copy link
Author

@rockg I'm implementing a scheduling application. The user defines working hours in his/her own timezone, then I'm combining the time provided by the user with dates, to get actual work start and end utc datetimes. I can't really forbid the user from setting 02:30 as his work start/end time (maybe he/she likes to wake up in the middle of the night and code ;) ), so I need a way to reliably localize any datetime. Even if technically a given time doesn't exist, I still need to output some logical utc datetime.

@ischwabacher I'd go with the way pytz handles non existent/ambiguous time.

@ischwabacher
Copy link
Contributor

Unfortunately, pytz uses is_dst=True to mean option 5, is_dst=False to mean option 4, and is_dst=None to mean raise. This is partly due to pytz's workaround for limitations in the datetime API, which are tentatively scheduled to be fixed in python3.5 (woohoo!), so we will probably see pytz switch to behaviors 3 and 2, respectively.

One issue I have here is that is_dst=True returns a time that (once normalized) is not DST but would be the given time if it were DST, while is_dst=False returns one that is DST but would be the given time if it were not DST. I am not sure whether this is more or less confusing than swapping them.

Also, if your users set a time of 2:30 as a work start/end time, do you think your users will be least surprised by an alarm at 1:30, 3:00, or 3:30? Does it depend on whether it's a start or end time?

@mskrajnowski
Copy link
Author

Since 2:00 becomes 3:00 in that transition, imho it's logical that 2:30 would become 3:30. That's what I get with pytz using localize and normalize (without passing is_dst). Of course, now the question is, what to do with a time interval of 2:30 - 3:00, which would become 3:30 - 3:00.
However, I'd still prefer that problem, because now I have better data to work with.

@mskrajnowski
Copy link
Author

Another thing that would help is if we had an api which would enable us to fix ambiguous and non existent times. Maybe tz_localize could return NaT along with information, why a time wasn't localized? If I had such information, I could, for example , add an hour to non existent times and retry. At the moment tz_localize raises on first error, which isn't very helpful when working with series.

@ischwabacher
Copy link
Contributor

I definitely think there should be an option to return NaT instead of raising, but it doesn't seem feasible to attach any other information to that unless you just want a warning, which we should not emit unless it's explicitly requested (since nonexistent times can arise without necessarily coming from a programmer error).

As far as defaults go, I think we should keep raise as the default (or possibly raise for single operations and NaT for vectorized operations if we can do that?) because "errors should not pass silently // unless explicitly silenced".

@jreback jreback modified the milestones: 0.16.0, Next Major Release Mar 6, 2015
@esvhd
Copy link

esvhd commented Jun 23, 2017

Hi guys - Was this released in 0.20.2 yet? I'm seeing the same problem here.
thanks.

@jorisvandenbossche
Copy link
Member

This issue is still open, so no, this has not been fixed in 0.20.2. Contributions are welcome.

@randomgambit
Copy link

@jorisvandenbossche @esvhd same problem here. Got some timestamps in central time, and try to localize in EST give me this error. Is there a quick and dirty fix for that? Thanks!

@rockg
Copy link
Contributor

rockg commented Jul 17, 2017

There should be nothing wrong converting to EST as that does not have DST offsets. Can you post your example?

In [10]: ts = pd.Timestamp("2017-03-12 02:30")

In [11]: ts.tz_localize("EST")
Out[11]: Timestamp('2017-03-12 02:30:00-0500', tz='EST')

@randomgambit
Copy link

@rockg No I mean I convert from Central time to EST.

df.timestamp.map(lambda x: x.tz_localize('US/Central', ambiguous = 'NaT').tz_convert('US/Eastern').tz_localize(None))

@randomgambit
Copy link

randomgambit commented Jul 17, 2017

@jorisvandenbossche @mskrajnowski @rockg would this be the pytz- equivalent (and correct) way to convert from US/Central to US/Eastern in a Pandas dataframe?

Assume df.timestamp is a naive pd.to_datetime() timestamp

import pytz

tz_est = pytz.timezone('US/Eastern')
tz_central = pytz.timezone('US/Central')

df.timestamp.map(lambda x: tz_central.localize(x).astimezone(tz_est).tz_localize(None)))

@randomgambit
Copy link

@jorisvandenbossche @mskrajnowski @rockg any ideas? sorry for the spam but this is an important question in my humble opinion :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Enhancement Timezones Timezone data dtype
Projects
None yet
8 participants