Off-by-one microsecond when parsing with certain timezones #74

dokai · 2016-11-12T02:29:38Z

I found a curious behavior when parsing timestamps where the microsecond value is off by one depending on the timezone used.

As a minimal example I'm attempting to parse the string 2016-11-12T02:09:39.594000.

Parsing with the default timezone works as excepted:

In [3]: Pendulum.parse('2016-11-12T02:09:39.594000').microsecond
Out[3]: 594000

Using a particular timezone shows the off-by-one difference.

In [4]: Pendulum.parse('2016-11-12T02:09:39.594000', 'America/Panama').microsecond
Out[4]: 593999

Several timezones (but not all by any means) have this behavior

In [5]: len(pytz.common_timezones)
Out[5]: 436

In [6]: len({name for name in pytz.common_timezones if Pendulum.parse('2016-11-12T02:09:39.594000', name).microsecond != 594000})
Out[6]: 182

I don't think anything in the particular choice of time or timezone should result in a difference of a single microsecond?

The text was updated successfully, but these errors were encountered:

sdispater · 2016-11-12T04:17:23Z

It's not limited to parse():

>>> import pendulum
>>> pendulum.create(2016, 11, 12, 2, 9, 39, 594000, 'America/Panama').microsecond
593999

After checking, it seems that the wrong value is due to the C extension to calculate offsets. If you install pendulum without C extensions it works properly:

PENDULUM_EXTENSIONS=0 pip install pendulum

dokai · 2016-11-12T14:28:32Z

It seems the difference between the Python and C versions is that Python rounds the value before casting to int which the C version does not.

C

microsecond = (int64_t) (unix_time * 1000000) % 1000000;
    if (microsecond < 0) {
        microsecond += 1000000;
    }

Python

microsecond = int(round(unix_time % 1, 6) * 1e6)

Using the 594000 microseconds from above we can see that due to the way floats are handled the modulo value varies:

>>> 1.594000 % 1
0.5940000000000001

>>> 123.594000 % 1
0.5939999999999941

and without the round() call the Python version would be similar to C.

>>> int((123.594000 % 1 ) * 1e6)
593999
>>> int(round(123.594000 % 1, 6) * 1e6)
594000

As described in sdispater#74 there is a difference in the behavior of the C extension and Python implementation in terms of handling microseconds. The problem is that a microsecond value passed in may have an off-by-one difference for certain values due to how floating point numbers are handled, unless rounding is performed. >>> import pendulum >>> pendulum.create(2016, 11, 12, 2, 9, 39, 594000, 'America/Panama').microsecond 593999 This commit adds the same rounding behavior to the C extension as used in the Python version.

danilobellini · 2016-11-12T16:45:33Z

As @dokai pointed, it's a floating point truncating error, but I think this has something similar to what I described in #71. Always rounding instead of truncating would be a workaround to the issue, but I think Pendulum should never use floating point numbers internally, unless the result itself has to be a float number.

In this case, it goes back to the timezone _normalize method (pendulum.tz.Timezone._normalize) when it calculates the unix timestamp as:

unix_time = tr.unix_time - (tr.pre_time - dt).total_seconds()
unix_time = tr.unix_time + (dt - tr.time).total_seconds()

The method total_seconds returns a floating point. Every single call to delta.total_seconds() internal to Pendulum should be replaced by delta.days * 86400 + delta.seconds dealing with the microseconds part elsewhere (an extra local_time parameter in both C and Python implementations). This way, nothing is float in between.

The given example:

>>> import pendulum
>>> from datetime import datetime
>>> dt = datetime(2016, 11, 12, 2, 9, 39, 594000)
>>> tz = pendulum.timezone("America/Panama")
>>> tr = tz.transitions[-1]

This method would be internally called on creation:

>>> tz._normalize(dt, "post")
(2016, 11, 12, 2, 9, 39, 593999, <TimezoneInfo [America/Panama, -18000, False]>)

And it does this:

>>> unix_time = tr.unix_time + (dt - tr.time).total_seconds()
>>> offset = tz._tzinfos[tr._transition_type_index].offset
>>> pendulum._extensions._helpers.local_time(unix_time, offset)
(2016, 11, 12, 2, 9, 39, 593999)

But for very high year values, the microsecond is simply lost (i.e., the unix_time value itself isn't valid), no matter the Python/C implementation:

>>> dt = datetime(2316, 11, 12, 2, 9, 39, 857)
>>> unix_time = tr.unix_time + (dt - tr.time).total_seconds()
>>> "%.18f" % unix_time # Rounding/truncating wouldn't be enough
'10945955379.000856399536132812'
>>> dt = datetime(2222, 11, 12, 2, 9, 39, 1454)
>>> unix_time = tr.unix_time + (dt - tr.time).total_seconds()
>>> "%.18f" % unix_time # Neither rounding nor truncating would do it, again
'7979584179.001453399658203125'
>>> dt = datetime(2180, 7, 4, 13, 16, 8, 12)
>>> unix_time = tr.unix_time + (dt - tr.time).total_seconds()
>>> "%.18f" % unix_time
'6643016168.000011444091796875'

sdispater · 2016-11-12T17:56:26Z

@danilobellini I agree that working with floating point numbers are prone to error and approximations.

I will check if I can cook up a better implementation to be sure we don't lose information.

sdispater · 2016-11-14T22:05:51Z

Commit 184b94a on the develop branch fixes the issue:

>>> import pendulum
>>> dt = pendulum.parse('2016-11-12T02:09:39.594000', 'America/Panama')
>>> dt.isoformat()
'2016-11-12T02:09:39.594000-05:00'
>>> dt.microsecond
594000

>>> dt = pendulum.create(2316, 11, 12, 2, 9, 39, 857, 'America/Panama')
>>> dt.isoformat()
'2316-11-12T02:09:39.000857-05:00'
>>> dt.microsecond
857

Basically, microseconds are now treated separately to avoid having to round the value.

dokai mentioned this issue Nov 12, 2016

Fix issue with microsecond rounding in C extensions. #75

Closed

sdispater added this to the 0.7 milestone Nov 14, 2016

sdispater added the bug label Nov 14, 2016

sdispater self-assigned this Nov 14, 2016

sdispater closed this as completed in 184b94a Dec 7, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Off-by-one microsecond when parsing with certain timezones #74

Off-by-one microsecond when parsing with certain timezones #74

dokai commented Nov 12, 2016

sdispater commented Nov 12, 2016

dokai commented Nov 12, 2016

danilobellini commented Nov 12, 2016

sdispater commented Nov 12, 2016

sdispater commented Nov 14, 2016

Off-by-one microsecond when parsing with certain timezones #74

Off-by-one microsecond when parsing with certain timezones #74

Comments

dokai commented Nov 12, 2016

sdispater commented Nov 12, 2016

dokai commented Nov 12, 2016

danilobellini commented Nov 12, 2016

sdispater commented Nov 12, 2016

sdispater commented Nov 14, 2016