Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Off-by-one microsecond when parsing with certain timezones #74

Closed
dokai opened this issue Nov 12, 2016 · 5 comments
Closed

Off-by-one microsecond when parsing with certain timezones #74

dokai opened this issue Nov 12, 2016 · 5 comments
Assignees
Labels
Milestone

Comments

@dokai
Copy link

dokai commented Nov 12, 2016

I found a curious behavior when parsing timestamps where the microsecond value is off by one depending on the timezone used.

As a minimal example I'm attempting to parse the string 2016-11-12T02:09:39.594000.

Parsing with the default timezone works as excepted:

In [3]: Pendulum.parse('2016-11-12T02:09:39.594000').microsecond
Out[3]: 594000

Using a particular timezone shows the off-by-one difference.

In [4]: Pendulum.parse('2016-11-12T02:09:39.594000', 'America/Panama').microsecond
Out[4]: 593999

Several timezones (but not all by any means) have this behavior

In [5]: len(pytz.common_timezones)
Out[5]: 436

In [6]: len({name for name in pytz.common_timezones if Pendulum.parse('2016-11-12T02:09:39.594000', name).microsecond != 594000})
Out[6]: 182

I don't think anything in the particular choice of time or timezone should result in a difference of a single microsecond?

@sdispater
Copy link
Owner

It's not limited to parse():

>>> import pendulum
>>> pendulum.create(2016, 11, 12, 2, 9, 39, 594000, 'America/Panama').microsecond
593999

After checking, it seems that the wrong value is due to the C extension to calculate offsets. If you install pendulum without C extensions it works properly:

PENDULUM_EXTENSIONS=0 pip install pendulum

@dokai
Copy link
Author

dokai commented Nov 12, 2016

It seems the difference between the Python and C versions is that Python rounds the value before casting to int which the C version does not.

C

microsecond = (int64_t) (unix_time * 1000000) % 1000000;
    if (microsecond < 0) {
        microsecond += 1000000;
    }

Python

microsecond = int(round(unix_time % 1, 6) * 1e6)

Using the 594000 microseconds from above we can see that due to the way floats are handled the modulo value varies:

>>> 1.594000 % 1
0.5940000000000001

>>> 123.594000 % 1
0.5939999999999941

and without the round() call the Python version would be similar to C.

>>> int((123.594000 % 1 ) * 1e6)
593999
>>> int(round(123.594000 % 1, 6) * 1e6)
594000

dokai added a commit to dokai/pendulum that referenced this issue Nov 12, 2016
As described in sdispater#74 there is
a difference in the behavior of the C extension and Python
implementation in terms of handling microseconds.

The problem is that a microsecond value passed in may have an off-by-one
difference for certain values due to how floating point numbers are
handled, unless rounding is performed.

  >>> import pendulum
  >>> pendulum.create(2016, 11, 12, 2, 9, 39, 594000, 'America/Panama').microsecond
  593999

This commit adds the same rounding behavior to the C extension as used
in the Python version.
dokai added a commit to dokai/pendulum that referenced this issue Nov 12, 2016
As described in sdispater#74 there is
a difference in the behavior of the C extension and Python
implementation in terms of handling microseconds.

The problem is that a microsecond value passed in may have an off-by-one
difference for certain values due to how floating point numbers are
handled, unless rounding is performed.

  >>> import pendulum
  >>> pendulum.create(2016, 11, 12, 2, 9, 39, 594000, 'America/Panama').microsecond
  593999

This commit adds the same rounding behavior to the C extension as used
in the Python version.
@danilobellini
Copy link

As @dokai pointed, it's a floating point truncating error, but I think this has something similar to what I described in #71. Always rounding instead of truncating would be a workaround to the issue, but I think Pendulum should never use floating point numbers internally, unless the result itself has to be a float number.

In this case, it goes back to the timezone _normalize method (pendulum.tz.Timezone._normalize) when it calculates the unix timestamp as:

unix_time = tr.unix_time - (tr.pre_time - dt).total_seconds()
unix_time = tr.unix_time + (dt - tr.time).total_seconds()

The method total_seconds returns a floating point. Every single call to delta.total_seconds() internal to Pendulum should be replaced by delta.days * 86400 + delta.seconds dealing with the microseconds part elsewhere (an extra local_time parameter in both C and Python implementations). This way, nothing is float in between.

The given example:

>>> import pendulum
>>> from datetime import datetime
>>> dt = datetime(2016, 11, 12, 2, 9, 39, 594000)
>>> tz = pendulum.timezone("America/Panama")
>>> tr = tz.transitions[-1]

This method would be internally called on creation:

>>> tz._normalize(dt, "post")
(2016, 11, 12, 2, 9, 39, 593999, <TimezoneInfo [America/Panama, -18000, False]>)

And it does this:

>>> unix_time = tr.unix_time + (dt - tr.time).total_seconds()
>>> offset = tz._tzinfos[tr._transition_type_index].offset
>>> pendulum._extensions._helpers.local_time(unix_time, offset)
(2016, 11, 12, 2, 9, 39, 593999)

But for very high year values, the microsecond is simply lost (i.e., the unix_time value itself isn't valid), no matter the Python/C implementation:

>>> dt = datetime(2316, 11, 12, 2, 9, 39, 857)
>>> unix_time = tr.unix_time + (dt - tr.time).total_seconds()
>>> "%.18f" % unix_time # Rounding/truncating wouldn't be enough
'10945955379.000856399536132812'
>>> dt = datetime(2222, 11, 12, 2, 9, 39, 1454)
>>> unix_time = tr.unix_time + (dt - tr.time).total_seconds()
>>> "%.18f" % unix_time # Neither rounding nor truncating would do it, again
'7979584179.001453399658203125'
>>> dt = datetime(2180, 7, 4, 13, 16, 8, 12)
>>> unix_time = tr.unix_time + (dt - tr.time).total_seconds()
>>> "%.18f" % unix_time
'6643016168.000011444091796875'

@sdispater
Copy link
Owner

@danilobellini I agree that working with floating point numbers are prone to error and approximations.

I will check if I can cook up a better implementation to be sure we don't lose information.

@sdispater sdispater added this to the 0.7 milestone Nov 14, 2016
@sdispater sdispater added the bug label Nov 14, 2016
@sdispater sdispater self-assigned this Nov 14, 2016
@sdispater
Copy link
Owner

Commit 184b94a on the develop branch fixes the issue:

>>> import pendulum
>>> dt = pendulum.parse('2016-11-12T02:09:39.594000', 'America/Panama')
>>> dt.isoformat()
'2016-11-12T02:09:39.594000-05:00'
>>> dt.microsecond
594000
>>> dt = pendulum.create(2316, 11, 12, 2, 9, 39, 857, 'America/Panama')
>>> dt.isoformat()
'2316-11-12T02:09:39.000857-05:00'
>>> dt.microsecond
857

Basically, microseconds are now treated separately to avoid having to round the value.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants