Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Please reconsider 'guessing' whether a timestamp is in milliseconds or seconds #7940

Open
1 task done
ariebovenberg opened this issue Oct 27, 2023 · 17 comments
Open
1 task done
Assignees

Comments

@ariebovenberg
Copy link
Contributor

ariebovenberg commented Oct 27, 2023

Initial Checks

  • I confirm that I'm using Pydantic V2

Description

The following surprising behavior recently tripped me up. When parsing datetimes from unix timestamps, Pydantic
determines automagically whether you mean seconds or milliseconds:

(from the docs)

int or float; assumed as Unix time, i.e. seconds (if >= -2e10 and <= 2e10) or
milliseconds (if < -2e10or > 2e10) since 1 January 1970

This seems sensible at first...but if you're working with milliseconds, Pydantic will decide that most timestamps in 1970 and 1969 are just too small, and they will be interpreted as seconds instead — leading to wildly different times.

While most Pydantic users won't encounter this, it is not a stretch to imagine some Pydantic users are handling data from the 1970s and their validation library, ironically, is silently changing their timestamps.

If you'd rather not change this, please consider adding a big red warning to the docs on the implications of this behavior on timestamps in 1970 and 1969.

Example Code

from datetime import datetime
from pydantic import BaseModel

class Foo(BaseModel):
    a: datetime

# some timestamps in ms
recent = datetime(2023, 10, 4, 12, 22).timestamp() * 1000
apollo17_launch = datetime(1972, 12, 7, 5, 33).timestamp() * 1000
apollo13_launch = datetime(1970, 4, 11, 19, 13).timestamp() * 1000

print(Foo(a=recent))           # 2023-10-04 (correct)
print(Foo(a=apollo17_launch))  # 1972-12-07 (correct)
print(Foo(a=apollo13_launch))  # 2245-11-14 (INCORRECT)

Python, Pydantic & OS Version

pydantic version: 2.4.2
        pydantic-core version: 2.10.1
          pydantic-core build: profile=release pgo=false
                 install path: /Users/arie/.pyenv/versions/3.12.0/envs/sandbox312/lib/python3.12/site-packages/pydantic
               python version: 3.12.0 (main, Oct  3 2023, 08:33:34) [Clang 15.0.0 (clang-1500.0.40.1)]
                     platform: macOS-14.0-arm64-arm-64bit
             related packages: typing_extensions-4.8.0
@ariebovenberg ariebovenberg added bug V2 Bug related to Pydantic V2 pending Awaiting a response / confirmation labels Oct 27, 2023
@sydney-runkle sydney-runkle added change Suggested alteration to pydantic, not a new feature nor a bug and removed bug V2 Bug related to Pydantic V2 pending Awaiting a response / confirmation labels Oct 27, 2023
@samuelcolvin
Copy link
Member

This has been around since the very earliest days of pydantic, it has saved me personally hours in combined time not having to thing about whether something is in seconds or milliseconds.

I won't be changing the default.

But I would consider a config switch or similar to stop using the watershed.

@sydney-runkle sydney-runkle added the awaiting author response awaiting response from issue opener label Oct 27, 2023
@ariebovenberg
Copy link
Contributor Author

Thanks for the quick reply.

I will certainly agree the behavior is more convenient 😄, and probably saves many developers a lot of time.

Although I am curious: is there another validation library or language that also does it this way? Part of the reason for my bemusement is that I've never encountered this type of behavior before 😅.

Regarding a potential config switch: how about an option called timestamp_unit which can be set to either "auto" (the current behavior and ongoing default), "s", or "ms".

@sydney-runkle sydney-runkle added feature request and removed awaiting author response awaiting response from issue opener change Suggested alteration to pydantic, not a new feature nor a bug labels Oct 28, 2023
@samuelcolvin
Copy link
Member

Solution sounds good to me, "infer" or "guess" might be better than "auto".

@samuelcolvin
Copy link
Member

I'm just realized you're the he same person who asked the question on HN, sorry for my dumb initial response. 🙏

By the way, to fix this I think you'll need to update speedate, then pydantic-core.

@ariebovenberg
Copy link
Contributor Author

ariebovenberg commented Oct 28, 2023

I'm just realized you're the he same person who asked the question on HN, sorry for my dumb initial response. 🙏

Thanks for mentioning. Combined with your twitter post today regarding issue submission quality, I was concerned I hadn't brought it up correctly.
In any case...the HN post wasn't the lively discussion I had hoped it to be 🦗.

Solution sounds good to me, "infer" or "guess" might be better than "auto".

"infer" has a nice, neutral sound to it.

I won't be changing the default.

I do feel compelled to say that I disagree with keeping the default behavior like this. If you've already made up your mind, I won't argue — it's your project after all, and there might be some context I'm missing.
But would you be open to putting this question to the community/contributors in some way?

Part of me really wants to write a blog post to convince the world 🔥 😉 , but it's a lot more constructive to work with you all on this 🤝

By the way, to fix this I think you'll need to update speedate, then pydantic-core.

Happy to submit the PRs — if you're OK with a slow pace. I'll need to brush up on Rust, and the codebase in general.

@samuelcolvin
Copy link
Member

No you definitely weren't the trigger for my twitter idea.

I'm not pretending this is a democracy, and I'm not resorting to the tyranny of the masses / hegemony of the upvote.

More or the point, we definitely can't change the behaviour anyway until V3.

Obviously happy to accept the config switch asap.

The only thing I'd consider in V3 is to move away from a single watershed, to a ranges which are accepted, with errors outside those ranges. That would mean it doesn't "suddenly" change.

E.g. timestamps are only accepted 1980-2100 or whatever in "infer" mode.

@ariebovenberg
Copy link
Contributor Author

E.g. timestamps are only accepted 1980-2100 or whatever in "infer" mode.

Indeed, this is the way. It would remove any ambiguous cases, in favor of an explicit boundary 🙌 . Actually, I think you can even move the upper bound way up 🤔. The only requirement is that there is no overlap between the numerical ranges...

Sadly we can't move it all the way to datetime.max, but at least to 3000...

  • maximum timestamp (year 9999) in seconds is: 253_402_210_800
  • minimum timestamp (year 1980) in seconds is 315_529_200
  • maximum timestamp (year 9999) in milliseconds is: 253_402_210_800_000
  • minimum timestamp (year 1980) in milliseconds is 315_529_200_000

or alternatively, put the lower bound at 1985 and you don't need an upper bound — maybe cleaner?

@ariebovenberg
Copy link
Contributor Author

I'm not pretending this is a democracy, and I'm not resorting to the tyranny of the masses / hegemony of the upvote.

Agree that decisions shouldn't be taken this way. Although, discussion could yield interesting perspectives from people that use Pydantic daily.

The only thing I'd consider in V3 is to move away from a single watershed, to a ranges which are accepted, with errors outside those ranges. That would mean it doesn't "suddenly" change.

Makes total sense to save this for a breaking release. Also logical to make the transition itself as small as possible.

Assuming that defaults are trivial to change, the big decision can be deferred if you like. What's the ballpark date for V3? This will also help me plan my work on the PRs.

@samuelcolvin
Copy link
Member

I guess V3 should be expected in the middle of next year.

But as I said, we can add the config flag asap.

@sydney-runkle
Copy link
Member

@ariebovenberg,

I'd be happy to review your PRs for the config flag in both pydantic and pydantic-core. Thanks for your work on this 😄

@PrettyWood
Copy link
Member

I'm also interested in making sure we can ensure the unit used for timestamps. Our usecase is to validate data pipelines created by clients and running with polars.

@ariebovenberg I would be more than happy to work on speedate myself if you don't have time. You tell me!
@samuelcolvin Would you also consider adding extra units to have the full range 'ns', 'us', 'ms' and 's'?

@samuelcolvin
Copy link
Member

Yes more units should be fine on e we have agreed ranges.

@ariebovenberg
Copy link
Contributor Author

ariebovenberg commented Oct 30, 2023

@PrettyWood Thanks for the offer, but I'd like to take a stab at it myself first. I'll let you know if it doesn't work out.

My 2 cents on more time units: Makes sense to support it, but note that this would make the "infer" behavior very broad. I'd myself prefer a timestamp of 1.000.000 times the usual size would raise an exception, instead of being silently interpreted as nanoseconds 🤔. An alternative would be "infer_ms_or_s" but that also feels a bit wonky 🙃

@ariebovenberg
Copy link
Contributor Author

A naive implementation of this in speedate does cause lots of breaking changes (i.e. a new parameter to parse_str, parse_bytes, etc). Let me know what you think in the PR 🤔

@ariebovenberg
Copy link
Contributor Author

@samuelcolvin could you have a look at the speedate PR? Some open questions are blocking progress at the moment.

@ariebovenberg
Copy link
Contributor Author

@PrettyWood you're welcome to take over this issue, if you like. I won't have time for it in the near future. Feel free to build on my earlier "work in progress" PRs, or to start over 👍

@PrettyWood
Copy link
Member

Ok I'll take some time to work on it soon

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants