-
-
Notifications
You must be signed in to change notification settings - Fork 30.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
datetime: add ability to parse RFC 3339 dates and times #60077
Comments
The datetime module has support for output to a string of dates and times in ISO 8601 format ("2012-09-09T18:00:00-07:00"), with the object method "isoformat([sep])". But there's no support for parsing such strings. A string to datetime class method should be provided, one capable of parsing at least the RFC 3339 subset of ISO 8601. The problem is parsing time zone information correctly. "strptime" does not understand timezone offsets. The "datetime" documentation suggests that the "z" format directive handles time zone info, but that's not actually implemented for input. Pypi has four modules for parsing ISO 8601 dates. Each has least one major iso8601 0.1.4 Thus, nothing in Pypi provides a good alternative. It would be appropriate to handle this in the datetime module. One small, correct, tested function would be better than the existing five bad alternatives. |
%z format is supported, but it cannot accept colon in TZ offset. It can parse offsets like -0600 just fine. What OP is looking for is the GNU date %:z format which datetime does not support. For ISO 8601 compliance, however I think we need a way to specify a parser that will accept any valid 8601 format: with T or space separator and with or without : in time and timezone and with or without dashes in date. I would very much like such promiscuous parser to be implemented in datetime.__new__. So that we can create datetime objects from strings the way we do it with numbers. |
Re: "%z format is supported". That's platform-specific; the actual parsing is delegated to the C library. It's not in Python 2.7 / Win32: ValueError: 'z' is a bad directive in format '%Y-%m-%dT%H:%M:%S%z' It really shouldn't be platform-specific; the underlying platform is irrelevant to this task. That's more of a documentation error; the features not common to all supported Python platforms should not be mentioned in the documentation. Re: "I would very much like such promiscuous parser to be implemented in datetime.__new__. " For string input, it's probably better to do this conversion in a specific class-level function. Full ISO 8601 dates/times generally come from computer-generated data via a file or API. If invalid text shows up, it should be detected as an error, not be heuristically interpreted as a date. There's already "fromtimestamp" and "fromordinal", I'd also suggest providing a standard subclass of tzinfo in datetime for fixed offsets. That's needed to express the time zone information in an ISO 8601 date. The new "fromisoformat" would convert an ISO 8601 date/time would be convertible to a time-zone "aware" datetime object. If converted back to an ISO 8601 string with .isoformat(), the round trip should preserve the original data, including time zone offset. (Several more implementations of this conversion have turned up. In addition to the four already mentioned, there was one in xml.util, and one in feedparser. There are probably more yet to be found.) |
On Thu, Sep 6, 2012 at 9:51 PM, John Nagle <report@bugs.python.org> wrote:
Python 2.x series is closed and cannot accept new features. Both %z |
I am attaching a quick python only prototype for the proposed feature. My goal is to make date/time objects behave like numeric types for which constructors accept strings produced by str(). Since str() format is ISO 8601, it is natural to accept ISO 8601 formats in constructors. |
We need to define the scope of what input strings will be accepted. ISO-8601 defines a lot of stuff which we may not wish to accept. Do we want to accept both basic format (YYYYMMDD) and extended format (YYYY-MM-DD)? Do we want to accept things like "1985-W15-5", which is (if I understand this correctly(), the 5th day of the 15th week of 1985 [section 4.1.4.2]. Do we want to accept [section 4.2.2.4], "23:20,8", which is 23 hours, 20 minutes, 8 tenths of a minute. I suspect most people who have been following the recent thread (https://groups.google.com/d/topic/comp.lang.python/Q2w4R89Nq1w/discussion) would say none of the above are needed. All that's needed is if you have an existing datetime object, d1, you can do: s = str(d1)
d2 = datetime.datetime(s)
assert d1 == d2 for all values of d1. But, let's at least agree on that. Or, in the alternative, agree on something else. Then we know what we're shooting for. |
On Sep 9, 2012, at 8:15 AM, Roy Smith <report@bugs.python.org> wrote:
Since it is easier to widen the domain of acceptable arguments than to narrow it in the future, I would say let's start by accepting str(x) only where x is date, time, timezone or datetime. I would leave out timedelta for now because it's str(x) does not resemble ISO at all. Either that or full ISO 8601. Anything in between is just too hard to explain. |
I see I mis-stated my example. When I wrote: s = str(d1)
d2 = datetime.datetime(s)
assert d1 == d2 what I really meant was: s = d1.isoformat()
d2 = datetime.datetime(s)
assert d1 == d2 But, now I realize that while that is certainly an absolute lower bound, it's almost certainly not sufficient. The most common use case I see on a daily basis is parsing strings that look like "2012-09-07T23:59:59+00:00". This is also John Nagle's original use case from the cited mailing list thread:
Datetime.isoformat() returns something that matches the beginning of that, but doesn't have the time zone offset. And it's the offset that makes strptime() not usable as a soluation, because "%z" isn't portable. If we don't satisfy the "2012-09-07T23:59:59+00:00" case, then we won't have really done anything useful. |
For what parts of ISO 8601 to accept, there's a standard: RFC3339, "Date and Time on the Internet: Timestamps". See section 5.6: date-fullyear = 4DIGIT partial-time = time-hour ":" time-minute ":" time-second date-time = full-date "T" full-time NOTE: Per [ABNF] and ISO8601, the "T" and "Z" characters in this
That's straightforward, and can be expressed as a regular expression. |
This is exactly what isoformat() of an aware datetime looks like: >>> datetime.now(timezone.utc).isoformat()
'2012-09-09T16:09:46.165886+00:00' str() is the same up to T replaced by space: >>> print(datetime.now(timezone.utc))
2012-09-09 15:19:12.567692+00:00 |
This is almost indistinguishable from the idea of accepting .isoformat() and str() results. From what I see the only difference is that 't' is accepted for date/time separator and 'z' is accepted as a timezone. Let's start with this. As an ultimate solution, I would like to see something like codec registry so that we can do things like datetime(.., format='rfc3339') or date(.., format='gnu') for GNU parse_datetime. I think this will look more pythonic than strptime(). Of course, strptime format can also be accepted as the value for the format keyword. |
I've started collecting some test cases. I'll keep adding to the collection. I'm going to start trolling ISO 8601:2004(E) for more. Let me know if there are other sources I should be considering. |
Ooops, clicked the wrong button. |
there is a module that parses those strings pretty nicely, it’s called pyiso8601: http://code.google.com/p/pyiso8601/ in the context of writing a better plistlib, i also needed the capability to parse those strings, and decided not to use the sucky incomplete implementation of plistlib, but the one mentioned above. i py3ified it, eliminating quite some code, and the result is pretty terse, check it out: https://github.com/flying-sheep/plist/blob/master/iso8601.py note that that implementation returns utc-datetimes for timezoneless strings, instead of naive ones. (l.30) |
I've written a parser for ISO 8601: https://github.com/boxed/iso8601 Some basic tests are included and it supports most of the standard. Haven't gotten around to the more obscure parts like durations and intervals, but those are trivial to add... |
Are you offering the module for inclusion in the stdlib? |
Éric Araujo: absolutely. Although I think my code can be improved (speed wise, elegance, etc) since I just wrote it quickly a weekend :) |
John listed four modules with issues in the first message, and now we have proposals for two more modules. Could you work together to make a unified patch? Alexander, do you think there is a need to check python-ideas or python-dev before working on this? (I changed the title to clarify scope: ISO 8601 is huge and not easily accessible whereas W3CDTF/RFC 3339 is narrower in scope and freely accessible.) |
Éric> do you think there is a need to check python-ideas or python-dev before working on this? Yes, I think this is python-ideas material. IMHO, what should be added to datetime module in 3.4 is ability to construct date/time objects from their str() representation: assert time(str(t)) == t I am not sure the same is needed for timedelta, but this can be discussed. Implementation of any external to python standard should be wetted at PyPI first. There may be a reason why there is no rfc3339.py module on PyPI. |
I had the issue today. I needed to parse a date with the following format.
and could not with strptime. I see a discussion in March 2014 http://code.activestate.com/lists/python-ideas/26883/ but no followup. For references: |
On closer inspection, Anders Hovmöller proposal doesn't work. At least for the microseconds part. In http://tools.ietf.org/html/rfc3339#section-5.6, the microsecond part is defined as: time-secfrac = "." 1*DIGIT In http://www.w3.org/TR/NOTE-datetime, same thing: Anders considers it to be only six digits. It can be more or it can be less. :) Will comment on github too. |
Noticed some people doing the same thing https://github.com/tonyg/python-rfc3339 |
After inspections, the best library for parsing RFC3339 style date is definitely: Main code at |
So, shall we include it ? Otherwise, py8601 (https://bitbucket.org/micktwomey/pyiso8601/) looks pretty popular and well maintained (various committers, started in 2012, last commit in 2016). |
I'm working on the OpenStack project and iso8601 is heavily used.
I don't think that we should add the iso8601 module to the stdlib, but merge iso8601 "features" into the datetime module. The iso8601 module supports Python 2.7 and so has to implement its own timezone classes. The datetime module now has datetime.timezone since Python 3.2 for fixed timezone. The iso8601 module provides functions. I would prefer datetime.datetime *methods*. Would you mind to try to implement that? It would be kind to contact iso8601 author before. The important part is also unit tests. |
Hmm, ok. I guess I was confused by "dates and times" part of the subject. Ok, so only datetimes. My other comments still apply though.
|
Mathieu: Maybe you haven’t seen some of the comments on your older patches. E.g. my comment on fromisoformat4.patch about improper use of “with self.assertRaises(...)” still stands. Also, adding some documentation to the patch might help the likes of Anders figure out the scope of the change. I think we decided to parse RFC 3339’s “internet date and time format” profile of ISO 8601 with the date, time, and datetime classes, including tolerating arbitrary resolutions of fractions of seconds in the time, and parsing time zones. I don’t think we need to test every combination of the other ISO 8601 formats. There are already a couple of negative tests. Are there any in particular you think are important to add? |
I'm back on the issue. I'm currently stuck on the design. We need to store the regexes somewhere, and that's what causes problem : I can't really find a good place to store them. We basically have two possible designs :
I post the two versions of the implementation as patches here. These adress all the concerns expressed before (Martin). If we can't decide, I will post a mail on the mailing list Martin suggested, python-ideas. By the way, are you sure it's the right one to ask ? Wouldn't be python-dev more appropriated ? |
updated version with SilentGhost's concerns addressed. |
Please move _parse_isotime to _strptime so that it can be called from C implementation. Also, the new method should be documented. |
Otherwise, py8601 (https://bitbucket.org/micktwomey/pyiso8601/) looks pretty popular and well maintained (various committers, started in 2012, last commit in 2016). I don't think that we should add the iso8601 module to the stdlib, but merge iso8601 "features" into the datetime module. The iso8601 module supports Python 2.7 and so has to implement its own timezone classes. The datetime module now has datetime.timezone since Python 3.2 for fixed timezone. To me it's the finest, the most elegant, and no other one can claim to be more robust since it's probably the #1 iso parsing functions used in python. Have a look at https://docs.djangoproject.com/en/1.9/_modules/django/utils/dateparse/#parse_datetime. |
@larsonreever That lib is pretty limited, in that it doesn't handle dates or deltas. Again: my lib that is linked above does and has comprehensive tests. |
I think that both the pyiso8601 and boxed/iso8601 implementations parse ISO 8601 strings incorrectly. The standard explicitly says that all truncated datetime strings are *reduced accuracy timestamps*. In other words, "2017-10" is *not* equal to "2017-10-01". Instead, "2017-10" represents the whole month of October 2017. Same thing with hours. Earlier versions of ISO 8601 even allowed dropping the year: "--10-01", which meant October 1st of _any year_. They dropped this from more recent revisions of the standard. The only place where the truncated representation means "default to zero" is the timezone offset, so "10:10:00+4" and "10:10:00+04:00" mean the same thing. |
P-ganssle seems to be proposing to limit parsing to exactly what “datetime.isoformat” produces; i.e. whole number of seconds, milliseconds or microseconds. Personally I would prefer it without this limitation, like in Mathieu’s patches. But P-ganssle has done some documentation, so perhaps we can combine the work of each? |
The other difference is Mattieu guarantees ValueError for invalid input strings, which I think is good. |
The better is the enemy of the good here. Given the history of this issue, I would rather accept a well documented restrictive parser than wait for a more general code to be written. Note that we can always relax the parsing rules in the future. |
I'm right now available again to work on this issue. I'll submit a pull Le 4 déc. 2017 11:45 PM, "Alexander Belopolsky" <report@bugs.python.org> a
|
This is in fact the exact reason why I wrote the isoformat parser like I did, because ISO 8601 is actually a quite expansive standard, and this is the least controversial subset of the features. In fact, I spent quite a bit of time on adapting the general purpose ISO8601 parser I wrote for dateutil *into* one that only accepts the output of isoformat() because it places a minimum burden on ongoing support, so it's not really a matter of waiting for a more general parser to be written. I suggest that for Python 3.7 we only support output of isoformat(). Many general iso8601 parsers exist, including the one I have already implemented for python-dateutil (which will be part of the dateutil 2.7.0 release). We can have further discussion later about what exactly should be supported in Python 3.8, but even in the pre-release discussions I'm already seeing pushback about some of the more unusual 8601 formats, and it's a lot easier to explain (in documentation) that |
+1 on what Paul said. Mathieu, the goal for 3.7 will be to get Paul's PR merged. It will be great if you could help in reviewing it. We can return to the features in your PR during the 3.8 development cycle. |
I forgot to address this - but I don't think this is a difference in approaches. If you pass (I'll note that my patch does not accept bytes, though this is something of an artificial limitation, since the patch makes use of the fact that all valid isoformat() strings will contain at most exactly 1 non-ascii character in position 10, so we could easily work around this, but I think the trend for CPython is to avoid blurring the lines between bytes and str rather than encouraging their interchangeable use.) |
I finally released my work. It looks like Paul's work is more comprehensive, but if you want to pick one thing or two in mine, feel free. |
Regarding Matthieu’s RFC 3339 parser, Victor wanted to use the round-half-to-even rule to get a whole number of microseconds. But considering the “time” class cannot represent 24:00, how do you round up in the extreme case past 23:59? time.fromisoformat("23:59:59.9999995") Perhaps it is better to always truncate to zero, only support 6 digits (rejecting fractions of a microsecond), or add Anders’s truncate_microseconds=True option. |
@martin.panter I don't see the problem here? Wouldn't 23:59.9999995 round up to 00:00? |
Not if the time is associated with a particular day. Imagine implementing datetime.fromisoformat by separately calling date.fromisoformat and time.fromisoformat. The date will be off by one day if you naively rounded 2017-12-18 23:59 “up” to 2017-12-18 00:00. |
Yes, I suppose this is a problem if you implement it that way. Seems like a somewhat moot point, but I think any decision about rounding should probably be driven by what people are expecting more than by how it is implemented. That said, I can see a good case for truncation *and* rounding up for something like '2016-12-31T23:59:59.999999999'. Rounding up to '2017-01-01' is certainly the closest whole millisecond to round to, *but* often people expressing a "23:59:59.9999999" are trying to actually express "the last possible moment *before* 00:00". |
I wanted to note here... I've been trying to get strptime to work with the types of dates specified in this request and came across a documentation bug here: https://docs.python.org/3.5/library/time.html#time.strptime You can see that the %z attribute's examples given have colons in them while the format specified is +HHMM rather than +HH:MM which the examples illude to. |
maybe it's worth adding an entry in python 3.7 "what's new" ? I think it was a very long awaited issue. |
Correct, a new feature should always get a what's new entry. You could submit a PR for it :) |
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields:
bugs.python.org fields:
The text was updated successfully, but these errors were encountered: