Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timezone fixes #235

Merged
merged 7 commits into from
Sep 26, 2016
Merged

Timezone fixes #235

merged 7 commits into from
Sep 26, 2016

Conversation

waqasshabbir
Copy link
Contributor

fixes #212, #230

There was a general confusion with TIMEZONE setting. To rectify that, we've added another setting, TO_TIMEZONE to make settings self explanatory. It's a major change and is not backward compatible.

Now, dateparser consider timezone to be system's default unless specified. e.g.

In [2]: parse('10:04 am EST')
Out[2]: datetime.datetime(2016, 9, 26, 10, 4)  # date isn't converted to UTC.

No conversion takes place, and specified timezone is attached to the resultant date

In [5]: parse('10:04 am', settings={'RETURN_AS_TIMEZONE_AWARE': True, 'TIMEZONE': 'EST'})
Out[5]: datetime.datetime(2016, 9, 26, 10, 4, tzinfo=<StaticTzInfo 'EST'>)

Conversion is explicit now and must be done using TO_TIMEZONE setting.

In [7]: parse('10:04 am', settings={'TIMEZONE': 'EST', 'TO_TIMEZONE': 'Utc'})
Out[7]: datetime.datetime(2016, 9, 26, 15, 4)

@waqasshabbir waqasshabbir mentioned this pull request Sep 26, 2016
@codecov-io
Copy link

codecov-io commented Sep 26, 2016

Current coverage is 94.19% (diff: 88.23%)

Merging #235 into master will decrease coverage by 0.17%

@@             master       #235   diff @@
==========================================
  Files            32         32          
  Lines          2736       2826    +90   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits           2582       2662    +80   
- Misses          154        164    +10   
  Partials          0          0          

Powered by Codecov. Last update 73ba31c...7e0887a

@waqasshabbir waqasshabbir merged commit a7e96d3 into master Sep 26, 2016
@waqasshabbir waqasshabbir deleted the timezone-fixes branch September 26, 2016 18:57
@kmike
Copy link
Member

kmike commented Sep 26, 2016

Now, dateparser consider timezone to be system's default unless specified. e.g.

Hm, what's the reason for this change? Server timezone has nothing to do with a timezone of a site you're scraping. Using server timezone can produce hard-to-notice bugs - you can e.g. develop code locally, get it working, then deploy it to a server, and if a server happens to have a different timezone, you've got a wrong result in production. With UTC default there was no such a problem.

@pganssle
Copy link

pganssle commented Sep 26, 2016

@kmike @waqasshabbir I honestly don't understand why, from a design perspective, you're adding in this needless complexity to begin with. It seems to me that the right way to do it is to return time zone aware datetimes if the string you parsed has a time zone in it, and otherwise return a naive time zone. One option like, IGNORE_TIME_ZONE or something would allow people to specify that you want a naive object with time zone ignored, and then all other time zone logic is handled by the end user. Allowing user-specified lists of string <-> zone mappings is also probably a good idea (so users can specify that they want e.g. CST to map to China Standard time or Central Standard Time, as appropriate).

If you do something where you are handling all time zone applications as part of the parsing step, people will be confused as to how the time zones are being treated, and you'll end up with a bunch of complicated options for how to deal with different cases using different zones - all of which can easily be handled by the user just manipulating the results using python's (superior) datetime functions.

@kmike
Copy link
Member

kmike commented Sep 26, 2016

@pganssle there is one problem which can't be solved at post-processing: handling of relative dates like '2 hours ago'. On forums it is common to have relative and absolute dates mixed - older dates are absolute, recent posts/comments have relative dates. Because dateparser returns absolute dates, there must be a way to tell dateparser "this website uses this timezone" in order to parse such relative dates correctly. If you parse these dates as relative to server time or to UTC then you need to convert absolute dates to server time or UTC (otherwise returned dates will be incompatible) - and you need server timezone information for that.

I agree that all other timezone processing is less critical because it can be handled by an end user as a post-processing step.

@kmike
Copy link
Member

kmike commented Sep 26, 2016

Probably a better design would be the following:

  • return absolute dates with time zone written in them as time-zone aware;
  • return absolute dates without time zone as naive datetime objects;
  • return relative dates as time-zone aware datetime objects, either in UTC or in server timezone;
  • provide a function to normalize all dates to a single timezone; it should have 'default_timezone' argument with timezone for zone-less dates, and 'result_timezone' with a timezone date should be converted to.

@waqasshabbir
Copy link
Contributor Author

waqasshabbir commented Sep 27, 2016

@kmike reason for this change is conversion to UTC in either relative dates or incomplete dates like 2 pm was implicit and caused confusions for many users. For example, the site you are parsing dates from is in Eastern timezone, you get correct result normalized to UTC when you parse 2 hrs ago. But when you parse today at 2 pm, no conversion happens because dateparser by default assumed all dates to be in UTC, thus, a wrong recording.

You can also rephrase below as

Now, dateparser consider timezone to be system's default unless specified. e.g.

dateparser no longer make implicit conversion to a default timezone.

Having to specify settings will help user be more aware of timezone conversions and implications. When you specify TO_TIMEZONE you know, you're converting from your environment's timezone to the specified one.

re better design, not really sure, how useful is the idea of mixing naive and tzaware objects when you can do general post-processing/normalizing using settings to either return all tzaware objects or naive objects converted to a desired timezone.

@pganssle dateparser offers all of its primary functionality through one entry point dateparser.parse. That's why we introduced settings to allow customization of involved factors to tailor the end result. For example, while parsing incomplete date like Friday, we can specify PREFER_DATES_FROM to either get Friday's date from the last week or next. If day is missing from a date string like September 2016, there's a setting to specify use either first of the month or last day to populate the missing day. Timezone conversion settings share the same idea and that's why are consistent with library's design.

For timezones with similar abbreviations, some work was done early on to adopt only the most commonly used ones and avoid duplicates. See here

@pganssle
Copy link

@kmike I see how that is, but wouldn't it make more sense for relative dates to return timedelta objects, which completely obviates the need for this time zone thing?

@waqasshabbir The problem I have is that you are doing manipulations to the datetime that have no place in a parsing function. I think it's very reasonable to specify a default value for the time zone, which is replaced if a time zone is specified in the string. I also think it's very reasonable to say "even if a time zone is specified in the string, ignore it", but what you are doing is providing some facility for doing timezone math within the parsing function, which does not make sense, conceptually.

All of this makes the whole function quite confusing, because I would expect parse('2014-04-21 12:00:00 PST', settings={'TIMEZONE': 'US/Eastern'}) to return datetime(2014, 4, 21, 12, 0, tzinfo=PST), or, at a stretch, datetime(2014, 4, 21, 16, 0, tzinfo=EST), but definitely not datetime(2014, 4, 21, 16, 0). What setting do I use if I want to return something that's naive if the string is naive and tzaware if the string it tzaware? What settings would I use if I want to say, "If it's naive assume the timezone is EST, otherwise use whatever time zone setting is specified in the string?" Those would be the expected behavior of a parser, because they are actually settings about how a string is interpreted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Timezone confusion (mine)
4 participants