faster pop_tz_offset_from_string, 2x faster dateparser.parse if date has no timezone info #569

lopuhin · 2019-09-20T15:58:04Z

When most strings don't have tz offset, this is massively faster, as we avoid a loop over all timezones (around 800 of them).
But it's possible to improve this.

TODO:

benchmark results
improve case when we do find a match

when most strings don't have tz offset, this is massively faster, as we avoid a loop over all timezones (around 800 of them). But it's possitble to improve this.

word_is_tz is supposed to be case sensetive, so don't modify it's behaviour

lopuhin · 2019-09-23T11:54:51Z

Tests run quite a bit faster after this change (e.g. tests/test_date_parser.py went from 1.7s to 1.5s, tests/test_search.py went from 2.2s to 1.9s), also I wrote a little benchmark specifically for this PR:

$ cat t.py 
import time
import random
import dateparser

def main():
    inputs_no_tz = []
    inputs_tz = []
    random.seed(42)
    for _ in range(1000):
        day = random.randint(10, 31)
        year = random.randint(1970, 2030)
        hour = random.randint(12, 23)
        date = f'{year}.12.{day} {hour}:45:00'
        inputs_no_tz.append(date)
        inputs_tz.append(date + ' EST')

    t0 = time.perf_counter()
    for x in inputs_tz:
        assert dateparser.parse(x)
    t_tz = (time.perf_counter() - t0) / len(inputs_tz)

    t0 = time.perf_counter()
    for x in inputs_no_tz:
        assert dateparser.parse(x)
    t_no_tz = (time.perf_counter() - t0) / len(inputs_no_tz)

    print(f'Time with tz: {t_tz:.6f} s, t_no_tz: {t_no_tz:.6f} s')


if __name__ == '__main__':
    main()

And here are results (on branch first):

$ python t.py
Time with tz: 0.001375 s, t_no_tz: 0.000579 s
$ python t.py
Time with tz: 0.001363 s, t_no_tz: 0.000578 s
$ git checkout master
Switched to branch 'master'
Your branch is up to date with 'origin/master'.
$ python t.py
Time with tz: 0.001353 s, t_no_tz: 0.001223 s
$ python t.py
Time with tz: 0.001344 s, t_no_tz: 0.001216 s

so parsing without TZ is more than 2x faster, while with TZ it's a tiny bit smaller. It should also be possible to bring the time of TZ parsing down.

codecov · 2019-09-23T11:57:49Z

Codecov Report

Merging #569 into master will increase coverage by <.01%.
The diff coverage is 100%.

@@            Coverage Diff             @@
##           master     #569      +/-   ##
==========================================
+ Coverage   95.11%   95.12%   +<.01%     
==========================================
  Files         302      302              
  Lines        2498     2500       +2     
==========================================
+ Hits         2376     2378       +2     
  Misses        122      122

Impacted Files	Coverage Δ
dateparser/timezone_parser.py	`98.21% <100%> (+0.06%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a08cecc...dc8e868. Read the comment docs.

lopuhin · 2019-09-23T12:01:50Z

@asadurski do you think it's fine to merge this PR as it is now (big speedup for dates without TZ, small slowdown for dates with TZ), or it's better to improve the case with TZ as well? It should be possible to eliminate the loop from pop_tz_offset_from_string entirely, but it would be a bit more changes than now.

lopuhin · 2019-09-23T12:24:20Z

I also checked import time, and it's affected a little as well:

# PR
$ for i in $(seq 10); do python -c "import time; t0 = time.perf_counter(); import dateparser; print(f'{(time.perf_counter() - t0):.3f} s')"; done
0.126 s
0.128 s
0.129 s
0.127 s
0.129 s
0.126 s
0.130 s
0.127 s
0.126 s
0.126 s
# master
$ for i in $(seq 10); do python -c "import time; t0 = time.perf_counter(); import dateparser; print(f'{(time.perf_counter() - t0):.3f} s')"; done
0.114 s
0.115 s
0.112 s
0.112 s
0.112 s
0.112 s
0.111 s
0.112 s
0.112 s
0.111 s

lopuhin · 2019-09-23T12:30:59Z

I also checked import time, and it's affected a little as well:

It should be possible to reverse this though if we lazy-init stuff used by timezone parser.

asadurski

Thank you! Huge gain with negligible trade-off. Let's merge this and focus on negative case in another PR.
And I'll look some more into performance benchmark options with Travis.

lopuhin · 2019-09-26T09:45:53Z

Thank you @asadurski 👍

And I'll look some more into performance benchmark options with Travis.

I think even a benchmark suite which can be run locally to compare with master would be very useful, and there are plenty of nice cases in the tests which could be used as benchmark input.

lopuhin added 2 commits September 20, 2019 18:56

[WIP] faster pop_tz_offset_from_string

4968bcb

when most strings don't have tz offset, this is massively faster, as we avoid a loop over all timezones (around 800 of them). But it's possitble to improve this.

fix fast path check: use re.IGNORECASE as in timezone regexps

dc8e868

word_is_tz is supposed to be case sensetive, so don't modify it's behaviour

lopuhin changed the title ~~[WIP] faster pop_tz_offset_from_string~~ faster pop_tz_offset_from_string, 2x faster dateparser.parse if date has no timezone info Sep 23, 2019

asadurski approved these changes Sep 26, 2019

View reviewed changes

asadurski merged commit f4b493f into master Sep 26, 2019

manycoding deleted the faster-pop-tz-offset branch October 28, 2019 19:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

faster pop_tz_offset_from_string, 2x faster dateparser.parse if date has no timezone info #569

faster pop_tz_offset_from_string, 2x faster dateparser.parse if date has no timezone info #569

lopuhin commented Sep 20, 2019 •

edited

Loading

lopuhin commented Sep 23, 2019

codecov bot commented Sep 23, 2019

lopuhin commented Sep 23, 2019

lopuhin commented Sep 23, 2019

lopuhin commented Sep 23, 2019

asadurski left a comment

lopuhin commented Sep 26, 2019

faster pop_tz_offset_from_string, 2x faster dateparser.parse if date has no timezone info #569

faster pop_tz_offset_from_string, 2x faster dateparser.parse if date has no timezone info #569

Conversation

lopuhin commented Sep 20, 2019 • edited Loading

lopuhin commented Sep 23, 2019

codecov bot commented Sep 23, 2019

Codecov Report

lopuhin commented Sep 23, 2019

lopuhin commented Sep 23, 2019

lopuhin commented Sep 23, 2019

asadurski left a comment

Choose a reason for hiding this comment

lopuhin commented Sep 26, 2019

lopuhin commented Sep 20, 2019 •

edited

Loading