Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

faster pop_tz_offset_from_string, 2x faster dateparser.parse if date has no timezone info #569

Merged
merged 2 commits into from
Sep 26, 2019

Conversation

lopuhin
Copy link
Member

@lopuhin lopuhin commented Sep 20, 2019

When most strings don't have tz offset, this is massively faster, as we avoid a loop over all timezones (around 800 of them).
But it's possible to improve this.

TODO:

  • benchmark results
  • improve case when we do find a match

when most strings don't have tz offset, this is massively faster, as we
avoid a loop over all timezones (around 800 of them).
But it's possitble to improve this.
word_is_tz is supposed to be case sensetive, so don't modify it's
behaviour
@lopuhin
Copy link
Member Author

lopuhin commented Sep 23, 2019

Tests run quite a bit faster after this change (e.g. tests/test_date_parser.py went from 1.7s to 1.5s, tests/test_search.py went from 2.2s to 1.9s), also I wrote a little benchmark specifically for this PR:

$ cat t.py 
import time
import random
import dateparser

def main():
    inputs_no_tz = []
    inputs_tz = []
    random.seed(42)
    for _ in range(1000):
        day = random.randint(10, 31)
        year = random.randint(1970, 2030)
        hour = random.randint(12, 23)
        date = f'{year}.12.{day} {hour}:45:00'
        inputs_no_tz.append(date)
        inputs_tz.append(date + ' EST')

    t0 = time.perf_counter()
    for x in inputs_tz:
        assert dateparser.parse(x)
    t_tz = (time.perf_counter() - t0) / len(inputs_tz)

    t0 = time.perf_counter()
    for x in inputs_no_tz:
        assert dateparser.parse(x)
    t_no_tz = (time.perf_counter() - t0) / len(inputs_no_tz)

    print(f'Time with tz: {t_tz:.6f} s, t_no_tz: {t_no_tz:.6f} s')


if __name__ == '__main__':
    main()

And here are results (on branch first):

$ python t.py
Time with tz: 0.001375 s, t_no_tz: 0.000579 s
$ python t.py
Time with tz: 0.001363 s, t_no_tz: 0.000578 s
$ git checkout master
Switched to branch 'master'
Your branch is up to date with 'origin/master'.
$ python t.py
Time with tz: 0.001353 s, t_no_tz: 0.001223 s
$ python t.py
Time with tz: 0.001344 s, t_no_tz: 0.001216 s

so parsing without TZ is more than 2x faster, while with TZ it's a tiny bit smaller. It should also be possible to bring the time of TZ parsing down.

@codecov
Copy link

codecov bot commented Sep 23, 2019

Codecov Report

Merging #569 into master will increase coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #569      +/-   ##
==========================================
+ Coverage   95.11%   95.12%   +<.01%     
==========================================
  Files         302      302              
  Lines        2498     2500       +2     
==========================================
+ Hits         2376     2378       +2     
  Misses        122      122
Impacted Files Coverage Δ
dateparser/timezone_parser.py 98.21% <100%> (+0.06%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a08cecc...dc8e868. Read the comment docs.

@lopuhin lopuhin changed the title [WIP] faster pop_tz_offset_from_string faster pop_tz_offset_from_string, 2x faster dateparser.parse if date has no timezone info Sep 23, 2019
@lopuhin
Copy link
Member Author

lopuhin commented Sep 23, 2019

@asadurski do you think it's fine to merge this PR as it is now (big speedup for dates without TZ, small slowdown for dates with TZ), or it's better to improve the case with TZ as well? It should be possible to eliminate the loop from pop_tz_offset_from_string entirely, but it would be a bit more changes than now.

@lopuhin
Copy link
Member Author

lopuhin commented Sep 23, 2019

I also checked import time, and it's affected a little as well:

# PR
$ for i in $(seq 10); do python -c "import time; t0 = time.perf_counter(); import dateparser; print(f'{(time.perf_counter() - t0):.3f} s')"; done
0.126 s
0.128 s
0.129 s
0.127 s
0.129 s
0.126 s
0.130 s
0.127 s
0.126 s
0.126 s
# master
$ for i in $(seq 10); do python -c "import time; t0 = time.perf_counter(); import dateparser; print(f'{(time.perf_counter() - t0):.3f} s')"; done
0.114 s
0.115 s
0.112 s
0.112 s
0.112 s
0.112 s
0.111 s
0.112 s
0.112 s
0.111 s

@lopuhin
Copy link
Member Author

lopuhin commented Sep 23, 2019

I also checked import time, and it's affected a little as well:

It should be possible to reverse this though if we lazy-init stuff used by timezone parser.

Copy link
Member

@asadurski asadurski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! Huge gain with negligible trade-off. Let's merge this and focus on negative case in another PR.
And I'll look some more into performance benchmark options with Travis.

@asadurski asadurski merged commit f4b493f into master Sep 26, 2019
@asadurski asadurski added this to To release in Project board Sep 26, 2019
@lopuhin
Copy link
Member Author

lopuhin commented Sep 26, 2019

Thank you @asadurski 👍

And I'll look some more into performance benchmark options with Travis.

I think even a benchmark suite which can be run locally to compare with master would be very useful, and there are plenty of nice cases in the tests which could be used as benchmark input.

@manycoding manycoding deleted the faster-pop-tz-offset branch October 28, 2019 19:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Project board
To release
Development

Successfully merging this pull request may close these issues.

None yet

2 participants