Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Huge performance problem: DateDataParser stored multiple duplicate previous_locales #457

Closed
bzamecnik opened this issue Oct 16, 2018 · 14 comments

Comments

@bzamecnik
Copy link

We got surprised that dateparser.parse() gets slower over time (in a long-running process ~1 day), from hundereds of ms to 3 seconds per call! One possible cause is the following:

DateDataParser.get_date_data() stores some previously used locales:

if self.try_previous_locales:
    self.previous_locales.insert(0, locale)

However, we observed that dateparser._default_parser.previous_locales contained many duplicate instances of the same set of locales. Eg. 2189 instances of 63 unique locales.

The problem with the code above is that self.previous_locales should be a set, possibly ordered, but there's no check that locale is already inside.

Anyway, even when we run dateparser.parse() it's pretty slow (~400 ms / item). However with only languages=['en] it's really fast (~0.25 ms / item).

Package version: latest 0.7.0, Python 2.7.

@lopuhin
Copy link
Member

lopuhin commented Oct 16, 2018

Hey @bzamecnik thanks for the bug report 👍 I think this is an important issue for us as well (although I didn't try to reproduce it yet). I wanted to clarify just one thing:

Anyway, even when we run dateparser.parse() it's pretty slow (~400 ms / item). However with only languages=['en] it's really fast (~0.25 ms / item).

Does this also happen after some time, or right from the start? If it happens right away, then this PR #428 has some performance improvements that are already in master but not released to PyPI yet.

@bzamecnik
Copy link
Author

Thanks for a quick reply. In one of the smaller runs with 500 strings, it happened right away. When we restricted languages to just ['en'] it was fast. The results of the mentioned PR look good. Should I make a small PR for this or you can add the one line with the check and run measurements yourself within that PR #428?

@lopuhin
Copy link
Member

lopuhin commented Oct 16, 2018

@bzamecnik if you can make a PR, that would be awesome 👍

@bzamecnik
Copy link
Author

OK, I can prepare that in the evening hopefully.

@htInEdin
Copy link

Just to confirm that a combination of pulling the #428 master and adding a minimal one line patch per bzamecnik (attached) helped cut a 6 1/2 day run, involving millions of dates, to 25 minutes!

date.txt

@lopuhin
Copy link
Member

lopuhin commented Nov 23, 2018

Thanks, great suff @htInEdin ! I hope you and @bzamecnik don't mind if we take @bzamecnik 's patch and make a PR with that? Or you can do it yourself if you wish (also maybe it could be further improved to avoid having to search in a list and to insert in the beginning).

@htInEdin
Copy link

htInEdin commented Nov 23, 2018 via email

@jfelectron
Copy link

What's the status of this? dateparser is great, but unusable in large datasets with mixed locals.

@lopuhin
Copy link
Member

lopuhin commented Feb 11, 2019

@jfelectron that particular patch is not integrated and has no PR at the moment.

In general master has many other speed improvements, soon to be released: #494

@jfelectron
Copy link

@lopuhin OK, thanks.

@iamtodor
Copy link

iamtodor commented Feb 22, 2019

UPD: Found a hack.

date_field_datetime = dateparser.parse(date_field, date_formats=["%Y-%m-%dT%H:%M:%S.",
                                                                         "%Y-%m-%dT%H:%M:%S.%f",
                                                                         "%Y-%m-%dT%H:%M:%S.%fZ"])

Hello, gents. I have the same issue with the latest version. I have checked traces with https://console.cloud.google.com/traces and there is a bad surprise.
Here is a screen https://imgur.com/a/lsMHUzm. Somehow every 9th cycle in a loop takes huge time.
Can I do something to prevent such behavior?

@Eveneko
Copy link

Eveneko commented Feb 22, 2020

Hi, I'm interested in this idea. But the requirements and goals of this project on GSoC are a little bit specific. Can I get more specific information? Or does it all depend on what I think and what I want to achieve?

@Gallaecio
Copy link
Member

Yes, this is a very specific task that can be part of the http://gsoc2020.scrapinghub.com/ideas#dateparser-performance idea, but the GSoC idea should include additional performance-driven changes, and probably thread-safety, as mentioned in the idea page.

If you want to work on a GSoC proposal for this, feel free to create a separate issue were we can list different performance improvements that could be handled as part of GSoC.

@noviluni
Copy link
Collaborator

noviluni commented Mar 6, 2020

Hi @Gallaecio the original issue has been fixed in the last 0.7.4 version (PR: #625). I think we should close this issue and keep the GSoC questions regarding the performance in different issues.

Let me know what you think and feel free to open this issue again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Project board
Awaiting triage
Development

No branches or pull requests

8 participants