Huge performance problem: `DateDataParser` stored multiple duplicate `previous_locales` #457

bzamecnik · 2018-10-16T14:35:51Z

We got surprised that dateparser.parse() gets slower over time (in a long-running process ~1 day), from hundereds of ms to 3 seconds per call! One possible cause is the following:

DateDataParser.get_date_data() stores some previously used locales:

if self.try_previous_locales:
    self.previous_locales.insert(0, locale)

However, we observed that dateparser._default_parser.previous_locales contained many duplicate instances of the same set of locales. Eg. 2189 instances of 63 unique locales.

The problem with the code above is that self.previous_locales should be a set, possibly ordered, but there's no check that locale is already inside.

Anyway, even when we run dateparser.parse() it's pretty slow (~400 ms / item). However with only languages=['en] it's really fast (~0.25 ms / item).

Package version: latest 0.7.0, Python 2.7.

The text was updated successfully, but these errors were encountered:

lopuhin · 2018-10-16T15:02:52Z

Hey @bzamecnik thanks for the bug report 👍 I think this is an important issue for us as well (although I didn't try to reproduce it yet). I wanted to clarify just one thing:

Anyway, even when we run dateparser.parse() it's pretty slow (~400 ms / item). However with only languages=['en] it's really fast (~0.25 ms / item).

Does this also happen after some time, or right from the start? If it happens right away, then this PR #428 has some performance improvements that are already in master but not released to PyPI yet.

bzamecnik · 2018-10-16T15:09:01Z

Thanks for a quick reply. In one of the smaller runs with 500 strings, it happened right away. When we restricted languages to just ['en'] it was fast. The results of the mentioned PR look good. Should I make a small PR for this or you can add the one line with the check and run measurements yourself within that PR #428?

lopuhin · 2018-10-16T15:12:20Z

@bzamecnik if you can make a PR, that would be awesome 👍

bzamecnik · 2018-10-16T15:19:32Z

OK, I can prepare that in the evening hopefully.

htInEdin · 2018-11-23T13:29:12Z

Just to confirm that a combination of pulling the #428 master and adding a minimal one line patch per bzamecnik (attached) helped cut a 6 1/2 day run, involving millions of dates, to 25 minutes!

date.txt

lopuhin · 2018-11-23T13:55:36Z

Thanks, great suff @htInEdin ! I hope you and @bzamecnik don't mind if we take @bzamecnik 's patch and make a PR with that? Or you can do it yourself if you wish (also maybe it could be further improved to avoid having to search in a list and to insert in the beginning).

htInEdin · 2018-11-23T14:01:57Z

Konstantin Lopuhin writes:

Thanks, great suff @htInEdin ! I hope you and @bzamecnik don't mind if we take @bzamecnik 's patch and make a PR with that? Or you can do it yourself if you wish (also maybe it could be further improved to avoid having to search in a list and to insert in the beginning).

No problem, please use @bzamecnik's or adapt/improve mine as you see fit. Thanks for a great, and now faster, library! ht -- Henry S. Thompson, School of Informatics, University of Edinburgh 10 Crichton Street, Edinburgh EH8 9AB, SCOTLAND -- (44) 131 650-4440 Fax: (44) 131 650-4587, e-mail: ht@inf.ed.ac.uk URL: http://www.ltg.ed.ac.uk/~ht/ [mail from me _always_ has a .sig like this -- mail without it is forged spam] The University of Edinburgh is a charitable body, registered in Scotland, with registration number SC005336.

jfelectron · 2019-02-10T19:09:16Z

What's the status of this? dateparser is great, but unusable in large datasets with mixed locals.

lopuhin · 2019-02-11T08:13:31Z

@jfelectron that particular patch is not integrated and has no PR at the moment.

In general master has many other speed improvements, soon to be released: #494

jfelectron · 2019-02-12T05:40:51Z

@lopuhin OK, thanks.

iamtodor · 2019-02-22T14:18:26Z

UPD: Found a hack.

date_field_datetime = dateparser.parse(date_field, date_formats=["%Y-%m-%dT%H:%M:%S.",
                                                                         "%Y-%m-%dT%H:%M:%S.%f",
                                                                         "%Y-%m-%dT%H:%M:%S.%fZ"])

Hello, gents. I have the same issue with the latest version. I have checked traces with https://console.cloud.google.com/traces and there is a bad surprise.
Here is a screen https://imgur.com/a/lsMHUzm. Somehow every 9th cycle in a loop takes huge time.
Can I do something to prevent such behavior?

Eveneko · 2020-02-22T04:14:47Z

Hi, I'm interested in this idea. But the requirements and goals of this project on GSoC are a little bit specific. Can I get more specific information? Or does it all depend on what I think and what I want to achieve?

Gallaecio · 2020-02-24T11:28:53Z

Yes, this is a very specific task that can be part of the http://gsoc2020.scrapinghub.com/ideas#dateparser-performance idea, but the GSoC idea should include additional performance-driven changes, and probably thread-safety, as mentioned in the idea page.

If you want to work on a GSoC proposal for this, feel free to create a separate issue were we can list different performance improvements that could be handled as part of GSoC.

noviluni · 2020-03-06T14:44:12Z

Hi @Gallaecio the original issue has been fixed in the last 0.7.4 version (PR: #625). I think we should close this issue and keep the GSoC questions regarding the performance in different issues.

Let me know what you think and feel free to open this issue again.

Gallaecio added the performance label Dec 12, 2019

OmarWagih1 mentioned this issue Mar 5, 2020

Perfomance optimization #624

Closed

noviluni closed this as completed Mar 6, 2020

sobolevn mentioned this issue Sep 25, 2021

More specific types in dateparser/date.pyi::DateDataParser python/typeshed#6075

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Huge performance problem: `DateDataParser` stored multiple duplicate `previous_locales` #457

Huge performance problem: `DateDataParser` stored multiple duplicate `previous_locales` #457

bzamecnik commented Oct 16, 2018

lopuhin commented Oct 16, 2018

bzamecnik commented Oct 16, 2018

lopuhin commented Oct 16, 2018

bzamecnik commented Oct 16, 2018

htInEdin commented Nov 23, 2018

lopuhin commented Nov 23, 2018

htInEdin commented Nov 23, 2018 via email

jfelectron commented Feb 10, 2019

lopuhin commented Feb 11, 2019

jfelectron commented Feb 12, 2019

iamtodor commented Feb 22, 2019 •

edited

Eveneko commented Feb 22, 2020

Gallaecio commented Feb 24, 2020

noviluni commented Mar 6, 2020

Huge performance problem: DateDataParser stored multiple duplicate previous_locales #457

Huge performance problem: DateDataParser stored multiple duplicate previous_locales #457

Comments

bzamecnik commented Oct 16, 2018

lopuhin commented Oct 16, 2018

bzamecnik commented Oct 16, 2018

lopuhin commented Oct 16, 2018

bzamecnik commented Oct 16, 2018

htInEdin commented Nov 23, 2018

lopuhin commented Nov 23, 2018

htInEdin commented Nov 23, 2018 via email

jfelectron commented Feb 10, 2019

lopuhin commented Feb 11, 2019

jfelectron commented Feb 12, 2019

iamtodor commented Feb 22, 2019 • edited

Eveneko commented Feb 22, 2020

Gallaecio commented Feb 24, 2020

noviluni commented Mar 6, 2020

Huge performance problem: `DateDataParser` stored multiple duplicate `previous_locales` #457

Huge performance problem: `DateDataParser` stored multiple duplicate `previous_locales` #457

iamtodor commented Feb 22, 2019 •

edited