Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible performance improvement in email parsing #106628

Closed
cfbolz opened this issue Jul 11, 2023 · 0 comments
Closed

Possible performance improvement in email parsing #106628

cfbolz opened this issue Jul 11, 2023 · 0 comments
Labels
performance Performance or resource usage stdlib Python modules in the Lib dir topic-email type-bug An unexpected behavior, bug, or error

Comments

@cfbolz
Copy link
Contributor

cfbolz commented Jul 11, 2023

PyPy received the following performance bug today: https://foss.heptapod.net/pypy/pypy/-/issues/3961

Somebody who was trying to process a lot of emails from an mbox file was complaining about terrible performance on PyPy. The problem turned out to be fact that email.feedparser.FeedParser._parsegen is compiling a new regular expression for every multipart message in the mbox file. On PyPy this is particularly bad, because those regular expressions are jitted and that costs even more time. However, even on CPython compiling these regular expressions takes a noticeable portion of the benchmark.

I fixed this problem in PyPy by simply using str.startswith with the multipart separator, followed by a generic regular expression that can be used for arbitrary boundaries. In PyPy this helps massively, but in CPython it's still a 20% performance improvement. Will open a PR for it.

Linked PRs

@cfbolz cfbolz added the type-bug An unexpected behavior, bug, or error label Jul 11, 2023
cfbolz added a commit to cfbolz/cpython that referenced this issue Jul 11, 2023
Don't compile a new regular expression for every single email that is
being parsed. Instead, use str.startswith and a generic regular
expression.
@AlexWaygood AlexWaygood added performance Performance or resource usage stdlib Python modules in the Lib dir topic-email labels Jul 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Performance or resource usage stdlib Python modules in the Lib dir topic-email type-bug An unexpected behavior, bug, or error
Projects
None yet
Development

No branches or pull requests

3 participants