-
Notifications
You must be signed in to change notification settings - Fork 886
Improve RawHtmlProcessor to have linear iso quadratic performance #454
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
It would be interesting to see some benchmarks before and after this change. Mostly, I'm just curious how much of a difference it makes. However, before it will be accepted, all the tests need to pass. Currently, 3 tests are failing related to syntax highlighting with the CodeHilite extension. That extensions actually stores its output in the HtmlStash so I'm assuming this change is causing the highlighted code blocks to not be swapped in for their placeholder's properly. Unfortunately, due to inconsistencies in output from different versions of Pygments, our tests are a little strange (we use However, it occurs to me that if your patch is relying on the placeholders being in a specific order (both in the HTMLStash and in the document), with the CodeHilite extension that order is not necessarily preserved between document and stash. Any such patch needs to support that possibility. |
Is there some error in the |
I've done some informal benchmarking. The big1.md file is a little under 10k lines big, big2 is twice that file, big3 three times, and so on. It's not entirely linear because the size of the regexp is also going to count when things get larger. Before: After: I don't know why the tests are failing. They are all successful on my machine with python 2.7 and 3.4. I am unable to reproduce the failure seen for pypy/2.7, even manually and the errors for 3.3/3.4 seem to come from inside nose. Are you referring to the use of an OrderedDict when you say that the patch relies on a specific order? The reason for the OrderedDict is to make sure that the placeholder with |
Wow, I missed that. Not sure what is going on there. Haven't seen that before. |
The failure printed when self.assertTrue was used with str.startswith did not show the string that was being searched.
I added an assertStartsWith function to the tests, so it now shows exactly why the assertions fail (at least for 2.7 and pypy):
There's an additional |
Hmm, I'd guess a different version of Pygments. Although, if you are running tox locally, then each test run should be using the most up-to-date versions. If that's not it, I don't know. |
OK, it passed for me with Pygments 2.1 but now I upgraded to 2.1.1 and it fails. Looks like this commit: https://bitbucket.org/birkenfeld/pygments-main/commits/164574c13533. This also happens without this pull request, so it's not a regression. |
Groan. Yep, that would be it. Thanks for finding it @mitya57. |
I've checked why the failure caused by pygments is not reported properly for python3 and I think it's this issue. Unfortunately that got closed with "upgrade to nose2". I do have a work-around (similar to what they are doing here):
If you'd like me to commit this, I'll add it to this pull request. Is there anything else I can do? |
@waylan: ping? I can take care of this pull request myself if you prefer (i.e. make sure the tests pass and merge it). |
@mitya57 go for it. I've been a little short on time lately. |
Merged. The tests pass everywhere according to Travis. Thanks @Griffon26 :) |
I noticed a significant slowdown as my input documents became larger.
It turned out to be caused by the RawHtmlProcessor doing a full text replace for every placeholder.
My patch changes this code to construct a regular expression that matches all items to be replaced and then replace each with the appropriate replacement text, requiring only one pass over the input text.