New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
About the performance of bs4 + html5lib #31
Comments
Hey @su27, performance is definitely a concern. I do not know much about performance testing python code & profiling it, so thank you very much for the above! For optimisations, I think the first and most significant step would be to make the parser configurable so anyone can switch to lxml or anything else that BeautifulSoup supports. We should try to keep The second step would be to allow further changes in what is going on in For now, do you mind if I integrate your profiling code directly within the project? |
Hi @thibaudcolas , yes I think making the parser configurable would be simple but very useful. I have switched to lxml for a few days, I found it won't cause much trouble if I choose the right html tags (no Of course I don't mind if you integrate my code, glad to help. |
Good morning @thibaudcolas! Well yesterday I tried to switch back to lxml alone, instead of BS4 + lxml. I found it's 8-10 times faster! It takes just 0.017s to run the benchmark above. In fact, I knew it would be much faster but 10 times? That's amazing. What I've tested was basically your previous version of |
Glad to know the change was simple enough for you to do it! I've been thinking about this a lot and the best solution I could think of is the following:
The thing I have trouble solving is that right now in every file you can just DOM = HTML.get_dom('bs-lxml') An alternative would be from draftjs_exporter.dom import DOM_BS_HTML5LIB as DOM
from draftjs_exporter.dom import DOM_BS_LXML as DOM
from draftjs_exporter.dom import DOM_LXML as DOM
# Or implement your own DOM
# And then
exporter = HTML(config, DOM) |
Also as an aside I have started refactoring the exporter a bit, and was able to make it 2x faster quite easily. It's definitely not as big of a change as using lxml, but will explore this further too. |
I think either the factory method or the import-as-alias way are nice enough to use, simple and clear. Well after the previous comment, I encountered some problems, seems it's not that simple to switch to lxml... But I got them done.(Hope I really did):
So, it seems there are still plenty of work to do. On the other side, glad you've refactored it! I did a little refactor too: su27@14d866d , I don't know if it's anything similar with what you did. |
I've just made changes that make the exporter perform significantly faster (±20-30x) regardless of which engine is used. Here are the numbers for the profiled test cases:
Here are details of what I did and timings along the way (each label loosely matches a commit): StartExample: 115913 function calls (115360 primitive calls) in 0.191 seconds Stop creating empty wrappersExample: 113933 function calls (113396 primitive calls) in 0.279 seconds Stop creating empty fragments when there are no composite decoratorsExample: 97056 function calls (96599 primitive calls) in 0.174 seconds With decorators in the perf tests, but no decorationsTests: 137648 function calls (137048 primitive calls) in 0.284 seconds Ran 146 tests in 1.150s With decorators, and a decoration testTests: 145860 function calls (145222 primitive calls) in 0.303 seconds Cache empty soupExample: 24447 function calls (24007 primitive calls) in 0.292 seconds Stop extra parsingExample: 8570 function calls (8342 primitive calls) in 0.010 seconds Stop generating unused command groupsExample: 7356 function calls (7172 primitive calls) in 0.007 seconds Remove useless command sortingExample: 7243 function calls (7059 primitive calls) in 0.007 seconds Finally, I tried using lxml directly (I did not find a significant performance benefit when using lxml with BeautifulSoup, not sure why). I ran into some issues and will have to stop there for now. Example: 2947 function calls (2927 primitive calls) in 0.003 seconds Here is the from __future__ import absolute_import, unicode_literals
import re
import inspect
from lxml import etree, html
# See http://stackoverflow.com/questions/7703018/how-to-write-namespaced-element-attributes-with-lxml
XLINK = 'http://www.w3.org/1999/xlink'
class DOM(object):
"""
Wrapper around our HTML building library to facilitate changes.
"""
@staticmethod
def create_tag(type_, attributes={}):
return etree.Element(type_, attrib=attributes)
@staticmethod
def create_element(type_=None, props=None, *children):
"""
Signature inspired by React.createElement.
createElement(
string/ReactClass type_,
[object props],
[children ...]
)
https://facebook.github.io/react/docs/top-level-api.html#react.createelement
"""
if props is None:
props = {}
if not type_:
elt = DOM.create_document_fragment()
else:
attributes = {}
# Map props from React/Draft.js to lxml lingo.
if 'className' in props:
props['class'] = props.pop('className')
# One-off fix ATM, even though the problem is everywhere.
if 'xlink:href' in props:
props['{%s}href' % XLINK] = props.get('xlink:href')
props.pop('xlink:href', None)
for key in props:
prop = props[key]
# Filter null values and cast to string for lxml
if prop is not None:
attributes[key] = str(prop)
if inspect.isclass(type_):
elt = type_().render(attributes)
elif callable(getattr(type_, 'render', None)):
elt = type_.render(attributes)
elif callable(type_):
elt = type_(attributes)
else:
elt = DOM.create_tag(type_, attributes)
for child in children:
if hasattr(child, 'tag'):
DOM.append_child(elt, child)
else:
DOM.set_text_content(elt, DOM.get_text_content(elt) + child if DOM.get_text_content(elt) else child)
return elt
@staticmethod
def create_document_fragment():
return DOM.create_tag('fragment')
@staticmethod
def create_text_node(text):
elt = DOM.create_tag('textnode')
DOM.set_text_content(elt, text)
return elt
@staticmethod
def parse_html(markup):
return html.fromstring(markup)
@staticmethod
def append_child(elt, child):
elt.append(child)
@staticmethod
def set_attribute(elt, attr, value):
elt.set(attr, value)
@staticmethod
def get_tag_name(elt):
return elt.tag
@staticmethod
def get_class_list(elt):
return [elt.get('class')]
@staticmethod
def get_text_content(elt):
return elt.text
@staticmethod
def set_text_content(elt, text):
elt.text = text
@staticmethod
def get_children(elt):
return elt.getchildren()
@staticmethod
def render(elt):
"""
Removes the fragments that should not have HTML tags. Caveat of lxml.
Dirty, but quite easy to understand.
"""
return re.sub(r'</?(fragment|textnode)>', '', etree.tostring(elt, method='html').decode('utf-8'))
@staticmethod
def pretty_print(markup):
"""
Convenience method.
Pretty print the element, removing the top-level node that lxml needs.
"""
return re.sub(r'</?doc>', '', etree.tostring(html.fromstring('<doc>%s</doc>' % markup), encoding='unicode', pretty_print=True)) |
Awesome work! This improvement is significant. One thing about the
Maybe you should take care of that too. |
#56 is merged so I will consider this fixed for now @su27 I would be keen to know how this goes for you. Hopefully the "custom backing engines" feature will still allow you to tweak your own Some other notes on performance:
|
Glad to know! I tried and there were a few problems for me to adopt this version, I'll fix it and check if everything is okay in production environment after May 3rd, because I'm gonna on vacation these days. |
@thibaudcolas Done! I've modified my code to fit v1.0.0 and now it works good, I'm going to deploy it to production today. Now my customized engine is like this: # coding: utf-8
import re
from lxml import etree
from draftjs_exporter.engines.lxml import DOM_LXML
NSMAP = {
'xlink': 'http://www.w3.org/1999/xlink',
}
try:
UNICODE_EXISTS = bool(type(unicode))
except NameError:
def unicode(s):
return str(s)
def clean_str(s):
if not isinstance(s, basestring):
s = unicode(s)
elif not isinstance(s, unicode):
s = s.decode('utf8')
# See http://stackoverflow.com/questions/8733233/filtering-out-certain-bytes-in-python
return re.sub(u'[^\u0020-\uD7FF\u0009\u000A\u000D\uE000-\uFFFD\u10000-\u10FFFF]+', '', s)
class LXMLEngine(DOM_LXML):
@staticmethod
def create_tag(type_, attr=None):
nsmap = None
if attr:
if 'xlink:href' in attr:
attr['{%s}href' % NSMAP['xlink']] = attr.pop('xlink:href')
nsmap = NSMAP
attr = {k: clean_str(v) for k, v in attr.iteritems()}
return etree.Element(type_, attrib=attr, nsmap=nsmap)
@staticmethod
def append_child(elt, child):
if hasattr(child, 'tag'):
elt.append(child)
else:
c = etree.Element('fragment')
c.text = clean_str(child)
elt.append(c) This is pretty straightforward. But the tricky part is to find out which part of the API has been changed since v0.6 and fix the affected entities and composite decorators. Finally I don't need to maintain my own version of draftjs_exporter, hooray! |
Cool! There have been a lot of breaking changes so you will likely have plenty of updating to do. I've tried my best to describe the changes in the CHANGELOG, more recently completed with "How to upgrade" instructions. |
Oh my fault, I should have read that first! I've just fixed a bug caused by the removal of |
Hi Thibaud Colas,
Here I got another problem. In my project, when I wrote a real-world note text which is not too long, but with a lot of entities, I found it takes more than 5 seconds to render. Of course that's unacceptable for an online service, so I tried to reduce the number of temporary wrapper elements to optimize the speed, finally I made it a little better, like more than 3 seconds, that's all I could do.
But when I tried to use lxml instead of html5lib, the rendering time decreased to less than 1 second!
WTH? Then I found someone's benchmark , which explains the hug difference (with python 2).
And here's my simple test case with a few images and "subjects". With lxml, the rendering takes 0.17 seconds:
But with html5lib it takes about 0.65 seconds.
So, any suggestion for optimizing? And I don't know if html5lib is good enough for us to ignore the performance issue, how do you think? Thank you~
The text was updated successfully, but these errors were encountered: