New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new string-based dependency-free DOM backing engine #77
Conversation
Good idea, it's quite efficient and straightforward. One thing, sometimes I need to insert an HTML string into the DOM structure, for example, in order to add a 3rd-party video into my article, I have to get the embedded HTML from the video website, parse it to DOM objects, and append them to the draftjs DOM structure. So, if the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good job, Thibaud :)
I pointed out a few significant performance improvements.
I’m also wondering why is the class defining static methods only. I know it has to do with the way other HTML generators work, but it seems irrelevant in this case. Instead of .create_tag()
, there could just be an .__init__()
method, and the rest of the code would be simpler.
draftjs_exporter/engines/string.py
Outdated
|
||
@staticmethod | ||
def render_attrs(attr): | ||
attrs = [' {0}="{1}"'.format(a, escape(attr[a])) for a in attr] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Always use the %
operator for string formatting, it is 10 times faster than str.format()
.
I know Python’s team has pushed a lot str.format()
in the past few years, but this has changed and %
is still very relevant, mostly because of its performance as well as similarity to other languages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For clarity, avoid adding a space before the attribute name. Instead, join using a space, then in the tag pattern in DOM_STRING.render()
, add a space between tag name and attrs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's very interesting, I didn't know that %
was faster & still recommended.
For the spaces between attributes, it's a bit more complicated. The current implementation outputs the correct format when there are no attributes (eg. <br/>
without any space within the tag), whereas what you are suggesting means there would need to be a check for whether to add the space between tag name and attribute list if there are attributes. While it's more semantically correct, it's also an extra operation.
draftjs_exporter/engines/string.py
Outdated
else: | ||
rendered = '<{0}{1}>{2}</{0}>'.format(elt['type'], attr, children) | ||
|
||
return rendered |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Faster operations:
- remove useless
rendered
variable - avoid fetching multiple times
'type'
withinelt
- use
%
format
tag_name = elt['type']
if tag_name in VOID_ELEMENTS:
return '<%s%s/>' % (tag_name, attr)
if tag_name == 'fragment':
return children
return '<%s%s>%s</%s>' % (tag_name, attr, children, tag_name)
draftjs_exporter/engines/string.py
Outdated
|
||
# http://w3c.github.io/html/single-page.html#void-elements | ||
# https://github.com/html5lib/html5lib-python/blob/0cae52b2073e3f2220db93a7650901f2200f2a13/html5lib/constants.py#L560 | ||
VOID_ELEMENTS = [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use a set
instead, it’s way faster for in
operations.
draftjs_exporter/engines/string.py
Outdated
] | ||
|
||
|
||
class DOM_STRING(DOMEngine): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any particular reason for this class name instead of DOMString?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, no, that's just me being silly. I probably tried to replicate the somewhat questionable nomenclature of other engines (DOM_HTML5LIB
and DOM_LXML
). Going for DOMString
instead.
draftjs_exporter/engines/string.py
Outdated
attr = DOM_STRING.render_attrs(elt['attr']) | ||
|
||
children = '' | ||
if len(elt['children']) != 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replace with the faster & shorter:
if elt['children']:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the feedback! I've implemented the perf-related suggestions, getting about a 5-10% speedup in my quick trial.
Here is the "best of 5" time, before & after:
3000933 function calls (2899267 primitive calls) in 2.682 seconds
2908579 function calls (2806913 primitive calls) in 2.588 seconds
(those times come from a new content benchmark I'm working on that's not in the repo yet)
draftjs_exporter/engines/string.py
Outdated
|
||
@staticmethod | ||
def render_attrs(attr): | ||
attrs = [' {0}="{1}"'.format(a, escape(attr[a])) for a in attr] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's very interesting, I didn't know that %
was faster & still recommended.
For the spaces between attributes, it's a bit more complicated. The current implementation outputs the correct format when there are no attributes (eg. <br/>
without any space within the tag), whereas what you are suggesting means there would need to be a check for whether to add the space between tag name and attribute list if there are attributes. While it's more semantically correct, it's also an extra operation.
draftjs_exporter/engines/string.py
Outdated
] | ||
|
||
|
||
class DOM_STRING(DOMEngine): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, no, that's just me being silly. I probably tried to replicate the somewhat questionable nomenclature of other engines (DOM_HTML5LIB
and DOM_LXML
). Going for DOMString
instead.
@BertrandBordage good question about the static methods. I don't remember my reasoning then. It doesn't feel worth the refactoring because of how seldom-used this part of the exporter's API is, but in this case yep it does seem completely over-engineered. @su27 I think it would be possible to add |
draftjs_exporter/engines/string.py
Outdated
|
||
# http://w3c.github.io/html/single-page.html#void-elements | ||
# https://github.com/html5lib/html5lib-python/blob/0cae52b2073e3f2220db93a7650901f2200f2a13/html5lib/constants.py#L560 | ||
VOID_ELEMENTS = set([ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be replaced with a set literal, i.e. {a, b, c, …}
draftjs_exporter/engines/string.py
Outdated
else: | ||
rendered += escape(c) | ||
|
||
return rendered |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May be faster this way:
@staticmethod
def render_children(children):
render_child = DOMString.render
return ''.join([render_child(c) if isinstance(c, dict) else escape(c)
for c in children])
To be tested, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to https://waymoot.org/home/python_string/, it's worth doing!
draftjs_exporter/engines/string.py
Outdated
if type_ in VOID_ELEMENTS: | ||
return '<%s%s/>' % (type_, attr) | ||
|
||
if type_ == 'fragment': |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depending on the chances of type_
being 'fragment'
, you may want to put it before the in VOID_ELEMENTS
check.
If type_
generally more often in VOID_ELEMENTS
than == 'fragment'
, then keep it this way. Otherwise, swap them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think fragment
would be more frequent so will move it around.
draftjs_exporter/engines/string.py
Outdated
type_ = elt['type'] | ||
attr = '' | ||
if elt['attr']: | ||
attr = DOMString.render_attrs(elt['attr']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Faster rewrite of this, avoid defining attr
twice when elt['attr']
:
attr = DOMString.render_attrs(elt['attr']) if elt['attr'] else ''
Depending on how often we have attributes in tags, this may be faster to do:
attr = elt['attr']
attr = DOMString.render_attrs(attr) if attr else ''
draftjs_exporter/engines/string.py
Outdated
@staticmethod | ||
def render_attrs(attr): | ||
attrs = [' %s="%s"' % (a, escape(attr[a])) for a in attr] | ||
return ''.join(sorted(attrs)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be faster, especially because we get rid of the useless sorting. In Python ≤ 3.5, this may generate different HTML from a run to another one, so it may affect the tests if you’re not using assertHTMLEqual
.
@staticmethod
def render_attrs(attr):
return ''.join([' %s="%s"' % (k, escape(v)) for k, v in attr.items()])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's quite valuable to have the same order of attributes across the board, so I'll keep the sorting for now. BeautifulSoup / html5lib and lxml also output attributes sorted.
I'm happy for this to be revisited later, but yep there'll also need to be some rework in the tests because there is no such thing as assertHTMLEqual
in vanilla Python AFAIK.
draftjs_exporter/engines/string.py
Outdated
|
||
children = '' | ||
if elt['children']: | ||
children = DOMString.render_children(elt['children']) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as for attr
.
@thibaudcolas That sounds good to me. Although it may cause problems if the user doesn't process the HTML string correctly, but I think the validating responsibility should belong to the user, not the engine. If one decides to use the string-based engine, he/she should make sure the input is legal, including tag names, attribute names, and the "raw HTML". |
544259e
to
039bd59
Compare
This is now merged, with the performance improvements suggested by @BertrandBordage and the addition of @loicteixeira and I also discussed the potential impacts of this work on memory consumption. From my testing, I'm happy to report that this engine uses up "a bit less memory" than the html5lib one. I'm still learning how to do this meaningfully in Python though, but I'll soon make a new PR with some tooling. For those interested, here are some performance measurements. They were all made on the same benchmark of 792 ContentState objects, representative of the content of one site the size of kiwibank.co.nz (the content itself is machine-generated from public domain texts), available over at https://github.com/thibaudcolas/markov_draftjs. Comparison of all engines1000 runs each, all in seconds. html5lib:
lxml:
string:
Then, here are how the numbers while I went through Bertrand's suggestions, one test run building up from the previous ("baseline" is the state of that engine 20 days ago): Baseline: Min. :0.5110 List comprehension join: Min. :0.5240 Fragment check first: Min. :0.5280 Attr definition rewrite: Min. :0.5200 Unsorted attrs: Min. :0.5190 Sorted attrs: Min. :0.5140 Children definition rewrite: Min. :0.5180 |
Here as well, thanks to @BertrandBordage for the suggestion!
Fixes #62. This PR adds a new engine based on string concatenation only. With the current engine API, the HTML/DOM manipulation work that has to be done is rather straightforward so there isn't actually a need for a parser & advanced tree builder.
The implementation:
html.escape
works)Does not support theparse_html
API available on other engines.In terms of performance, this is ±3x faster than the current BeautifulSoup + html5lib implementation and ±1.5x faster than the lxml one:
The output is the same as BeautifulSoup's except for the HTML-encoded single quotes.
I think this should be released as part of 1.0.1, and then be set as the default engine in a 2.0.0 release. The lack of dependencies and performance boost are likely to be a net win for everyone, but it's still a breaking change
, especially with the lack of.parse_html
in this (eg. it's not possible to properly interface this with templating engines)Edit: this needs a bit more documentation but I thought I would do this only after we agree on a plan for 1.0.1 / 2.0.0.