Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new string-based dependency-free DOM backing engine #77

Merged
merged 25 commits into from Nov 14, 2017

Conversation

thibaudcolas
Copy link
Collaborator

@thibaudcolas thibaudcolas commented Oct 23, 2017

Here as well, thanks to @BertrandBordage for the suggestion!

Fixes #62. This PR adds a new engine based on string concatenation only. With the current engine API, the HTML/DOM manipulation work that has to be done is rather straightforward so there isn't actually a need for a parser & advanced tree builder.

The implementation:

  • Processes normal and self-closing HTML elements differently.
  • Escapes text to HTML entities (single quote escapes being specific to this engine, due to how html.escape works)
  • Does not support the parse_html API available on other engines.

In terms of performance, this is ±3x faster than the current BeautifulSoup + html5lib implementation and ±1.5x faster than the lxml one:

html5lib 27228 function calls (26794 primitive calls) in 0.031 seconds
lxml 11497 function calls (11474 primitive calls) in 0.020 seconds
string 14321 function calls (13889 primitive calls) in 0.011 seconds

html5lib 27228 function calls (26794 primitive calls) in 0.025 seconds
lxml 11497 function calls (11474 primitive calls) in 0.016 seconds
string 14321 function calls (13889 primitive calls) in 0.014 seconds

The output is the same as BeautifulSoup's except for the HTML-encoded single quotes.

I think this should be released as part of 1.0.1, and then be set as the default engine in a 2.0.0 release. The lack of dependencies and performance boost are likely to be a net win for everyone, but it's still a breaking change , especially with the lack of parse_html in this (eg. it's not possible to properly interface this with templating engines).

Edit: this needs a bit more documentation but I thought I would do this only after we agree on a plan for 1.0.1 / 2.0.0.

@thibaudcolas thibaudcolas added the enhancement New feature or request label Oct 23, 2017
@thibaudcolas thibaudcolas added this to the Nice to have milestone Oct 23, 2017
@thibaudcolas thibaudcolas changed the title [WIP] Add new string-based dependency-free DOM backing engine Add new string-based dependency-free DOM backing engine Oct 23, 2017
@su27
Copy link
Contributor

su27 commented Oct 24, 2017

Good idea, it's quite efficient and straightforward.

One thing, sometimes I need to insert an HTML string into the DOM structure, for example, in order to add a 3rd-party video into my article, I have to get the embedded HTML from the video website, parse it to DOM objects, and append them to the draftjs DOM structure. So, if the parse_html method is removed, I still need a way to insert raw HTML into the article.

Copy link

@BertrandBordage BertrandBordage left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job, Thibaud :)
I pointed out a few significant performance improvements.

I’m also wondering why is the class defining static methods only. I know it has to do with the way other HTML generators work, but it seems irrelevant in this case. Instead of .create_tag(), there could just be an .__init__() method, and the rest of the code would be simpler.


@staticmethod
def render_attrs(attr):
attrs = [' {0}="{1}"'.format(a, escape(attr[a])) for a in attr]

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Always use the % operator for string formatting, it is 10 times faster than str.format().
I know Python’s team has pushed a lot str.format() in the past few years, but this has changed and % is still very relevant, mostly because of its performance as well as similarity to other languages.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For clarity, avoid adding a space before the attribute name. Instead, join using a space, then in the tag pattern in DOM_STRING.render(), add a space between tag name and attrs.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's very interesting, I didn't know that % was faster & still recommended.

For the spaces between attributes, it's a bit more complicated. The current implementation outputs the correct format when there are no attributes (eg. <br/> without any space within the tag), whereas what you are suggesting means there would need to be a check for whether to add the space between tag name and attribute list if there are attributes. While it's more semantically correct, it's also an extra operation.

else:
rendered = '<{0}{1}>{2}</{0}>'.format(elt['type'], attr, children)

return rendered

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Faster operations:

  • remove useless rendered variable
  • avoid fetching multiple times 'type' within elt
  • use % format
tag_name = elt['type']
if tag_name in VOID_ELEMENTS:
    return '<%s%s/>' % (tag_name, attr)
if tag_name == 'fragment':
    return children
return '<%s%s>%s</%s>' % (tag_name, attr, children, tag_name)


# http://w3c.github.io/html/single-page.html#void-elements
# https://github.com/html5lib/html5lib-python/blob/0cae52b2073e3f2220db93a7650901f2200f2a13/html5lib/constants.py#L560
VOID_ELEMENTS = [

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use a set instead, it’s way faster for in operations.

]


class DOM_STRING(DOMEngine):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any particular reason for this class name instead of DOMString?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, no, that's just me being silly. I probably tried to replicate the somewhat questionable nomenclature of other engines (DOM_HTML5LIB and DOM_LXML). Going for DOMString instead.

attr = DOM_STRING.render_attrs(elt['attr'])

children = ''
if len(elt['children']) != 0:

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replace with the faster & shorter:

if elt['children']:

Copy link
Collaborator Author

@thibaudcolas thibaudcolas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the feedback! I've implemented the perf-related suggestions, getting about a 5-10% speedup in my quick trial.

Here is the "best of 5" time, before & after:

3000933 function calls (2899267 primitive calls) in 2.682 seconds
2908579 function calls (2806913 primitive calls) in 2.588 seconds

(those times come from a new content benchmark I'm working on that's not in the repo yet)


@staticmethod
def render_attrs(attr):
attrs = [' {0}="{1}"'.format(a, escape(attr[a])) for a in attr]
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's very interesting, I didn't know that % was faster & still recommended.

For the spaces between attributes, it's a bit more complicated. The current implementation outputs the correct format when there are no attributes (eg. <br/> without any space within the tag), whereas what you are suggesting means there would need to be a check for whether to add the space between tag name and attribute list if there are attributes. While it's more semantically correct, it's also an extra operation.

]


class DOM_STRING(DOMEngine):
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, no, that's just me being silly. I probably tried to replicate the somewhat questionable nomenclature of other engines (DOM_HTML5LIB and DOM_LXML). Going for DOMString instead.

@thibaudcolas
Copy link
Collaborator Author

@BertrandBordage good question about the static methods. I don't remember my reasoning then. It doesn't feel worth the refactoring because of how seldom-used this part of the exporter's API is, but in this case yep it does seem completely over-engineered.

@su27 I think it would be possible to add parse_html onto this, all that's needed would be to circumvent the escaping of the HTML string in the render method. The big caveat is that contrary to the other BeautifulSoup and lxml engines, the HTML would be treated as a dumb string without any processing at all (checking its validity, escaping the parts that need escaping, etc). Does that sound reasonable / useful?


# http://w3c.github.io/html/single-page.html#void-elements
# https://github.com/html5lib/html5lib-python/blob/0cae52b2073e3f2220db93a7650901f2200f2a13/html5lib/constants.py#L560
VOID_ELEMENTS = set([

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be replaced with a set literal, i.e. {a, b, c, …}

else:
rendered += escape(c)

return rendered

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be faster this way:

@staticmethod
def render_children(children):
    render_child = DOMString.render
    return ''.join([render_child(c) if isinstance(c, dict) else escape(c)
                    for c in children])

To be tested, though.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to https://waymoot.org/home/python_string/, it's worth doing!

if type_ in VOID_ELEMENTS:
return '<%s%s/>' % (type_, attr)

if type_ == 'fragment':

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on the chances of type_ being 'fragment', you may want to put it before the in VOID_ELEMENTS check.
If type_ generally more often in VOID_ELEMENTS than == 'fragment', then keep it this way. Otherwise, swap them.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think fragment would be more frequent so will move it around.

type_ = elt['type']
attr = ''
if elt['attr']:
attr = DOMString.render_attrs(elt['attr'])

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Faster rewrite of this, avoid defining attr twice when elt['attr']:

attr = DOMString.render_attrs(elt['attr']) if elt['attr'] else ''

Depending on how often we have attributes in tags, this may be faster to do:

attr = elt['attr']
attr = DOMString.render_attrs(attr) if attr else ''

@staticmethod
def render_attrs(attr):
attrs = [' %s="%s"' % (a, escape(attr[a])) for a in attr]
return ''.join(sorted(attrs))
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be faster, especially because we get rid of the useless sorting. In Python ≤ 3.5, this may generate different HTML from a run to another one, so it may affect the tests if you’re not using assertHTMLEqual.

@staticmethod
def render_attrs(attr):
    return ''.join([' %s="%s"' % (k, escape(v)) for k, v in attr.items()])

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's quite valuable to have the same order of attributes across the board, so I'll keep the sorting for now. BeautifulSoup / html5lib and lxml also output attributes sorted.

I'm happy for this to be revisited later, but yep there'll also need to be some rework in the tests because there is no such thing as assertHTMLEqual in vanilla Python AFAIK.


children = ''
if elt['children']:
children = DOMString.render_children(elt['children'])

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as for attr.

@su27
Copy link
Contributor

su27 commented Oct 26, 2017

@thibaudcolas That sounds good to me. Although it may cause problems if the user doesn't process the HTML string correctly, but I think the validating responsibility should belong to the user, not the engine. If one decides to use the string-based engine, he/she should make sure the input is legal, including tag names, attribute names, and the "raw HTML".

@thibaudcolas
Copy link
Collaborator Author

thibaudcolas commented Nov 14, 2017

This is now merged, with the performance improvements suggested by @BertrandBordage and the addition of parse_html (thank you for the feedback @su27), with the caveats discussed above documented in the README.

@loicteixeira and I also discussed the potential impacts of this work on memory consumption. From my testing, I'm happy to report that this engine uses up "a bit less memory" than the html5lib one. I'm still learning how to do this meaningfully in Python though, but I'll soon make a new PR with some tooling.

For those interested, here are some performance measurements. They were all made on the same benchmark of 792 ContentState objects, representative of the content of one site the size of kiwibank.co.nz (the content itself is machine-generated from public domain texts), available over at https://github.com/thibaudcolas/markov_draftjs.

Comparison of all engines

1000 runs each, all in seconds.

html5lib:

 Min.   :1.271
 1st Qu.:1.333
 Median :1.347
 Mean   :1.351
 3rd Qu.:1.361
 Max.   :1.756

lxml:

 Min.   :0.6150
 1st Qu.:0.6300
 Median :0.6350
 Mean   :0.6366
 3rd Qu.:0.6410
 Max.   :0.7960

string:

 Min.   :0.5190
 1st Qu.:0.5310
 Median :0.5360
 Mean   :0.5384
 3rd Qu.:0.5410
 Max.   :1.0250

Then, here are how the numbers while I went through Bertrand's suggestions, one test run building up from the previous ("baseline" is the state of that engine 20 days ago):

Baseline:

Min. :0.5110
1st Qu.:0.5250
Median :0.5410
Mean :0.5947
3rd Qu.:0.6450
Max. :1.0040

List comprehension join:

Min. :0.5240
1st Qu.:0.5340
Median :0.5400
Mean :0.5538
3rd Qu.:0.5490
Max. :0.9890

Fragment check first:

Min. :0.5280
1st Qu.:0.5360
Median :0.5430
Mean :0.5488
3rd Qu.:0.5540
Max. :0.7190

Attr definition rewrite:

Min. :0.5200
1st Qu.:0.5290
Median :0.5340
Mean :0.5367
3rd Qu.:0.5390
Max. :0.6190

Unsorted attrs:

Min. :0.5190
1st Qu.:0.5330
Median :0.5410
Mean :0.5433
3rd Qu.:0.5470
Max. :0.5930

Sorted attrs:

Min. :0.5140
1st Qu.:0.5240
Median :0.5270
Mean :0.5303
3rd Qu.:0.5370
Max. :0.5820

Children definition rewrite:

Min. :0.5180
1st Qu.:0.5270
Median :0.5310
Mean :0.5315
3rd Qu.:0.5340
Max. :0.5600

@thibaudcolas thibaudcolas mentioned this pull request Nov 21, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants