New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new string-based dependency-free DOM backing engine #77

Merged
merged 25 commits into from Nov 14, 2017

Conversation

Projects
None yet
3 participants
@thibaudcolas
Collaborator

thibaudcolas commented Oct 23, 2017

Here as well, thanks to @BertrandBordage for the suggestion!

Fixes #62. This PR adds a new engine based on string concatenation only. With the current engine API, the HTML/DOM manipulation work that has to be done is rather straightforward so there isn't actually a need for a parser & advanced tree builder.

The implementation:

  • Processes normal and self-closing HTML elements differently.
  • Escapes text to HTML entities (single quote escapes being specific to this engine, due to how html.escape works)
  • Does not support the parse_html API available on other engines.

In terms of performance, this is ±3x faster than the current BeautifulSoup + html5lib implementation and ±1.5x faster than the lxml one:

html5lib 27228 function calls (26794 primitive calls) in 0.031 seconds
lxml 11497 function calls (11474 primitive calls) in 0.020 seconds
string 14321 function calls (13889 primitive calls) in 0.011 seconds

html5lib 27228 function calls (26794 primitive calls) in 0.025 seconds
lxml 11497 function calls (11474 primitive calls) in 0.016 seconds
string 14321 function calls (13889 primitive calls) in 0.014 seconds

The output is the same as BeautifulSoup's except for the HTML-encoded single quotes.

I think this should be released as part of 1.0.1, and then be set as the default engine in a 2.0.0 release. The lack of dependencies and performance boost are likely to be a net win for everyone, but it's still a breaking change , especially with the lack of parse_html in this (eg. it's not possible to properly interface this with templating engines).

Edit: this needs a bit more documentation but I thought I would do this only after we agree on a plan for 1.0.1 / 2.0.0.

@thibaudcolas thibaudcolas added this to the Nice to have milestone Oct 23, 2017

@thibaudcolas thibaudcolas changed the title from [WIP] Add new string-based dependency-free DOM backing engine to Add new string-based dependency-free DOM backing engine Oct 23, 2017

@su27

This comment has been minimized.

Show comment
Hide comment
@su27

su27 Oct 24, 2017

Contributor

Good idea, it's quite efficient and straightforward.

One thing, sometimes I need to insert an HTML string into the DOM structure, for example, in order to add a 3rd-party video into my article, I have to get the embedded HTML from the video website, parse it to DOM objects, and append them to the draftjs DOM structure. So, if the parse_html method is removed, I still need a way to insert raw HTML into the article.

Contributor

su27 commented Oct 24, 2017

Good idea, it's quite efficient and straightforward.

One thing, sometimes I need to insert an HTML string into the DOM structure, for example, in order to add a 3rd-party video into my article, I have to get the embedded HTML from the video website, parse it to DOM objects, and append them to the draftjs DOM structure. So, if the parse_html method is removed, I still need a way to insert raw HTML into the article.

@BertrandBordage

Good job, Thibaud :)
I pointed out a few significant performance improvements.

I’m also wondering why is the class defining static methods only. I know it has to do with the way other HTML generators work, but it seems irrelevant in this case. Instead of .create_tag(), there could just be an .__init__() method, and the rest of the code would be simpler.

Show outdated Hide outdated draftjs_exporter/engines/string.py Outdated
Show outdated Hide outdated draftjs_exporter/engines/string.py Outdated
Show outdated Hide outdated draftjs_exporter/engines/string.py Outdated
Show outdated Hide outdated draftjs_exporter/engines/string.py Outdated
Show outdated Hide outdated draftjs_exporter/engines/string.py Outdated
@thibaudcolas

Thanks for all the feedback! I've implemented the perf-related suggestions, getting about a 5-10% speedup in my quick trial.

Here is the "best of 5" time, before & after:

3000933 function calls (2899267 primitive calls) in 2.682 seconds
2908579 function calls (2806913 primitive calls) in 2.588 seconds

(those times come from a new content benchmark I'm working on that's not in the repo yet)

Show outdated Hide outdated draftjs_exporter/engines/string.py Outdated
Show outdated Hide outdated draftjs_exporter/engines/string.py Outdated
@thibaudcolas

This comment has been minimized.

Show comment
Hide comment
@thibaudcolas

thibaudcolas Oct 24, 2017

Collaborator

@BertrandBordage good question about the static methods. I don't remember my reasoning then. It doesn't feel worth the refactoring because of how seldom-used this part of the exporter's API is, but in this case yep it does seem completely over-engineered.

@su27 I think it would be possible to add parse_html onto this, all that's needed would be to circumvent the escaping of the HTML string in the render method. The big caveat is that contrary to the other BeautifulSoup and lxml engines, the HTML would be treated as a dumb string without any processing at all (checking its validity, escaping the parts that need escaping, etc). Does that sound reasonable / useful?

Collaborator

thibaudcolas commented Oct 24, 2017

@BertrandBordage good question about the static methods. I don't remember my reasoning then. It doesn't feel worth the refactoring because of how seldom-used this part of the exporter's API is, but in this case yep it does seem completely over-engineered.

@su27 I think it would be possible to add parse_html onto this, all that's needed would be to circumvent the escaping of the HTML string in the render method. The big caveat is that contrary to the other BeautifulSoup and lxml engines, the HTML would be treated as a dumb string without any processing at all (checking its validity, escaping the parts that need escaping, etc). Does that sound reasonable / useful?

@su27

This comment has been minimized.

Show comment
Hide comment
@su27

su27 Oct 26, 2017

Contributor

@thibaudcolas That sounds good to me. Although it may cause problems if the user doesn't process the HTML string correctly, but I think the validating responsibility should belong to the user, not the engine. If one decides to use the string-based engine, he/she should make sure the input is legal, including tag names, attribute names, and the "raw HTML".

Contributor

su27 commented Oct 26, 2017

@thibaudcolas That sounds good to me. Although it may cause problems if the user doesn't process the HTML string correctly, but I think the validating responsibility should belong to the user, not the engine. If one decides to use the string-based engine, he/she should make sure the input is legal, including tag names, attribute names, and the "raw HTML".

@thibaudcolas thibaudcolas merged commit 8a9837b into master Nov 14, 2017

3 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details
coverage/coveralls Coverage remained the same at 100.0%
Details

@thibaudcolas thibaudcolas deleted the feature/string-engine branch Nov 14, 2017

@thibaudcolas

This comment has been minimized.

Show comment
Hide comment
@thibaudcolas

thibaudcolas Nov 14, 2017

Collaborator

This is now merged, with the performance improvements suggested by @BertrandBordage and the addition of parse_html (thank you for the feedback @su27), with the caveats discussed above documented in the README.

@loicteixeira and I also discussed the potential impacts of this work on memory consumption. From my testing, I'm happy to report that this engine uses up "a bit less memory" than the html5lib one. I'm still learning how to do this meaningfully in Python though, but I'll soon make a new PR with some tooling.

For those interested, here are some performance measurements. They were all made on the same benchmark of 792 ContentState objects, representative of the content of one site the size of kiwibank.co.nz (the content itself is machine-generated from public domain texts), available over at https://github.com/thibaudcolas/markov_draftjs.

Comparison of all engines

1000 runs each, all in seconds.

html5lib:

 Min.   :1.271
 1st Qu.:1.333
 Median :1.347
 Mean   :1.351
 3rd Qu.:1.361
 Max.   :1.756

lxml:

 Min.   :0.6150
 1st Qu.:0.6300
 Median :0.6350
 Mean   :0.6366
 3rd Qu.:0.6410
 Max.   :0.7960

string:

 Min.   :0.5190
 1st Qu.:0.5310
 Median :0.5360
 Mean   :0.5384
 3rd Qu.:0.5410
 Max.   :1.0250

Then, here are how the numbers while I went through Bertrand's suggestions, one test run building up from the previous ("baseline" is the state of that engine 20 days ago):

Baseline:

Min. :0.5110
1st Qu.:0.5250
Median :0.5410
Mean :0.5947
3rd Qu.:0.6450
Max. :1.0040

List comprehension join:

Min. :0.5240
1st Qu.:0.5340
Median :0.5400
Mean :0.5538
3rd Qu.:0.5490
Max. :0.9890

Fragment check first:

Min. :0.5280
1st Qu.:0.5360
Median :0.5430
Mean :0.5488
3rd Qu.:0.5540
Max. :0.7190

Attr definition rewrite:

Min. :0.5200
1st Qu.:0.5290
Median :0.5340
Mean :0.5367
3rd Qu.:0.5390
Max. :0.6190

Unsorted attrs:

Min. :0.5190
1st Qu.:0.5330
Median :0.5410
Mean :0.5433
3rd Qu.:0.5470
Max. :0.5930

Sorted attrs:

Min. :0.5140
1st Qu.:0.5240
Median :0.5270
Mean :0.5303
3rd Qu.:0.5370
Max. :0.5820

Children definition rewrite:

Min. :0.5180
1st Qu.:0.5270
Median :0.5310
Mean :0.5315
3rd Qu.:0.5340
Max. :0.5600

Collaborator

thibaudcolas commented Nov 14, 2017

This is now merged, with the performance improvements suggested by @BertrandBordage and the addition of parse_html (thank you for the feedback @su27), with the caveats discussed above documented in the README.

@loicteixeira and I also discussed the potential impacts of this work on memory consumption. From my testing, I'm happy to report that this engine uses up "a bit less memory" than the html5lib one. I'm still learning how to do this meaningfully in Python though, but I'll soon make a new PR with some tooling.

For those interested, here are some performance measurements. They were all made on the same benchmark of 792 ContentState objects, representative of the content of one site the size of kiwibank.co.nz (the content itself is machine-generated from public domain texts), available over at https://github.com/thibaudcolas/markov_draftjs.

Comparison of all engines

1000 runs each, all in seconds.

html5lib:

 Min.   :1.271
 1st Qu.:1.333
 Median :1.347
 Mean   :1.351
 3rd Qu.:1.361
 Max.   :1.756

lxml:

 Min.   :0.6150
 1st Qu.:0.6300
 Median :0.6350
 Mean   :0.6366
 3rd Qu.:0.6410
 Max.   :0.7960

string:

 Min.   :0.5190
 1st Qu.:0.5310
 Median :0.5360
 Mean   :0.5384
 3rd Qu.:0.5410
 Max.   :1.0250

Then, here are how the numbers while I went through Bertrand's suggestions, one test run building up from the previous ("baseline" is the state of that engine 20 days ago):

Baseline:

Min. :0.5110
1st Qu.:0.5250
Median :0.5410
Mean :0.5947
3rd Qu.:0.6450
Max. :1.0040

List comprehension join:

Min. :0.5240
1st Qu.:0.5340
Median :0.5400
Mean :0.5538
3rd Qu.:0.5490
Max. :0.9890

Fragment check first:

Min. :0.5280
1st Qu.:0.5360
Median :0.5430
Mean :0.5488
3rd Qu.:0.5540
Max. :0.7190

Attr definition rewrite:

Min. :0.5200
1st Qu.:0.5290
Median :0.5340
Mean :0.5367
3rd Qu.:0.5390
Max. :0.6190

Unsorted attrs:

Min. :0.5190
1st Qu.:0.5330
Median :0.5410
Mean :0.5433
3rd Qu.:0.5470
Max. :0.5930

Sorted attrs:

Min. :0.5140
1st Qu.:0.5240
Median :0.5270
Mean :0.5303
3rd Qu.:0.5370
Max. :0.5820

Children definition rewrite:

Min. :0.5180
1st Qu.:0.5270
Median :0.5310
Mean :0.5315
3rd Qu.:0.5340
Max. :0.5600

@thibaudcolas thibaudcolas referenced this pull request Nov 21, 2017

Merged

Release v1.1.0 #82

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment