New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

About the performance of bs4 + html5lib #31

Closed
su27 opened this Issue Dec 9, 2016 · 13 comments

Comments

Projects
None yet
2 participants
@su27
Contributor

su27 commented Dec 9, 2016

Hi Thibaud Colas,

Here I got another problem. In my project, when I wrote a real-world note text which is not too long, but with a lot of entities, I found it takes more than 5 seconds to render. Of course that's unacceptable for an online service, so I tried to reduce the number of temporary wrapper elements to optimize the speed, finally I made it a little better, like more than 3 seconds, that's all I could do.

But when I tried to use lxml instead of html5lib, the rendering time decreased to less than 1 second!

WTH? Then I found someone's benchmark , which explains the hug difference (with python 2).

And here's my simple test case with a few images and "subjects". With lxml, the rendering takes 0.17 seconds:

def Soup(raw_str):
    """
    Wrapper around BeautifulSoup to keep the code DRY.
    """
    return BeautifulSoup(raw_str, 'lxml')
         47459 function calls (46117 primitive calls) in 0.164 seconds

   Ordered by: cumulative time
   List reduced from 257 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.165    0.165 html.py:24(render)
       16    0.001    0.000    0.115    0.007 html.py:39(render_block)
      133    0.002    0.000    0.091    0.001 dom.py:35(create_element)
       38    0.000    0.000    0.091    0.002 entity_state.py:55(render_entitities)
        6    0.001    0.000    0.073    0.012 benchmark.py:173(render)
      174    0.001    0.000    0.072    0.000 dom.py:17(Soup)
      174    0.008    0.000    0.071    0.000 __init__.py:87(__init__)
      172    0.001    0.000    0.059    0.000 dom.py:28(create_tag)
        1    0.000    0.000    0.049    0.049 wrapper_state.py:111(to_string)
        1    0.000    0.000    0.049    0.049 dom.py:124(render)
      174    0.001    0.000    0.042    0.000 __init__.py:285(_feed)
      174    0.001    0.000    0.041    0.000 _lxml.py:246(feed)
 1166/515    0.002    0.000    0.037    0.000 {hasattr}
      107    0.001    0.000    0.036    0.000 element.py:1029(__getattr__)
      107    0.000    0.000    0.034    0.000 element.py:1273(find)
      107    0.001    0.000    0.034    0.000 element.py:1284(find_all)
      174    0.007    0.000    0.034    0.000 {method 'feed' of 'lxml.etree._FeedParser' objects}
      107    0.003    0.000    0.033    0.000 element.py:518(_find_all)
        2    0.000    0.000    0.028    0.014 element.py:1077(__unicode__)
    336/2    0.008    0.000    0.028    0.014 element.py:1105(decode)

But with html5lib it takes about 0.65 seconds.

         178504 function calls (177142 primitive calls) in 0.663 seconds

   Ordered by: cumulative time
   List reduced from 455 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.663    0.663 html.py:24(render)
      174    0.001    0.000    0.572    0.003 dom.py:17(Soup)
      174    0.008    0.000    0.571    0.003 __init__.py:87(__init__)
       16    0.001    0.000    0.551    0.034 html.py:39(render_block)
      174    0.001    0.000    0.543    0.003 __init__.py:285(_feed)
      174    0.003    0.000    0.542    0.003 _html5lib.py:57(feed)
      172    0.001    0.000    0.488    0.003 dom.py:28(create_tag)
      133    0.002    0.000    0.412    0.003 dom.py:35(create_element)
       38    0.000    0.000    0.382    0.010 entity_state.py:55(render_entitities)
      174    0.011    0.000    0.350    0.002 html5parser.py:55(__init__)
        6    0.001    0.000    0.256    0.043 benchmark.py:173(render)
      174    0.001    0.000    0.188    0.001 html5parser.py:225(parse)
      174    0.002    0.000    0.187    0.001 html5parser.py:81(_parse)
     5916    0.104    0.000    0.176    0.000 utils.py:49(__init__)
      174    0.007    0.000    0.145    0.001 html5parser.py:157(mainLoop)
        1    0.000    0.000    0.104    0.104 wrapper_state.py:111(to_string)
        1    0.000    0.000    0.104    0.104 dom.py:124(render)
      174    0.019    0.000    0.090    0.001 html5parser.py:874(__init__)
       16    0.000    0.000    0.089    0.006 wrapper_state.py:125(element_for)
      174    0.054    0.000    0.086    0.000 html5parser.py:422(getPhases)

So, any suggestion for optimizing? And I don't know if html5lib is good enough for us to ignore the performance issue, how do you think? Thank you~

@thibaudcolas

This comment has been minimized.

Show comment
Hide comment
@thibaudcolas

thibaudcolas Dec 11, 2016

Collaborator

Hey @su27, performance is definitely a concern. I do not know much about performance testing python code & profiling it, so thank you very much for the above!

For optimisations, I think the first and most significant step would be to make the parser configurable so anyone can switch to lxml or anything else that BeautifulSoup supports. We should try to keep draftjs_exporter agnostic to the parsing as much as possible. That step is very simple, we just have to make this Soup function configurable.

The second step would be to allow further changes in what is going on in dom.py. We actually don't need to use BeautifulSoup at all, it is there just because there is no documentation on using html5lib directly. The DOM API used lxml previously, and is meant as a wrapper around whichever tree builder we want to use, so it shouldn't be too complicated to change it.

For now, do you mind if I integrate your profiling code directly within the project?

Collaborator

thibaudcolas commented Dec 11, 2016

Hey @su27, performance is definitely a concern. I do not know much about performance testing python code & profiling it, so thank you very much for the above!

For optimisations, I think the first and most significant step would be to make the parser configurable so anyone can switch to lxml or anything else that BeautifulSoup supports. We should try to keep draftjs_exporter agnostic to the parsing as much as possible. That step is very simple, we just have to make this Soup function configurable.

The second step would be to allow further changes in what is going on in dom.py. We actually don't need to use BeautifulSoup at all, it is there just because there is no documentation on using html5lib directly. The DOM API used lxml previously, and is meant as a wrapper around whichever tree builder we want to use, so it shouldn't be too complicated to change it.

For now, do you mind if I integrate your profiling code directly within the project?

@su27

This comment has been minimized.

Show comment
Hide comment
@su27

su27 Dec 12, 2016

Contributor

Hi @thibaudcolas , yes I think making the parser configurable would be simple but very useful. I have switched to lxml for a few days, I found it won't cause much trouble if I choose the right html tags (no image, but img, for example) and restrict the attributes carefully.

Of course I don't mind if you integrate my code, glad to help.

Contributor

su27 commented Dec 12, 2016

Hi @thibaudcolas , yes I think making the parser configurable would be simple but very useful. I have switched to lxml for a few days, I found it won't cause much trouble if I choose the right html tags (no image, but img, for example) and restrict the attributes carefully.

Of course I don't mind if you integrate my code, glad to help.

@su27

This comment has been minimized.

Show comment
Hide comment
@su27

su27 Dec 23, 2016

Contributor

Good morning @thibaudcolas!

Well yesterday I tried to switch back to lxml alone, instead of BS4 + lxml. I found it's 8-10 times faster! It takes just 0.017s to run the benchmark above. In fact, I knew it would be much faster but 10 times? That's amazing. What I've tested was basically your previous version of dom.py in v0.3.3.

Contributor

su27 commented Dec 23, 2016

Good morning @thibaudcolas!

Well yesterday I tried to switch back to lxml alone, instead of BS4 + lxml. I found it's 8-10 times faster! It takes just 0.017s to run the benchmark above. In fact, I knew it would be much faster but 10 times? That's amazing. What I've tested was basically your previous version of dom.py in v0.3.3.

@thibaudcolas

This comment has been minimized.

Show comment
Hide comment
@thibaudcolas

thibaudcolas Jan 18, 2017

Collaborator

Glad to know the change was simple enough for you to do it! I've been thinking about this a lot and the best solution I could think of is the following:

  • Add an engine parameter somewhere
  • Value bs-html5lib uses the current dom.py with the BS4 parser set to html5lib.
  • Value bs-lxml uses the current dom.py with the BS4 parser set to lxml.
  • Value lxml uses a new dom-lxml.py with direct usage of lxml.
  • If a Python module is given, it will be used as DOM

The thing I have trouble solving is that right now in every file you can just from draftjs_exporter.dom import DOM to get a hold on all of the DOM methods. If we start having the ability to swap the engine, you need to get a reference to the same DOM flavour in every file. The most likely solution I see would be to have a factory method, something like:

DOM = HTML.get_dom('bs-lxml')

An alternative would be

from draftjs_exporter.dom import DOM_BS_HTML5LIB as DOM
from draftjs_exporter.dom import DOM_BS_LXML as DOM
from draftjs_exporter.dom import DOM_LXML as DOM
# Or implement your own DOM

# And then
exporter = HTML(config, DOM)
Collaborator

thibaudcolas commented Jan 18, 2017

Glad to know the change was simple enough for you to do it! I've been thinking about this a lot and the best solution I could think of is the following:

  • Add an engine parameter somewhere
  • Value bs-html5lib uses the current dom.py with the BS4 parser set to html5lib.
  • Value bs-lxml uses the current dom.py with the BS4 parser set to lxml.
  • Value lxml uses a new dom-lxml.py with direct usage of lxml.
  • If a Python module is given, it will be used as DOM

The thing I have trouble solving is that right now in every file you can just from draftjs_exporter.dom import DOM to get a hold on all of the DOM methods. If we start having the ability to swap the engine, you need to get a reference to the same DOM flavour in every file. The most likely solution I see would be to have a factory method, something like:

DOM = HTML.get_dom('bs-lxml')

An alternative would be

from draftjs_exporter.dom import DOM_BS_HTML5LIB as DOM
from draftjs_exporter.dom import DOM_BS_LXML as DOM
from draftjs_exporter.dom import DOM_LXML as DOM
# Or implement your own DOM

# And then
exporter = HTML(config, DOM)
@thibaudcolas

This comment has been minimized.

Show comment
Hide comment
@thibaudcolas

thibaudcolas Jan 18, 2017

Collaborator

Also as an aside I have started refactoring the exporter a bit, and was able to make it 2x faster quite easily. It's definitely not as big of a change as using lxml, but will explore this further too.

Collaborator

thibaudcolas commented Jan 18, 2017

Also as an aside I have started refactoring the exporter a bit, and was able to make it 2x faster quite easily. It's definitely not as big of a change as using lxml, but will explore this further too.

@su27

This comment has been minimized.

Show comment
Hide comment
@su27

su27 Jan 20, 2017

Contributor

I think either the factory method or the import-as-alias way are nice enough to use, simple and clear.

Well after the previous comment, I encountered some problems, seems it's not that simple to switch to lxml... But I got them done.(Hope I really did):

  • su27@7202454

    • Some DOM methods should be changed, since the API is different.
    • XHTML standards couldn't be strictly supported anymore. But it's not a bad thing, in the recent years, it seems that the major opinion of the front-end world has moved on to html4/5, instead of XHTML, more and more people are thinking the XML-like standard is too strict and not necessary. In fact, I was not sure about this, then I interviewed some senior FE guys, they told me XHTML is not important now, you can leave the br and hr tag open, and that's totally fine. So I did this.
  • su27@26c5261

    • For the encoding problem and the illegal XML character problem.

So, it seems there are still plenty of work to do.

On the other side, glad you've refactored it! I did a little refactor too: su27@14d866d , I don't know if it's anything similar with what you did.

Contributor

su27 commented Jan 20, 2017

I think either the factory method or the import-as-alias way are nice enough to use, simple and clear.

Well after the previous comment, I encountered some problems, seems it's not that simple to switch to lxml... But I got them done.(Hope I really did):

  • su27@7202454

    • Some DOM methods should be changed, since the API is different.
    • XHTML standards couldn't be strictly supported anymore. But it's not a bad thing, in the recent years, it seems that the major opinion of the front-end world has moved on to html4/5, instead of XHTML, more and more people are thinking the XML-like standard is too strict and not necessary. In fact, I was not sure about this, then I interviewed some senior FE guys, they told me XHTML is not important now, you can leave the br and hr tag open, and that's totally fine. So I did this.
  • su27@26c5261

    • For the encoding problem and the illegal XML character problem.

So, it seems there are still plenty of work to do.

On the other side, glad you've refactored it! I did a little refactor too: su27@14d866d , I don't know if it's anything similar with what you did.

@thibaudcolas

This comment has been minimized.

Show comment
Hide comment
@thibaudcolas

thibaudcolas Feb 16, 2017

Collaborator

I've just made changes that make the exporter perform significantly faster (±20-30x) regardless of which engine is used. Here are the numbers for the profiled test cases:

  • Before: 144519 function calls (143847 primitive calls) in 0.335 seconds
  • After: 9854 function calls (9624 primitive calls) in 0.012 seconds
  • 15x less function calls
  • 28x faster

Here are details of what I did and timings along the way (each label loosely matches a commit):

Start

Example: 115913 function calls (115360 primitive calls) in 0.191 seconds
Tests: 143968 function calls (143296 primitive calls) in 0.479 seconds Ran 146 tests in 1.731s

Stop creating empty wrappers

Example: 113933 function calls (113396 primitive calls) in 0.279 seconds
Tests: 137899 function calls (137299 primitive calls) in 0.339 seconds Ran 146 tests in 1.170s

Stop creating empty fragments when there are no composite decorators

Example: 97056 function calls (96599 primitive calls) in 0.174 seconds
Tests: 137899 function calls (137299 primitive calls) in 0.288 seconds Ran 146 tests in 1.042s

With decorators in the perf tests, but no decorations

Tests: 137648 function calls (137048 primitive calls) in 0.284 seconds Ran 146 tests in 1.150s

With decorators, and a decoration test

Tests: 145860 function calls (145222 primitive calls) in 0.303 seconds

Cache empty soup

Example: 24447 function calls (24007 primitive calls) in 0.292 seconds
Tests: 38984 function calls (38346 primitive calls) in 0.150 seconds Ran 147 tests in 0.711s

Stop extra parsing

Example: 8570 function calls (8342 primitive calls) in 0.010 seconds
Tests: 11647 function calls (11361 primitive calls) in 0.010 seconds Ran 147 tests in 0.050s

Stop generating unused command groups

Example: 7356 function calls (7172 primitive calls) in 0.007 seconds
Tests: 10018 function calls (9788 primitive calls) in 0.012 seconds

Remove useless command sorting

Example: 7243 function calls (7059 primitive calls) in 0.007 seconds
Tests: 9819 function calls (9589 primitive calls) in 0.011 seconds Ran 147 tests in 0.052s

Finally, I tried using lxml directly (I did not find a significant performance benefit when using lxml with BeautifulSoup, not sure why). I ran into some issues and will have to stop there for now.

Example: 2947 function calls (2927 primitive calls) in 0.003 seconds

Here is the dom.py file with lxml:

from __future__ import absolute_import, unicode_literals

import re
import inspect

from lxml import etree, html

# See http://stackoverflow.com/questions/7703018/how-to-write-namespaced-element-attributes-with-lxml
XLINK = 'http://www.w3.org/1999/xlink'


class DOM(object):
    """
    Wrapper around our HTML building library to facilitate changes.
    """
    @staticmethod
    def create_tag(type_, attributes={}):
        return etree.Element(type_, attrib=attributes)

    @staticmethod
    def create_element(type_=None, props=None, *children):
        """
        Signature inspired by React.createElement.
        createElement(
          string/ReactClass type_,
          [object props],
          [children ...]
        )
        https://facebook.github.io/react/docs/top-level-api.html#react.createelement
        """
        if props is None:
            props = {}

        if not type_:
            elt = DOM.create_document_fragment()
        else:
            attributes = {}

            # Map props from React/Draft.js to lxml lingo.
            if 'className' in props:
                props['class'] = props.pop('className')

            # One-off fix ATM, even though the problem is everywhere.
            if 'xlink:href' in props:
                props['{%s}href' % XLINK] = props.get('xlink:href')
                props.pop('xlink:href', None)

            for key in props:
                prop = props[key]
                # Filter null values and cast to string for lxml
                if prop is not None:
                    attributes[key] = str(prop)

            if inspect.isclass(type_):
                elt = type_().render(attributes)
            elif callable(getattr(type_, 'render', None)):
                elt = type_.render(attributes)
            elif callable(type_):
                elt = type_(attributes)
            else:
                elt = DOM.create_tag(type_, attributes)

        for child in children:
            if hasattr(child, 'tag'):
                DOM.append_child(elt, child)
            else:
                DOM.set_text_content(elt, DOM.get_text_content(elt) + child if DOM.get_text_content(elt) else child)

        return elt

    @staticmethod
    def create_document_fragment():
        return DOM.create_tag('fragment')

    @staticmethod
    def create_text_node(text):
        elt = DOM.create_tag('textnode')
        DOM.set_text_content(elt, text)
        return elt

    @staticmethod
    def parse_html(markup):
        return html.fromstring(markup)

    @staticmethod
    def append_child(elt, child):
        elt.append(child)

    @staticmethod
    def set_attribute(elt, attr, value):
        elt.set(attr, value)

    @staticmethod
    def get_tag_name(elt):
        return elt.tag

    @staticmethod
    def get_class_list(elt):
        return [elt.get('class')]

    @staticmethod
    def get_text_content(elt):
        return elt.text

    @staticmethod
    def set_text_content(elt, text):
        elt.text = text

    @staticmethod
    def get_children(elt):
        return elt.getchildren()

    @staticmethod
    def render(elt):
        """
        Removes the fragments that should not have HTML tags. Caveat of lxml.
        Dirty, but quite easy to understand.
        """
        return re.sub(r'</?(fragment|textnode)>', '', etree.tostring(elt, method='html').decode('utf-8'))

    @staticmethod
    def pretty_print(markup):
        """
        Convenience method.
        Pretty print the element, removing the top-level node that lxml needs.
        """
        return re.sub(r'</?doc>', '', etree.tostring(html.fromstring('<doc>%s</doc>' % markup), encoding='unicode', pretty_print=True))
Collaborator

thibaudcolas commented Feb 16, 2017

I've just made changes that make the exporter perform significantly faster (±20-30x) regardless of which engine is used. Here are the numbers for the profiled test cases:

  • Before: 144519 function calls (143847 primitive calls) in 0.335 seconds
  • After: 9854 function calls (9624 primitive calls) in 0.012 seconds
  • 15x less function calls
  • 28x faster

Here are details of what I did and timings along the way (each label loosely matches a commit):

Start

Example: 115913 function calls (115360 primitive calls) in 0.191 seconds
Tests: 143968 function calls (143296 primitive calls) in 0.479 seconds Ran 146 tests in 1.731s

Stop creating empty wrappers

Example: 113933 function calls (113396 primitive calls) in 0.279 seconds
Tests: 137899 function calls (137299 primitive calls) in 0.339 seconds Ran 146 tests in 1.170s

Stop creating empty fragments when there are no composite decorators

Example: 97056 function calls (96599 primitive calls) in 0.174 seconds
Tests: 137899 function calls (137299 primitive calls) in 0.288 seconds Ran 146 tests in 1.042s

With decorators in the perf tests, but no decorations

Tests: 137648 function calls (137048 primitive calls) in 0.284 seconds Ran 146 tests in 1.150s

With decorators, and a decoration test

Tests: 145860 function calls (145222 primitive calls) in 0.303 seconds

Cache empty soup

Example: 24447 function calls (24007 primitive calls) in 0.292 seconds
Tests: 38984 function calls (38346 primitive calls) in 0.150 seconds Ran 147 tests in 0.711s

Stop extra parsing

Example: 8570 function calls (8342 primitive calls) in 0.010 seconds
Tests: 11647 function calls (11361 primitive calls) in 0.010 seconds Ran 147 tests in 0.050s

Stop generating unused command groups

Example: 7356 function calls (7172 primitive calls) in 0.007 seconds
Tests: 10018 function calls (9788 primitive calls) in 0.012 seconds

Remove useless command sorting

Example: 7243 function calls (7059 primitive calls) in 0.007 seconds
Tests: 9819 function calls (9589 primitive calls) in 0.011 seconds Ran 147 tests in 0.052s

Finally, I tried using lxml directly (I did not find a significant performance benefit when using lxml with BeautifulSoup, not sure why). I ran into some issues and will have to stop there for now.

Example: 2947 function calls (2927 primitive calls) in 0.003 seconds

Here is the dom.py file with lxml:

from __future__ import absolute_import, unicode_literals

import re
import inspect

from lxml import etree, html

# See http://stackoverflow.com/questions/7703018/how-to-write-namespaced-element-attributes-with-lxml
XLINK = 'http://www.w3.org/1999/xlink'


class DOM(object):
    """
    Wrapper around our HTML building library to facilitate changes.
    """
    @staticmethod
    def create_tag(type_, attributes={}):
        return etree.Element(type_, attrib=attributes)

    @staticmethod
    def create_element(type_=None, props=None, *children):
        """
        Signature inspired by React.createElement.
        createElement(
          string/ReactClass type_,
          [object props],
          [children ...]
        )
        https://facebook.github.io/react/docs/top-level-api.html#react.createelement
        """
        if props is None:
            props = {}

        if not type_:
            elt = DOM.create_document_fragment()
        else:
            attributes = {}

            # Map props from React/Draft.js to lxml lingo.
            if 'className' in props:
                props['class'] = props.pop('className')

            # One-off fix ATM, even though the problem is everywhere.
            if 'xlink:href' in props:
                props['{%s}href' % XLINK] = props.get('xlink:href')
                props.pop('xlink:href', None)

            for key in props:
                prop = props[key]
                # Filter null values and cast to string for lxml
                if prop is not None:
                    attributes[key] = str(prop)

            if inspect.isclass(type_):
                elt = type_().render(attributes)
            elif callable(getattr(type_, 'render', None)):
                elt = type_.render(attributes)
            elif callable(type_):
                elt = type_(attributes)
            else:
                elt = DOM.create_tag(type_, attributes)

        for child in children:
            if hasattr(child, 'tag'):
                DOM.append_child(elt, child)
            else:
                DOM.set_text_content(elt, DOM.get_text_content(elt) + child if DOM.get_text_content(elt) else child)

        return elt

    @staticmethod
    def create_document_fragment():
        return DOM.create_tag('fragment')

    @staticmethod
    def create_text_node(text):
        elt = DOM.create_tag('textnode')
        DOM.set_text_content(elt, text)
        return elt

    @staticmethod
    def parse_html(markup):
        return html.fromstring(markup)

    @staticmethod
    def append_child(elt, child):
        elt.append(child)

    @staticmethod
    def set_attribute(elt, attr, value):
        elt.set(attr, value)

    @staticmethod
    def get_tag_name(elt):
        return elt.tag

    @staticmethod
    def get_class_list(elt):
        return [elt.get('class')]

    @staticmethod
    def get_text_content(elt):
        return elt.text

    @staticmethod
    def set_text_content(elt, text):
        elt.text = text

    @staticmethod
    def get_children(elt):
        return elt.getchildren()

    @staticmethod
    def render(elt):
        """
        Removes the fragments that should not have HTML tags. Caveat of lxml.
        Dirty, but quite easy to understand.
        """
        return re.sub(r'</?(fragment|textnode)>', '', etree.tostring(elt, method='html').decode('utf-8'))

    @staticmethod
    def pretty_print(markup):
        """
        Convenience method.
        Pretty print the element, removing the top-level node that lxml needs.
        """
        return re.sub(r'</?doc>', '', etree.tostring(html.fromstring('<doc>%s</doc>' % markup), encoding='unicode', pretty_print=True))

@thibaudcolas thibaudcolas added this to the v1.0.0 milestone Feb 16, 2017

@su27

This comment has been minimized.

Show comment
Hide comment
@su27

su27 Feb 17, 2017

Contributor

Awesome work! This improvement is significant.

One thing about the dom.py with lxml, I made this change (su27@26c5261) because there're two problems:

  1. When there're utf8 strings in the props values, lxml raises error;
  2. When there're invalid chars of XML, it raises too. (I solved by this answer)

Maybe you should take care of that too.

Contributor

su27 commented Feb 17, 2017

Awesome work! This improvement is significant.

One thing about the dom.py with lxml, I made this change (su27@26c5261) because there're two problems:

  1. When there're utf8 strings in the props values, lxml raises error;
  2. When there're invalid chars of XML, it raises too. (I solved by this answer)

Maybe you should take care of that too.

@thibaudcolas

This comment has been minimized.

Show comment
Hide comment
@thibaudcolas

thibaudcolas Apr 25, 2017

Collaborator

#56 is merged so I will consider this fixed for now @su27 I would be keen to know how this goes for you. Hopefully the "custom backing engines" feature will still allow you to tweak your own dom.py if you want to, but still benefit from the other improvements in the lib.

Some other notes on performance:

  • At the moment lxml seems to be roughly 2x as fast as html5lib + bs4. This is now measured in every build. See #65 for the next step.
  • I have no idea what the memory footprint of this library is like. @su27 do you have experience measuring this? See #66 for the next step.
Collaborator

thibaudcolas commented Apr 25, 2017

#56 is merged so I will consider this fixed for now @su27 I would be keen to know how this goes for you. Hopefully the "custom backing engines" feature will still allow you to tweak your own dom.py if you want to, but still benefit from the other improvements in the lib.

Some other notes on performance:

  • At the moment lxml seems to be roughly 2x as fast as html5lib + bs4. This is now measured in every build. See #65 for the next step.
  • I have no idea what the memory footprint of this library is like. @su27 do you have experience measuring this? See #66 for the next step.
@su27

This comment has been minimized.

Show comment
Hide comment
@su27

su27 Apr 27, 2017

Contributor

Glad to know! I tried and there were a few problems for me to adopt this version, I'll fix it and check if everything is okay in production environment after May 3rd, because I'm gonna on vacation these days.

Contributor

su27 commented Apr 27, 2017

Glad to know! I tried and there were a few problems for me to adopt this version, I'll fix it and check if everything is okay in production environment after May 3rd, because I'm gonna on vacation these days.

@su27

This comment has been minimized.

Show comment
Hide comment
@su27

su27 May 10, 2017

Contributor

@thibaudcolas Done! I've modified my code to fit v1.0.0 and now it works good, I'm going to deploy it to production today. Now my customized engine is like this:

# coding: utf-8

import re
from lxml import etree
from draftjs_exporter.engines.lxml import DOM_LXML


NSMAP = {
    'xlink': 'http://www.w3.org/1999/xlink',
}

try:
    UNICODE_EXISTS = bool(type(unicode))
except NameError:
    def unicode(s):
        return str(s)


def clean_str(s):
    if not isinstance(s, basestring):
        s = unicode(s)

    elif not isinstance(s, unicode):
        s = s.decode('utf8')
        # See http://stackoverflow.com/questions/8733233/filtering-out-certain-bytes-in-python
    return re.sub(u'[^\u0020-\uD7FF\u0009\u000A\u000D\uE000-\uFFFD\u10000-\u10FFFF]+', '', s)


class LXMLEngine(DOM_LXML):

    @staticmethod
    def create_tag(type_, attr=None):
        nsmap = None

        if attr:
            if 'xlink:href' in attr:
                attr['{%s}href' % NSMAP['xlink']] = attr.pop('xlink:href')
                nsmap = NSMAP
            attr = {k: clean_str(v) for k, v in attr.iteritems()}

        return etree.Element(type_, attrib=attr, nsmap=nsmap)

    @staticmethod
    def append_child(elt, child):
        if hasattr(child, 'tag'):
            elt.append(child)
        else:
            c = etree.Element('fragment')
            c.text = clean_str(child)
            elt.append(c)

This is pretty straightforward. But the tricky part is to find out which part of the API has been changed since v0.6 and fix the affected entities and composite decorators.

Finally I don't need to maintain my own version of draftjs_exporter, hooray!

Contributor

su27 commented May 10, 2017

@thibaudcolas Done! I've modified my code to fit v1.0.0 and now it works good, I'm going to deploy it to production today. Now my customized engine is like this:

# coding: utf-8

import re
from lxml import etree
from draftjs_exporter.engines.lxml import DOM_LXML


NSMAP = {
    'xlink': 'http://www.w3.org/1999/xlink',
}

try:
    UNICODE_EXISTS = bool(type(unicode))
except NameError:
    def unicode(s):
        return str(s)


def clean_str(s):
    if not isinstance(s, basestring):
        s = unicode(s)

    elif not isinstance(s, unicode):
        s = s.decode('utf8')
        # See http://stackoverflow.com/questions/8733233/filtering-out-certain-bytes-in-python
    return re.sub(u'[^\u0020-\uD7FF\u0009\u000A\u000D\uE000-\uFFFD\u10000-\u10FFFF]+', '', s)


class LXMLEngine(DOM_LXML):

    @staticmethod
    def create_tag(type_, attr=None):
        nsmap = None

        if attr:
            if 'xlink:href' in attr:
                attr['{%s}href' % NSMAP['xlink']] = attr.pop('xlink:href')
                nsmap = NSMAP
            attr = {k: clean_str(v) for k, v in attr.iteritems()}

        return etree.Element(type_, attrib=attr, nsmap=nsmap)

    @staticmethod
    def append_child(elt, child):
        if hasattr(child, 'tag'):
            elt.append(child)
        else:
            c = etree.Element('fragment')
            c.text = clean_str(child)
            elt.append(c)

This is pretty straightforward. But the tricky part is to find out which part of the API has been changed since v0.6 and fix the affected entities and composite decorators.

Finally I don't need to maintain my own version of draftjs_exporter, hooray!

@thibaudcolas

This comment has been minimized.

Show comment
Hide comment
@thibaudcolas

thibaudcolas May 10, 2017

Collaborator

Cool! There have been a lot of breaking changes so you will likely have plenty of updating to do. I've tried my best to describe the changes in the CHANGELOG, more recently completed with "How to upgrade" instructions.

Collaborator

thibaudcolas commented May 10, 2017

Cool! There have been a lot of breaking changes so you will likely have plenty of updating to do. I've tried my best to describe the changes in the CHANGELOG, more recently completed with "How to upgrade" instructions.

@su27

This comment has been minimized.

Show comment
Hide comment
@su27

su27 May 11, 2017

Contributor

Oh my fault, I should have read that first! I've just fixed a bug caused by the removal of DOM.create_text_node.

Contributor

su27 commented May 11, 2017

Oh my fault, I should have read that first! I've just fixed a bug caused by the removal of DOM.create_text_node.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment