New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add DOM engine configuration and official lxml support #56

Merged
merged 23 commits into from Apr 25, 2017

Conversation

Projects
None yet
4 participants
@thibaudcolas
Collaborator

thibaudcolas commented Mar 22, 2017

Allows any DOM backing engine to be used, with two built-in: html5lib (via BeautifulSoup) and lxml. lxml still is and will likely remain faster.

Importantly, it is now quite easy to create custom implementations. I hope this will make it easier to add a dependency-free implementation based on xml.etree.ElementTree or xml.etree.cElementTree, and an html5lib implementation that does not rely on BeautifulSoup.


A dent towards #31.

Right now this is just the implementation of DOM backed by lxml, but there is a bigger question of how this is meant to be integrated once it's in a working state. It's not clear to me how best to do this.

@su27

This comment has been minimized.

Show comment
Hide comment
@su27

su27 Mar 22, 2017

Contributor

This problem is not clear to me either, so I'm just using lxml-only dom.py, and waiting for your configurable solution. 😆

Contributor

su27 commented Mar 22, 2017

This problem is not clear to me either, so I'm just using lxml-only dom.py, and waiting for your configurable solution. 😆

@thibaudcolas

This comment has been minimized.

Show comment
Hide comment
@thibaudcolas

thibaudcolas Mar 26, 2017

Collaborator

@su27 those latest changes make lxml and BS4 + html5lib 99% compatible. There are two things that are still issues:

# The output format is slightly different

# BS4 + html5lib
- <img src="/example.png"/>
# lxml
+ <img src="/example.png">

and lxml does not want to parse invalid HTML attributes like *ngFor.

I feel like no matter what I do there will always be small differences in their behaviour, so I might just say "here are the differences we are aware of" and stop there.

Collaborator

thibaudcolas commented Mar 26, 2017

@su27 those latest changes make lxml and BS4 + html5lib 99% compatible. There are two things that are still issues:

# The output format is slightly different

# BS4 + html5lib
- <img src="/example.png"/>
# lxml
+ <img src="/example.png">

and lxml does not want to parse invalid HTML attributes like *ngFor.

I feel like no matter what I do there will always be small differences in their behaviour, so I might just say "here are the differences we are aware of" and stop there.

@su27

This comment has been minimized.

Show comment
Hide comment
@su27

su27 Mar 27, 2017

Contributor

@thibaudcolas You're right, they behave in slightly different ways which the wrapper can't exactly control, and both ways are correct. It's reasonable to ignore the difference.

Contributor

su27 commented Mar 27, 2017

@thibaudcolas You're right, they behave in slightly different ways which the wrapper can't exactly control, and both ways are correct. It's reasonable to ignore the difference.

@coveralls

This comment has been minimized.

Show comment
Hide comment
@coveralls

coveralls Apr 18, 2017

Coverage Status

Coverage remained the same at 100.0% when pulling 77a1925 on feature/lxml-compat into 41090f3 on master.

coveralls commented Apr 18, 2017

Coverage Status

Coverage remained the same at 100.0% when pulling 77a1925 on feature/lxml-compat into 41090f3 on master.

@thibaudcolas thibaudcolas changed the title from [WIP] lxml compatibility to Add DOM engine configuration and official lxml support Apr 18, 2017

@thibaudcolas thibaudcolas requested a review from loicteixeira Apr 18, 2017

@thibaudcolas

This comment has been minimized.

Show comment
Hide comment
@thibaudcolas

thibaudcolas Apr 18, 2017

Collaborator

Now allows any DOM backing engine to be used, with two built-in: html5lib (via BeautifulSoup) and lxml. lxml still is and will likely remain faster.

Importantly, it is now quite easy to create custom implementations. I hope this will make it easier to add a dependency-free implementation based on xml.etree.ElementTree or xml.etree.cElementTree, and an html5lib implementation that does not rely on BeautifulSoup.

NB: for reference, here is my tentative implementation of an engine using ElementTree. For some reason it outputs wrappers twice, I would suspect a bug in wrapper_state that this particular implementation surfaces.

class DOM_ETREE(DOMEngine):
    """
    lxml implementation of the DOM API.
    """
    @staticmethod
    def create_tag(type_, attr=None):
        if not attr:
            attr = {}

        return etree.Element(type_, attrib=attr)

    @staticmethod
    def parse_html(markup):
        pass

    @staticmethod
    def append_child(elt, child):
        if hasattr(child, 'tag'):
            elt.append(child)
        else:
            c = etree.Element('fragment')
            c.text = child
            elt.append(c)

    @staticmethod
    def render(elt):
        return re.sub(r'(</?fragment>)', '', etree.tostring(elt, method='html'))

    @staticmethod
    def render_debug(elt):
        return etree.tostring(elt, method='html')
Collaborator

thibaudcolas commented Apr 18, 2017

Now allows any DOM backing engine to be used, with two built-in: html5lib (via BeautifulSoup) and lxml. lxml still is and will likely remain faster.

Importantly, it is now quite easy to create custom implementations. I hope this will make it easier to add a dependency-free implementation based on xml.etree.ElementTree or xml.etree.cElementTree, and an html5lib implementation that does not rely on BeautifulSoup.

NB: for reference, here is my tentative implementation of an engine using ElementTree. For some reason it outputs wrappers twice, I would suspect a bug in wrapper_state that this particular implementation surfaces.

class DOM_ETREE(DOMEngine):
    """
    lxml implementation of the DOM API.
    """
    @staticmethod
    def create_tag(type_, attr=None):
        if not attr:
            attr = {}

        return etree.Element(type_, attrib=attr)

    @staticmethod
    def parse_html(markup):
        pass

    @staticmethod
    def append_child(elt, child):
        if hasattr(child, 'tag'):
            elt.append(child)
        else:
            c = etree.Element('fragment')
            c.text = child
            elt.append(c)

    @staticmethod
    def render(elt):
        return re.sub(r'(</?fragment>)', '', etree.tostring(elt, method='html'))

    @staticmethod
    def render_debug(elt):
        return etree.tostring(elt, method='html')
@loicteixeira

I've focused on reviewing the swappable engine part, not the rendering logic (stuff like self.stack.stack[-1].last_child = elt). It would be nice to do the latter at some point but I would need much more time to really dig into it and understand the whole rendering. Not sure I'll be able to do this before 1.0 😞

Edit: I should mention that this is some great work overall! The swappable engines and the previous PR regarding fallback components are very exciting features 😀

"""
Choose which DOM implementation to use.
"""
if engine:

This comment has been minimized.

@loicteixeira

loicteixeira Apr 21, 2017

Collaborator

What should happen if there are no engine?
I understand it has a default of DOM_HTML5LIB so it just have no effect, but it sounds a bit silly to call use with no engine.

@loicteixeira

loicteixeira Apr 21, 2017

Collaborator

What should happen if there are no engine?
I understand it has a default of DOM_HTML5LIB so it just have no effect, but it sounds a bit silly to call use with no engine.

This comment has been minimized.

@loicteixeira

loicteixeira Apr 21, 2017

Collaborator

Ah I see that in HTML it does DOM.use(config.get('engine')) which would potentially return None.

Given that the default engine has some dependencies which do not come with a basic installation of the package, I think there shouldn't be a default engine and if none is provided, it should raise a ConfigException. This way the error would come in two stages, no engine selected and then missing dependencies; rather than a unique error which could be confusing if say, you install the dependencies for lxml but forget to declare the engine and you would get some BeautifulSoup errors..

@loicteixeira

loicteixeira Apr 21, 2017

Collaborator

Ah I see that in HTML it does DOM.use(config.get('engine')) which would potentially return None.

Given that the default engine has some dependencies which do not come with a basic installation of the package, I think there shouldn't be a default engine and if none is provided, it should raise a ConfigException. This way the error would come in two stages, no engine selected and then missing dependencies; rather than a unique error which could be confusing if say, you install the dependencies for lxml but forget to declare the engine and you would get some BeautifulSoup errors..

This comment has been minimized.

@thibaudcolas

thibaudcolas Apr 25, 2017

Collaborator

I have updated the documentation and install procedure so the default is html5lib, and there isn't any mention of this (and of the engine config key, and of DOM.use) until people want to switch to lxml.

Let's see what happens with #62 next.

@thibaudcolas

thibaudcolas Apr 25, 2017

Collaborator

I have updated the documentation and install procedure so the default is html5lib, and there isn't any mention of this (and of the engine config key, and of DOM.use) until people want to switch to lxml.

Let's see what happens with #62 next.

Show outdated Hide outdated draftjs_exporter/dom.py
Show outdated Hide outdated draftjs_exporter/dom_engine.py
Show outdated Hide outdated draftjs_exporter/dom_engine.py
Show outdated Hide outdated README.rst
@thibaudcolas

This comment has been minimized.

Show comment
Hide comment
@thibaudcolas

thibaudcolas Apr 24, 2017

Collaborator

Thanks for the great review @loicteixeira! I'm learning a lot :shityeah:

Collaborator

thibaudcolas commented Apr 24, 2017

Thanks for the great review @loicteixeira! I'm learning a lot :shityeah:

@coveralls

This comment has been minimized.

Show comment
Hide comment
@coveralls

coveralls Apr 25, 2017

Coverage Status

Coverage remained the same at 100.0% when pulling 32a149e on feature/lxml-compat into 6f4446a on master.

coveralls commented Apr 25, 2017

Coverage Status

Coverage remained the same at 100.0% when pulling 32a149e on feature/lxml-compat into 6f4446a on master.

All valid comments but will have to wait for future work.

@coveralls

This comment has been minimized.

Show comment
Hide comment
@coveralls

coveralls Apr 25, 2017

Coverage Status

Coverage remained the same at 100.0% when pulling 860ff7f on feature/lxml-compat into 6f4446a on master.

coveralls commented Apr 25, 2017

Coverage Status

Coverage remained the same at 100.0% when pulling 860ff7f on feature/lxml-compat into 6f4446a on master.

@thibaudcolas thibaudcolas merged commit 85c1f68 into master Apr 25, 2017

3 checks passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
continuous-integration/travis-ci/push The Travis CI build passed
Details
coverage/coveralls Coverage remained the same at 100.0%
Details

@thibaudcolas thibaudcolas deleted the feature/lxml-compat branch Apr 25, 2017

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment