Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use LXML for html_to_vdom #795

Merged
merged 54 commits into from Aug 14, 2022
Merged

Conversation

Archmonger
Copy link
Contributor

@Archmonger Archmonger commented Aug 4, 2022

fix #777

Technical Details

  • Generate a DOM tree using lxml
  • Recursively converts lxml tree to VDOM
  • HTML is encapsulated within a React fragment if needed
  • Style tags are handled separately
  • Utilize idom.html constructors for VDOM when possible
  • Unneeded VDOM fields are pruned after user transformations are applied
  • Uses list comprehension where possible for maximum performance

Checklist

Please update this checklist as you complete each item:

  • Tests have been included for all bug fixes or added functionality.
  • The changelog.rst has been updated with any significant changes.
  • GitHub Issues which may be closed by this Pull Request have been linked.

@Archmonger
Copy link
Contributor Author

Archmonger commented Aug 4, 2022

@rmorshea

This seems to work great based on my informal testing on Django-IDOM, but I'm still searching for any edge cases that may exist.

  1. Off the top of your head, are there any VDOM related oddities (such as how Scripts should automatically get keys) that I should look out for?
  2. On that note, is it still safe to use the index as the primary key for html_to_vdom components?

I will fix the tests in IDOM core soon.


Since script tags on the page might toy around with the DOM, we might want to consider being more lax in the future about missing/modified DOM elements. Currently, it can result in a client side exception if a DOM element goes poof, then IDOM attempts to modify it.

This might bring up a valid enough reason to consider using morphdom instead of react-dom.

  1. MorphDOM does not use virtual DOM abstractions, thus won't suffer from mismatch issues
  2. This whole topic of morphdom might relate to the questions as to whether we even need to do JSON patches anymore.
  3. Not doing JSON patches would effectively be identical in performance to AJAX webpages.
  4. morphdom usage is semi-related to IDOM's re-rendering behavior on websocket reconnect, as it would make rehydration easier.

@Archmonger
Copy link
Contributor Author

Seems like our Flask support is broken on Python 3.10 according to the tests.

src/idom/utils.py Outdated Show resolved Hide resolved
@Archmonger
Copy link
Contributor Author

Archmonger commented Aug 5, 2022

Questions that need to be answered:

  1. Should we add a strict boolean parameter to html_to_vdom for our HTML parser?
  2. Should we automatically set some sort of key-value on the root node to help transformations identify the root easier? Maybe node["root"] = True?
  3. Are there any other built-in VDOM mutations that need to take place, similar to style attribute formatting or script tag keys?
  4. Is it expected behavior that script nodes don't truly exist on the page, and are replaced with an empty div?
  5. Is safe to use the index as the primary key for html_to_vdom components, or should we automatically default to a uuid?
  6. Should I create a discussion thread to talk about morphdom vs react-dom?
  7. Should we handle the Python 3.x and docs build test failures in this PR, or a separate PR?

src/idom/utils.py Outdated Show resolved Hide resolved
src/idom/utils.py Outdated Show resolved Hide resolved
@rmorshea
Copy link
Collaborator

rmorshea commented Aug 6, 2022

  1. I'm not really sure what "strict" implies. Is that specific to the lxml parsing backend? If so, then probably not.
  2. If there's a root node in the document then just use that. If there's more than one root node I think we can just make a tagName="" node since they're not displayed anyway.
  3. If there were a sane way to camelCase other HTML attributes we could do that. I can't think of a way though since they aren't nicely dash-separated like CSS. Regardless, I think the style mutation should be implemented as a transform function that we apply by default if no others are given. While the React client expects camelCase, that might not be true of all clients.
  4. The script sanitization was from before we had a script element.
  5. Key's should not be the responsibility of this method. Leaving the key absent (implies key is index) is safe unless the HTML was generated from a template and might change, in which case the template, or the thing rendering it is responsible for managing keys.
  6. Sure. I don't recall talking about morphdom. Could you include some background on that?
  7. Let's resolve that in a separate PR

@Archmonger
Copy link
Contributor Author

Archmonger commented Aug 6, 2022

  1. Strict implies that malformatted HTML will generate an exception. The default behavior of lxml is to loosely parse broken HTML, which may result in malformatted tags, such as a <div> without an end </div>, being interpreted as raw text.
    • I committed defaulting "recovering the HTML" to False, since I think it's best the user be aware of syntactically broken HTML.
  2. N/A
  3. I disagree that it should be done via the transforms iterable. This is likely to result in a lot of user headache if they attempt to use our transforms without realizing they'd immediately lose the style transforms. Even if we document it, most people don't RTM.
    • If you want this to be customizable, I'd recommend we use some other customization variable
    • This behavior isn't really customizable anywhere else within IDOM though, so I don't know why we would do it here
    • In my opinion, at this point in IDOM's lifecycle I don't think it's reasonable to expect anyone to use any other front-end besides React, given our lack of documentation & official support around this.
  4. Is this a bug that needs to be fixed then?
  5. N/A
  6. See the comment above
  7. N/A

@Archmonger
Copy link
Contributor Author

Archmonger commented Aug 8, 2022

_html_to_vdom was renamed to etree_to_vdom after moving out the str handling from it. The function technically can be documented, but I don't know if we want to. If we do, we're indirectly telling people it's normal to create DOM trees in lxml, which doesn't feel right.

I will keep it as an undocumented utility unless you feel otherwise.

@rmorshea
Copy link
Collaborator

rmorshea commented Aug 8, 2022

It seems like it should be private then. Can you rename it to _etree_to_vdom?

@rmorshea
Copy link
Collaborator

rmorshea commented Aug 8, 2022

  1. I think we can probably leave out the recover flag you've added and parse the HTML strictly. If someone needs it to be loosely parsed we can add in the flag later.
  2. Seems reasonable.
  3. Not a bug, just an old design decision. Now that IDOM's client supports script elements we can stop stripping them.
  4. I think we'd write an entirely new client if we were to use morphdom. Very little of the React-based client code could be re-used. In order to justify spending time on that we'd need to do a bunch of benchmarking.

src/idom/utils.py Outdated Show resolved Hide resolved
src/idom/utils.py Outdated Show resolved Hide resolved
src/idom/utils.py Outdated Show resolved Hide resolved
src/idom/utils.py Outdated Show resolved Hide resolved
src/idom/utils.py Outdated Show resolved Hide resolved
@Archmonger Archmonger marked this pull request as ready for review August 10, 2022 02:20
@rmorshea
Copy link
Collaborator

There's a problem with the editable install on Python 3.10.6 that should be fixed in my PR.

src/idom/utils.py Outdated Show resolved Hide resolved
src/idom/utils.py Outdated Show resolved Hide resolved
src/idom/utils.py Outdated Show resolved Hide resolved
src/idom/utils.py Outdated Show resolved Hide resolved
@Archmonger
Copy link
Contributor Author

@rmorshea force pushing on a remote fork typically isn't a good idea, since it causes the owner of the fork to have to delete and repull his local branches.

GitHub automatically hides all commits that are within the parent branch of a PR (idom-team:main), so would've been better to do a simple merge here.

Archmonger and others added 2 commits August 13, 2022 14:31
Co-authored-by: Ryan Morshead <ryan.morshead@gmail.com>
src/idom/utils.py Outdated Show resolved Hide resolved
src/idom/utils.py Outdated Show resolved Hide resolved
src/idom/utils.py Outdated Show resolved Hide resolved
src/idom/utils.py Outdated Show resolved Hide resolved
@rmorshea
Copy link
Collaborator

If you want to add the strict flag back in, feel free. If not, I think we could merge as-is.

src/idom/utils.py Outdated Show resolved Hide resolved
src/idom/utils.py Outdated Show resolved Hide resolved
Archmonger and others added 2 commits August 13, 2022 20:12
Co-authored-by: Ryan Morshead <ryan.morshead@gmail.com>
@rmorshea
Copy link
Collaborator

rmorshea commented Aug 14, 2022

You should have write access now. Let me know if you're unable to merge.

@rmorshea
Copy link
Collaborator

Will cut a release once this gets in.

@Archmonger Archmonger merged commit b06977a into reactive-python:main Aug 14, 2022
@Archmonger Archmonger deleted the fix-html-to-vdom branch August 14, 2022 03:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

html_to_vdom breaks on whitespaces and null HTML tags
2 participants