Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML Parser Rewrite #283

Merged
merged 1 commit into from Nov 22, 2022
Merged

HTML Parser Rewrite #283

merged 1 commit into from Nov 22, 2022

Conversation

tbranyen
Copy link
Owner

@tbranyen tbranyen commented Oct 30, 2022

Status: In Progress

Goals:

The primary motivations of the rewrite are to fix known compatibility bugs, support comments completely (instead of skipping them), improve support for middleware, and greatly improve memory usage and performance. The way the new parser works is by breaking down the HTML using global regexes that have their lastIndex reset on each loop. No copies or references of the input HTML are made. The VTree hierarchy is built up as the markup is crawled. This iteration of the parser is to help ship 1.0. Using regex's like this is prone to errors and not ideal to parse real HTML, but using any known alternatives will incur significant filesize impact, not work in Node, or be overkill for what is needed, which is basically HTML/XML/SVG -> JSON.

Tasks:

  • Extract each regex into a separate module and perform isolated tests on each using partial fragments of markup
  • Remove/deprecate strict mode
  • Get all existing unit tests passing

Future tasks:

  • Implement any changes needed to fully support the HTMLHint plugin
  • Code cleanup
  • Add more tests
  • Write documentation

@tbranyen tbranyen marked this pull request as draft October 30, 2022 03:37
@tbranyen
Copy link
Owner Author

I've been cracking away at this locally and now have nearly all unit tests passing. Since I've migrated a lot of the logic out of the HTML parser into the tagged template, the responsibilties between the two are more defined and divided. The parser simply handles flat strings (so wasm works great with this), and then pieces together the dynamic parts in the tagged template layer.

@tbranyen
Copy link
Owner Author

tbranyen commented Nov 18, 2022

Did some performance testing and almost had a heart attack. On my Core i9, the parser was taking 40seconds on a 22MB HTML file. The old parser does this is in ~80ms. I used Chrome and x0 profilers and got nowhere. It only said the parse function was slow, nothing internal. Didn't surface any of the regexes. So I did my own instrumentation and found this:

tim in ~/git/diffhtml/packages/diffhtml-rust-parser on fix-parser-components (home) node test
took 76921 iterations in 39.68s
regexes took {
  comment: '39.524s',
  open: '0.031s',
  attribute: '0.028s',
  close: '0.004s',
  text: '0.001s'
}

Phew! Just looks like the Comment regex I wrote is awful. Will fix this and the future runs will hopefully not be in seconds lol

@tbranyen
Copy link
Owner Author

tbranyen commented Nov 19, 2022

Figured out the problem, was due to the fact that there were no comments in the markup so it was doing unnecessary scanning every loop. Easy fix by tracking matches, and if none are found, never checking again. This makes it go down to 50-60ms. Huge improvement!

Rust/WASM took 161 ms
JS took 58 ms

@CetinSert
Copy link
Contributor

@tbranyen – Following development on this repo with great excitement every day!
Why do we need regexes to parse HTML here? Is it to introduce a degree of permissiveness?

I have been using https://deno.land/x/deno_dom with great success in Deno Deploy.
Leaving it here in case you have not looked into it.

@tbranyen
Copy link
Owner Author

tbranyen commented Nov 19, 2022

@CetinSert diffHTML core is currently dependency free and works in all ESM environments and is sub 10kb min+gzip. I suspect deno dom would add a considerable overhead to the bundle size and make it less portable. It uses https://github.com/servo/html5ever under-the-hood. Maybe we could make this an option in diffhtml-rust-parser? Allowing you to pick which backend to use (tl or html5ever).

I'll make sure this works before the next publish:

import { DOMParser } from 'https://deno.land/x/deno_dom/deno-dom-wasm.ts';
import { html, Internals } from 'https://diffhtml.org/core';

Internals.parse = input => new DOMParser().parseFromString(input, 'text/html');

console.log(html`
  <div>From Deno DOM</div>
`);

Edit: unfortunately doesn't seem like it will be possible until that module supports text/xml type.

@tbranyen
Copy link
Owner Author

Few more gotchas, the DOMParser implementation for text/html is to return a full HTML document. Including HTML, HEAD and BODY tags.

You can see this here:
image

While text/xml produces a fragment:
image

Unfortunately in that deno module we get:
image

Which means it's not really suitable for this use case, since we parse markup at every level. For instance how would we know the difference between the two of these:

<head></head>
<marquee>Some markup</marquee>

and:

<marquee></marquee>

If you wrote something like:

const fragment = html`
  <head></head>
  <marquee>Some markup</marquee>
`;

How would we know what to return?

@tbranyen tbranyen marked this pull request as ready for review November 19, 2022 23:30
@tbranyen tbranyen force-pushed the fix-parser-components branch 3 times, most recently from a6c690d to 6eb141f Compare November 20, 2022 01:15
@CetinSert
Copy link
Contributor

CetinSert commented Nov 20, 2022

@tbranyen – I use partial results too actually. Like this:

import { Document } from "https://deno.land/x/deno_dom/deno-dom-wasm.ts";

const document = new Document();

const parse = (h: string): HTMLTemplateElement => {
  const  t = document.createElement('template');
  /**/   t.innerHTML = h;
  return t; //.content.constructor === DocumentFragment
};

//                                                               _______ : DocumentFragment
console.warn(parse('<script type=worker route="/sse"></script>').content.querySelector('script[type=worker]').getAttribute('route'));

DOMParser.parseFromString() would leak memory in some browsers, so I came up with the above workaround.
It works like a charm in browsers and in Deno.

@CetinSert
Copy link
Contributor

CetinSert commented Nov 20, 2022

@tbranyen – also text/xml follows an entirely different spec.

As an example, try this in browsers:

const parse = h => {
  const  t = document.createElement('template');
  /**/   t.innerHTML = h;
  return t;
};

parse('<br>').content.children[0].constructor // ✔️ HTMLBRElement
parse('<br>').content.children[0].outerHTML   // ✔️ <br>

deno-dom-wasm (v0.1.36-alpha) has parse('<br>').content.children[0].constructor // Element but outerHTML works the same.

vs

const domParser = new DOMParser();
domParser.parseFromString('<br>', 'text/xml').children[0].constructor // ❌ Element
domParser.parseFromString('<br>', 'text/xml').children[0].outerHTML   // ❌ (see below)
<br><parsererror xmlns="http://www.w3.org/1999/xhtml" style="display: block; white-space: pre; border: 2px solid #c77; padding: 0 1em 0 1em; margin: 1em; background-color: #fdd; color: black"><h3>This page contains the following errors:</h3><div style="font-family:monospace;font-size:12px">error on line 1 at column 5: Extra content at the end of the document\n</div><h3>Below is a rendering of the page up to the first error.</h3></parsererror></br>

@CetinSert
Copy link
Contributor

CetinSert commented Nov 20, 2022

Maybe we could make this an option in diffhtml-rust-parser? Allowing you to pick which backend to use (tl or html5ever).

I was only surprised about the need to use regexes. I am not necessarily suggesting/asking for an option to use deno-dom-wasm actually.

@tbranyen
Copy link
Owner Author

tbranyen commented Nov 20, 2022

@CetinSert Awesome,

Maybe we could make this an option in diffhtml-rust-parser? Allowing you to pick which backend to use (tl or html5ever).

I was only surprised about the need to use regexes. I am not necessarily suggesting/asking for an option to use deno-dom-wasm actually.

Maybe there is a better way to handle parsing, it's worth thinking about. diffHTML needs a sensible parser default since it is critical to the framework. It should be fast, require minimal memory overhead, work everywhere, and is nice if it converts directly to the internal VDOM structure instead of an intermediary (HTMLElement/Node). I was inspired by and forked node fast-html-parser, since it was small and fast, but had too many issues as we see in the issue tracker here. That's why I'm rewriting using similar principles, but implemented better. SAX, DOMParser, and others I looked at years ago had too many cons around portability, GC, and performance to be a sensible default. I'd like to see a solution that works as well as regex-based under 60fps requestAnimationFrame without introducing GC pausing.

Now that Internals.parse is a public mutable API, what core uses should be less relevant if you have a particular parser you like. Maybe you don't care about performance for your use case, and you're always using Deno as the runtime so you don't need the portability. This should be a good compromise.

I do find it amusing I'm using regex given https://stackoverflow.com/a/1732454/282175, but IMO the response is overblown and specific to parsing with a single regex, not multiple. I would pay someone a lot of money if they were able to write an alternative to regexes that runs everywhere, is as spec-compliant or more, the same file size or less, operate at the same performance or faster, and same memory consumption or less. I'm not sure if it can be done, and if it can it's worth being paid for ha!


@tbranyen – I use partial results too actually. Like this:

import { Document } from "https://deno.land/x/deno_dom/deno-dom-wasm.ts";

const document = new Document();

const parse = (h: string): HTMLTemplateElement => {
  const  t = document.createElement('template');
  /**/   t.innerHTML = h;
  return t; //.content.constructor : DocumentFragment
};

//                                                               _______ : DocumentFragment
console.warn(parse('<script type=worker route="/sse"></script>').content.querySelector('script[type=worker]').getAttribute('route'));

DOMParser.parseFromString() would leak memory in some browsers, so I came up with the above workaround. It works like a charm in browsers and in Deno.

Interesting idea! One issue is that this does not seem to work correctly with full HTML documents:

image

Deprecations:

- Removes strict mode.

New features:

- HTML Comments support
- Extremely fast
- Lots of bug fixes
- Smaller footprint
- Fully supports any HTML parser
@tbranyen
Copy link
Owner Author

To avoid this PR getting messier, I'm going to merge as-is, and continue the remaining tasks in subsequent pulls.

@tbranyen tbranyen merged commit 415a6d1 into master Nov 22, 2022
@tbranyen tbranyen deleted the fix-parser-components branch November 28, 2022 15:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants