New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HTML Parser Rewrite #283
HTML Parser Rewrite #283
Conversation
c00d833
to
f52f2ac
Compare
I've been cracking away at this locally and now have nearly all unit tests passing. Since I've migrated a lot of the logic out of the HTML parser into the tagged template, the responsibilties between the two are more defined and divided. The parser simply handles flat strings (so wasm works great with this), and then pieces together the dynamic parts in the tagged template layer. |
632f867
to
95f574d
Compare
Did some performance testing and almost had a heart attack. On my Core i9, the parser was taking 40seconds on a 22MB HTML file. The old parser does this is in ~80ms. I used Chrome and x0 profilers and got nowhere. It only said the parse function was slow, nothing internal. Didn't surface any of the regexes. So I did my own instrumentation and found this:
Phew! Just looks like the Comment regex I wrote is awful. Will fix this and the future runs will hopefully not be in seconds lol |
Figured out the problem, was due to the fact that there were no comments in the markup so it was doing unnecessary scanning every loop. Easy fix by tracking matches, and if none are found, never checking again. This makes it go down to 50-60ms. Huge improvement!
|
@tbranyen – Following development on this repo with great excitement every day! I have been using https://deno.land/x/deno_dom with great success in Deno Deploy. |
@CetinSert diffHTML core is currently dependency free and works in all ESM environments and is sub 10kb min+gzip. I suspect deno dom would add a considerable overhead to the bundle size and make it less portable. It uses https://github.com/servo/html5ever under-the-hood. Maybe we could make this an option in diffhtml-rust-parser? Allowing you to pick which backend to use (tl or html5ever).
import { DOMParser } from 'https://deno.land/x/deno_dom/deno-dom-wasm.ts';
import { html, Internals } from 'https://diffhtml.org/core';
Internals.parse = input => new DOMParser().parseFromString(input, 'text/html');
console.log(html`
<div>From Deno DOM</div>
`); Edit: unfortunately doesn't seem like it will be possible until that module supports |
a6c690d
to
6eb141f
Compare
@tbranyen – I use partial results too actually. Like this: import { Document } from "https://deno.land/x/deno_dom/deno-dom-wasm.ts";
const document = new Document();
const parse = (h: string): HTMLTemplateElement => {
const t = document.createElement('template');
/**/ t.innerHTML = h;
return t; //.content.constructor === DocumentFragment
};
// _______ : DocumentFragment
console.warn(parse('<script type=worker route="/sse"></script>').content.querySelector('script[type=worker]').getAttribute('route'));
|
@tbranyen – also As an example, try this in browsers: const parse = h => {
const t = document.createElement('template');
/**/ t.innerHTML = h;
return t;
};
parse('<br>').content.children[0].constructor // ✔️ HTMLBRElement
parse('<br>').content.children[0].outerHTML // ✔️ <br>
vs const domParser = new DOMParser();
domParser.parseFromString('<br>', 'text/xml').children[0].constructor // ❌ Element
domParser.parseFromString('<br>', 'text/xml').children[0].outerHTML // ❌ (see below) <br><parsererror xmlns="http://www.w3.org/1999/xhtml" style="display: block; white-space: pre; border: 2px solid #c77; padding: 0 1em 0 1em; margin: 1em; background-color: #fdd; color: black"><h3>This page contains the following errors:</h3><div style="font-family:monospace;font-size:12px">error on line 1 at column 5: Extra content at the end of the document\n</div><h3>Below is a rendering of the page up to the first error.</h3></parsererror></br> |
I was only surprised about the need to use regexes. I am not necessarily suggesting/asking for an option to use |
@CetinSert Awesome,
Maybe there is a better way to handle parsing, it's worth thinking about. diffHTML needs a sensible parser default since it is critical to the framework. It should be fast, require minimal memory overhead, work everywhere, and is nice if it converts directly to the internal VDOM structure instead of an intermediary (HTMLElement/Node). I was inspired by and forked node fast-html-parser, since it was small and fast, but had too many issues as we see in the issue tracker here. That's why I'm rewriting using similar principles, but implemented better. SAX, DOMParser, and others I looked at years ago had too many cons around portability, GC, and performance to be a sensible default. I'd like to see a solution that works as well as regex-based under 60fps requestAnimationFrame without introducing GC pausing. Now that I do find it amusing I'm using regex given https://stackoverflow.com/a/1732454/282175, but IMO the response is overblown and specific to parsing with a single regex, not multiple. I would pay someone a lot of money if they were able to write an alternative to regexes that runs everywhere, is as spec-compliant or more, the same file size or less, operate at the same performance or faster, and same memory consumption or less. I'm not sure if it can be done, and if it can it's worth being paid for ha!
Interesting idea! One issue is that this does not seem to work correctly with full HTML documents: |
Deprecations: - Removes strict mode. New features: - HTML Comments support - Extremely fast - Lots of bug fixes - Smaller footprint - Fully supports any HTML parser
76143e3
to
f984611
Compare
To avoid this PR getting messier, I'm going to merge as-is, and continue the remaining tasks in subsequent pulls. |
Status: In Progress
Goals:
The primary motivations of the rewrite are to fix known compatibility bugs, support comments completely (instead of skipping them), improve support for middleware, and greatly improve memory usage and performance. The way the new parser works is by breaking down the HTML using global regexes that have their lastIndex reset on each loop. No copies or references of the input HTML are made. The VTree hierarchy is built up as the markup is crawled. This iteration of the parser is to help ship 1.0. Using regex's like this is prone to errors and not ideal to parse real HTML, but using any known alternatives will incur significant filesize impact, not work in Node, or be overkill for what is needed, which is basically HTML/XML/SVG -> JSON.
Tasks:
Future tasks: