HTML Parser Rewrite #283

tbranyen · 2022-10-30T03:37:07Z

Status: In Progress

Goals:

The primary motivations of the rewrite are to fix known compatibility bugs, support comments completely (instead of skipping them), improve support for middleware, and greatly improve memory usage and performance. The way the new parser works is by breaking down the HTML using global regexes that have their lastIndex reset on each loop. No copies or references of the input HTML are made. The VTree hierarchy is built up as the markup is crawled. This iteration of the parser is to help ship 1.0. Using regex's like this is prone to errors and not ideal to parse real HTML, but using any known alternatives will incur significant filesize impact, not work in Node, or be overkill for what is needed, which is basically HTML/XML/SVG -> JSON.

Tasks:

Extract each regex into a separate module and perform isolated tests on each using partial fragments of markup
Remove/deprecate strict mode
Get all existing unit tests passing

Future tasks:

Implement any changes needed to fully support the HTMLHint plugin
Code cleanup
Add more tests
Write documentation

packages/diffhtml/lib/util/parse.js

tbranyen · 2022-11-15T22:17:18Z

I've been cracking away at this locally and now have nearly all unit tests passing. Since I've migrated a lot of the logic out of the HTML parser into the tagged template, the responsibilties between the two are more defined and divided. The parser simply handles flat strings (so wasm works great with this), and then pieces together the dynamic parts in the tagged template layer.

tbranyen · 2022-11-18T20:37:42Z

Did some performance testing and almost had a heart attack. On my Core i9, the parser was taking 40seconds on a 22MB HTML file. The old parser does this is in ~80ms. I used Chrome and x0 profilers and got nowhere. It only said the parse function was slow, nothing internal. Didn't surface any of the regexes. So I did my own instrumentation and found this:

tim in ~/git/diffhtml/packages/diffhtml-rust-parser on fix-parser-components (home) node test
took 76921 iterations in 39.68s
regexes took {
  comment: '39.524s',
  open: '0.031s',
  attribute: '0.028s',
  close: '0.004s',
  text: '0.001s'
}

Phew! Just looks like the Comment regex I wrote is awful. Will fix this and the future runs will hopefully not be in seconds lol

tbranyen · 2022-11-19T01:18:35Z

Figured out the problem, was due to the fact that there were no comments in the markup so it was doing unnecessary scanning every loop. Easy fix by tracking matches, and if none are found, never checking again. This makes it go down to 50-60ms. Huge improvement!

Rust/WASM took 161 ms
JS took 58 ms

CetinSert · 2022-11-19T11:37:47Z

@tbranyen – Following development on this repo with great excitement every day!
Why do we need regexes to parse HTML here? Is it to introduce a degree of permissiveness?

I have been using https://deno.land/x/deno_dom with great success in Deno Deploy.
Leaving it here in case you have not looked into it.

tbranyen · 2022-11-19T15:38:51Z

@CetinSert diffHTML core is currently dependency free and works in all ESM environments and is sub 10kb min+gzip. I suspect deno dom would add a considerable overhead to the bundle size and make it less portable. It uses https://github.com/servo/html5ever under-the-hood. Maybe we could make this an option in diffhtml-rust-parser? Allowing you to pick which backend to use (tl or html5ever).

~~I'll make sure this works before the next publish:~~

import { DOMParser } from 'https://deno.land/x/deno_dom/deno-dom-wasm.ts';
import { html, Internals } from 'https://diffhtml.org/core';

Internals.parse = input => new DOMParser().parseFromString(input, 'text/html');

console.log(html`
  <div>From Deno DOM</div>
`);

Edit: unfortunately doesn't seem like it will be possible until that module supports text/xml type.

tbranyen · 2022-11-19T18:13:40Z

Few more gotchas, the DOMParser implementation for text/html is to return a full HTML document. Including HTML, HEAD and BODY tags.

You can see this here:

While text/xml produces a fragment:

Unfortunately in that deno module we get:

Which means it's not really suitable for this use case, since we parse markup at every level. For instance how would we know the difference between the two of these:

<head></head>
<marquee>Some markup</marquee>

and:

<marquee></marquee>

If you wrote something like:

const fragment = html`
  <head></head>
  <marquee>Some markup</marquee>
`;

How would we know what to return?

CetinSert · 2022-11-20T13:19:54Z

@tbranyen – I use partial results too actually. Like this:

import { Document } from "https://deno.land/x/deno_dom/deno-dom-wasm.ts";

const document = new Document();

const parse = (h: string): HTMLTemplateElement => {
  const  t = document.createElement('template');
  /**/   t.innerHTML = h;
  return t; //.content.constructor === DocumentFragment
};

//                                                               _______ : DocumentFragment
console.warn(parse('<script type=worker route="/sse"></script>').content.querySelector('script[type=worker]').getAttribute('route'));

DOMParser.parseFromString() would leak memory in some browsers, so I came up with the above workaround.
It works like a charm in browsers and in Deno.

CetinSert · 2022-11-20T13:42:35Z

@tbranyen – also text/xml follows an entirely different spec.

As an example, try this in browsers:

const parse = h => {
  const  t = document.createElement('template');
  /**/   t.innerHTML = h;
  return t;
};

parse('<br>').content.children[0].constructor // ✔️ HTMLBRElement
parse('<br>').content.children[0].outerHTML   // ✔️ <br>

deno-dom-wasm (v0.1.36-alpha) has parse('<br>').content.children[0].constructor // Element but outerHTML works the same.

vs

const domParser = new DOMParser();
domParser.parseFromString('<br>', 'text/xml').children[0].constructor // ❌ Element
domParser.parseFromString('<br>', 'text/xml').children[0].outerHTML   // ❌ (see below)

<br><parsererror xmlns="http://www.w3.org/1999/xhtml" style="display: block; white-space: pre; border: 2px solid #c77; padding: 0 1em 0 1em; margin: 1em; background-color: #fdd; color: black"><h3>This page contains the following errors:</h3><div style="font-family:monospace;font-size:12px">error on line 1 at column 5: Extra content at the end of the document\n</div><h3>Below is a rendering of the page up to the first error.</h3></parsererror></br>

CetinSert · 2022-11-20T14:10:53Z

Maybe we could make this an option in diffhtml-rust-parser? Allowing you to pick which backend to use (tl or html5ever).

I was only surprised about the need to use regexes. I am not necessarily suggesting/asking for an option to use deno-dom-wasm actually.

tbranyen · 2022-11-20T14:45:01Z

@CetinSert Awesome,

Maybe we could make this an option in diffhtml-rust-parser? Allowing you to pick which backend to use (tl or html5ever).

I was only surprised about the need to use regexes. I am not necessarily suggesting/asking for an option to use deno-dom-wasm actually.

Maybe there is a better way to handle parsing, it's worth thinking about. diffHTML needs a sensible parser default since it is critical to the framework. It should be fast, require minimal memory overhead, work everywhere, and is nice if it converts directly to the internal VDOM structure instead of an intermediary (HTMLElement/Node). I was inspired by and forked node fast-html-parser, since it was small and fast, but had too many issues as we see in the issue tracker here. That's why I'm rewriting using similar principles, but implemented better. SAX, DOMParser, and others I looked at years ago had too many cons around portability, GC, and performance to be a sensible default. I'd like to see a solution that works as well as regex-based under 60fps requestAnimationFrame without introducing GC pausing.

Now that Internals.parse is a public mutable API, what core uses should be less relevant if you have a particular parser you like. Maybe you don't care about performance for your use case, and you're always using Deno as the runtime so you don't need the portability. This should be a good compromise.

I do find it amusing I'm using regex given https://stackoverflow.com/a/1732454/282175, but IMO the response is overblown and specific to parsing with a single regex, not multiple. I would pay someone a lot of money if they were able to write an alternative to regexes that runs everywhere, is as spec-compliant or more, the same file size or less, operate at the same performance or faster, and same memory consumption or less. I'm not sure if it can be done, and if it can it's worth being paid for ha!

@tbranyen – I use partial results too actually. Like this:

import { Document } from "https://deno.land/x/deno_dom/deno-dom-wasm.ts";

const document = new Document();

const parse = (h: string): HTMLTemplateElement => {
  const  t = document.createElement('template');
  /**/   t.innerHTML = h;
  return t; //.content.constructor : DocumentFragment
};

//                                                               _______ : DocumentFragment
console.warn(parse('<script type=worker route="/sse"></script>').content.querySelector('script[type=worker]').getAttribute('route'));

DOMParser.parseFromString() would leak memory in some browsers, so I came up with the above workaround. It works like a charm in browsers and in Deno.

Interesting idea! One issue is that this does not seem to work correctly with full HTML documents:

Deprecations: - Removes strict mode. New features: - HTML Comments support - Extremely fast - Lots of bug fixes - Smaller footprint - Fully supports any HTML parser

tbranyen · 2022-11-22T19:32:06Z

To avoid this PR getting messier, I'm going to merge as-is, and continue the remaining tasks in subsequent pulls.

tbranyen marked this pull request as draft October 30, 2022 03:37

tbranyen force-pushed the fix-parser-components branch from c00d833 to f52f2ac Compare November 2, 2022 00:52

tbranyen commented Nov 12, 2022

View reviewed changes

packages/diffhtml/lib/util/parse.js Outdated Show resolved Hide resolved

tbranyen force-pushed the fix-parser-components branch from 632f867 to 95f574d Compare November 18, 2022 16:25

tbranyen marked this pull request as ready for review November 19, 2022 23:30

tbranyen force-pushed the fix-parser-components branch 3 times, most recently from a6c690d to 6eb141f Compare November 20, 2022 01:15

New HTML Parser

f984611

Deprecations: - Removes strict mode. New features: - HTML Comments support - Extremely fast - Lots of bug fixes - Smaller footprint - Fully supports any HTML parser

tbranyen force-pushed the fix-parser-components branch from 76143e3 to f984611 Compare November 21, 2022 13:59

tbranyen merged commit 415a6d1 into master Nov 22, 2022

tbranyen deleted the fix-parser-components branch November 28, 2022 15:37

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTML Parser Rewrite #283

HTML Parser Rewrite #283

tbranyen commented Oct 30, 2022 •

edited

tbranyen commented Nov 15, 2022

tbranyen commented Nov 18, 2022 •

edited

tbranyen commented Nov 19, 2022 •

edited

CetinSert commented Nov 19, 2022

tbranyen commented Nov 19, 2022 •

edited

tbranyen commented Nov 19, 2022

CetinSert commented Nov 20, 2022 •

edited

CetinSert commented Nov 20, 2022 •

edited

CetinSert commented Nov 20, 2022 •

edited

tbranyen commented Nov 20, 2022 •

edited

tbranyen commented Nov 22, 2022

HTML Parser Rewrite #283

HTML Parser Rewrite #283

Conversation

tbranyen commented Oct 30, 2022 • edited

tbranyen commented Nov 15, 2022

tbranyen commented Nov 18, 2022 • edited

tbranyen commented Nov 19, 2022 • edited

CetinSert commented Nov 19, 2022

tbranyen commented Nov 19, 2022 • edited

tbranyen commented Nov 19, 2022

CetinSert commented Nov 20, 2022 • edited

CetinSert commented Nov 20, 2022 • edited

CetinSert commented Nov 20, 2022 • edited

tbranyen commented Nov 20, 2022 • edited

tbranyen commented Nov 22, 2022

tbranyen commented Oct 30, 2022 •

edited

tbranyen commented Nov 18, 2022 •

edited

tbranyen commented Nov 19, 2022 •

edited

tbranyen commented Nov 19, 2022 •

edited

CetinSert commented Nov 20, 2022 •

edited

CetinSert commented Nov 20, 2022 •

edited

CetinSert commented Nov 20, 2022 •

edited

tbranyen commented Nov 20, 2022 •

edited