New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a "modern" parsing API #2993

Open
dominiccooney opened this Issue Sep 1, 2017 · 101 comments

Comments

@dominiccooney
Copy link
Collaborator

dominiccooney commented Sep 1, 2017

TL;DR HTML should provide an API for parsing. Why? "Textual" HTML is a widely used syntax. HTML parsing is complex enough to want to use the browser's parser, plus browsers can do implementation tricks with how they create elements, etc.

Unfortunately the way the HTML parser is exposed in the web platform is a hodge-podge. Streaming parsing is only available for main document loads; other things rely on strings which put pressure on memory. innerHTML is synchronous and could cause jank for large documents (although I would like to see data on this because it is pretty fast.)

Here are some strawman requirements:

  • Should work with streams, and probably strings.
  • It should be asynchronous. HTML parsing is fast, but if you wanted to handle megabytes of data on phones while animating something, you probably can't do it synchronously.

Commentary:

One big question is when this API exposes the tree it is operating on. Main document parsing does expose the tree and handles mutations to it pretty happily; innerHTML parsing does not until the nodes are adopted into the target node's document (which may start running custom element stuff.)

One minor question is what to do with errors.

Being asynchronous has implications for documents and/or custom elements. If you allow creating stuff in the main document, then you have to run the custom element constructors sometime, so to make it not jank you probably can't run them together. This is probably a feature worth addressing.

See also:

Issue 2827

@annevk

This comment has been minimized.

Copy link
Member

annevk commented Sep 1, 2017

@jakearchibald

This comment has been minimized.

Copy link
Collaborator

jakearchibald commented Sep 1, 2017

One big question is when this API exposes the tree it is operating on.

I'd like this API to support progressive rendering, so I guess I guess my preference is "as soon as possible".

const streamingFragment = document.createStreamingFragment();

const response = await fetch(url);
response.body
  .pipeThrough(new TextDecoder())
  .pipeTo(streamingFragment.writable);

document.body.append(streamingFragment);

I'd like the above to progressively render. The parsing would follow the "in template", although we may want options to handle other cases, like SVG.

One minor question is what to do with errors

What kinds of errors?

@jakearchibald

This comment has been minimized.

Copy link
Collaborator

jakearchibald commented Sep 1, 2017

There are a few libraries that use tagged template literals to build HTML, I think their code would be simpler if they knew what state the parser was in at a given point. This might be an opportunity.

Eg:

const fragment = whatever`
  <p>${someContent}</p>
  <img src=${someImgSrc}>
`;

These libraries allow someContent to be text, an element, a promise for text/element. someImgSrc would be text in this case, but may be a function if it's assigning to an event listener. Right now these libraries insert a UID, then crawl the created elements for those UIDs so they can perform the interpolation.

I wonder if something like streamingFragment could provide enough details to avoid the UID hack.

const streamingFragment = document.createStreamingFragment();
const writer = streamingFragment.writer.getWriter();

await writer.write('<p>');
let parserState = await streamingFragment.getParserState();
parserState.currentNode; // paragraph

await writer.write('</p><img src=');
parserState = await streamingFragment.getParserState();

…I guess this last bit is more complicated, but ideally it should know it's in the "before attribute value" state for "src" within tag "img". Ideally there should be a way to get the resulting attribute & element as a promise.

+@justinfagnani @WebReflection

@inikulin

This comment has been minimized.

Copy link
Member

inikulin commented Sep 1, 2017

@dominiccooney HTML can have conformance errors, but there are recovery mechanisms for all of them and user agents doesn't bail out on errors. So any input can be consumed by the HTML parser without a problem.

I like @jakearchibald's API. However, I wonder if we need to support full document streaming parser and how API will look like for it. Also, in streaming fragment approach will it be possible to perform consequent writes to fragment (e.g. pipe one response to fragment and afterwards another one). If so, how it will behave: overwrite content of fragment or insert it in the end of fragment?

@inikulin

This comment has been minimized.

Copy link
Member

inikulin commented Sep 1, 2017

@jakearchibald

I think their code would be simpler if they knew what state the parser was in at a given point.

What do you mean by state here? Parser insertion mode, tokeniser state or something else?

@jakearchibald

This comment has been minimized.

Copy link
Collaborator

jakearchibald commented Sep 1, 2017

@inikulin

I wonder if we need to support full document streaming parser

Hmm yeah. I'm not sure what the best pattern is to use for that.

will it be possible to perform consequent writes to fragment (e.g. pipe one response to fragment and afterwards another one). If so, how it will behave: overwrite content of fragment or insert it in the end of fragment?

Yeah, you can do this with streams. Either with individual writes, or piping with {preventClose: true}. This will follow the same rules as if you mess with elements' content during initial page load.

As in, if the parser eats:

<p>Hello

…then you:

document.querySelector('p').append(', how are you today?');

…you get:

<p>Hello, how are you today?

…if the parser then receives " everyone", I believe you get:

<p>Hello everyone, how are you today?

…as the parser as a pointer to the first text node of the paragraph.

@inikulin

This comment has been minimized.

Copy link
Member

inikulin commented Sep 1, 2017

@jakearchibald There is a problem with this approach. Consider we have two streams: one writes <div>Hey and the other one ya. Usually when parser encounters end of the stream it finalises the AST and, therefore, the result of feeding the first stream to the parser will be <div>Hey</div> (parser will emit implied end tag here). So, when second stream will write ya you'll get <div>Hey</div>ya as a result. So it will be pretty much the same as creating second fragment and appending it to the first one. On the other hand we can have API that will explicitly say parser to treat second stream as a continuation of the first one.

@WebReflection

This comment has been minimized.

Copy link

WebReflection commented Sep 1, 2017

Thanks @jakearchibald for thinking of us.

I can speak for my 6+ months on the template literals VS DOM pattern so that maybe you can have as many info as possible about implementations/proposals/APIs etc.

I'll try to split this post in topics.


Not just a UID

I am not using just a UID, I'm using a comment that contains some UID.

// dumb example
function tag(statics, ...interpolations) {
  const out = [statics[0]];
  for (let i = 1; i < statics.length; i++)
    out.push('<!-- MY UID -->', statics[i]);
  return out.join('');
}

tag`<p>a ${'b'} c</p>`;

This gives me the ability to let the HTML parser split for me text content in chunks, and verify that if the nodeType of the <p> childNodes[x] is Node.COMMENT_NODE and its textContent is my UID, I'm fine.

The reason I'm using comments, beside letting the browser do the splitting job for me, is that browsers that don't support in core HTMLTemplateElement will discard partial tables, cols, or options layout but they wouldn't with comments.

var brokenWorkAround = document.createElement('div');
brokenWorkAround.innerHTML = '<td>goodbye TD</td>';
brokenWorkAround.childNodes; // [#text]
brokenWorkAround.outerHTML;
// <div>goodbye TD</div>

You can read about this issue in all the polyfill from webcomponents issues.
https://github.com/webcomponents/template/issues

As summary, if every browser was natively compatible with the template element and the fact it doesn't ignore any kind of node, the only thing parsers like mine would need is a way to understand when the HTML engine encounters a "special node", in my case represented by a comment with a special content.

Right now we all need to traverse the whole tree after creating it, and in search of special placeholders.

This is fast enough as a one-off operation, and thanks gosh template literals are unique so it's easy to perform the traversing only once, but it wouldn't scale on huge documents, specially now that I've learned for browsers, and due legacy, simply checking nodeType is a hell of a performance nightmare!


Attributes are "doomed"

Now that I've explained the basics for the content, let's talk about attributes.

If you inject a comment as attribute and there are no quotes around, the layout is destroyed.

<nope nopity=<!-- nope -->>nayh</nope>

So, for attributes, having a similar mechanism to define a unique entity/value to be notify about woul dbe ACE!!!! Right now the content is injected sanitized upfront. It works darn well but it's not ideal as a solution.

Moreover on Attribuites

If you put a placeholder in attributes you have the following possible issues:

  • IE / Edge might throw random errors and break if the attribute is, for example, style, and the content does not contain colons (even if it's unvalid). _some: uid; works, shena-nigans wouldn't.
  • some not-so-smart browser throws error with invalid attributes. As example, <img src=uid> would throw an error about the resource without even bothering the network (which has a smarter layer). This is Firefox
  • some node will throw, without failing though (thanks gosh), errors at first parse. These are SVG nodes. If you have <rect x=uid y=uid /> , before you'll set the right values it will show an error that x or y were not valid.

HTML is very forgiving in many parts, attributes are quite the opposite for various scenarios.

As summary if whatever mechanism would tell the browser any attribute with such special content should be ignored, all these problems would disappear.


Backward compatibility

As much as I'd love to have help from the platform itslef regarding the template literals pattern, I'm afraid that won't ever land in production until all browsers out there would support it (or there is a reliable polyfill for that).

That means that exposing the internal HTML parser through a new API can surely benefits projects from the future, but it would rarely land for all browser in 5+ years.

This last point is just my consideration about effort / results ratio.

Thanks again for helping out regardless.

@jakearchibald

This comment has been minimized.

Copy link
Collaborator

jakearchibald commented Sep 1, 2017

@inikulin

There is a problem with this approach

I don't think it's a problem. If you use {preventClose: true}, it doesn't encounter "end of stream". So:

await textStream1.pipeTo(streamingFragment.writable, { preventClose: true });
await textStream2.pipeTo(streamingFragment.writable);

The streaming fragment would consume the streams as if there were a single stream concatenated.

await textStream3.pipeTo(streamingFragment.writable);

The above would fail, as the writable has now closed.

@WebReflection

This comment has been minimized.

Copy link

WebReflection commented Sep 1, 2017

P.S. just in case my wishes come true ... what both me and (most-likely) Justin would love to have natively exposed, is a document.queryRawContent(UID) that would return, in linear order, atributes with such value, or comments nodes with such value.

<html lang=UID>
<body> Hello <!--UID-->! <p class=UID></p></body>

The JS coutner part would be:

const result = document.queryRawContent(UID);
[
  the html lang attribute,
  the comment childNodes[1] of the body,
  the p class arttribute
]

Now that, in core, would make my parser a no brainer (beside the issue with comments and attributes, but RegExp upfront are very good at that and blazing fast

[edit] even while streaming it would work, actually it'd be even better so it's one pass for the browser

@WebReflection

This comment has been minimized.

Copy link

WebReflection commented Sep 1, 2017

Also since I know for many code is better than thousand words, this is the TL;DR version of what hyperHTML does.

function tag(statics, ...interpolations) {
  if (this.statics !== statics) {
    this.statics = statics;
    this.updates = parse.call(this, statics, '<!--WUT-->');
  }
  this.updates(interpolations);
}

function parse(statics, lookFor) {
  const updates = [];
  this.innerHTML = statics.join(lookFor);
  traverse(this, updates, lookFor);
  const update = (value, i) => updates[i](value);
  return interpolations => interpolations.forEach(update);
}

function traverse(node, updates, lookFor) {
  switch (node.nodeType) {
    case Node.ELEMENT_NODE:
      updates.forEach.call(node.attributes, attr => {
        if (attr.value === lookFor)
          updates.push(v => attr.value = v)});
      updates.forEach.call(node.childNodes,
        node => traverse(node, updates, lookFor)); break;
    case Node.COMMENT_NODE:
      if (`<!--${node.textContent}-->` === lookFor) {
        const text = node.ownerDocument.createTextNode('');
        node.parentNode.replaceChild(text, node);
        updates.push(value => text.textContent = value);
}}}

const body = tag.bind(document.body);

setInterval(() => {
  body`
  <div class="${'my-class'}">
    <p> It's ${(new Date).toLocaleTimeString()} </p>
  </div>`;
}, 1000);

The slow path is the traverse function, the not-so-cool part is the innerHTML injection (as regular node, template or whatever it is) without having the ability to intercept, while parsing the string, all placeholders / attributes and act addressing them accordingly.

OK, I'll let you discuss the rest now 😄

@jakearchibald

This comment has been minimized.

Copy link
Collaborator

jakearchibald commented Sep 1, 2017

@WebReflection

I think the UID scanner you're talking about might not be necessary. Consider:

const fragment = whatever`
  <p>${someContent}</p>
  <img src=${someImgSrc}>
`;

Where whatever could do something like this:

async function whatever(strings, ...values) {
  const streamingFragment = document.createStreamingFragment();
  const writer = streamingFragment.writer.getWriter();

  for (const str of strings) {
    // str is:
    // <p>
    // </p> <img src=
    // >
    // (with extra whitespace of course)
    await writer.write(str);
    let parserState = streamingFragment.getParserState();

    if (parserState.tokenState == 'data') {
      // This is the case for <p>, and >
      await writer.write('<!-- -->');
      parserState.currentTarget.lastChild; // this is the comment you just created.
      // Swap it out for the interpolated value
    }
    else if (parserState.tokenState.includes('attr-value')) {
      // await the creation of this attr node
      parserState.attrNode.then(attr => {
        // Add the interpolated value, or remove it and add an event listener instead etc etc.
      });
    }
  }
}
@WebReflection

This comment has been minimized.

Copy link

WebReflection commented Sep 1, 2017

Yes, that might work. As long as these scenarios are allowed:

const fragment = whatever`
  <ul>${...}</ul>
  ${...}
  <p data-a=${....} onclick=${....}>also ${...} and</p>
  <img a=${...} b=${...} src=${someImgSrc}>
  <table><tr>${...}</tr></table>
`;

which looks like it'd be the case.

@jakearchibald

This comment has been minimized.

Copy link
Collaborator

jakearchibald commented Sep 1, 2017

@WebReflection Interpolation should be allowed anywhere.

whatever`
  <${'img'} src="hi">
`;

In the above case tokenState would be "tag-open" or similar. At this point you could either throw a helpful error, or just pass the interpolated value through.

@inikulin

This comment has been minimized.

Copy link
Member

inikulin commented Sep 1, 2017

@jakearchibald Do you expect tokenStateto be one of tokeniser states defined in https://html.spec.whatwg.org/multipage/parsing.html#tokenization? If so, I'm afraid we can't do that, they are part of parser intrinsics and are subject to change. Moreover, some of them can be meaningless for a user.

@jakearchibald

This comment has been minimized.

Copy link
Collaborator

jakearchibald commented Sep 1, 2017

@inikulin yeah, that's what I was hoping to expose, or something equivalent. Why can't we expose it?

@WebReflection

This comment has been minimized.

Copy link

WebReflection commented Sep 1, 2017

@jakearchibald

what about the following ?

whatever`
  <${'button'} ${'disabled'}>
`;

I actually don't mind having that possible because boolean attributes need boolean values so that ${obj.disabled ? 'disabled' : ''} doesn't look like a great option to me, but I'd be curious to know if "attribute-name" would be exposed too.

Anyway, having my example covered would be already awesome.

@jakearchibald

This comment has been minimized.

Copy link
Collaborator

jakearchibald commented Sep 1, 2017

@WebReflection The tokeniser calls that the "Before attribute name state", so if we could expose that, it'd be possible.

@WebReflection

This comment has been minimized.

Copy link

WebReflection commented Sep 1, 2017

Not sure this is actually just extra noise or something valuable, but if it can simplify anything, viperHTML uses similar mechanism to parse once on the NodeJS side.

The parser is the pretty awesome htmlparser2.

Probably inspiring as API ? I use the comment trick there though, but since there is a .write mechanism, I believe it could be possible to make it incremental.

@inikulin

This comment has been minimized.

Copy link
Member

inikulin commented Sep 1, 2017

@jakearchibald These states are part of intrinsic parser mechanism and are subject of change, we've even removed/introduced few recently just to fix some conformance-error related bug in parser. So, exposing them to end user will require us to freeze current list of states, that will significantly complicate further development of the parser spec. Moreover, I believe some of them will be quite confusing for end users, e.g. Comment less-than sign bang dash dash state

@WebReflection

This comment has been minimized.

Copy link

WebReflection commented Sep 1, 2017

@inikulin would a subset be reasonable? As example, data and attr-value for me would cover already 100% of hyperHTML use cases and I believe those two will never change in the history of HTML ... right?

@jakearchibald

This comment has been minimized.

Copy link
Collaborator

jakearchibald commented Sep 1, 2017

I'm keen on exposing some parser state to help libraries, but I'm happy for us to add it later rather than block streaming parsing on it.

@inikulin

This comment has been minimized.

Copy link
Member

inikulin commented Sep 1, 2017

@WebReflection Yes, that could be a solution. But I have some use cases in mind that can be confusing for end user. Consider <div data-foo="bar". We'll emit attr-value state in that case, however this markup will not produce attribute in AST (it will not even produce a tag, since unclosed tags in the end of the input stream are omitted).

@WebReflection

This comment has been minimized.

Copy link

WebReflection commented Sep 1, 2017

@inikulin if someone writes broken html I don't expect anything different than throwing errors and break everything right away (when using a new parser API)

Template literals are static, there's no way one of them would instantly fail the parser ... it either work or fail forever, since these are also frozen Arrays.

Accordingly, I understand this API is not necessarily for template literals only, but if the streamer goes bananas due wrong output it's developer fault.

today it's developer fault regardless, but she'll never notice due silent failure.

@inikulin

This comment has been minimized.

Copy link
Member

inikulin commented Sep 1, 2017

if someone writes broken html I don't expect anything different than throwing errors and break everything right away.

You will be surprised looking at the real world markup around the web. Also, there is no such thing as "broken markup" anymore. There is non-conforming markup, but modern HTML parser can swallow anything. So, to conclude, you suggest to bail out with an error in case of invalid markup in this new streaming API?

@WebReflection

This comment has been minimized.

Copy link

WebReflection commented Sep 1, 2017

You will be surprised looking at the real world markup around the web.

you missed the edit: when using a new parser API

So, to conclude, you suggest to bail out with an error in case of invalid markup in this new streaming API?

If the alternative is to not have it, yes please.

I'm tired of missed opportunities due lazy developers that need to be cuddle by standards for their mistakes.

@inikulin

This comment has been minimized.

Copy link
Member

inikulin commented Sep 1, 2017

If the alternative is to not have it, yes please.

I'm tired of missed opportunities due lazy developers that need to be cuddle by standards about their mistakes.

I'm not keen to this approach to be honest, it brings us back to times of XHTML. One of the advantages of HTML5 was it's flexibility regarding parse errors and, hence, document authoring.

@WebReflection

This comment has been minimized.

Copy link

WebReflection commented Sep 1, 2017

this API goal is different, and developers want to know if they wrote a broken template.

Not knowing it hurts themselves, and since there is no html highlight by default inside strings, it's also a safety belt for them.

So throw like any asynchronous operation that failed would throw, and let them decide if they want to fallback to innerHTML or fix that template literal instead, and forever.

@jakearchibald

This comment has been minimized.

Copy link
Collaborator

jakearchibald commented Sep 7, 2017

@domenic

from what I can tell you're still thinking of the parser states as a "real" thing, and as a low-level primitive. But as @inikulin pointed out, they're not really primitives, they're just implementation details and spec devices we use to navigate through the algorithm.

If browsers don't implement it & don't intend to, what's the point of having it in a spec? I realise that browsers may use different terms internally, but unless they're implementing something wildly different to the spec, and intend to continue doing so, those states could be mapped to something standard.

I also don't think we should expose them.

Why?

@dominiccooney

What if we exposed a smaller set of states?

Agreed. We could even start by exposing nothing, but design the parser in a way that allows this in future.

@inikulin

This comment has been minimized.

Copy link
Member

inikulin commented Sep 7, 2017

@jakearchibald

but unless they're implementing something wildly different to the spec

Sometimes they do, e.g. Blink don't use states from spec that are dedicated to entity parsing and uses custom state machine for that: https://chromium.googlesource.com/chromium/blink/+/master/Source/core/html/parser/HTMLEntityParser.cpp

@annevk

This comment has been minimized.

Copy link
Member

annevk commented Sep 7, 2017

Right, generally specifications define some kind of process that brings you from A to B. The details of that process are not important and implementations are encouraged to compete in that area. The moment you want to expose more details of that process to the outside world it starts mattering a whole lot more what those details are and how they function, as the moment you expose them you prevent all kinds of optimizations and code refactoring that could otherwise take place.

@jakearchibald

This comment has been minimized.

Copy link
Collaborator

jakearchibald commented Sep 7, 2017

Fair enough. It'd be good to expose these states at some point, but it doesn't need to be v1.

@RReverser

This comment has been minimized.

Copy link
Member

RReverser commented Sep 7, 2017

@WebReflection I agree having events for separate pieces of the HTML as it goes through would be quite nice, but I'd say it's already a little bit more advanced than the "as small as possible", more like version 2. For version 1, it would be nice at least to be able to insert streaming content into the DOM even without hooks for separate parts of it.

@WebReflection

This comment has been minimized.

Copy link

WebReflection commented Sep 7, 2017

events are just attributes ... what I've written intercepts/pauses at dom chunks and / or attributes, no matter which attribute it is or what it does ... attributes 😄

@RReverser

This comment has been minimized.

Copy link
Member

RReverser commented Sep 7, 2017

@WebReflection Sure, but as I said, it's a bit more advanced because it requires providing hooks from inside of the parser. I want to start with something that will be definitely possible to get implemented by vendors with pretty much no changes or hooks that are not already there, and then iterate on top of that.

@dvoytenko

This comment has been minimized.

Copy link

dvoytenko commented Sep 7, 2017

@dominiccooney

Thanks for those details. Roughly how much content are we talking about here?

This is really full-size docs. Anywhere between 10K and 200K. I don't know what averages are, tbh.

@jakearchibald

This comment has been minimized.

Copy link
Collaborator

jakearchibald commented Sep 11, 2017

#2142 – previous issue where a streaming parsing API was discussed

@inikulin

This comment has been minimized.

Copy link
Member

inikulin commented Sep 14, 2017

Another important question: do we want it to behave like a streaming innerHTML? If so, such functionality can't be achieved with the fragment approach, since we don't know context of parsing ahead of time. Consider we have a <textarea> element. With innerHTML setter parser knows that content will be parsed in context of <textarea> element and switches tokeniser to text parsing mode. So, e.g. <div></div> will be parsed as text content. Whereas, with fragment we'll parse it as a div tag. If we'll use same machinery for fragment parsing approach as we use for the <template> parsing we can workaround some of the cases, such as parsing table content (however e.g. foster parenting will not work), but everything that involves adjustment of the tokeniser state will be a problem.

@jakearchibald

This comment has been minimized.

Copy link
Collaborator

jakearchibald commented Sep 14, 2017

@inikulin The fragment could buffer text until it's appended, at which point it knows its context. Although a guess it's a bit weird that you wouldn't be able to look at stuff in the fragment.

The API could take an option that would give it context ahead of time, so nodes could be created before insertion.

@inikulin

This comment has been minimized.

Copy link
Member

inikulin commented Sep 14, 2017

@jakearchibald What if we modify API a bit. We'll introduce new entity, let's call it StreamingParser for now:

// If we provide context element, then content is streamed directly to it.
let parser = new StreamingParser(contentElement); 

let response = await fetch(url);
response.body
  .pipeTo(parser.stream);

// You can examine parsed content at any moment using `parser.fragment`
// property which is a fragment mapped to the parsed content in context element
console.log(parser.fragment.childNodes.length);

// If context element is not provided, we don't stream content anywhere,
// however you can still use `parser.fragment` to examine content or attach it to some node
parser = new StreamingParser(); 

// ...
@jakearchibald

This comment has been minimized.

Copy link
Collaborator

jakearchibald commented Sep 14, 2017

If you don't provide the content element, how is the content parsed?

@inikulin

This comment has been minimized.

Copy link
Member

inikulin commented Sep 14, 2017

In that case parser.fragment (or even better call it parser.target) will be a DocumentFragment element implicitly created by the parser.

@jakearchibald

This comment has been minimized.

Copy link
Collaborator

jakearchibald commented Sep 14, 2017

Is that a valid context for a parser?

@jakearchibald

This comment has been minimized.

Copy link
Collaborator

jakearchibald commented Sep 14, 2017

As in, if I push <path/> to the parser, what ends up in parser.fragment?

@inikulin

This comment has been minimized.

Copy link
Member

inikulin commented Sep 14, 2017

DocumentFragment itself is not a valid context for parser. I forgot to elaborate here: in case if we don't provide content element for the parser, it creates <template> element under the hood and pipes content into it, parser.target will be template.content in this case.

@jakearchibald

This comment has been minimized.

Copy link
Collaborator

jakearchibald commented Sep 14, 2017

It'd still be nice to have the nodes created before the target. A "context" option could do this. The option could take a Range, an Element (treated like a range that starts within the element), or a DOMString, which is treated as an element that would be created by document.createElement(string).

@inikulin

This comment has been minimized.

Copy link
Member

inikulin commented Sep 14, 2017

How it will behave if we pass a Range as a context?

@inikulin

This comment has been minimized.

Copy link
Member

inikulin commented Sep 27, 2017

@jakearchibald Seems like I got it: in case of Range we'll stream to all elements in Range? If so. we'll need separate instance of parser for each element in Range.

@jakearchibald

This comment has been minimized.

Copy link
Collaborator

jakearchibald commented Sep 27, 2017

@inikulin whoa, I really thought I'd replied to this, sorry. Range would simply be used to figure out the context, like https://w3c.github.io/DOM-Parsing/#idl-def-range-createcontextualfragment(fragment). There'd only be one parser instance.

@inikulin

This comment has been minimized.

Copy link
Member

inikulin commented Sep 27, 2017

@jakearchibald Thanks for the clarification. We've just discussed possible behaviours with @RReverser and we were wondering if parsing should affect context element's ambient context: e.g. in case if we stream inside <table> and provided markup contains text outside table cell should we move this text above context <table> element (foster parent it) as it's done in full document parsing. Or we should behave exactly like innerHTML and keep text inside <table>?

@jakearchibald

This comment has been minimized.

Copy link
Collaborator

jakearchibald commented Sep 27, 2017

Hmm, that's a tough one. It'd be difficult to do what the parser does while giving access to the nodes before they're inserted. As in:

const streamingFragment = document.createStreamingFragment({context: 'table'});
const writer = streamingFragment.writer.getWriter();
await writer.write('hello');

// Is 'hello' anywhere in streamingFragment.childNodes?

In cases where the node would be moved outside of the context, we could do the innerHTML thing, or discard the node (it's been moved outside of the fragment, to nowhere).

I'd want to avoid as many of the innerHTML behaviours as possible, but I guess it isn't possible here.

@RReverser

This comment has been minimized.

Copy link
Member

RReverser commented Sep 27, 2017

Another concern we discussed with @inikulin (also related to the discussion in last few comments) is that content being parsed might contain closing tags and so leave the parent context. In that regard, behaviour of innerHTML or createContextualDocumentFragment seems better in that it keeps the content isolated, although we're still not sure how stable is machinery for the latter API (given that it does more than innerHTML, e.g. executing scripts is allowed).

@domenic

This comment has been minimized.

Copy link
Member

domenic commented Nov 14, 2018

In an offline discussion, @sebmarkbage brought up the helpful point that if we added Response-accepting srcObject to iframe (see #3972), this would also serve as a streaming parsing API, albeit only in iframes.

@RReverser

This comment has been minimized.

Copy link
Member

RReverser commented Nov 14, 2018

@domenic Hmm, I'm not sure how it would help with streaming parsing? Seems to mostly help with streaming generation of content?

@domenic

This comment has been minimized.

Copy link
Member

domenic commented Dec 19, 2018

@RReverser The parsing would also be done in a streaming fashion, just like it is currently done for iframes loaded from network-derived lowercase-"r" responses.

@RReverser

This comment has been minimized.

Copy link
Member

RReverser commented Dec 20, 2018

What I mean is, I don't see how this helps with actually parsing HTML from JS side (and getting tokens etc.), it rather seems to help with generating and delivering HTML to the renderer.

@RReverser

This comment has been minimized.

Copy link
Member

RReverser commented Dec 20, 2018

Actually nevermind, I realised that half of this old thread was already about the "delivery to the renderer" problem and not actual parsing. Which is useful too, but seems confusing to mix both in the same discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment