Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Browser executable-less preprocessor #1469

Open
saschanaz opened this issue Feb 2, 2018 · 27 comments
Open

Browser executable-less preprocessor #1469

saschanaz opened this issue Feb 2, 2018 · 27 comments
Labels
Low priority Suggestions with no good driving force

Comments

@saschanaz
Copy link
Member

ReSpec shouldn't require installing real browsers to use respec2html.js as jsdom supports browser-like DOM manipulation in Node.js. Dropping Nightmare (and so Electron) dependency will allow easier build environment settings on GPU-less CLI environments.

@marcoscaceres
Copy link
Member

Definitely. I've hit this on Dreamhost too.

The problem in the past (years ago!) was that JSDOM was super-duper slow for whatever reason - specially on big specs - so it wasn't feasible to use it. That's probably been addressed since then, but someone would need to try it out.

@tidoust
Copy link
Member

tidoust commented May 12, 2018

I thought I'd report here because I've been investigating how to make Respec run in JSDOM (for Reffy). Good news is it's mostly doable. Bad news is it's not exactly straightforward...

The tweaks that I needed to make to JSDOM and Respec in Reffy are parked in a drop-respec branch in Reffy for now. See code and comments in util.js. Here is a summary of problems I encountered and workarounds I found so far:

  1. JSDOM does not support URL.createObjectURL:
    Respec uses that to create a separate worker for the core/highlight module. Dropping that module solves the problem, but obviously means that the module does not run.

  2. JSDOM does not support document.createRange:
    Respec uses that in the core/list-sorter module. Dropping that module solves the problem, with the obvious same consequence as above.
    Also note Fix temp ID syntax in list sorter #1665 related to that module: JSDOM throws an error because the generated CSS selector is invalid.

  3. JSDOM's CSS parser is not necessarily up-to-date:
    JSDOM does not like the @keyframes and @supports directives in particular and throws an error when parsing the CSS Respec injects. Not sure this really has any real impact, but since I don't need the CSS in Reffy, replacing them with @media directives does the trick.

  4. JSDOM does not support innerText:
    Given the use in Respec, replacing occurrences of innerText in Respec by textContent solves the problem.

  5. marked behaves strangely in the presence of <pre> blocks when there are no blank lines after the block.
    For instance, if you look at this markdown demo, you'll note that the generated HTML for the <pre> tag includes unexpected (and broken) <p> tags.
    Now, the problem is that Respec drops blank lines before it calls that library. An example of spec where this happens is Netinfo. Interestingly, in "real" browsers, the weird tags magically disappear afterwards and the generated HTML looks clean (but the output of markdownToHtml is wrong). I do not know why. In JSDOM, the generated HTML stays ugly, and extracting the IDL afterwards is problematic.
    Replacing \n with \n\n in normalizePadding to preserve a blank line fixes the issue.

  6. JSDOM does not support cloning of attributes:
    This is needed by HyperHTML. In theory, providing a polyfill is easy but constructors and prototypes are shared in JSDOM, so there is no good way to get back to the underlying document when given an attribute to clone that does not belong to one, or at least I could not find a good solution here. This requires tweaking the code of HyperHTML to call createAttributeNS itself when the attribute it wants to clone does not belong to a document.

  7. JSDOM does not support inserAdjacentElement:
    It's easy to come up with a polyfill.

  8. JSDOM does not support closest:
    Same thing, it's easy to come up with a polyfill.

  9. JSDOM does not support matchMedia:
    Given that Reffy does not need that logic, it's easy to mock the interface

  10. JSDOM does not support fetch:
    Actually, that's fine for Reffy, because I want to control network requests as much as possible and force usage of a local cache when possible.

  11. JSDOM does not support a few visual methods on window:
    In particular, blur, focus, moveBy, moveTo, resizeBy, resizeTo, scroll, scrollBy, scrollTo are not supported (and a couple of specs call scrollBy). Reffy does not need that logic, so it's easy to mock these methods.

  12. JSDOM does not support IndexedDB:
    Respec degrades gracefully when IndexexDB is not supported. It still outputs an obscure error message to say that it cannot call open on undefined (where undefined should have been IndexedDB). Detecting whether IndexedDB exists before trying to access it in Respec would allow to report a more useful error message.
    Also worth noting here: the readyPromise gets rejected but, depending on the spec, the promise may be checked in another tick. When that happens, Node.js reports an UnhandledPromiseRejectionWarning, which is, as far I can tell, a false positive.

  13. JSDOM does not seem to support the whatToShow filter in TreeWalker:
    As a result, HyperHTML fails to remove the <!-- _hyper: xxxx --> comments it adds all over the place when it runs. That does not have any consequence.

  14. As far as I can tell, there is no easy way to run code in JSDOM once Respec has started to run, and in particular when the respecIsReady promise is set. As a result, I need to do a bit of polling to wait for the promise to appear so that I can attach to it and detect the end of the generation. Or is there a better way?

With the workarounds and limitations mentioned above, I manage to run Reffy on all W3C specs that Reffy knows about, and the outcome seems good!

@saschanaz
Copy link
Member Author

saschanaz commented May 12, 2018

Thanks for sharing your experience!

Yeah, I also tried JSDOM and faced those limits. Maybe we should change the whole logic. Some personal thoughts:

  1. We now have access to ES modules, ideally we should be able to use the same modules in both browsers and Node.js without JSDOM page load or a separate browser executable.
  2. We currently heavily depends on browser native global DOM methods, I think we can still use JSDOM there if we make sure that we pass JSDOM document object rather than using global variables. Or viperHTML?
  3. Workers. I hope we can load it as a module in Node.js, if worker scripts can be modules... (Update: Yes, a worker script can be a module.)

@marcoscaceres
Copy link
Member

@tidoust, I thought Reffy ran against "static" documents? If we had a Reffy web service, ReSpec could simply post a static snapshot directly from the browser without needing to run any transformations on the server.

To rephrase: what is Reffy's input?

@tidoust
Copy link
Member

tidoust commented May 14, 2018

To rephrase: what is Reffy's input?

Reffy's goal is to generate a knowledge graph of specs and an analysis of that graph to create a list of potential anomalies. To generate the knowledge graph, Reffy:

  1. starts from a list of spec URLs (multiple lists, actually: specs-common.json, specs-cg.json, specs-whawg.json.
  2. retrieves information about each spec from the W3C API and Specref, including the URL of the Editor's Draft
  3. fetches the Editor's Draft of the spec
  4. parses the Editor's Draft to extract IDL, references and links. For Editor's Drafts that use ReSpec, this is where Reffy needs to get the generated document, which is currently done using RespecDocWriter.

The input of the Reffy web service that we talked about is slightly different. It would take as input:

  • the URL of a spec, or a static snapshot of the generated draft
  • the latest version of the knowledge graph (typically from reffy-reports).

From that, it would simply have to parse the provided spec, and analyses the result against the knowledge graph. No transformation is needed on the server for that (unless the input is the raw source, obviously).

I started to develop such a service (see reffy-service). I haven't had time to test it with POST content, so it's most probably broken for now. It also lacks a README for now. Note the service needs the HTML and the URL of the spec to test. The URL is needed when the spec already exists in the knowledge graph, so that Reffy can remove it from there first before, otherwise the resulting graph would contain two specs that e.g. define the same IDL terms, and analysis would report that as anomalies.

I'm now looking at ways to make the service more robust before we deploy it on one of our servers. The investigation reported here is part of that, as an attempt to reduce the number of dependencies that would have to run on the server.

@marcoscaceres
Copy link
Member

@tidoust thanks for clarifying. @saschanaz, do you know if puppeteer runs without needing a virtual screen? If yes, then that addresses GPU-less CLI environments problem in one sense... still a fat dependency, but at least we don't need to refactor code significantly to back port ReSpec to JSDOM.

@saschanaz
Copy link
Member Author

do you know if puppeteer runs without needing a virtual screen?

puppeteer runs in headless mode by default, so yes.

@marcoscaceres
Copy link
Member

@tidoust, given that Puppeteer runs without needing the virtual display, is it a viable alternative to JSDOM for Reffy?

@tidoust
Copy link
Member

tidoust commented May 14, 2018

@tidoust, given that Puppeteer runs without needing the virtual display, is it a viable alternative to JSDOM for Reffy?

Sure! I was not reporting these issues as requests to change ReSpec. The problem here is not how ReSpec is written, it is JSDOM not yet supporting enough features for ReSpec. I just wanted to document the current limitations somewhere.

Actually, we still have a virtual screen for now on the server, but maybe that's no longer needed with Puppeteer (it was for nightmare). We just haven't tried without. But that was not the reason why I wanted to get rid of RespecDocWriter, it was mostly to simplify dependencies and avoid having to run a full chromeless browser on the server.

@marcoscaceres
Copy link
Member

I just wanted to document the current limitations somewhere.

Ok cool. Sorry I was confused. I appreciate the documentation above also, so thanks for that!

We just haven't tried without. But that was not the reason why I wanted to get rid of RespecDocWriter, it was mostly to simplify dependencies and avoid having to run a full chromeless browser on the server.

Understood. Ok, if there is anything else we can do, happy to try to help. Reffy looks super cool and if we can integrate somehow directly (via POST or whatever), happy to help with that however we can!

tidoust added a commit to w3c/reffy that referenced this issue May 15, 2018
Reffy needs to run ReSpec on Editor's Drafts. It relied on RespecDocWriter
until now, which uses Puppeeter, which in turn relies on a headless version
of Chrome. That requires downloading 100s of MB of dependencies and is a tad
heavy for the task at hands, especially as we consider running Reffy as an
HTTP service.

With this update, RespecDocWriter is no longer being used. Instead, Reffy makes
ReSpec run in JSDOM, which is much more lightweight (and JS-based).

Problem is that JSDOM does not yet support enough features to run ReSpec.
Concrete issues are documented in:
w3c/respec#1469 (comment)

This update monkey-patches JSDOM's code to add good-enough implementations of
missing features, as well as to regain control over network requests so that
all network requests go through our cache. It also monkey-patches ReSpec to:

1. drop a couple of modules that cannot run because they use non-implemented
features of JSDOM that cannot be easily patched
2. update the code of a couple of ReSpec dependencies for the same reason

The whole thing remains fragile: updates to ReSpec or to its dependencies may
well break the regular expressions used to monkey patch the code. Most of the
monkey-patching should resist minor changes. A couple of them may not though.
tidoust added a commit to w3c/reffy that referenced this issue May 15, 2018
Reffy needs to run ReSpec on Editor's Drafts. It relied on RespecDocWriter
until now, which uses Puppeeter, which in turn relies on a headless version
of Chrome. That requires downloading 100s of MB of dependencies and is a tad
heavy for the task at hands, especially as we consider running Reffy as an
HTTP service.

With this update, RespecDocWriter is no longer being used. Instead, Reffy makes
ReSpec run in JSDOM, which is much more lightweight (and JS-based).

Problem is that JSDOM does not yet support enough features to run ReSpec.
Concrete issues are documented in:
w3c/respec#1469 (comment)

This update monkey-patches JSDOM's code to add good-enough implementations of
missing features, as well as to regain control over network requests so that
all network requests go through our cache. It also monkey-patches ReSpec to:

1. drop a couple of modules that cannot run because they use non-implemented
features of JSDOM that cannot be easily patched
2. update the code of a couple of ReSpec dependencies for the same reason

The whole thing remains fragile: updates to ReSpec or to its dependencies may
well break the regular expressions used to monkey patch the code. Most of the
monkey-patching should resist minor changes. A couple of them may not though.
@saschanaz saschanaz changed the title Use/support jsdom to preprocess Browser executable-less preprocessor Mar 22, 2019
@saschanaz
Copy link
Member Author

saschanaz commented Mar 22, 2019

@marcoscaceres This has always been my ultimate purpose, for binary-less lighter dependencies, a scriptable API, easier test without karma, and easier debug. Could we bake an experimental build with minimal features to check what should be done, although I'm currently not sure which feature to start?

Edit: My vague thought:

  1. We create a structure that represents ReSpec document, which includes Document object, user configuration, definition maps, biblio maps, etc.
  2. Expose an API function that receives Document and user configuration, creates the structure, and passes it to modules. For the current in-HTML script injection use, ReSpec calls it internally with the current document and respecConfig value.
  3. Modules will receive the structure, replacing the current use of global document and the conf argument. Normalizing configuration shouldn't modify the input configuration to prevent unexpected side effect.
  4. Modules that store its state in module level should instead store it in the structure.
  5. Modules that add event listeners, or do anything cannot be stored statically, should instead insert an inline script to do it. (Separate static and dynamic parts #1668)

@marcoscaceres
Copy link
Member

Excited to see where this ends up!

@saschanaz
Copy link
Member Author

saschanaz commented Mar 27, 2019

I got a blocker from hyperHTML, which cannot be imported on node.js. Maybe we have to fork viperhtml (which is a lot lighter than hyperhtml but only for server) to make it work.

@marcoscaceres
Copy link
Member

Pinged WebReflection for guidance.

@WebReflection
Copy link

I got a blocker from hyperHTML, which cannot be imported on node.js

hyperHTML is 100% code covered via node, all you need to do is to write require('basichtml').init() before requiring hyperhtml.

I've used basichtml to serve output even in cloudflare so everything you need should be there.

However, viperHTML is always an alternative if you need just plain SSR 👋

@saschanaz
Copy link
Member Author

So it looks like require('basichtml').init() defines document globally? That somehow clashes with jsdom... But thanks, TIL basichtml.

@WebReflection
Copy link

WebReflection commented Mar 28, 2019

@saschanaz if you have jsdom, and it pollutes the global scope, hyperHTML should work already.

If you don't you can use an empty .init() or specify various things, including the object to pollute.

i.e.

const window = {};
require('basichtml').init({window});

const {customElements, document, HTMLElement} = window;
customElements.define('hello-world', class extends HTMLElement {
  connectedCallback() {
    this.textContent = 'Hello world!';
  }
});

// workaround to execute with a temporary global document
const $ = fn => {
  const oldDocument = global.document;
  global.document = document;
  const result = fn();
  if (oldDocument)
    global.document = oldDocument;
  else
    delete global.document;
  return result;
};

const {bind} = $(() => require('hyperhtml'));

$(() => {
  bind(document.documentElement)`
  <head>
    <title>This is basicHTML</title>
  </head>
  <body><hello-world/></body>`;
});

console.log(document.toString());
/*
<!DOCTYPE html><html><head>
    <title>This is basicHTML</title>
  </head><body><hello-world>Hello world!</hello-world></body></html>
*/

but then again, if you have already a document everything should just work, you just need to require hyperhtml after 👋

@saschanaz
Copy link
Member Author

I mean require("basichtml").init(); require("jsdom"); fails. Maybe jsdom detects browser environment by checking global existence of document or something...

Currently I decided to use viperhtml for template string and parse the result with jsdom again, as I'm currently trying to remove any global or module level states. It's still a draft, so maybe I could migrate to basichtml later if it looks great enough.

@WebReflection
Copy link

I mean require("basichtml").init(); require("jsdom"); fails.

that is not a real-world use case, but even if you need to do that, you can pass an object instead, so nothing gets polluted, and jsdom will work.

Currently I decided to use viperhtml for template string and parse the result with jsdom again, as I'm currently trying to remove any global or module level states

You are contradicting yourself, since jsdom pollute the global scope, right?

It's still a draft, so maybe I could migrate to basichtml later if it looks great enough.

Sure, whatever works best. I don't even know what is your actual goal (had no time to read this whole thread, I just jumped in at "I got stuck due node" when all I do is using node for everything I develop on the web 😁)

@saschanaz
Copy link
Member Author

saschanaz commented Mar 28, 2019

since jsdom pollute the global scope

AFAIK it doesn't, does it?

@WebReflection
Copy link

actually my bad, it doesn't. I've seen tests written with global.window = new JSDOM but apparently it doesn't indeed pollute anything by its own.

Then again, neither does basichtml, it's just the handy default that never, personally, hurts.

In regards of hyperHTML it needs to do some feature detection requiring a document in a way or another.

@saschanaz
Copy link
Member Author

@tidoust Hi! Unfortunately we found that #2187 introduces too many changes and creates additional maintenance burden to us. (Even when it might give some hypothetical wins 👀)

I want to hear your current need again to decide the future of #2187. Could you use respec2html now that we depend on headless Chromium? If you can't, what would be the blockers?

@tidoust
Copy link
Member

tidoust commented Oct 28, 2019

Hello @saschanaz,

I want to hear your current need again to decide the future of #2187. Could you use respec2html now that we depend on headless Chromium? If you can't, what would be the blockers?

I don't think that there is anything blocking us from using respec2html in practice. It was more a matter of reducing dependencies to large binaries (also not to have to run a virtual screen on the server, but that's no longer needed with a headless Chromium).

The one thing that I do now, and which I don't know if I can easily do if we switch to respec2html, is to hook my own implementation of "fetch" to force the use of cached versions of resources so that I can run crawls offline (for testing purpose). That is not a requirement though, and I'm sure I'll find a way to do that.

I totally understand the maintenance burden argument. I actually considered using a forked version of Respec with #2187 baked in... and decided against it for the very same reason.

+@dontcallmedom who was the one advocating for the switch to JSDOM in the first place.

@dontcallmedom
Copy link
Member

I agree it'd a nice to have; if it's too costly to maintain, then we'll absorb the cost in running with headless chromium.

@saschanaz
Copy link
Member Author

Okay, that means this is now officially low-priority unless we find an easier way.

@marcoscaceres I still want to keep #2187 open to 1) extract some good thing from there 2) maybe just for me to experiment things. Sounds good?

@saschanaz saschanaz added the Low priority Suggestions with no good driving force label Oct 28, 2019
@marcoscaceres
Copy link
Member

Sounds good. Yes, let’s grab whatever goodies we can from #2187.

tidoust added a commit to w3c/reffy that referenced this issue Nov 21, 2019
Making latest version of Respec run under JSDOM is too much of a maintenance
burden in the end:
w3c/respec#1469 (comment)

This update makes Reffy use respecDocWriter (which uses the headless browser
Puppeteer under the hoods) again. This allows to run the latest version of
Respec.

JSDOM is still used to load specifications that do not use Respec and to detect
when a specification uses Respec. Note the code no longer runs external scripts
when it loads specifications: that seems unneeded in practice.

The code still contains minimal monkey patching for JSDOM so because a number
of specifications do have inline scripts that call `window.matchMedia`, which
JSDOM does not support.

This will fix #134.
tidoust added a commit to w3c/reffy that referenced this issue Nov 21, 2019
Making latest version of Respec run under JSDOM is too much of a maintenance
burden in the end:
w3c/respec#1469 (comment)

This update makes Reffy use respecDocWriter (which uses the headless browser
Puppeteer under the hoods) again. This allows to run the latest version of
Respec.

JSDOM is still used to load specifications that do not use Respec and to detect
when a specification uses Respec. Note the code no longer runs external scripts
when it loads specifications: that seems unneeded in practice.

The code still contains minimal monkey patching for JSDOM so because a number
of specifications do have inline scripts that call `window.matchMedia`, which
JSDOM does not support.

This will fix #134.
@saschanaz
Copy link
Member Author

A new possibility: someone made a JS-to-binary compiler that also supports WASM.

Not sure it's any valid, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Low priority Suggestions with no good driving force
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants
@WebReflection @dontcallmedom @marcoscaceres @tidoust @saschanaz and others