Add IDL Names generator tool #489

tidoust · 2021-01-29T11:13:52Z

The IDL names generator takes a crawl report as input and creates a report per referenceable IDL name, that details the complete parsed IDL structure that defines the name across all specs.

The parsed IDL structure is a wrapped version of the structure that appears in idlparsed extracts. Here is an example:

{
  "defined": {
    "spec": {
      "title": "Media Capture and Streams",
      "url": "https://www.w3.org/TR/mediacapture-streams/"
    },
    "type": "dictionary",
    "name": "ConstraintSet",
    "inheritance": null,
    "members": [],
    "extAttrs": [],
    "partial": false,
    "href": "https://w3c.github.io/mediacapture-main/#dom-constraintset"
  },
  "extended": [],
  "inheritance": null,
  "includes": []
}

The meaning of the properties is:

defined contains the base IDL definition of the name and includes a spec property that describes where the name is defined. Note the URL that appears is the spec identifier, equivalent to the url field in browser-specs, and not necessarily the crawled URL. The rest of the structure is the idlparsed one (with the exception of the href property, see below).
extended contains the list of partial definitions that extend the base definition, each of them following the same structure as the one presented here. The order of the list follows the order of appearance in the crawl results, where specs are sorted by URL.
inheritance contains the inherited interface when there is one, again following the same structure. The whole inheritance chain appears, meaning that one can follow inheritance properties to get from HTMLVideoElement all the way down to EventTarget.
includes contains the list of mixins that the name includes, each of them following the same structure as the one presented here. The order of the list follows the order of appearance in the crawl results, where specs are sorted by URL.

Whenever possible, all IDL terms get linked to their definition in the spec through an href property (which uses the crawled URL). That property is computed from the dfns extracts. The property appears at the interface level and also for individual IDL property names, as in:

{
  "defined": {
    "spec": {
      "title": "CSS Spatial Navigation Level 1",
      "url": "https://www.w3.org/TR/css-nav-1/"
    },
    "type": "enum",
    "name": "SpatialNavigationDirection",
    "values": [
      {
        "type": "enum-value",
        "value": "up",
        "href": "https://drafts.csswg.org/css-nav-1/#dom-spatialnavigationdirection-up"
      },
      {
        "type": "enum-value",
        "value": "down",
        "href": "https://drafts.csswg.org/css-nav-1/#dom-spatialnavigationdirection-down"
      },
      {
        "type": "enum-value",
        "value": "left",
        "href": "https://drafts.csswg.org/css-nav-1/#dom-spatialnavigationdirection-left"
      },
      {
        "type": "enum-value",
        "value": "right",
        "href": "https://drafts.csswg.org/css-nav-1/#dom-spatialnavigationdirection-right"
      }
    ],
    "extAttrs": [],
    "href": "https://drafts.csswg.org/css-nav-1/#enumdef-spatialnavigationdirection"
  },
  "extended": [],
  "includes": []
}

The crawler calls the IDL names generator to create individual exports per IDL name in idlnamesparsed. Individual exports remain relatively small in size (max is 775KB for the WebGL2RenderingContext interface, average size is 25KB). Total folder size is a bit more than 50MB though.

Partially addresses #472 (this does not create the textual definition).

The IDL names generator takes a crawl report as input and creates a report per referenceable IDL name, that details the complete parsed IDL structure that defines the name across all specs. The parsed IDL structure is a wrapped version of the structure that appears in `idlparsed` extracts. Here is an example: ```json { "defined": { "spec": { "title": "Media Capture and Streams", "url": "https://www.w3.org/TR/mediacapture-streams/" }, "type": "dictionary", "name": "ConstraintSet", "inheritance": null, "members": [], "extAttrs": [], "partial": false, "href": "https://w3c.github.io/mediacapture-main/#dom-constraintset" }, "extended": [], "inheritance": null, "includes": [] } ``` The meaning of the properties is: - `defined` contains the base IDL definition of the name and includes a `spec` property that describes where the name is defined. Note the URL that appears is the spec identifier, equivalent to the `url` field in browser-specs, and not necessarily the crawled URL. The rest of the structure is the `idlparsed` one (with the exception of the `href` property, see below). - `extended` contains the list of partial definitions that extend the base definition, each of them following the same structure as the one presented here. The order of the list follows the order of appearance in the crawl results, where specs are sorted by URL. - `inheritance` contains the inherited interface when there is one, again following the same structure. The whole inheritance chain appears, meaning that one can follow `inheritance` properties to get from `HTMLVideoElement` all the way down to `EventTarget`. - `includes` contains the list of mixins that the name includes, each of them following the same structure as the one presented here. The order of the list follows the order of appearance in the crawl results, where specs are sorted by URL. Whenever possible, all IDL terms get linked to their definition in the spec through an `href` property (which uses the crawled URL). That property is computed from the dfns extracts. The property appears at the interface level and also for individual IDL property names, as in: ```json { "defined": { "spec": { "title": "CSS Spatial Navigation Level 1", "url": "https://www.w3.org/TR/css-nav-1/" }, "type": "enum", "name": "SpatialNavigationDirection", "values": [ { "type": "enum-value", "value": "up", "href": "https://drafts.csswg.org/css-nav-1/#dom-spatialnavigationdirection-up" }, { "type": "enum-value", "value": "down", "href": "https://drafts.csswg.org/css-nav-1/#dom-spatialnavigationdirection-down" }, { "type": "enum-value", "value": "left", "href": "https://drafts.csswg.org/css-nav-1/#dom-spatialnavigationdirection-left" }, { "type": "enum-value", "value": "right", "href": "https://drafts.csswg.org/css-nav-1/#dom-spatialnavigationdirection-right" } ], "extAttrs": [], "href": "https://drafts.csswg.org/css-nav-1/#enumdef-spatialnavigationdirection" }, "extended": [], "includes": [] } ``` Related discussion in #472.

dontcallmedom

LGTM with a minor stylistic question

src/cli/generate-idlnames.js

tidoust · 2021-02-01T13:56:47Z

Thinking about this a bit more, I believe that this is not a fantastically useful approach for the scenarios that we have in mind, so I'd like to leave this open for the time being.

The initial goal was to give a ready-to-use extract per IDL name that would in particular allow people to serialize the IDL under whatever format that they might want.

That would be straightforward if one could use the serializer in webidl2.js library. That is not directly possible though, because the serizalizer actually operates on the "hidden" tokens in the AST ("hidden" in the sense that they disappear when toJSON is called), and not on the "visible" AST (meaning the one that appears in idlparsed extracts).

As such, there would be no easy way to use these extracts directly in WebIDLPedia or Respec.

On top of my head, several possibilities:

Amend the WebIDL serializer so that it can operate on the "visible" AST (or create a separate IDL serializer if that turns out to be too difficult). In most cases, the tokens can actually be automatically determined (e.g. an open ( token for function parameters) or imposed (e.g. indentation rules). However, one problem is that we would lose comments in IDL blocks (recorded in trivia tokens), and that seems like a useful thing to preserve.
Simplify the created structure to rather contain the textual representation of the IDL block instead of the parsed structure. This would allow people to use the parser in webidl2.js before they call the serializer. We would expose the logic in this PR to associate terms with definitions somehow. The parsed structures are also useful to run analyses and detailed IDL structure would no longer be included in this solution.
Preserve the tokens in the export. That may be very verbose, though.
A middle-ground between 1. and 3.: preserve the comment tokens but discard the rest and update the WebIDL serializer

dontcallmedom · 2021-02-01T14:00:43Z

I had imagined option (2) when considering that issue when reviewing the pull request - or more specifically, I thought we could add the IDL fragments to the JSON output in a later iteration.

The AST is only half useful because it is not suitable for serialization purpose. Adding the raw IDL fragment is better from that perspective. For the export to be readily usable, it needs to include definitions. Without the AST, these definitions need to be at the root level. In turn, this means that additional logic is needed when one wants to re-serialize the IDL fragment to associate the definitions back to the appropriate definitions. That logic is not fantastically straightforward. Definitions are not included by default, and not included in the crawl. It would be good to have clear feedback on whether they are going to be useful.

The crawler now also exports one text file per IDL name in an "idlnames" folder. Each text file contains the full interface (without the fragments that define the inherited classes).

This update adds an `href` property to `defined` structures that link to the definition of the underlying IDL name in the spec, when known. Note the full definition would also appears in the `dfns` array if the generator is told to generate definitions.

tidoust · 2021-02-09T11:06:15Z

I made several updates:

The raw IDL fragment is now reported in a fragment property. This is also done in idlparsed exports.
The AST structure is no longer serialized in idlnamesparsed extracts. It does not seem useful to create 50MB of data that cannot directly be used to re-serialize the IDL fragments.
The href property that links back to the definition of an IDL name in the spec now only appears at the defined level in the exported structure, and not for individual members (since these individual members no longer appear in the exported structure)
The generator can be told to also add all relevant definitions (for all individual members) in a dfns array property at the root level. However, this is not done by default. I suggest to wait until we get practical feedback on whether that's needed. For instance, WebIDLPedia does not currently need this, as far as I can tell. The definitions would add ~50MB of data to the total size of the idlnamesparsed folder (total size is currently <7MB), and one needs specific logic to process it anyway (see below) so I propose not to include that data for now.
The generator now also creates textual versions in an idlnames folder

I note that the cleanup job running on webref will have to be completed to detect files that need to be deleted in the idlnames and idlnamesparsed folders.

Custom logic to serialize an IDL name with links

Some possible logic to link an IDL fragment members with definitions using the IDLNames generator and the WebIDL writer (Relative paths are from the root of the Reffy package and need to be updated if you create that script elsewhere).

const path = require('path');
const { parse, write } = require('webidl2');
const { requireFromWorkingDirectory, expandCrawlResult } = require('./src/lib/util');
const { generateIdlNames } = require('./src/cli/generate-idlnames.js');
const { matchIdlDfn, getExpectedDfnFromIdlDesc } = require('./src/cli/check-missing-dfns');

function templates(idlName, dfns) {
  function getExpectedDfn(name, context) {
    if (context && context.data) {
      const expected = getExpectedDfnFromIdlDesc(context.data, context.parent);
      if (expected) {
        const dfn = dfns.find(dfn => matchIdlDfn(expected, dfn));
        if (dfn) {
          return dfn;
        }
      }
      if (!expected || (expected.type === 'interface')) {
        return getInterfaceDfn(name);
      }
    }
    return null;
  }

  function getInterfaceDfn(name) {
    const expected = { linkingText: [name], type: 'interface', 'for': [] };
    let dfn = null;
    if (idlName.dfns) {
      for (const list of Object.values(idlName.dfns)) {
        dfn = list.find(dfn => matchIdlDfn(expected, dfn));
        if (dfn) {
          break;
        }
      }
    }
    return dfn;
  }

  function getWrappingFunction(lookupFunction) {
    return function (name, context) {
      const dfn = lookupFunction(name, context);
      if (dfn) {
        return `[${name}](${dfn.href})`;
      }
      return name;
    }
  }

  return {
    name: getWrappingFunction(getExpectedDfn),
    nameless: getWrappingFunction(getExpectedDfn),
    reference: getWrappingFunction(getInterfaceDfn)
  };
}


function serialize(idlName) {
  let res = [];

  function serializeNode(node) {
    const root = node.defined ? node.defined : node;
    const spec = root.spec ? root.spec : null;
    let dfns = [];
    if (spec && idlName.dfns && idlName.dfns[spec.url]) {
      dfns = idlName.dfns[spec.url];
    }
    const writeParams = { templates: templates(idlName, dfns) };
    const idlTree = parse(node.defined ? node.defined.fragment : node.fragment);
    const idl = write(idlTree, writeParams);
    res.push(idl);

    if (node.inheritance) {
      serializeNode(node.inheritance);
    }
    if (node.extended) {
      node.extended.map(node => serializeNode(node));
    }
    if (node.includes) {
      node.includes.map(node => serializeNode(node));
    }
  }

  serializeNode(idlName);
  return res;
}

async function linkify(idlName, crawlPath) {
  const crawlIndex = requireFromWorkingDirectory(path.join(crawlPath, 'index.json'));
  const crawlResults = await expandCrawlResult(crawlIndex, crawlPath);
  const names = generateIdlNames(crawlResults.results, { dfns: true });
  const desc = names[idlName];
  const res = serialize(desc);
  return res.join('\n\n');
}

const idlName = process.argv[2] || 'Document';
const crawlPath = process.argv[3] || 'reports/ed';
linkify(idlName, crawlPath).then(res => {
  console.log('==========');
  console.log(res);
  console.log('==========');
});

dontcallmedom

I haven't reviewed everything in detail, but the direction and the design choices LGTM; once the webref build is out, I'll try to adapt webidlpedia to use this

The "idlnames" and "idlnameparsed" folders have recently been added, see: w3c/reffy#489 (comment) They contain files per IDL Name, which need to be dropped when the IDL names no longer appear in any of the crawled specs. Also adjust jobs execution schedules to have the cleanup job run after the weekly tr crawl, see: #86 (review)

tidoust requested a review from dontcallmedom January 29, 2021 11:13

dontcallmedom approved these changes Jan 29, 2021

View reviewed changes

src/cli/generate-idlnames.js Outdated Show resolved Hide resolved

src/cli/generate-idlnames.js Show resolved Hide resolved

tidoust mentioned this pull request Jan 29, 2021

Detect cycles in inheritance chains #490

Open

Make the Earth flat per feedback

f7ea46e

tidoust mentioned this pull request Feb 3, 2021

Extract parsed CSS property value definitions #494

Closed

tidoust added 3 commits February 8, 2021 14:58

Save IDL fragments in idlnames folders

2f08d44

The crawler now also exports one text file per IDL name in an "idlnames" folder. Each text file contains the full interface (without the fragments that define the inherited classes).

tidoust requested a review from dontcallmedom February 9, 2021 11:06

dontcallmedom approved these changes Feb 9, 2021

View reviewed changes

tidoust merged commit f051482 into master Feb 9, 2021

tidoust mentioned this pull request Feb 10, 2021

Adjust cleanup job to also look at IDL names folders w3c/webref#88

Merged

tidoust deleted the generate-idlnames branch July 9, 2021 12:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add IDL Names generator tool #489

Add IDL Names generator tool #489

tidoust commented Jan 29, 2021

dontcallmedom left a comment

tidoust commented Feb 1, 2021

dontcallmedom commented Feb 1, 2021

tidoust commented Feb 9, 2021

dontcallmedom left a comment

Add IDL Names generator tool #489

Add IDL Names generator tool #489

Conversation

tidoust commented Jan 29, 2021

dontcallmedom left a comment

Choose a reason for hiding this comment

tidoust commented Feb 1, 2021

dontcallmedom commented Feb 1, 2021

tidoust commented Feb 9, 2021

dontcallmedom left a comment

Choose a reason for hiding this comment