Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add IDL Names generator tool #489

Merged
merged 5 commits into from
Feb 9, 2021
Merged

Add IDL Names generator tool #489

merged 5 commits into from
Feb 9, 2021

Conversation

tidoust
Copy link
Member

@tidoust tidoust commented Jan 29, 2021

The IDL names generator takes a crawl report as input and creates a report per referenceable IDL name, that details the complete parsed IDL structure that defines the name across all specs.

The parsed IDL structure is a wrapped version of the structure that appears in idlparsed extracts. Here is an example:

{
  "defined": {
    "spec": {
      "title": "Media Capture and Streams",
      "url": "https://www.w3.org/TR/mediacapture-streams/"
    },
    "type": "dictionary",
    "name": "ConstraintSet",
    "inheritance": null,
    "members": [],
    "extAttrs": [],
    "partial": false,
    "href": "https://w3c.github.io/mediacapture-main/#dom-constraintset"
  },
  "extended": [],
  "inheritance": null,
  "includes": []
}

The meaning of the properties is:

  • defined contains the base IDL definition of the name and includes a spec property that describes where the name is defined. Note the URL that appears is the spec identifier, equivalent to the url field in browser-specs, and not necessarily the crawled URL. The rest of the structure is the idlparsed one (with the exception of the href property, see below).
  • extended contains the list of partial definitions that extend the base definition, each of them following the same structure as the one presented here. The order of the list follows the order of appearance in the crawl results, where specs are sorted by URL.
  • inheritance contains the inherited interface when there is one, again following the same structure. The whole inheritance chain appears, meaning that one can follow inheritance properties to get from HTMLVideoElement all the way down to EventTarget.
  • includes contains the list of mixins that the name includes, each of them following the same structure as the one presented here. The order of the list follows the order of appearance in the crawl results, where specs are sorted by URL.

Whenever possible, all IDL terms get linked to their definition in the spec through an href property (which uses the crawled URL). That property is computed from the dfns extracts. The property appears at the interface level and also for individual IDL property names, as in:

{
  "defined": {
    "spec": {
      "title": "CSS Spatial Navigation Level 1",
      "url": "https://www.w3.org/TR/css-nav-1/"
    },
    "type": "enum",
    "name": "SpatialNavigationDirection",
    "values": [
      {
        "type": "enum-value",
        "value": "up",
        "href": "https://drafts.csswg.org/css-nav-1/#dom-spatialnavigationdirection-up"
      },
      {
        "type": "enum-value",
        "value": "down",
        "href": "https://drafts.csswg.org/css-nav-1/#dom-spatialnavigationdirection-down"
      },
      {
        "type": "enum-value",
        "value": "left",
        "href": "https://drafts.csswg.org/css-nav-1/#dom-spatialnavigationdirection-left"
      },
      {
        "type": "enum-value",
        "value": "right",
        "href": "https://drafts.csswg.org/css-nav-1/#dom-spatialnavigationdirection-right"
      }
    ],
    "extAttrs": [],
    "href": "https://drafts.csswg.org/css-nav-1/#enumdef-spatialnavigationdirection"
  },
  "extended": [],
  "includes": []
}

The crawler calls the IDL names generator to create individual exports per IDL name in idlnamesparsed. Individual exports remain relatively small in size (max is 775KB for the WebGL2RenderingContext interface, average size is 25KB). Total folder size is a bit more than 50MB though.

Partially addresses #472 (this does not create the textual definition).

The IDL names generator takes a crawl report as input and creates a report per
referenceable IDL name, that details the complete parsed IDL structure that
defines the name across all specs.

The parsed IDL structure is a wrapped version of the structure that appears in
`idlparsed` extracts. Here is an example:

```json
{
  "defined": {
    "spec": {
      "title": "Media Capture and Streams",
      "url": "https://www.w3.org/TR/mediacapture-streams/"
    },
    "type": "dictionary",
    "name": "ConstraintSet",
    "inheritance": null,
    "members": [],
    "extAttrs": [],
    "partial": false,
    "href": "https://w3c.github.io/mediacapture-main/#dom-constraintset"
  },
  "extended": [],
  "inheritance": null,
  "includes": []
}
```

The meaning of the properties is:
- `defined` contains the base IDL definition of the name and includes a `spec`
property that describes where the name is defined. Note the URL that appears is
the spec identifier, equivalent to the `url` field in browser-specs, and not
necessarily the crawled URL. The rest of the structure is the `idlparsed` one
(with the exception of the `href` property, see below).
- `extended` contains the list of partial definitions that extend the base
definition, each of them following the same structure as the one presented here.
The order of the list follows the order of appearance in the crawl results,
where specs are sorted by URL.
- `inheritance` contains the inherited interface when there is one, again
following the same structure. The whole inheritance chain appears, meaning that
one can follow `inheritance` properties to get from `HTMLVideoElement` all the
way down to `EventTarget`.
- `includes` contains the list of mixins that the name includes, each of them
following the same structure as the one presented here. The order of the list
follows the order of appearance in the crawl results, where specs are sorted by
URL.

Whenever possible, all IDL terms get linked to their definition in the spec
through an `href` property (which uses the crawled URL). That property is
computed from the dfns extracts. The property appears at the interface level
and also for individual IDL property names, as in:

```json
{
  "defined": {
    "spec": {
      "title": "CSS Spatial Navigation Level 1",
      "url": "https://www.w3.org/TR/css-nav-1/"
    },
    "type": "enum",
    "name": "SpatialNavigationDirection",
    "values": [
      {
        "type": "enum-value",
        "value": "up",
        "href": "https://drafts.csswg.org/css-nav-1/#dom-spatialnavigationdirection-up"
      },
      {
        "type": "enum-value",
        "value": "down",
        "href": "https://drafts.csswg.org/css-nav-1/#dom-spatialnavigationdirection-down"
      },
      {
        "type": "enum-value",
        "value": "left",
        "href": "https://drafts.csswg.org/css-nav-1/#dom-spatialnavigationdirection-left"
      },
      {
        "type": "enum-value",
        "value": "right",
        "href": "https://drafts.csswg.org/css-nav-1/#dom-spatialnavigationdirection-right"
      }
    ],
    "extAttrs": [],
    "href": "https://drafts.csswg.org/css-nav-1/#enumdef-spatialnavigationdirection"
  },
  "extended": [],
  "includes": []
}
```

Related discussion in #472.
Copy link
Member

@dontcallmedom dontcallmedom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with a minor stylistic question

src/cli/generate-idlnames.js Outdated Show resolved Hide resolved
src/cli/generate-idlnames.js Show resolved Hide resolved
@tidoust
Copy link
Member Author

tidoust commented Feb 1, 2021

Thinking about this a bit more, I believe that this is not a fantastically useful approach for the scenarios that we have in mind, so I'd like to leave this open for the time being.

The initial goal was to give a ready-to-use extract per IDL name that would in particular allow people to serialize the IDL under whatever format that they might want.

That would be straightforward if one could use the serializer in webidl2.js library. That is not directly possible though, because the serizalizer actually operates on the "hidden" tokens in the AST ("hidden" in the sense that they disappear when toJSON is called), and not on the "visible" AST (meaning the one that appears in idlparsed extracts).

As such, there would be no easy way to use these extracts directly in WebIDLPedia or Respec.

On top of my head, several possibilities:

  1. Amend the WebIDL serializer so that it can operate on the "visible" AST (or create a separate IDL serializer if that turns out to be too difficult). In most cases, the tokens can actually be automatically determined (e.g. an open ( token for function parameters) or imposed (e.g. indentation rules). However, one problem is that we would lose comments in IDL blocks (recorded in trivia tokens), and that seems like a useful thing to preserve.
  2. Simplify the created structure to rather contain the textual representation of the IDL block instead of the parsed structure. This would allow people to use the parser in webidl2.js before they call the serializer. We would expose the logic in this PR to associate terms with definitions somehow. The parsed structures are also useful to run analyses and detailed IDL structure would no longer be included in this solution.
  3. Preserve the tokens in the export. That may be very verbose, though.
  4. A middle-ground between 1. and 3.: preserve the comment tokens but discard the rest and update the WebIDL serializer

@dontcallmedom
Copy link
Member

I had imagined option (2) when considering that issue when reviewing the pull request - or more specifically, I thought we could add the IDL fragments to the JSON output in a later iteration.

The AST is only half useful because it is not suitable for serialization
purpose. Adding the raw IDL fragment is better from that perspective.

For the export to be readily usable, it needs to include definitions. Without
the AST, these definitions need to be at the root level. In turn, this means
that additional logic is needed when one wants to re-serialize the IDL fragment
to associate the definitions back to the appropriate definitions. That logic is
not fantastically straightforward.

Definitions are not included by default, and not included in the crawl. It would
be good to have clear feedback on whether they are going to be useful.
The crawler now also exports one text file per IDL name in an "idlnames" folder.
Each text file contains the full interface (without the fragments that define
the inherited classes).
This update adds an `href` property to `defined` structures that link to the
definition of the underlying IDL name in the spec, when known.

Note the full definition would also appears in the `dfns` array if the generator
is told to generate definitions.
@tidoust
Copy link
Member Author

tidoust commented Feb 9, 2021

I made several updates:

  1. The raw IDL fragment is now reported in a fragment property. This is also done in idlparsed exports.
  2. The AST structure is no longer serialized in idlnamesparsed extracts. It does not seem useful to create 50MB of data that cannot directly be used to re-serialize the IDL fragments.
  3. The href property that links back to the definition of an IDL name in the spec now only appears at the defined level in the exported structure, and not for individual members (since these individual members no longer appear in the exported structure)
  4. The generator can be told to also add all relevant definitions (for all individual members) in a dfns array property at the root level. However, this is not done by default. I suggest to wait until we get practical feedback on whether that's needed. For instance, WebIDLPedia does not currently need this, as far as I can tell. The definitions would add ~50MB of data to the total size of the idlnamesparsed folder (total size is currently <7MB), and one needs specific logic to process it anyway (see below) so I propose not to include that data for now.
  5. The generator now also creates textual versions in an idlnames folder

I note that the cleanup job running on webref will have to be completed to detect files that need to be deleted in the idlnames and idlnamesparsed folders.

Custom logic to serialize an IDL name with links

Some possible logic to link an IDL fragment members with definitions using the IDLNames generator and the WebIDL writer (Relative paths are from the root of the Reffy package and need to be updated if you create that script elsewhere).

const path = require('path');
const { parse, write } = require('webidl2');
const { requireFromWorkingDirectory, expandCrawlResult } = require('./src/lib/util');
const { generateIdlNames } = require('./src/cli/generate-idlnames.js');
const { matchIdlDfn, getExpectedDfnFromIdlDesc } = require('./src/cli/check-missing-dfns');

function templates(idlName, dfns) {
  function getExpectedDfn(name, context) {
    if (context && context.data) {
      const expected = getExpectedDfnFromIdlDesc(context.data, context.parent);
      if (expected) {
        const dfn = dfns.find(dfn => matchIdlDfn(expected, dfn));
        if (dfn) {
          return dfn;
        }
      }
      if (!expected || (expected.type === 'interface')) {
        return getInterfaceDfn(name);
      }
    }
    return null;
  }

  function getInterfaceDfn(name) {
    const expected = { linkingText: [name], type: 'interface', 'for': [] };
    let dfn = null;
    if (idlName.dfns) {
      for (const list of Object.values(idlName.dfns)) {
        dfn = list.find(dfn => matchIdlDfn(expected, dfn));
        if (dfn) {
          break;
        }
      }
    }
    return dfn;
  }

  function getWrappingFunction(lookupFunction) {
    return function (name, context) {
      const dfn = lookupFunction(name, context);
      if (dfn) {
        return `[${name}](${dfn.href})`;
      }
      return name;
    }
  }

  return {
    name: getWrappingFunction(getExpectedDfn),
    nameless: getWrappingFunction(getExpectedDfn),
    reference: getWrappingFunction(getInterfaceDfn)
  };
}


function serialize(idlName) {
  let res = [];

  function serializeNode(node) {
    const root = node.defined ? node.defined : node;
    const spec = root.spec ? root.spec : null;
    let dfns = [];
    if (spec && idlName.dfns && idlName.dfns[spec.url]) {
      dfns = idlName.dfns[spec.url];
    }
    const writeParams = { templates: templates(idlName, dfns) };
    const idlTree = parse(node.defined ? node.defined.fragment : node.fragment);
    const idl = write(idlTree, writeParams);
    res.push(idl);

    if (node.inheritance) {
      serializeNode(node.inheritance);
    }
    if (node.extended) {
      node.extended.map(node => serializeNode(node));
    }
    if (node.includes) {
      node.includes.map(node => serializeNode(node));
    }
  }

  serializeNode(idlName);
  return res;
}

async function linkify(idlName, crawlPath) {
  const crawlIndex = requireFromWorkingDirectory(path.join(crawlPath, 'index.json'));
  const crawlResults = await expandCrawlResult(crawlIndex, crawlPath);
  const names = generateIdlNames(crawlResults.results, { dfns: true });
  const desc = names[idlName];
  const res = serialize(desc);
  return res.join('\n\n');
}

const idlName = process.argv[2] || 'Document';
const crawlPath = process.argv[3] || 'reports/ed';
linkify(idlName, crawlPath).then(res => {
  console.log('==========');
  console.log(res);
  console.log('==========');
});

Copy link
Member

@dontcallmedom dontcallmedom left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't reviewed everything in detail, but the direction and the design choices LGTM; once the webref build is out, I'll try to adapt webidlpedia to use this

@tidoust tidoust merged commit f051482 into master Feb 9, 2021
tidoust added a commit to w3c/webref that referenced this pull request Feb 10, 2021
The "idlnames" and "idlnameparsed" folders have recently been added, see:
w3c/reffy#489 (comment)

They contain files per IDL Name, which need to be dropped when the IDL names no
longer appear in any of the crawled specs.

Also adjust jobs execution schedules to have the cleanup job run after the
weekly tr crawl, see:
#86 (review)
tidoust added a commit to w3c/webref that referenced this pull request Feb 10, 2021
The "idlnames" and "idlnameparsed" folders have recently been added, see:
w3c/reffy#489 (comment)

They contain files per IDL Name, which need to be dropped when the IDL names no
longer appear in any of the crawled specs.

Also adjust jobs execution schedules to have the cleanup job run after the
weekly tr crawl, see:
#86 (review)
@tidoust tidoust deleted the generate-idlnames branch July 9, 2021 12:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants