Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Export WebIDL fragments #3

Closed
dontcallmedom opened this issue Mar 18, 2018 · 31 comments
Closed

Export WebIDL fragments #3

dontcallmedom opened this issue Mar 18, 2018 · 31 comments

Comments

@dontcallmedom
Copy link
Member

To help with https://github.com/GoogleChromeLabs/webidl-diff/blob/dataCollection/data/FetchSpecRunner-WebSpec-journal.js and imports into web-platform-tests

Ideally: one file per spec, whose name match the spec shortname

dontcallmedom added a commit to w3c/reffy that referenced this issue Mar 19, 2018
dontcallmedom added a commit to w3c/reffy that referenced this issue Mar 21, 2018
dontcallmedom added a commit to w3c/reffy that referenced this issue Mar 21, 2018
tidoust pushed a commit to w3c/reffy that referenced this issue Mar 22, 2018
Add support for exporting crawled IDL fragments, for:
w3c/webref#3

The crawler exports IDL fragments for each spec in separate `idl` files on top of the crawl report. It did not make sense to request the user to provide the name of the crawl report and to expect the user to have created an `idl` folder, so the crawler now expects the second parameter to be the name of an existing folder into which it will write the crawl report and create an `idl` folder that contains the IDL exports.
@mdittmer
Copy link

mdittmer commented Apr 2, 2018

@dontcallmedom looks like there's great progress here! I tried running the crawler on one spec to get a feel for things, and saw that there is now an ouput_dir/idl directory. Is this ready to be included in updates to reffy-reports?

@mdittmer
Copy link

mdittmer commented Apr 2, 2018

@lukebjerring FYI: Once this issue is resolved, we should be able to pull per-spec Web IDL fragments from reffy-reports in WPT (and anywhere else that needs them).

@dontcallmedom
Copy link
Member Author

the idl fragments are now published at https://github.com/tidoust/reffy-reports/tree/master/whatwg/idl and will get updated on a daily basis; there are probably still some gaps and additional QA needed.

@lukebjerring
Copy link
Collaborator

Very cool - we can script a PR for replacing the web-platform-tests/interfaces with these to get a quick snapshot of the differences.

@lukebjerring
Copy link
Collaborator

lukebjerring commented Apr 9, 2018

@lukebjerring
Copy link
Collaborator

@foolip - you're one of the best-equipped to summarize and major red flags in the diff. Would you be able to take a look?

@foolip
Copy link
Member

foolip commented May 4, 2018

Overall this looks great, and clearly the overlap is so big that it doesn't really make sense to maintain two pipelines that both achieve roughly the same outcome, we should pick one and fix the bugs :)

From a quick glance it looks like most of the differences should be whitespace or because of spec changes that have happened in between the scraping of the two different tools. Interesting this that aren't that:

2dcontext.idl, eventsource.idl, webmessaging.idl, webstorage.idl and workers.idl: @dontcallmedom where were these extracted from, and how were they split into these files? Any naive approach would results in a single html.idl file, which is actually what I think we should have.

DOM-Parsing.idl fixed the bug with serializeAsCDATA appearing in the webidl-diff tooling.

appmanifest.idl: @dontcallmedom how was the file name chosen? It should be extracted from https://w3c.github.io/manifest/, so do you look for a TR version and use just the name from that? Same question about csp.idl, generic-sensor.idl and a few others.

cssom.idl: reffy correctly deletes the _camel_cased_attribute bits. @dontcallmedom, how was that achieved?

custom-elements.idl and shadow-dom.idl are IDL file I think we don't want to include in WPT, as the specs have been folded into WHATWG DOM+HTML.

dom.idl: rather many changes, more than seem like they could be explained by the time between scraping. @dontcallmedom, what version of the spec was scraped for this?

selectors-api.idl seems to be from the very dated https://www.w3.org/TR/selectors-api/ and shouldn't be included in WPT.

spec.idl: @dontcallmedom seems like a bug in the file naming logic?

@tidoust
Copy link
Member

tidoust commented May 5, 2018

@foolip @lukebjerring, I believe you're running the diff against the wrong report...

As explained in https://tidoust.github.io/reffy-reports/, there are 3 different reports in the repository:

  1. a W3C-centric report based on Editor's Drafts
  2. a W3C-centric report based on latest published versions in /TR/
  3. a WHATWG-centric report

The report you probably want to use is 3, and that's the one @dontcallmedom suggested in #3 (comment) (reffy-reports/whatwg/idl), but you seem to have used 1 (reffy-reports/w3c/idl) in lukebjerring/wpt@33985ec.

This explains most of the diffs you're wondering about:

  • 2dcontext.idl, eventsource.idl, webmessaging.idl, webstorage.idl, workers.idl, custom-elements.idl, shadow-dom.idl only exist in the W3C-centric report. All of these interfaces are in whatwg/idl/html.idl or in whatwg/idl/dom.idl in the WHATWG-centric report.
  • dom.idl is generated from DOM 4.1 in the W3C-centric report, not from DOM LS. Again, whatwg/idl/dom.idl should be correct.

FWIW, the list of specs common to all reports is defined in specs-common.json. The WHATWG-centric report contains these specs, plus those defined in specs-whatwg.json. The W3C-centric report contains the same common specs, plus those defined in specs-w3c.json.

The starting point is indeed the URL in /TR/, and the shortname is used to name the IDL file, which explains why we end up with appmanifest.idl and generic-sensor.idl. The version number is dropped from the shortname, so csp.idl and not csp3.idl. The getShortname function holds that logic. It can certainly be modified and it needs to be fixed in any case, since the spec.idl file you mention shows it does not really generate the name one would expect on non /TR/ URLs...

For cssom.idl, Reffy extracts the IDL from the IDL Index section when it exists. That section does not have the _camel_cased_attribute bits. I do not know how that was achieved in the spec itself, but that's where the magic is...

selectors-api.idl could indeed be removed from specs-common.json.

@foolip
Copy link
Member

foolip commented May 7, 2018

@lukebjerring, could you create a new diff with the WHATWG-centric report? There may still be some odd things remaining, but hopefullt that'll resolve most of the issues.

@tidoust, I've sent w3c/reffy#99, but am also curious about how the list of specs is maintained? I maintain a list for https://foolip.github.io/day-to-day/ using a combination of specref and manual additions, and it would be nice to converge on a single list of "specs that matter" for purposes like these.

tidoust pushed a commit to w3c/reffy that referenced this issue May 7, 2018
@tidoust
Copy link
Member

tidoust commented May 7, 2018

@foolip, the list is maintained manually. We have a small tool that reports on specs we may have missed, but that's all for now.

Also, the list mostly contains specs that define some IDL content for now. For instance, many CSS specs are missing. There is a priori no problem adding them, it's just that Reffy does not really know how to analyse them and report useful info (right now, it will report that these specs may have a problem because it cannot extract any IDL...).

@foolip
Copy link
Member

foolip commented May 7, 2018

Thanks @tidoust! @tabatkins @jstenback FYI that this is another list of specs that's being maintained by hand. Since the needs are slightly different for each list maybe fixing the duplication "problem" isn't warranted, dunno.

@lukebjerring
Copy link
Collaborator

Updated the diff link(s) above. Might have to re-run the up-to-date pipeline to reduce noise, but PTAL

@foolip
Copy link
Member

foolip commented May 7, 2018

lukebjerring/wpt@14b4e10 looks good. @lukebjerring, does that also delete any files that aren't in reffy, so we can see if we've found any specs reffy has not?

@lukebjerring
Copy link
Collaborator

No; it only copies from whatwg/idl into that dir, so you would have to take the inverse of affected files vs existing files to see what we have that they don't.

@foolip
Copy link
Member

foolip commented May 8, 2018

With lukebjerring/wpt@14b4e10 checked out:

$ comm -23 <(find interfaces -name '*.idl' | sort) <(git diff --name-only HEAD~~ | sort)
interfaces/animation-worklet.idl
interfaces/aom.idl
interfaces/background-fetch.idl
interfaces/BackgroundSync.idl
interfaces/battery.idl
interfaces/budget-api.idl
interfaces/compat.idl
interfaces/cookie-store.idl
interfaces/cors-rfc1918.idl
interfaces/css-animations.idl
interfaces/css-conditional.idl
interfaces/css-device-adapt.idl
interfaces/css-fonts.idl
interfaces/css-layout-api.idl
interfaces/css-masking.idl
interfaces/css-transitions.idl
interfaces/dedicated-workers.idl
interfaces/deviceorientation.idl
interfaces/feature-policy.idl
interfaces/filter-effects.idl
interfaces/geolocation-sensor.idl
interfaces/InputDeviceCapabilities.idl
interfaces/keyboard-lock.idl
interfaces/media-capabilities.idl
interfaces/mediacapture-image.idl
interfaces/mediacapture-main.idl
interfaces/mediacapture-output.idl
interfaces/mediacapture-record.idl
interfaces/netinfo.idl
interfaces/picture-in-picture.idl
interfaces/remoteplayback.idl
interfaces/ResizeObserver.idl
interfaces/scroll-animations.idl
interfaces/sensors.idl
interfaces/ServiceWorker.idl
interfaces/shape-detection-api.idl
interfaces/touchevents.idl
interfaces/webappsec-credential-management.idl
interfaces/webappsec-csp-embedded.idl
interfaces/webappsec-csp.idl
interfaces/webappsec-referrer-policy.idl
interfaces/webappsec-secure-contexts.idl
interfaces/webappsec-subresource-integrity.idl
interfaces/web-audio-api.idl
interfaces/web-bluetooth.idl
interfaces/webidl.idl
interfaces/web-midi-api.idl
interfaces/web-nfc.idl
interfaces/webrtc-pc.idl
interfaces/web-share.idl
interfaces/webusb.idl
interfaces/webxr.idl

Seems like there are quite many files missing in reffy, or some other mistake. @tidoust, do you know why interfaces/css-fonts.idl isn't included, for example?

tidoust added a commit to w3c/reffy that referenced this issue May 8, 2018
Goal is to converge on a common list of specs with Web Platform Tests to
generate the right IDL files. See:
w3c/webref#3 (comment)

This update creates a new `specs-cg.json` file that contains the list of specs
that define some IDL content incubated in the WICG or in other Community Groups.
That file is referenced from the W3C / WHATWG crawls.

This update also adds specs that were still missing.
@tidoust
Copy link
Member

tidoust commented May 8, 2018

@foolip, thanks for the diff. I had a look at it. Answer about css-fonts inline...

Specs I'm not clear what to do with:

  • cookie-store: I see https://github.com/WICG/cookie-store but where is the IDL defined?
  • dedicated-workers: am I right to think that this is defined in HTML?
  • deviceorientation: no longer maintained, so not included in Reffy, but I guess we could add it nevertheless.

Specs we have under a different name in Reffy, because IDL filenames are named after the /TR/ shortname:

  • battery => battery-status
  • mediacapture-image => image-capture
  • mediacapture-main => mediacapture-streams
  • mediacapture-output => audio-output
  • mediacapture-record => mediastream-recording
  • remoteplayback => remote-playback
  • sensors => generic-sensor
  • serviceworkers => service-workers
  • touchevents => touch-events
  • webappsec-credential-management => credential-management
  • webappsec-csp => CSP
  • webappsec-referrer-policy => referrer-policy
  • webappsec-secure-contexts => secure-contexts
  • web-audio-api => webaudio
  • webidl => WebIDL
  • webrtc-pc => webrtc

Then there are specs we have and for which Reffy failed to generate IDL files. Problem is Reffy only generates these files when the IDL is valid, and the IDL parser reports some errors for:

  • css-fonts: partial interface CSSRule: Missing semicolon after interface
  • css-layout: interface BreakToken: Attributes cannot accept sequence types

We should probably rather save the IDL file no matter what. I created w3c/reffy#103 to track this.

I added all other missing specs to Reffy (adding specs developed in the WICG and other CGs in particular). Local tests suggest things work fine, but note that some specs will appear under a different name for the same reason as above:

  • webappsec-csp-embedded => will be csp-embedded-enforcement
  • webappsec-subresource-integrity => will be SRI
  • web-midi-api => will be webmidi
  • webshare => will be web-share

Also, note some of the specs I added have invalid IDL, so IDL files won't appear for now either:

  • css-conditional: partial interface CSSRule: Missing semicolon after interface
  • ResizeObserver: callback ResizeObserverCallback: Unterminated callback
  • web-bluetooth: dictionary BluetoothPermissionData: Required member must not have a default

NB: The new IDL files may not appear immediately because the server on which Reffy runs daily to update this repo does not seem to be willing to pick up the latest version of the code. That's in the hands of @dontcallmedom...

@foolip
Copy link
Member

foolip commented May 9, 2018

cookie-store: I see https://github.com/WICG/cookie-store but where is the IDL defined?

Ah, far from all of the stuff in interfaces/ is created from the webidl-diff tooling, it's a mix still. This one doesn't have a spec yet, the IDL is used in a tentative test added by @costan in web-platform-tests/wpt#9455

dedicated-workers: am I right to think that this is defined in HTML?

Yep, https://html.spec.whatwg.org/multipage/workers.html#the-workerglobalscope-common-interface and other bits.

deviceorientation: no longer maintained, so not included in Reffy, but I guess we could add it nevertheless.

Yep. https://w3c.github.io/deviceorientation/spec-source-orientation.html is the only thing that defines this at all, and I think it should be tested in WPT. If we don't get it from Reffy, then manually maintained.

@Honry listed a bunch of the name mismatch issues in web-platform-tests/wpt#9937 and it's not clear what to do about it. Is the idea that these mismatches will just remain forever, or should the next TR publication use a new name of the github.io repos be renamed?

Then there are specs we have and for which Reffy failed to generate IDL files. Problem is Reffy only generates these files when the IDL is valid

Sent PR and create issue:
w3c/csswg-drafts#2661
w3c/csswg-drafts#2662

css-conditional: partial interface CSSRule: Missing semicolon after interface

Tried to fix in source but it looks OK already? https://github.com/w3c/csswg-drafts/blob/0f376507e8aa827d4dbb65a76ade819cfbb73657/css-conditional-3/Overview.bs#L819

ResizeObserver: callback ResizeObserverCallback: Unterminated callback

Also seems OK already:
https://github.com/WICG/ResizeObserver/blob/351d5d40e7f30e26fcefdb31db5d1bc34474c1de/index.bs#L176

web-bluetooth: dictionary BluetoothPermissionData: Required member must not have a default

Sent WebBluetoothCG/web-bluetooth#395.

Thanks for all of the investigation @tidoust! I think we should be getting pretty close to parity with what we already have in interfaces/. In addition, I wonder what process needs to put in place so that Reffy discovers new specs as they appear, without requiring people to know that Reffy exists and needs to be updated. I have a setup in https://github.com/foolip/day-to-day which while fragile does discover new specs without my intervention. Example. Would you be interested in something similar for Reffy, or what's your ideal state of affairs on spec list maintenance?

@tidoust
Copy link
Member

tidoust commented May 11, 2018

Thanks for following-up with PRs to fix the WebIDL issues!

For css-conditional, that's a bug on our side. For some reason, the W3C API does not manage to extract the URL of the Editor's Draft, so Reffy parses the spec published in /TR/ instead, which has the problem. At a minimum, Reffy should warn when that happens (I created w3c/reffy#104 to track this). I'll also investigate why the API does not know anything about the ED.

For ResizeObserver, that was fixed in index.bs 3 weeks ago but index.html hasn't been re-generated since early 2017: https://github.com/WICG/ResizeObserver/blob/master/index.html

@Honry listed a bunch of the name mismatch issues in web-platform-tests/wpt#9937 and it's not clear what to do about it. Is the idea that these mismatches will just remain forever, or should the next TR publication use a new name of the github.io repos be renamed?

I'm not entirely clear what you mean with your last sentence. There are multiple places where naming rules can change. In Reffy, we could perhaps be persuaded to switch to the github.io repos when that's possible. We used /TR/ shortnames because these are stable once created and there are cases where it's not necessarily clear how to extract a useful name from the GitHub repo (e.g. https://webassembly.github.io/spec/js-api/index.html). If the IDL files are useful for WPT, we're definitely interested to converge on something that works for you in any case.

It seems valuable to align the repo names with the /TR/ shortnames. Once a /TR/ shortname has been chosen, it mostly cannot be changed. But it should still be easy to rename GitHub repos as needed. And we should encourage the reuse of the repo name for the /TR/ shortname when possible.

In addition, I wonder what process needs to put in place so that Reffy discovers new specs as they appear, without requiring people to know that Reffy exists and needs to be updated. I have a setup in https://github.com/foolip/day-to-day which while fragile does discover new specs without my intervention. Example. Would you be interested in something similar for Reffy, or what's your ideal state of affairs on spec list maintenance?

Give us a bit of time to look into it, but I'm both interested by mechanisms to automate the discovery of new specs, as well as by discussions to reduce/align the sources that contain info about specs (Specref, Reffy, your day-to-day repository, we also maintain a list with links to CanIUse and platform status platforms in our roadmap framework, etc.).

@foolip
Copy link
Member

foolip commented May 16, 2018

It seems valuable to align the repo names with the /TR/ shortnames. Once a /TR/ shortname has been chosen, it mostly cannot be changed. But it should still be easy to rename GitHub repos as needed. And we should encourage the reuse of the repo name for the /TR/ shortname when possible.

Perhaps a good case to try this with is webrtc vs. webrtc-pc. There's now also https://www.w3.org/TR/webrtc-stats/, and I don't think the WG would be too keen on renaming the main spec back to just webrtc.

I wonder how the scraping works. If you start at a /TR/ link, do you discover the latest version only if it's linked from the TR, and do you pick a name based on the original TR link, or a TR link in the final spec from which IDL is extracted?

@tidoust
Copy link
Member

tidoust commented May 16, 2018

Perhaps a good case to try this with is webrtc vs. webrtc-pc. There's now also https://www.w3.org/TR/webrtc-stats/, and I don't think the WG would be too keen on renaming the main spec back to just webrtc.

Right. To clarify, I'm not saying that /TR/ shortnames are better names. I'm saying they're more stable. Process-wise, the WG cannot easily change the shortname of the published spec: it will remain webrtc. Whereas the WG is entirely free to rename the GitHub repository whenever it wants. Whether webrtc-pc is a better repo name than webrtc is a different matter :)

I wonder how the scraping works. If you start at a /TR/ link, do you discover the latest version only if it's linked from the TR

Mostly. From the /TR/ link, Reffy gets the link to the Editor's Draft from the W3C API. The data returned by that API is created/updated when a spec is published by scraping the links it contains. In a few cases, the data is curated or added by hand afterwards. Typically, the link to the Editor's Draft was not extracted a few years ago, which explains why the URL of the ED was not returned for the CSS Conditional Rules spec (this has now been fixed, and Reffy correctly extracts the IDL from that spec). The URL of the ED may also be added to the database used by the W3C API afterwards even when the /TR/ document does not link to it.

and do you pick a name based on the original TR link, or a TR link in the final spec from which IDL is extracted?

The name is based on the original TR link. But would that make any difference?

@foolip
Copy link
Member

foolip commented May 16, 2018

The name is based on the original TR link. But would that make any difference?

If the ED doesn't link to TR at all is the case I was considering. Maybe this doesn't happen.

@tabatkins
Copy link
Member

It does happen before FPWD publication.

@tidoust
Copy link
Member

tidoust commented May 17, 2018

Right, all specs start as Editor's Drafts which don't link to a document published on TR (or worse, which link to a document published on TR that does not exist yet).

Summarizing, there are two open questions here for W3C specs:

  1. Whether to use TR or ED URLs as input to Reffy. Reffy gathers additional info about each spec from the W3C API before it fetches the spec, and the W3C API only knows about the TR shortname, so I need to stick to the TR URL for now. I'm happy to consider another approach later on.
  2. Whether to base the name of the IDL files generated by Reffy on the TR shortname, or on the GitHub repo name. Here, I'm fine using the name of the GitHub repo if that seems better from a WPT perspective. Is that what you'd prefer to have, @foolip?

@foolip
Copy link
Member

foolip commented May 17, 2018

I suppose that I'd prefer whichever rule matches most of the existing files in wpt/interfaces/ and wpt/* dirnames. Is #3 (comment) still the complete list of mismatched names? In that first long list the bits on the right seem to be the names used in WPT more often, and I think those are the TR names? If so, maybe we should try that and see if we get mismatches that cause trouble? (The webrtc one is particularly annoying, but I guess one-off fixes would also be annoying.)

@dontcallmedom
Copy link
Member Author

My understanding had been that root directories in wpt are supposed to match TR names (although the actual rule is bit fuzzier than that).

A possible additional issue with using github repo names is that some groups (notably css) many specs in a single repository.

Which ever approach we settle on needs to take into account:

  • shortname changes
  • versioning
  • possible naming conflicts?

On the W3C side, we could try to get more oversight on the naming of github repos in the w3c github organization (which right now is mostly a free for all) if that helps keeping better sync across these various projects.

@foolip
Copy link
Member

foolip commented May 18, 2018

Let's just try your current scheme for naming IDL files.

How often does the pipeline run, and would it be possible to save enough data about where the snippets came from to add boilerplate at the top of IDL files in wpt/interfaces/ like Luke has been doing?

tidoust added a commit to w3c/reffy that referenced this issue May 19, 2018
See: w3c/webref#3 (comment)

Boilerplate is:

// GENERATED CONTENT - DO NOT EDIT
// Content of this file was automatically extracted from the
// "{spec title}" spec.
// See: {crawled url}

Note that we could also easily add the spec date to the boilerplate, but doing
so would mean that IDL files will be updated in GitHub whenever the spec
changes, even when the changes do not touch on the IDL defined in the spec.
@tidoust
Copy link
Member

tidoust commented May 23, 2018

How often does the pipeline run

Once per day for now.

would it be possible to save enough data about where the snippets came from to add boilerplate at the top of IDL files in wpt/interfaces/ like Luke has been doing?

Now done in Reffy but the auto-update script we have on the server seems to be broken (@dontcallmedom?), so note the boilerplate does not yet appear in the report.

@tidoust
Copy link
Member

tidoust commented Jun 1, 2018

@foolip, @lukebjerring Note IDL files published in the report now contain the boilerplate. Let me know if you need anything else.

@foolip
Copy link
Member

foolip commented Jun 1, 2018

Awesome, thanks @tidoust!

@tidoust
Copy link
Member

tidoust commented Jun 6, 2018

Closing this issue now. Feel free to raise more specific issues as needed.

@tidoust tidoust closed this as completed Jun 6, 2018
@foolip
Copy link
Member

foolip commented Jun 11, 2018

Thank you, @tidoust! @dontcallmedom, is the updating run as a daily Travis cron job? The cadence of @dontcallmedom-bot makes it look like that, just checking.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants