Add JSON schemas for extracts export validation function #1075

tidoust · 2022-09-25T18:03:16Z

This creates JSON schemas for all extracts and files generated by Reffy. A new getSchemaValidationFunction function is now exported by Reffy to validate data against the schemas. It is typically intended for use in Webref to validate curated data before publication.

Some minor adjustments made along the way to make data more consistent and keep schemas relatively simple:

The dfns extractor now always returns a heading property, defaulting to the top of the page
The events extractor now gets rid of null properties altogether
The events post-processor no longer outputs null href.

Existing tests on extractors were updated to also check the schema of the extracted data. More tests could be added to check post-processors, and the crawler as a whole.

Note this update also includes a minor bug fix in the sorting code of the events post-processor so that it may run on not-yet-curated events extracts.

This is based on (and intends to replace the current form of) w3c/webref#731

This creates JSON schemas for all extracts and files generated by Reffy. A new `getSchemaValidationFunction` function is now exported by Reffy to validate data against the schemas. It is typically intended for use in Webref to validate curated data before publication. Some minor adjustments made along the way to make data more consistent and keep schemas relatively simple: - The dfns extractor now always returns a `heading` property, defaulting to the top of the page - The events extractor now gets rid of null properties altogether - The events post-processor no longer outputs null href. Existing tests on extractors were updated to also check the schema of the extracted data. More tests could be added to check post-processors, and the crawler as a whole. Note this update also includes a minor bug fix in the sorting code of the events post-processor so that it may run on not-yet-curated events extracts.

Linked to w3c/reffy#1075 This will only work once a version of Reffy has been released that exposes the appropriate schema validation function.

dontcallmedom

gosh, another amazing PR :)

a few suggestions, but feel free to merge as is an consider them as input for later :)

dontcallmedom · 2022-09-26T12:57:07Z

src/lib/util.js

+    try {
+        schema = require(path.join(schemasFolder, schemaFile));
+    }
+    catch (err) {


shouldn't this throw?

It threw in the first version I had, but then it seemed a bit rude to throw when the caller requested an unknown schema. Typically, returning null allows to avoid a try... catch in Webref:
https://github.com/w3c/webref/pull/731/files#diff-128accda0c348bbd52246995174cc7409be8cf4308f3a1389b5d9e47a556cb55R15-R18

maybe this should distinguish between situations where we know validation doesn't make sense (as in the webref example) from situations where the requested schema name doesn't make sense (e.g. because there was a typo in the name)

I'm not sure I understand how the code is supposed to distinguish between these two situations (in both cases, all the code could tell is that the schema file does not exist). Why does it matter in practice?

why it matters - to avoid a situation where you think you're all good because you didn't get a complaint when in fact you made a type in your schema name.

how - I assume we know the patterns and the names that can be applied to it - anything that doesn't match that should throw (but my assumption may very well be wrong :)

I was trying not to hardcode the list of names so that we don't have to update that piece of code when we add more extraction tools and post-processing modules. The code can tell which names are correct for extraction modules by looking at property in src/browserlib/reffy.json. There is no automatic way to tell what files post-processing modules that operate at the crawl level may generate though.

I'll take the liberty to merge as-is. If you think that deserves an update, the floor is yours :)

src/lib/util.js

schemas/common.json

schemas/browserlib/extract-dfns.json

schemas/files/index.json

schemas/postprocessing/events.json

schemas/postprocessing/idlparsed.json

- Drop now useless `nullableurl` construct - Fix RegExp for interface names - Add new common `interfacetype` and `extensiontype` definitions, used in idlparsed and idlnames-parsed. - Adjust postprocessing/events to use common `interfaces` definition - Make list of allowed dfns types explicit

dontcallmedom · 2022-09-26T16:20:52Z

src/lib/util.js

@@ -1026,14 +1026,31 @@ function getInterfaceTreeInfo(iface, interfaces) {
 *   if the requested schema does not exist.
 */
 function getSchemaValidationFunction(schemaName) {
+    // Helper function that selects the right schema file from the given


you may question-mark as much as you want, it actually does improve readability :)

New feature: - Add JSON schemas for extracts and export validation function (#1075) Bug fix: - [CLI] Allow running post-processors without "--output" (#1074) Dependencies bumped: - Bump web-specs from 2.25.0 to 2.26.0 (#1076)

This makes use of the new schema validation function in Reffy to make sure that the curated data Webref produces follow expected scheams, see: w3c/reffy#1075 This replaces #731 and fixes #657. Schemas, notably those that deal with parsed IDL structures, could go deeper into details. To be improved over time. Tests are run against the curated version of data. That is not necessary for extracts that aren't actually curated (dfns, headings, ids, links, refs), just more convenient not to have branching logic in the test code.

tidoust requested a review from dontcallmedom September 25, 2022 18:03

tidoust added a commit to w3c/webref that referenced this pull request Sep 26, 2022

Use schemas from Reffy

e89882e

Linked to w3c/reffy#1075 This will only work once a version of Reffy has been released that exposes the appropriate schema validation function.

dontcallmedom approved these changes Sep 26, 2022

View reviewed changes

tidoust added 2 commits September 26, 2022 16:55

Use "switch" to improve(?) code readability

0144868

dontcallmedom reviewed Sep 26, 2022

View reviewed changes

tidoust merged commit 5991f32 into main Sep 26, 2022

tidoust deleted the schemas branch September 26, 2022 17:48

tidoust mentioned this pull request Sep 27, 2022

Validate JSON data against schemas w3c/webref#749

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add JSON schemas for extracts export validation function #1075

Add JSON schemas for extracts export validation function #1075

tidoust commented Sep 25, 2022

dontcallmedom left a comment

dontcallmedom Sep 26, 2022

tidoust Sep 26, 2022

dontcallmedom Sep 26, 2022

tidoust Sep 26, 2022

dontcallmedom Sep 26, 2022

tidoust Sep 26, 2022

dontcallmedom Sep 26, 2022

Add JSON schemas for extracts export validation function #1075

Add JSON schemas for extracts export validation function #1075

Conversation

tidoust commented Sep 25, 2022

dontcallmedom left a comment

Choose a reason for hiding this comment

dontcallmedom Sep 26, 2022

Choose a reason for hiding this comment

tidoust Sep 26, 2022

Choose a reason for hiding this comment

dontcallmedom Sep 26, 2022

Choose a reason for hiding this comment

tidoust Sep 26, 2022

Choose a reason for hiding this comment

dontcallmedom Sep 26, 2022

Choose a reason for hiding this comment

tidoust Sep 26, 2022

Choose a reason for hiding this comment

dontcallmedom Sep 26, 2022

Choose a reason for hiding this comment