Add more N-Quads escaping tests. #81

davidlehn · 2023-03-01T05:00:44Z

Add 4 and 8 hex char unicode tests.
Add IRI escaping tests.
Add graph name escaping tests.

Notes:

This should improve testing coverage of implementations N-Quads I/O.
Test are based on my reading of rules at the N-Quads level. Help needed to check this is all valid.
This was related to JSON-LD/jsonld.js work, and similar evil tests are needed there in part to ensure escaping, serialization, and parsing works between potential differences in both systems. Help needed to find other appropriate tests.
URI naming got a bit more elaborate to ensure ordering when escaped and unicode segments were added.
There may be related N-Quads spec work for normalization or other issues?
This poked a number of holes in the JavaScript rdf-canonize code.
The need to ensure everything is escaped can be a performance issue. Implementations could optimize for known input, but untrusted input likely needs that step.

- Add 4 and 8 hex char unicode tests. - Add IRI escaping tests. - Add graph name escaping tests.

iherman · 2023-03-01T09:51:53Z

This is actually related to my comments in https://lists.w3.org/Archives/Public/public-rch-wg/2023Mar/0000.html:

[…] What I would propose this WG could do is to look at all the tests which could be affected by any of those proposed changes and either remove them or make them optional tests. These tests would not affect or main goal of the CR testing, namely to prove that the URDNA algorithm, and its textual specification, is correct and interoperable (which is the real goal of the CR phase); as a consequence, these tests should not stand in a way of passing to Proposed Rec when the time comes.

The test suite currently plays two roles:

Helps to prove that the specification is correct, consistent, and interoperable (by, implicitly, relying on the fact that the underlying RDF environment is a conformant implementation of the RDF data model).
Helps to prove that a specific implementation is consistent with the URDNA algorithm and the RDF Specification.

Clearly, (2) is important, and essential for the deployment of any URDNA implementation. However, we should realize that only (1) is required to pass the CR phase per W3C Process. Put it another way, this test suite is not required to test the underlying RDF environment.

Very specifically: say my implementation fails on some nasty Unicode or IRI escaping issue; what should I do? Re-implement the underlying RDF environment? Obviously not. The only thing I could do is to hope that that environment will, eventually, take care of the issue. However, in the meantime, my implementation will not pass the full CR test suite for this Working Group specification: we shoot into our own foot.

I would therefore propose to separate the URDNA necessary tests from the others. The non-necessary tests should be marked as such and should be explicitly flagged as non-required for the CR exit criteria for this specification. Furthermore, these tests should be submitted to the WG that maintains the RDF tests suites, because they are of a general value.

Cc @pchampin

davidlehn · 2023-03-01T15:53:38Z

Understood and agree. Making tests optional as needed or moving them to other projects sounds great.

From an implementor and interop view, it would be good to make it easy to find and use related test suites.

gkellogg

Also note that some of this may be affected by potential changes to N-Quads canonicalization.

tests/urdna2015/test060-in.nq

gkellogg · 2023-03-01T23:28:09Z

This is actually related to my comments in https://lists.w3.org/Archives/Public/public-rch-wg/2023Mar/0000.html:

[…] What I would propose this WG could do is to look at all the tests which could be affected by any of those proposed changes and either remove them or make them optional tests. These tests would not affect or main goal of the CR testing, namely to prove that the URDNA algorithm, and its textual specification, is correct and interoperable (which is the real goal of the CR phase); as a consequence, these tests should not stand in a way of passing to Proposed Rec when the time comes.

When I ran the current test suite by an experimental update, I found that this (the original version) was the only test that was affected.

...
I would therefore propose to separate the URDNA necessary tests from the others. The non-necessary tests should be marked as such and should be explicitly flagged as non-required for the CR exit criteria for this specification. Furthermore, these tests should be submitted to the WG that maintains the RDF tests suites, because they are of a general value.

+1 I'd also break down the test into smaller bits testing different aspects of escaping. Otherwise, finding a specific issue can be a pain. For instance, tests in IRI representation, datatyped literal representation, and the representation of escaped and non-escaped characters. Such tests should also go in to the N-Quads test suite eventually.

Cc @pchampin

gkellogg

I suggest we not modify test0060, other than the anticipated results considering upcoming changes in canonicalization. Instead, we should create new tests. That said, N-Quads parsing and serialization tests probably belong with N-Quads tests (currently here.

As for what should be tested, note that use of UCHAR escapes aside, the IRI grammar still has a narrow range of ascii characters it that are part of isegment/ipchar:

isegment       = *ipchar
ipchar         = iunreserved / pct-encoded / sub-delims / ":"
                  / "@"
 iunreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~" / ucschar
ucschar        = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF
                  / %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD
                  / %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD
                  / %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD
                  / %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD
                  / %xD0000-DFFFD / %xE1000-EFFFD

In the ASCII range, this is really just ALPHA / DIGIT / "-" / "." / "_" / "~" /. Testing UCHAR versions of those characters, or higher Unicode characters is fine, but we can't through in characters (other than explicitly %-encoded) that are not allowed. A parser should replace UCHAR escapes with the unescaped Unicode characters. When canonicalizing, I don't think that any characters in uschar would be represented by either ECHAR or UCHAR when canonicalizing.

Add generate.yml for GH Action to build and updatea PR for tests and reports when changed. * Use git-auto-commit-action. * Add permissions.

the previous phrasing gave the wrong impression of *defining* what it meant to be isomorphic

https://github.com/w3c/rdf-canon/pull/83/files#r1134676034

… .html output file.

* Adds serialization section Fixes #86. * Clarify that blank nodes are serialized using their canonical label. --------- Co-authored-by: Ted Thibodeau Jr <tthibodeau@openlinksw.com> Co-authored-by: Dan Yamamoto <dan@iij.ad.jp>

* Mark introduction as informative. * Mark Security Considerations and Privacy Considerations as informative. * No separator when concatenating; better punc & language --------- Co-authored-by: Gregg Kellogg <gregg@greggkellogg.net> Co-authored-by: Dan Yamamoto <dan@iij.ad.jp>

…ers and escapes.

gkellogg · 2023-04-12T21:58:13Z

@davidlehn This should make the test use legal ranges.

gkellogg · 2023-04-15T18:22:05Z

@yamdan and @davidlehn I've added comprehensive characters in literals from U+0000 through U+008F (the last don't really display). See if you think these are reasonable and we can merge this.

yamdan · 2023-04-17T07:10:00Z

@gkellogg Thank you for the comprehensive cases with tests for capitalizing HEX. It seems reasonable to me.
By the way, I just started wondering how we can justify that we do not need to escape C1 control characters (from U+0080 to U+009f), whereas we already escaped C0 control characters (from U+0000 to U+001F).
There might be some reason I should know since RFC 8785 JSON Canonicalization Scheme (JCS) takes the same approach (i.e., escaping only C0).

gkellogg · 2023-04-17T16:12:01Z

Although C1 control characters could benefit from escaping, as you note, this is no commonly done, and I don't think there's quite the same problem with control characters making a visual representation of a string behave unreasonably, just not necessarily display properly.

dlongley

This LGTM, but, of course, by it's nature it's hard to know for sure until implementations have been updated to run against it. Seems ok to merge to me at this time and we can sort out any trouble if we find it later. Thanks!

gkellogg · 2023-04-18T19:05:16Z

This LGTM, but, of course, by it's nature it's hard to know for sure until implementations have been updated to run against it. Seems ok to merge to me at this time and we can sort out any trouble if we find it later. Thanks!

It would be great if DB would update their implementation and submit an implementation report. For now, it's just Ruby and @iherman's TypeScript implementations, both of which will need to update after this PR is merged.

Add more N-Quads escaping tests.

6d9c7fd

- Add 4 and 8 hex char unicode tests. - Add IRI escaping tests. - Add graph name escaping tests.

davidlehn requested review from gkellogg, dlongley and yamdan as code owners March 1, 2023 05:00

davidlehn mentioned this pull request Mar 1, 2023

Fix escaping. digitalbazaar/rdf-canonize#58

Merged

gkellogg reviewed Mar 1, 2023

View reviewed changes

tests/urdna2015/test060-in.nq Outdated Show resolved Hide resolved

tests/urdna2015/test060-in.nq Outdated Show resolved Hide resolved

gkellogg requested changes Mar 15, 2023

View reviewed changes

gkellogg and others added 14 commits April 12, 2023 14:52

Automated test report generation (#82)

5317905

Add generate.yml for GH Action to build and updatea PR for tests and reports when changed. * Use git-auto-commit-action. * Add permissions.

Automated report generation

4b042af

rephrase the relationship btw. URDNA2015 and isomorphism

1f719f9

the previous phrasing gave the wrong impression of *defining* what it meant to be isomorphic

remove redundant sentence

8ce6982

applying @TallTed's suggestion

db897cf

https://github.com/w3c/rdf-canon/pull/83/files#r1134676034

Creates beautified HTML output for teset manifest and corrects errant…

f5d2496

… .html output file.

Fix Ruby version used for generating output to Ruby 3.1.3.

8d77560

Automated report generation

d74cc25

Automated report generation

e540c1c

Adds serialization section (#90)

f0a1eef

* Adds serialization section Fixes #86. * Clarify that blank nodes are serialized using their canonical label. --------- Co-authored-by: Ted Thibodeau Jr <tthibodeau@openlinksw.com> Co-authored-by: Dan Yamamoto <dan@iij.ad.jp>

Fix typos.

336ccc5

Updates test060 with legal IRI paths probing extended unicode charact…

80c4cb7

…ers and escapes.

Automated report generation

90bf813

gkellogg approved these changes Apr 12, 2023

View reviewed changes

Merge branch 'main' into improve-n-quads-escaping-tests

56a6648

gkellogg mentioned this pull request Apr 14, 2023

Local definitions to avoid RDF 1.2 dependencies #96

Merged

gkellogg and others added 2 commits April 15, 2023 11:20

Comprehensive cover for characters from U+0000 through U+008F.

c7a0090

Automated report generation

88c65d6

dlongley approved these changes Apr 18, 2023

View reviewed changes

gkellogg merged commit df0d1dd into main Apr 18, 2023

gkellogg deleted the improve-n-quads-escaping-tests branch April 18, 2023 19:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add more N-Quads escaping tests. #81

Add more N-Quads escaping tests. #81

davidlehn commented Mar 1, 2023 •

edited by pr-preview bot

Loading

iherman commented Mar 1, 2023

davidlehn commented Mar 1, 2023

gkellogg left a comment

gkellogg commented Mar 1, 2023

gkellogg left a comment

gkellogg commented Apr 12, 2023

gkellogg commented Apr 15, 2023

yamdan commented Apr 17, 2023

gkellogg commented Apr 17, 2023

dlongley left a comment

gkellogg commented Apr 18, 2023

Add more N-Quads escaping tests. #81

Add more N-Quads escaping tests. #81

Conversation

davidlehn commented Mar 1, 2023 • edited by pr-preview bot Loading

iherman commented Mar 1, 2023

davidlehn commented Mar 1, 2023

gkellogg left a comment

Choose a reason for hiding this comment

gkellogg commented Mar 1, 2023

gkellogg left a comment

Choose a reason for hiding this comment

gkellogg commented Apr 12, 2023

gkellogg commented Apr 15, 2023

yamdan commented Apr 17, 2023

gkellogg commented Apr 17, 2023

dlongley left a comment

Choose a reason for hiding this comment

gkellogg commented Apr 18, 2023

davidlehn commented Mar 1, 2023 •

edited by pr-preview bot

Loading