Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add more N-Quads escaping tests. #81

Merged
merged 18 commits into from
Apr 18, 2023
Merged

Conversation

davidlehn
Copy link
Contributor

@davidlehn davidlehn commented Mar 1, 2023

  • Add 4 and 8 hex char unicode tests.
  • Add IRI escaping tests.
  • Add graph name escaping tests.

Notes:

  • This should improve testing coverage of implementations N-Quads I/O.
  • Test are based on my reading of rules at the N-Quads level. Help needed to check this is all valid.
  • This was related to JSON-LD/jsonld.js work, and similar evil tests are needed there in part to ensure escaping, serialization, and parsing works between potential differences in both systems. Help needed to find other appropriate tests.
  • URI naming got a bit more elaborate to ensure ordering when escaped and unicode segments were added.
  • There may be related N-Quads spec work for normalization or other issues?
  • This poked a number of holes in the JavaScript rdf-canonize code.
  • The need to ensure everything is escaped can be a performance issue. Implementations could optimize for known input, but untrusted input likely needs that step.

Preview | Diff

- Add 4 and 8 hex char unicode tests.
- Add IRI escaping tests.
- Add graph name escaping tests.
@iherman
Copy link
Member

iherman commented Mar 1, 2023

This is actually related to my comments in https://lists.w3.org/Archives/Public/public-rch-wg/2023Mar/0000.html:

[…] What I would propose this WG could do is to look at all the tests which could be affected by any of those proposed changes and either remove them or make them optional tests. These tests would not affect or main goal of the CR testing, namely to prove that the URDNA algorithm, and its textual specification, is correct and interoperable (which is the real goal of the CR phase); as a consequence, these tests should not stand in a way of passing to Proposed Rec when the time comes.

The test suite currently plays two roles:

  1. Helps to prove that the specification is correct, consistent, and interoperable (by, implicitly, relying on the fact that the underlying RDF environment is a conformant implementation of the RDF data model).
  2. Helps to prove that a specific implementation is consistent with the URDNA algorithm and the RDF Specification.

Clearly, (2) is important, and essential for the deployment of any URDNA implementation. However, we should realize that only (1) is required to pass the CR phase per W3C Process. Put it another way, this test suite is not required to test the underlying RDF environment.

Very specifically: say my implementation fails on some nasty Unicode or IRI escaping issue; what should I do? Re-implement the underlying RDF environment? Obviously not. The only thing I could do is to hope that that environment will, eventually, take care of the issue. However, in the meantime, my implementation will not pass the full CR test suite for this Working Group specification: we shoot into our own foot.

I would therefore propose to separate the URDNA necessary tests from the others. The non-necessary tests should be marked as such and should be explicitly flagged as non-required for the CR exit criteria for this specification. Furthermore, these tests should be submitted to the WG that maintains the RDF tests suites, because they are of a general value.

Cc @pchampin

@davidlehn
Copy link
Contributor Author

Understood and agree. Making tests optional as needed or moving them to other projects sounds great.

From an implementor and interop view, it would be good to make it easy to find and use related test suites.

Copy link
Member

@gkellogg gkellogg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also note that some of this may be affected by potential changes to N-Quads canonicalization.

tests/urdna2015/test060-in.nq Outdated Show resolved Hide resolved
tests/urdna2015/test060-in.nq Outdated Show resolved Hide resolved
@gkellogg
Copy link
Member

gkellogg commented Mar 1, 2023

This is actually related to my comments in https://lists.w3.org/Archives/Public/public-rch-wg/2023Mar/0000.html:

[…] What I would propose this WG could do is to look at all the tests which could be affected by any of those proposed changes and either remove them or make them optional tests. These tests would not affect or main goal of the CR testing, namely to prove that the URDNA algorithm, and its textual specification, is correct and interoperable (which is the real goal of the CR phase); as a consequence, these tests should not stand in a way of passing to Proposed Rec when the time comes.

When I ran the current test suite by an experimental update, I found that this (the original version) was the only test that was affected.

...
I would therefore propose to separate the URDNA necessary tests from the others. The non-necessary tests should be marked as such and should be explicitly flagged as non-required for the CR exit criteria for this specification. Furthermore, these tests should be submitted to the WG that maintains the RDF tests suites, because they are of a general value.

+1 I'd also break down the test into smaller bits testing different aspects of escaping. Otherwise, finding a specific issue can be a pain. For instance, tests in IRI representation, datatyped literal representation, and the representation of escaped and non-escaped characters. Such tests should also go in to the N-Quads test suite eventually.

Cc @pchampin

Copy link
Member

@gkellogg gkellogg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we not modify test0060, other than the anticipated results considering upcoming changes in canonicalization. Instead, we should create new tests. That said, N-Quads parsing and serialization tests probably belong with N-Quads tests (currently here.

As for what should be tested, note that use of UCHAR escapes aside, the IRI grammar still has a narrow range of ascii characters it that are part of isegment/ipchar:

isegment       = *ipchar
ipchar         = iunreserved / pct-encoded / sub-delims / ":"
                  / "@"
 iunreserved    = ALPHA / DIGIT / "-" / "." / "_" / "~" / ucschar
ucschar        = %xA0-D7FF / %xF900-FDCF / %xFDF0-FFEF
                  / %x10000-1FFFD / %x20000-2FFFD / %x30000-3FFFD
                  / %x40000-4FFFD / %x50000-5FFFD / %x60000-6FFFD
                  / %x70000-7FFFD / %x80000-8FFFD / %x90000-9FFFD
                  / %xA0000-AFFFD / %xB0000-BFFFD / %xC0000-CFFFD
                  / %xD0000-DFFFD / %xE1000-EFFFD

In the ASCII range, this is really just ALPHA / DIGIT / "-" / "." / "_" / "~" /. Testing UCHAR versions of those characters, or higher Unicode characters is fine, but we can't through in characters (other than explicitly %-encoded) that are not allowed. A parser should replace UCHAR escapes with the unescaped Unicode characters. When canonicalizing, I don't think that any characters in uschar would be represented by either ECHAR or UCHAR when canonicalizing.

gkellogg and others added 14 commits April 12, 2023 14:52
Add generate.yml for GH Action to build and updatea PR for tests and reports when changed.
* Use git-auto-commit-action.
* Add permissions.
the previous phrasing gave the wrong impression of *defining*
what it meant to be isomorphic
* Adds serialization section Fixes #86.
* Clarify that blank nodes are serialized using their canonical label.

---------

Co-authored-by: Ted Thibodeau Jr <tthibodeau@openlinksw.com>
Co-authored-by: Dan Yamamoto <dan@iij.ad.jp>
* Mark introduction as informative.
* Mark Security Considerations and Privacy Considerations as informative.
* No separator when concatenating; better punc & language

---------

Co-authored-by: Gregg Kellogg <gregg@greggkellogg.net>
Co-authored-by: Dan Yamamoto <dan@iij.ad.jp>
@gkellogg
Copy link
Member

@davidlehn This should make the test use legal ranges.

@gkellogg
Copy link
Member

@yamdan and @davidlehn I've added comprehensive characters in literals from U+0000 through U+008F (the last don't really display). See if you think these are reasonable and we can merge this.

@yamdan
Copy link
Contributor

yamdan commented Apr 17, 2023

@gkellogg Thank you for the comprehensive cases with tests for capitalizing HEX. It seems reasonable to me.
By the way, I just started wondering how we can justify that we do not need to escape C1 control characters (from U+0080 to U+009f), whereas we already escaped C0 control characters (from U+0000 to U+001F).
There might be some reason I should know since RFC 8785 JSON Canonicalization Scheme (JCS) takes the same approach (i.e., escaping only C0).

@gkellogg
Copy link
Member

Although C1 control characters could benefit from escaping, as you note, this is no commonly done, and I don't think there's quite the same problem with control characters making a visual representation of a string behave unreasonably, just not necessarily display properly.

Copy link
Contributor

@dlongley dlongley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM, but, of course, by it's nature it's hard to know for sure until implementations have been updated to run against it. Seems ok to merge to me at this time and we can sort out any trouble if we find it later. Thanks!

@gkellogg
Copy link
Member

This LGTM, but, of course, by it's nature it's hard to know for sure until implementations have been updated to run against it. Seems ok to merge to me at this time and we can sort out any trouble if we find it later. Thanks!

It would be great if DB would update their implementation and submit an implementation report. For now, it's just Ruby and @iherman's TypeScript implementations, both of which will need to update after this PR is merged.

@gkellogg gkellogg merged commit df0d1dd into main Apr 18, 2023
@gkellogg gkellogg deleted the improve-n-quads-escaping-tests branch April 18, 2023 19:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants