Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Turtle, TriG, RDF/XML and SPARQL tests to better validate relative IRI resolution #6

Open
gkellogg opened this issue Sep 9, 2015 · 30 comments
Assignees

Comments

@gkellogg
Copy link
Member

gkellogg commented Sep 9, 2015

This was suggested by @RubenVerborgh and discussed on public-rdf-comments. The gist has been updated to include evaluation tests for Turtle, JSON-LD, and RDF/XML, which can readily be translated for TriG and SPARQL.

@RubenVerborgh
Copy link
Member

Thanks @gkellogg, feel free to assign this issue to me.

@gkellogg
Copy link
Member Author

gkellogg commented Sep 9, 2015

I don't have the permission to add you quite yet (which is a bit odd). Once you're added, you should be able to assign yourself. Thanks!

BTW, I think I own you a response relating to urn:ex:s276, which I haven't looked into.

@gkellogg
Copy link
Member Author

See PR #13 for Turtle and TriG tests.

@RubenVerborgh
Copy link
Member

About RDF/XML, I wasn't sure where to put those. The RDF/XML tests have a pretty specific structure.

@gkellogg
Copy link
Member Author

I've created RDF/XML versions of these tests for my own use, which can be added to the RDF/XML test suite (I also have JSON-LD tests). If you don't get to it, @RubenVerborgh, I'll handle it later this week.

SPARQL's another matter.

@RubenVerborgh
Copy link
Member

Only thing that held me back on RDF/XML is the weird folder structure; wasn't sure where to put things. So I'd leave it up to you if that's okay.

@gkellogg
Copy link
Member Author

Also RDFa tests.

@gkellogg
Copy link
Member Author

I added RDF/XML tests to PR #13.

@afs
Copy link
Contributor

afs commented Oct 20, 2015

I was pointed at the following text in RFC 3987 5.3.2.4. Path Segment Normalization

The complete path segments "." and ".." are intended only for use
within relative references

says dot-segments are only for the beginning of a relative IRIs. But is it a prohibition for use elsewhere?

RFC 3986: section 3.3 Paths

The path segments "." and "..", also known as dot-segments, are
defined for relative reference within the path name hierarchy. They
are intended for use at the beginning of a relative-path reference
(Section 4.2) to indicate relative position within the hierarchical
tree of names.
...
Aside from dot-segments in hierarchical paths, a path segment is
considered opaque by the generic syntax.

Also RFC 3986: section 1.2.3 Hierarchical Identifiers

All URI references are parsed by generic syntax parsers when used.
However, because hierarchical processing has no effect on an absolute
URI used in a reference unless it contains one or more dot-segments
(complete path segments of "." or "..", as described in Section 3.3),
URI scheme specifications can define opaque identifiers by
disallowing use of slash characters, question mark characters, and
the URIs "scheme:." and "scheme:..".

seems to say dot-segments processing applies to absolute URIs as it is called out specially.

All of which leaves me confused about the use of dot-segments in absolute IRIs and URIs and to some extent in relative URIs not at the beginning. Multiple readings seem possible depending on the "SHOULD NOT"/ "MUST NOT" ness of "intended only for" and "are defined".

Opinions?

@RubenVerborgh
Copy link
Member

says dot-segments are only for the beginning of a relative IRIs.

Not necessarily “beginning” though. The algorithm accounts for ../ not at the beginning.

The complete path segments "." and ".." are intended only for use
within relative references

The problem is “intended”. What does that even mean?

@afs
Copy link
Contributor

afs commented Oct 20, 2015

The algorithm works more generally and also works for sorting out absolute URIs as well by operating on their path. My sense is that the case of base URI having .. was either not given much weight or it was considered to be sorted out as part of making the base URI in the first place . I.e. .. does not appear. "intended" == "design space". Most of the ways to determine a base URI and usage like browser document base URLs don't have such segments.

The algorithm accounts for .. everywhere except at the end of the base URI. There, it simply looses it at the merge step if not <> and it has no effect.

Where .. has some action

<http://host/> <a/b/../c/d> => <http://host/a/c/d> 
<http://host/a/b/../c/> <d> => <http://host/a/c/d> 
<http://host/a/b/> <../c/d> => <http://host/a/c/d>
<http://host/x/y/> <z/..> => <http://host/x/y/>

Where .. has no effect

<http://host/a/b/..> <c/d> => <http://host/a/b/c/d>

The difference between .. and ../ but it isn't an arbitrary split anyway : </c/d> for example.

When, except by explicit choice, does .. appear in a base URI?

And it's different if the base URI is sorted out first which is what leads to oddities.

BASE <http://host/>
BASE </a/b/..>
<urn:ex:s> <urn:ex:p> <c/d> .

@RubenVerborgh
Copy link
Member

My sense is that the case of base URI having .. was either not given much weight or it was considered to be sorted out as part of making the base URI in the first place

The thing is, RFC3986 says that “Normalization of the base URI, as described in Sections 6.2.2 and 6.2.3, is optional” but the rest of the algorithm seems to silently assume that this normalization has been performed (as there would be no obvious reason not to). That's also what many libraries that perform resolution just assume.

However, the Turtle spec says that we should “[use] only the basic algorithm in section 5.2”. While I insist that this wording is ambiguous, I think that the intention of “only the basic algorithm” was to say that we should not do anything optional, which thus means not normalizing the base URI. Perhaps @cygri can help us clarify the intention.

@afs
Copy link
Contributor

afs commented Oct 21, 2015

Yes - sorting out the absolute base before even getting to the relative URI resolution seems highly likely. The text leading up to the quote from the Turtle spec makes it clear the text applies to relative IRIs so we're exposed to RFC processing on the base earlier (e.g. at the point of @base) so this isn't about the relative URI step.

Absolute URIs have other issues; this is why this came up for me. <file:data.ttl> is strictly illegal as is <file:/path> (which is what java URL.toString()produces). Jena normalizes those to (legal) <file:///fullpath>.

And there is URI scheme C: found on some operation systems :-).

So the problematic cases for http are:

  • trailing /.. in an absolute the base URI which leads to it's being ignored. Any other final component works including single dot. A base of @base <..> . works.
  • The special case of <> in 5.2.2 which bypasses remove_dot_segments and exposes the raw base URI which might elsewhere be clean or raw. This is an important case and should be tested. At least the assumptions of the tests data need to be captured.

Asking again: When does ".." appear in an absolute base URI in practice? We could avoid testing this one situation if it is a test-case corner case.

The text is the same in SPARQL 1.1 (@ericprud ? can you remember the history?).

@RubenVerborgh
Copy link
Member

When does ".." appear in an absolute base URI in practice?

In practice, I can't imagine why anyone would like to do that. Just like I hope nobody ever mints other very ugly URIs. In theory, however, it is possible, and the RDF allows it:

IRI normalization: Interoperability problems can be avoided by minting only IRIs that are normalized according to Section 5 of RFC3987. Non-normalized forms that are best avoided include:

  • […]
  • “/./” or “/../” in the path component of an IRI
  • […]

So we should avoid them, but at the same time, the spec acknowledges that they exist and are valid.

For me, however, the following part of the spec is just a disaster:

IRI equality: Two IRIs are equal if and only if they are equivalent under Simple String Comparison according to section 5.1 of RFC3987. Further normalization must not be performed when comparing IRIs for equality.

Because this means that dereferencing is utterly broken, since intermediaries are allowed to do such normalization. We should just always have to normalize. But that's another discussion altogether 😉

@afs
Copy link
Contributor

afs commented Oct 21, 2015

Good point. If the spec calls out "/../" (trailing /) which is case where "everything just works", then may be the tests should use that mostly, not "/.." (no trailing /) and have just one or two tests of "/..".

@RubenVerborgh
Copy link
Member

I think it's safer to test that behavior as well. If /../ can occur, so can /...

@gkellogg
Copy link
Member Author

gkellogg commented Nov 4, 2015

@RubenVerborgh what's the next step for this? Have we reached consensus on that tests to include and what to do with the .. segments in the base? It would be good to complete this issue.

@RubenVerborgh
Copy link
Member

I think this is mostly up to @afs. I could live with removing the /.. tests, but I'm not convinced this is necessary. However, if this speeds up the issue, we can indeed remove those and commit what we have already (and add the rest later if needed).

@afs
Copy link
Contributor

afs commented Nov 4, 2015

Focusing on the area where there is no disagreement seems like the way forward.

Is that test sets with no dot-segments in absolute IRIs which are 01, 02, 07 and 08?

@RubenVerborgh
Copy link
Member

I think we have a larger subset we agree on. I would propose everything except 5 and 6. Would that work?

@afs
Copy link
Contributor

afs commented Nov 5, 2015

03 and 04 have dot-segments in @base URIs and lead to different results.

See the external feedback which quotes RFC3987 that reference resolution is necessary when the reference is already an IRI and my testing with Redland.

@RubenVerborgh
Copy link
Member

Ah, I thought only trailing dot segments were an issue.

So what do we do with the other cases? We just accept that different parsers have different outcomes there, and thus we don't include them? Or we ask the spec authors what they intended?

@afs
Copy link
Contributor

afs commented Nov 5, 2015

The case of <> makes them different. <> bypasses the dot-segment removal step in relative URI resolution.

One Turtle editor has responded and said that saying they are "absolute IRIs" did mean to him that RFC 3987 applies, and not that absolute IRIs are untouched.

@RubenVerborgh
Copy link
Member

Any other opinions on this? Are 01, 02, 07 and 08 the only ones we agree on?

@gkellogg
Copy link
Member Author

gkellogg commented Jan 4, 2016

@RubenVerborgh can you drive this to conclusion and simply propose a change where we can ask for objections? I'd like to get these tests integrated so we can move on.

@RubenVerborgh
Copy link
Member

Yes, asked for confirmation on the mailing list.

@gkellogg
Copy link
Member Author

gkellogg commented Jan 7, 2016

Turtle and TriG tests completed via PR #30. Still need similar tests for SPARQL and RDF/XML. JSON-LD and RDFa can be done elsewhere.

@leipert
Copy link

leipert commented Mar 21, 2017

I just stumbled over this issue. I have two questions:

  1. The turtle and trig tests have been merged in Add IRI resolution tests (subset) #30. The tests currently have the rdft:Proposed approval status. Is there any resource how this approval process works?
  2. What would be the best way to add SPARQL tests for this? CONSTRUCT/SELECT/INSERT tests with a copy of the turtle tests as data?

@gkellogg
Copy link
Member Author

@leipert Until a new Working Group is chartered to update RDF and/or SPARQL, there is really no way to move from rdft:Proposed to rdft:Approved. All the CG can really do is propose new tests to be considered at a future date. But, for all practical purposes, if they have been merged into this repo, the community has had a chance to vet the tests, and they are fairly stable.

There is, of course, a whole sub-tree for SPARQL tests, but the suites are independent, so new tests would need to be created referencing local queries and data files. Ideally, both queries and data files are the minimum necessary to test the particular feature. (chair hat off) I don't think it's appropriate to create a SPARQL test for each Turtle test in general, but you can certainly propose some specific tests for the community to consider. (chair hat back on)

The way this has been done in the past is to propose a set of tests, on the rdf-tests mailing list along with either or both of public-rdf-comments or public-sparql-dev to gain consensus for the need for such tests; create a PR with changes necessary for those tests, after which we (me usually) send out a call for consensus to merge these tests into the main (gh-pages) branch of this repo.

I expect, at some point, that there will be new tests to check for a consensus change to EXISTS that's been in the works for a while.

@afs
Copy link
Contributor

afs commented Jan 22, 2023

This is done by #30.

#87 has just tweaked the tests of #30.

In #30, the "subset" is the subset of #6 we discussed on (although reading old issues is never a case of being completely sure!)

@gkellogg gkellogg removed the Turtle label Nov 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants