Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Turtle quickcheck -- DO NOT MERGE! #103

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

tmciver
Copy link
Contributor

@tmciver tmciver commented Aug 30, 2020

@robstewart57 This work is intended to use QuickCheck to ensure that subject grouping is happening correctly in the Turtle serialization as was discussed in #99. I've run into an issue and wanted to get some feedback on the technique used. Basically the approach is naive: serialize a graph to Turtle format and check that the file has only one instance of each subject. This works for the most part but the QuickCheck tests occasionally fail and it turns out that they fail because the same URI is sometimes used for both subject and predicate, or subject and object. I think the solution is to update the RDF TList Arbitrary instance to not use the same URI for more than a given type of node. Thoughts?

… serialization.

Theses tests reveal a possible problem with the `Arbitrary` instance for a `RDF
TList`: the same URI is occassionally used for more than one node type. I.e., a
URI might be used for a subject and predicate or subject and object. The
instance should be updated to not used duplicate URIs for different node types.
BNodes are now unique across subject, predicate and objects.
@tmciver
Copy link
Contributor Author

tmciver commented Sep 1, 2020

I've pushed some more commits that ensure that the same URI is not reused across subjects, predicates and objects. This is the case for both UNodes and BNodes. I ran the QuickCheck tests for the Turtle serializer about ten times without failure and verified that these changes have not caused any regressions in other tests.

@tmciver
Copy link
Contributor Author

tmciver commented Sep 9, 2020

@robstewart57 Do you have any thoughts on this work? Let me know if you think changes are needed.

Comment on lines +121 to +122
-- Assert that the graph serialization (first parameter) contains
-- exactly one instance of the given subject (second parameter).
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tmciver that seems to be subtly different from what we discussed in #99 (comment)

Your countMatches function looks like the needle (the subject) must not appear anywhere in the bytestring haystack (all subject/predicate/object data).

That would mean the prop_SingleSubject property would not allow:

<http://www.example.com/a> <http://www.example.com/b> <http://www.example.com/c>
<http://www.example.com/d> <http://www.example.com/e> <http://www.example.com/a>
<http://www.example.com/f> <http://www.example.com/g> <http://www.example.com/a>

Since, when calling assertSingleSubject with the subject <http://www.example.com/a> as its second argument, the needle <http://www.example.com/a> is part of the haystack in the 2nd and 3rd triples.

Have I misunderstood the intention here? Or indeed, your code?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@robstewart57 You did not misunderstand the code: you're right, this test does not account for a subject node in object position. Good catch! Considering that, this property seems much more difficult to test than I thought.

As I think about this, one possible solution might be to use the Turtle parser on the input graph, count the number of times a subject node occurs in non-subject position and then add 1. That should give the number of times that URI should occur in the ttl file.

One issue that this approach suffers from is if that URI appears multiple times in predicate position and then predicates are grouped, then that calculation will be wrong. I think this situation is rare but I think it's possible.

Another option would be to create a special-purpose parser that counts the number of times a given URI appears in subject position. But this seems like a lot of work and could therefore have its own bugs.

Thoughts?

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tmciver Hi,

An alternative option is to implement an Arbitrary instance of a special kind of RDF graph whereby:

  1. A URI that appears in a subject position may appear in multiple subject positions.
  2. A URI that appears in a subject position must not appear in any predicate or object positions in the graph.

The QuickCheck API should permit these constraints, and the prop_SingleSubject property would then be valid.

We could do with other property tests which tests for various identity functions, e.g. #99 (comment) and parsing, serialising then parsing again the same RDF data content for each of the given RDF formats that we so far support.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rob,
A great idea! I've been pretty busy of late but will try to work on that this week.

@robstewart57
Copy link
Owner

@tmciver Hi Tim, sorry for the delayed response, work very busy just now.

I've added a comment, above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants