N3/Turtle reader does not allow non-lowercase language tags #2

bhuga · 2010-11-10T03:38:59Z

Despite the fact that "abc"@en is invalid turtle according to both the N3 and Turtle grammars, this is in widespread use, including in the W3C's own SPARQL tests (what is the emoticon for 'irony'?):

http://www.w3.org/2001/sw/DataAccess/tests/r2#lang-case-insensitive-eq

The rdf-n3 gem currently, correctly, fails to parse the data file given above, dying at the @en. Could this be made more accepting, allowing non-lowercase language tags?

artob · 2010-11-10T04:05:26Z

For what it's worth, I made a similar improvement to RDF.rb's N-Triples parser recently, given how frequently mixed-case language tags are encountered in the wild. RDF.rb 0.3.0's parser will now accept language tags in any case, but the serializer emits them in lowercase only.

gkellogg · 2010-11-10T22:22:51Z

Rule is currently [a-z]+ ( "-" [a-z0-9]+ )*, I can easily change this to allow upper class too. I note that RDF::Literal#canonicalize should fix this. This is based on RFC3066, which has actually been replaced by RFC4646. I suspect that work out of RDF Next will update this, but for the time being, I'll just relax the parsing to allow both upper and lower case language expression.

artob · 2010-11-11T01:25:58Z

Yes, I just had a look at RDF::Literal#canonicalize in RDF.rb HEAD and we do indeed downcase the language tag when canonicalized. Of course, parsers could choose to do that even earlier, when first reading in the language tag; the N-Triples parser doesn't at present, but we could change that if you think it's appropriate.

In any case, we should probably have both RDF.rb's bundled N-Triples parser and the N3 parser handle language tag case-sensitivity questions consistently if possible, so let me know if something needs changing in the N-Triples reader.

bhuga · 2010-11-11T03:04:44Z

The particular sparql test at issue is testing whether or not lowercase and uppercase language tags are the same in the endpoint. I have no idea why this decision was made, but it would seem parsers should not canonicalize this.

gkellogg · 2010-11-11T07:48:22Z

Fixed in d9726a6

gkellogg · 2010-11-12T19:49:15Z

Note that the fix did include the c14n. The reader canonicalizes all input, which seems to be required to pass other W3C tests. We could add an option to the reader to perform c14n, which I would use for tests, but would allow your usage.

Let me know how it goes after you re-check with the updated Gem.

artob · 2010-11-14T03:04:38Z

Thanks for implementing this, Gregg.

We should probably define some standard options for all RDF.rb-compatible readers, such as indeed :canonicalize => true || false, but also e.g. :intern => true || false to control whether or not the reader instance will return interned URIs. The former could be false by default and the latter true by default, which (mostly) reflects the current default situation. What do you think?

gkellogg · 2010-11-14T04:11:18Z

Yes, I think this is the right set. I'll add support for this in my readers. If you add it to RDF::Reader, that would be great.

artob · 2010-11-14T04:41:11Z

OK, I'll add and document them shortly.

gkellogg · 2010-11-14T07:33:28Z

I implemented this, plus :prefixes option in rdf-n3, which is pushed to GitHub. I'll wait until 0.3.0 issues are resolved across other gems before releasing to rubygems.

artob · 2010-11-15T05:29:09Z

I've now defined and documented five standard options for RDF::Reader.new in:

https://github.com/bendiken/rdf/commit/e7b325b9ffd445781a0390f4aab51d2625f7cd4d

See http://rdf.rubyforge.org/RDF/Reader.html#initialize-instance_method for a readable summary. Not all reader implementations need to necessarily implement all options, but I'll work on implementing these (except for the prefixes, obviously) in RDF.rb's bundled N-Triples parser.

artob · 2010-11-15T05:35:32Z

Also, I should mention that I will cut a 0.3.0.pre release today or tomorrow, so that all gems depending on RDF.rb have a chance to get updated before the official 0.3.0 release.

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

N3/Turtle reader does not allow non-lowercase language tags #2

N3/Turtle reader does not allow non-lowercase language tags #2

bhuga commented Nov 10, 2010

artob commented Nov 10, 2010

gkellogg commented Nov 10, 2010

artob commented Nov 11, 2010

bhuga commented Nov 11, 2010

gkellogg commented Nov 11, 2010

gkellogg commented Nov 12, 2010

artob commented Nov 14, 2010

gkellogg commented Nov 14, 2010

artob commented Nov 14, 2010

gkellogg commented Nov 14, 2010

artob commented Nov 15, 2010

artob commented Nov 15, 2010

N3/Turtle reader does not allow non-lowercase language tags #2

N3/Turtle reader does not allow non-lowercase language tags #2

Comments

bhuga commented Nov 10, 2010

artob commented Nov 10, 2010

gkellogg commented Nov 10, 2010

artob commented Nov 11, 2010

bhuga commented Nov 11, 2010

gkellogg commented Nov 11, 2010

gkellogg commented Nov 12, 2010

artob commented Nov 14, 2010

gkellogg commented Nov 14, 2010

artob commented Nov 14, 2010

gkellogg commented Nov 14, 2010

artob commented Nov 15, 2010

artob commented Nov 15, 2010