Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TSV: state how to handle special characters in strings #10

Open
Tpt opened this issue Mar 6, 2023 · 9 comments
Open

TSV: state how to handle special characters in strings #10

Tpt opened this issue Mar 6, 2023 · 9 comments

Comments

@Tpt
Copy link

Tpt commented Mar 6, 2023

The specification does not explicitely states how quotes and ASCII control characters (\0...) should be escaped. It might be nice to add some sentences about it.

A note to state that the " quote should be prefered to the ' quote might also be nice to get some kind of "canonical" TSV serialization.

@Tpt Tpt mentioned this issue Mar 6, 2023
@afs afs changed the title TSV: state who to handle special characters in strings TSV: state how to handle special characters in strings Mar 6, 2023
@TallTed
Copy link
Member

TallTed commented Mar 13, 2023

A note to state that the " quote should be prefered to the ' quote might also be nice to get some kind of "canonical" TSV serialization.

This preference is commonly dictated by the data. If my data has lots of " characters and few or no ', I'd prefer to use the ' to quote each field, minimizing the need for inline escapes.

Putting some guidance like yours into a distinct "notes on canonicalization" section would probably be OK.

@domel
Copy link

domel commented Mar 13, 2023

I think that it is bad idea to change / overwrite the basic spec that is used. TSV has no official spec, but CSV has. And in that spec there is no information about '. It recommends to use double quotes (or nothing).

@afs
Copy link
Contributor

afs commented Mar 14, 2023

TSV does have an official spec!
https://www.iana.org/assignments/media-types/text/tab-separated-values

@afs
Copy link
Contributor

afs commented Mar 14, 2023

In TSV, the quotes and escapes are from the RDF term writing.

https://w3c.github.io/sparql-results-csv-tsv/spec/index.html#tsv-terms
"by using the syntax that SPARQL and Turtle use."

From what I see on the web, in Turtle, " is more common.

Each needs escaping checking (' in names seems to catch data writing system out).

Some advice-text would be useful - less than formal, single-choice canonicalization.

@domel
Copy link

domel commented Mar 14, 2023

TSV does have an official spec! https://www.iana.org/assignments/media-types/text/tab-separated-values

Yes and no. It's rather a documentation for media type than official spec (that is RFC or STD). Regardless of the naming, there is nothing about '.

@afs
Copy link
Contributor

afs commented Mar 15, 2023

And in that spec there is no information about '. It recommends to use double quotes (or nothing).

This issue (#10) is specific to TSV. For CSV, we should, of course, use ".

@domel domel added the needs discussion Proposed for discussion in an upcoming meeting label Apr 16, 2023
@afs
Copy link
Contributor

afs commented Nov 30, 2023

This "needs discussion" issue was discussed during the telecon of 2023-11-30.

From the issue thread above, are we agreed that:

  1. TSV does not make any special case of " or ' because it is separation by a raw TAB.
  2. Turtle serializers more commonly use ".
  3. The current spec text covers quoting and control characters "by using the syntax that SPARQL and Turtle use." (section 5.1). Hence, no raw TABs in RDF term text.
  4. The text would benefit from expanding, such as having inline examples.

Anything else?

@afs
Copy link
Contributor

afs commented Nov 30, 2023

Related to handling characters: the TSV Media Type does not specify the character set. Nowadays, the "default" for "text/" is UTF-8, a change from the original ASCII.

We can mention this and suggest ("SHOULD") that no character set is treated as UTF-8.

@kasei
Copy link

kasei commented Dec 1, 2023

The current spec text covers quoting and control characters "by using the syntax that SPARQL and Turtle use." (section 5.1). Hence, no raw TABs in RDF term text.

I think this one could use just a bit of nuance. There's no need for raw TABs in RDF term text, but SPARQL and Turtle do allow raw tabs in their literal syntax. The SPARQL TSV spec already has language about this, though:

A TSV format SPARQL result set must use the single quoted literal forms, together with any necessary escapes such as \t, \n and \r.

That seems clear enough to me.

Agree that inline examples would be an improvement.

@gkellogg gkellogg removed the needs discussion Proposed for discussion in an upcoming meeting label Dec 14, 2023
@rdfguy rdfguy added the needs discussion Proposed for discussion in an upcoming meeting label Dec 14, 2023
@gkellogg gkellogg removed the needs discussion Proposed for discussion in an upcoming meeting label Dec 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants