VCF ambiguities #89

d-cameron · 2015-06-08T14:28:11Z

The following syntax ambiguities/parsing issues remain in the draft VCF 4.3 specs:

Non-printable ASCII characters:

U+0000 causes causes paring issue for C implementations
Allowing alternate line separators (such as RS) within text breaks text file format conversion (round-trip LF->RS->LF not possible if RS is an allowable string character)

Recommended solution:

U+0000-U+0008, U+000B-U+000C, U+000E-U+001F disallowed anywhere
Line separators MUST be CR LF or LF
CR, LF allowed only as line separators at end of line

Meta-information lines

Meta-information line format not fully defined
Example in 1.1 contains undefined headers
Allowable characters in header key not defined
contig line in Example 1.1 contains taxonomy key that is not double-quote escaped

Recommended solution:

Allow arbitrary meta-information lines
keys must be alphabetic [a-zA-Z]+
- Optional: user-defined keys start with lowercase [a-z][a-zA-Z]*
field definition optional keys (eg ID, Description, md5, url, etc) must conform to [a-zA-Z0-9_]+
-Double-quote escaping MAY be used in custom headers
-Double-quote escaping MUST be used in any field definition header whose value contains any of '= > " \ :'

Percent encoding

TAB, CR, LF and other troublesome characters are not encoded
Literal % is written to an INFO field by at least one existing SV caller

Recommended solution:

Reference RFC1738 when referring to "URL encoding"
- the following characters MUST be encoded CR LF TAB % : ; (and U+0000-U+001F if disallowed)
- Optional: require = also be encoded (slightly simpler parsing since INFO can then be split on =)
- U+0100 and above MUST NOT be encoded (since they cannot encoded in 2 hex digits).
Implementations SHOULD treat a % not followed by 2 hex digits as a literal %

contig name

<DEL> is a valid contig name
DEL is a valid ID String (which causes issues for SVs such as "A[<DEL>[" )

Recommended solution:

Valid characters should use the SAM regex of "[!-)+-<>-~][!-~]*" but restricted the following additional characters "<>[]:"
contigs names MUST NOT use a reserved symbolic alternate allele name

end of record/file

should there be a line separator at end of file?
should there be a TAB after the last record in each line?

Recommended solution:

A TAB SHOULD NOT be written after after the last record in each line.
Implementations SHOULD write a line separator after each line
Implementation MUST consider a line separator at end of file optional (including for VCFs with 0 records)

"String"

Characters allowed in a "String" field should be defined / disallowed list is not exhaustive

Recommended solution:

Regex for each field and global list of reserved characters

Float parsing

Is scientific notation allowed?
What locale should float/int <-> string parsing be performed in

Recommended solution:

All integer values are of the form "\-?[0-9]+?"
All Numeric/Floating point values are of the form "\-?[0-9]+(\.[0-9]+)?"

Misc

Section 1.4.1.8 (Data lines, Fixed fields, INFO) does not list reserved SV fields -> add reference to
Official 'Description' text for sepc-defined fields not set -> add official section/example containing all headers defined by the specs (inc INFO SV headers)

The text was updated successfully, but these errors were encountered:

pd3 · 2015-06-10T16:22:55Z

This is excellent Daniel, thank you.
+1 from me.

jmarshall · 2015-06-11T17:09:11Z

should there be a TAB after the last record in each line?

No, and this is already mostly-adequately described: a final TAB in a line would introduce a final empty field, and those are disallowed by §1. This could be clarified by describing the lines as tab-separated rather than tab-delimited. So in §1.3 and §1.4.1:

-The header line is tab-delimited.
+The fields of the header line are separated by TAB characters.
@@ ...
-All data lines are tab-delimited.
+The fields of data lines are separated by TAB characters.

ghost · 2015-06-12T08:04:50Z

Float parsing

Is scientific notation allowed?

All Numeric/Floating point values are of the form "-?[0-9]+(.[0-9]+)?"

To me it doesn't seem necessary to limit a float to the regex above. AFAIK parsing scientific notation isn't a problem so either notation should be valid. Personally I'm thinking of VCF files generated by algorithms that may need to output several very small numbers (admittedly the preceding 0's would compress). What is the reasoning behind limiting floating point numbers to full notation?

pd3 · 2015-06-12T08:28:28Z

@drjsanger This escaped my attention, sorry. I also agree that it must be possible to express floating point numbers using the scientific notation, for example 1e-10 should be valid.
This regular expression should allow that '-?([0-9]+)?.?[0-9]*([eE][+-][0-9]+)?'. Note that by this we also disallow NaN, inf and similar.

jmarshall · 2015-06-12T08:50:28Z

Re float parsing, the regexp used in the SAM spec is [-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?, which better represents real-world desirable +/- usage. It's unclear to me why we would disallow NaNs and infinities though — they are explicitly valid in BCF.

An alternative approach to saying what text floats should look like is to refer to the C or Java definitions of the relevant functions. So we could say something to the effect of: float values contain no white space but otherwise are as parsed by C's atof() in the default "C" locale or equivalently by Java's Float.valueOf() without any FloatTypeSuffix, perhaps except that binary/hex representations are disallowed. Admittedly there's a bit much "without this bit" in this approach, but it has the advantage over the regexp of saying what the digit string is supposed to mean…

bhandsaker · 2015-06-12T12:34:50Z

I think NaN and +/-Inf should definitely be allowed as float values.

I'm a little unclear about the issue for contig names, but I think contig names should not be allowed to look like symbolic alleles (starting with "<" and ending with ">") to avoid ambiguities. This seems like a reasonable restriction to me on contig names. I don't think it is reasonable to think that we can pre-define all important/reasonable/useful symbolic alleles. Symbolic alleles give us a way (without going outside the specification) to experiment with how to represent complex structural haplotypes.

ghost · 2015-06-12T12:47:59Z

I have to agree, I think NaN and +/-Inf should be valid as float values.

pd3 · 2015-06-12T13:00:21Z

@drjsanger @bhandsaker I remember there was an informal discussion long time ago that +/-inf can be replaced by reasonably big floats, but I am happy with +/-inf as well.
However, what is the use case for NaN? We already have a missing value ".", parsers would have to check for both NaN and "."

bhandsaker · 2015-06-12T13:10:52Z

NaN is a valid float value. Depending on the API, a missing value "." or just no attribute present may be represented differently than returning a float. Also in some cases there may be difference between missing/absent (i.e. test was not performed) or NaN (test was performed but did not produce a valid value). Note that BCF distinguishes between NaN and missing (signaling NaN is used to store missing, quiet NaN to store a true NaN value).

d-cameron · 2015-06-15T01:11:55Z

As long as locale is explicitly specified, referring to another language specification seems to be the best way to fully define support. atof() might not be the best example as we'd want to disallow both leading whitespace, trailing garbage, alternate locales. I presume base 10 encoding is preferable for human readability purposes. Does it make sense to define a minimum round-trip fp precision? I've had errors introduced by tools truncating to 2dp, but that might be outside the scope of the specs.

d-cameron · 2015-06-15T01:32:55Z

@bhandsaker I was under the impression that complex structural variants were intended to be represented as any number of breakend variants all sharing matching EVENT identifier, with the predefined symbolic alleles used for common local SVs.

bhandsaker · 2015-06-15T12:27:13Z

I believe breakends are better for describing the kinds of structural rearrangements one finds in cancer genomes. Our lab focuses more on complex common SVs, such as 17q21.31 http://www.ncbi.nlm.nih.gov/pubmed/22751096 and similar sorts of situations where we might want to represent shared haplotypes in the population but may not have the level of specificity implicit in or required by the breakend notation.

I'm not suggesting any specific changes to the spec, just that we don't add language that would make a VCF invalid if it contains a new symbolic allele (although all symbolic alleles SHOULD be defined in the header).

d-cameron · 2015-06-15T15:59:12Z

@bhandsaker that in itself is a change to the v4.3 spec as the draft currently has a pre-defined fixed set of top-level SV allele as opposed to the open-ended style of v4.2. It seems like what you want to achieve can be resolved by reverting to allowing symbolic allele of any name, forcing interpretation of any reference matching a symbolic allele name as a reference to the symbolic allele and not a named contig with the same name.

It's looks like more than just the ALT header keys that would need changing for properly modelling complex SVs. SVTYPE is currently restricted to DEL, INS, DUP, INV, CNV, BND, and complex SVs don't fit neatly into any of those, although you could probably hack 17q21.31 into a named CNV allele. Does your use case involve representing such common variants as single-line named SVs at a nominal haplotype position, or is it something else?

On a related note, the specs still seem unclear as to whether alleles such as TTTA<DEL>, or AA<namedcontig1>TAAA<namedcontig2>NNN are allowed valid allele.

bhandsaker · 2015-06-15T16:12:58Z

@d-cameron @bhandsaker that in itself is a change to the v4.3 spec as the draft currently has a pre-defined fixed set of top-level SV allele as opposed to the open-ended style of v4.2.
Can you point me to where this is changed so that symbolic alleles have to come from a pre-defined fixed set?

d-cameron · 2015-06-16T01:42:12Z

Line 146-153 of the initial VCF 4.3 branch change set af41ed9

bhandsaker · 2015-06-16T01:55:51Z

Oh. heh heh. It's possible that I wrote that actually as part of the original SV specification.
Well, I don't think it's worth loosening the specification. Sorry for the confusion.

pd3 · 2015-06-22T14:12:40Z

@d-cameron The RFC1738 is primarily about URLs and most of the unsafe characters relevant to URLs do not apply to VCF. Wouldn't it be better to give an explicit list of characters that can be encoded?

%3a  ..  :  (colon)  
%3b  ..  ;  (semicolon)
%3d  ..  = (equal sign)
%25  ..  % (percent sign)
%2c  ..  ,  (comma)
%0d  .. CR
%0a  .. LF
%09  .. TAB

The percent sign should be always written as %25, I think.

EDIT: added TAB to the list

d-cameron · 2015-06-23T03:49:27Z

I'd be happy with an explicit list of characters, although your list is
missing TAB.

I've used RFC-style verbs MUST and SHOULD to indicate things that
implementations need to implement, and things it would be nice to have. A
percent sign MUST be encoded, but if a VCF is encountered in which it is
not, then it would be nice for the implementation to fall back to literal
%. The idea behind this is that it would allow an implementation to use the
4.3 decoding algorithm on earlier file versions which allow a literal % in
the INFO field without dying. Happy to have omit that section to prevent
confusion.

On Tue, Jun 23, 2015 at 12:12 AM, pd3 notifications@github.com wrote:

@d-cameron https://github.com/d-cameron The RFC1738 is primarily about
URLs and most of the unsafe characters relevant to URLs do not apply to
VCF. Wouldn't it be better to give an explicit list of characters that can
be encoded?

%3a .. : (colon)
%3b .. ; (semicolon)
%3d .. = (equal sign)
%25 .. % (percent sign)
%2c .. , (comma)
%0d .. CR
%0a .. LF

The percent sign should be always written as %25, I think.

—
Reply to this email directly or view it on GitHub
#89 (comment).

pd3 · 2015-06-23T08:11:06Z

Ah, sorry, added TAB to the list. What others think about the literal % sign?

bhandsaker · 2015-06-23T16:15:25Z

I guess you mean the proposal to fall back to treating % literally
(presumably if it is not followed by a digit).
I think it's OK, but another possibility is to just trust the file
header. If you want the old behavior, set your version to 4.2.

On 6/23/15 4:11 AM, pd3 wrote:

Ah, sorry, added TAB to the list. What others think about the literal
% sign?

—
Reply to this email directly or view it on GitHub
#89 (comment).

cwhelan · 2016-05-24T15:13:55Z

+1 to the section on Meta-information lines and quote-escaping

jmarshall · 2019-01-18T10:41:41Z

FYI re #89 (comment) and the following comments on NaNs and infinities: the 4.3 spec was modified to allow them in db372bc later that June, but that was never mentioned in this discussion.

The text added is that Float values are formatted “to match the regular expression ^…numeric…$, NaN, or +/-Inf”. It is not explicit whether NaN and Inf are intended to be case sensitive, but as they are adjacent to a regexp saying …[eE]… the reader's best guess is probably that they are intended to specify that exact capitalisation.

C's printf can output nan/NAN and [-]inf/[-]INF or [-]infinity/[-]INFINITY (there is no programmer control over whether infinity is 3 or 8 letters), while its scanf understands all of those completely case-insensitively. Meanwhile Java's Double.valueOf accepts only [+-]?NaN and [+-]?Infinity spelt and capitalised exactly thus.

Thus the VCF specification's NaN and +/-Inf are difficult to output in C and its +/-Inf is difficult to parse or output in Java. See also samtools/bcftools#755.

IMHO the right way to fix this in the specification is to apply Postel's law and change that text to match what scanf accepts (i.e., {+,-,}{NAN,INF,INFINITY} case-insensitively).

d-cameron mentioned this issue Jun 8, 2015

VCFv4.3 - first batch of changes #88

Merged

pd3 added the vcf label Jun 11, 2015

pd3 mentioned this issue Jun 11, 2015

Questions raised by the proposed VCFv4.3 encoding changes #65

Closed

dmyersturnbull mentioned this issue Jan 21, 2016

Character escaping in VCF #125

Open

cyenyxe mentioned this issue Mar 29, 2016

SVTYPE description for nested values and multi-allelic variants #142

Open

cwhelan mentioned this issue May 24, 2016

Update VCF header line parser to handle escaping quotes and backslashes samtools/htsjdk#489

Merged

tobiasrausch mentioned this issue Oct 21, 2016

Deviation from vcf spec dellytools/delly#59

Closed

d-cameron mentioned this issue Sep 18, 2017

Made significant updates to Section 3, INFO keys used for structural … #231

Closed

jmarshall mentioned this issue Jan 18, 2019

bcftools norm NaN > nan samtools/bcftools#755

Closed

This was referenced May 8, 2019

Tolerate lower-case nans in QUAL samtools/htsjdk#1364

Merged

Allow C and Java native text spellings of NaN and infinities #409

Merged

jmarshall mentioned this issue Mar 27, 2023

vcf: Contig name rules contradiction #711

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

VCF ambiguities #89

VCF ambiguities #89

d-cameron commented Jun 8, 2015

pd3 commented Jun 10, 2015

jmarshall commented Jun 11, 2015

ghost commented Jun 12, 2015

pd3 commented Jun 12, 2015

jmarshall commented Jun 12, 2015

bhandsaker commented Jun 12, 2015

ghost commented Jun 12, 2015

pd3 commented Jun 12, 2015

bhandsaker commented Jun 12, 2015

d-cameron commented Jun 15, 2015

d-cameron commented Jun 15, 2015

bhandsaker commented Jun 15, 2015

d-cameron commented Jun 15, 2015

bhandsaker commented Jun 15, 2015

d-cameron commented Jun 16, 2015

bhandsaker commented Jun 16, 2015

pd3 commented Jun 22, 2015

d-cameron commented Jun 23, 2015

pd3 commented Jun 23, 2015

bhandsaker commented Jun 23, 2015

cwhelan commented May 24, 2016

jmarshall commented Jan 18, 2019

VCF ambiguities #89

VCF ambiguities #89

Comments

d-cameron commented Jun 8, 2015

Non-printable ASCII characters:

Meta-information lines

Percent encoding

contig name

end of record/file

"String"

Float parsing

Misc

pd3 commented Jun 10, 2015

jmarshall commented Jun 11, 2015

ghost commented Jun 12, 2015

pd3 commented Jun 12, 2015

jmarshall commented Jun 12, 2015

bhandsaker commented Jun 12, 2015

ghost commented Jun 12, 2015

pd3 commented Jun 12, 2015

bhandsaker commented Jun 12, 2015

d-cameron commented Jun 15, 2015

d-cameron commented Jun 15, 2015

bhandsaker commented Jun 15, 2015

d-cameron commented Jun 15, 2015

bhandsaker commented Jun 15, 2015

d-cameron commented Jun 16, 2015

bhandsaker commented Jun 16, 2015

pd3 commented Jun 22, 2015

d-cameron commented Jun 23, 2015

pd3 commented Jun 23, 2015

bhandsaker commented Jun 23, 2015

cwhelan commented May 24, 2016

jmarshall commented Jan 18, 2019