Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VCF ambiguities #89

Open
d-cameron opened this issue Jun 8, 2015 · 22 comments
Open

VCF ambiguities #89

d-cameron opened this issue Jun 8, 2015 · 22 comments
Labels

Comments

@d-cameron
Copy link
Contributor

The following syntax ambiguities/parsing issues remain in the draft VCF 4.3 specs:

Non-printable ASCII characters:

  • U+0000 causes causes paring issue for C implementations
  • Allowing alternate line separators (such as RS) within text breaks text file format conversion (round-trip LF->RS->LF not possible if RS is an allowable string character)

Recommended solution:

  • U+0000-U+0008, U+000B-U+000C, U+000E-U+001F disallowed anywhere
  • Line separators MUST be CR LF or LF
  • CR, LF allowed only as line separators at end of line

Meta-information lines

  • Meta-information line format not fully defined
  • Example in 1.1 contains undefined headers
  • Allowable characters in header key not defined
  • contig line in Example 1.1 contains taxonomy key that is not double-quote escaped

Recommended solution:

  • Allow arbitrary meta-information lines
  • keys must be alphabetic [a-zA-Z]+
    • Optional: user-defined keys start with lowercase [a-z][a-zA-Z]*
  • field definition optional keys (eg ID, Description, md5, url, etc) must conform to [a-zA-Z0-9_]+
    -Double-quote escaping MAY be used in custom headers
    -Double-quote escaping MUST be used in any field definition header whose value contains any of '= > " \ :'

Percent encoding

  • TAB, CR, LF and other troublesome characters are not encoded
  • Literal % is written to an INFO field by at least one existing SV caller

Recommended solution:

  • Reference RFC1738 when referring to "URL encoding"
    • the following characters MUST be encoded CR LF TAB % : ; (and U+0000-U+001F if disallowed)
    • Optional: require = also be encoded (slightly simpler parsing since INFO can then be split on =)
    • U+0100 and above MUST NOT be encoded (since they cannot encoded in 2 hex digits).
  • Implementations SHOULD treat a % not followed by 2 hex digits as a literal %

contig name

  • <DEL> is a valid contig name
  • DEL is a valid ID String (which causes issues for SVs such as "A[<DEL>[" )

Recommended solution:

  • Valid characters should use the SAM regex of "[!-)+-<>-~][!-~]*" but restricted the following additional characters "<>[]:"
  • contigs names MUST NOT use a reserved symbolic alternate allele name

end of record/file

  • should there be a line separator at end of file?
  • should there be a TAB after the last record in each line?

Recommended solution:

  • A TAB SHOULD NOT be written after after the last record in each line.
  • Implementations SHOULD write a line separator after each line
  • Implementation MUST consider a line separator at end of file optional (including for VCFs with 0 records)

"String"

  • Characters allowed in a "String" field should be defined / disallowed list is not exhaustive

Recommended solution:

  • Regex for each field and global list of reserved characters

Float parsing

  • Is scientific notation allowed?
  • What locale should float/int <-> string parsing be performed in

Recommended solution:

  • All integer values are of the form "\-?[0-9]+?"
  • All Numeric/Floating point values are of the form "\-?[0-9]+(\.[0-9]+)?"

Misc

  • Section 1.4.1.8 (Data lines, Fixed fields, INFO) does not list reserved SV fields -> add reference to
  • Official 'Description' text for sepc-defined fields not set -> add official section/example containing all headers defined by the specs (inc INFO SV headers)
@pd3
Copy link
Member

pd3 commented Jun 10, 2015

This is excellent Daniel, thank you.
+1 from me.

@jmarshall
Copy link
Member

should there be a TAB after the last record in each line?

No, and this is already mostly-adequately described: a final TAB in a line would introduce a final empty field, and those are disallowed by §1. This could be clarified by describing the lines as tab-separated rather than tab-delimited. So in §1.3 and §1.4.1:

-The header line is tab-delimited.
+The fields of the header line are separated by TAB characters.
@@ ...
-All data lines are tab-delimited.
+The fields of data lines are separated by TAB characters.

@ghost
Copy link

ghost commented Jun 12, 2015

Float parsing

Is scientific notation allowed?

All Numeric/Floating point values are of the form "-?[0-9]+(.[0-9]+)?"

To me it doesn't seem necessary to limit a float to the regex above. AFAIK parsing scientific notation isn't a problem so either notation should be valid. Personally I'm thinking of VCF files generated by algorithms that may need to output several very small numbers (admittedly the preceding 0's would compress). What is the reasoning behind limiting floating point numbers to full notation?

@pd3
Copy link
Member

pd3 commented Jun 12, 2015

@drjsanger This escaped my attention, sorry. I also agree that it must be possible to express floating point numbers using the scientific notation, for example 1e-10 should be valid.
This regular expression should allow that '-?([0-9]+)?.?[0-9]*([eE][+-][0-9]+)?'. Note that by this we also disallow NaN, inf and similar.

@jmarshall
Copy link
Member

Re float parsing, the regexp used in the SAM spec is [-+]?[0-9]*\.?[0-9]+([eE][-+]?[0-9]+)?, which better represents real-world desirable +/- usage. It's unclear to me why we would disallow NaNs and infinities though — they are explicitly valid in BCF.

An alternative approach to saying what text floats should look like is to refer to the C or Java definitions of the relevant functions. So we could say something to the effect of: float values contain no white space but otherwise are as parsed by C's atof() in the default "C" locale or equivalently by Java's Float.valueOf() without any FloatTypeSuffix, perhaps except that binary/hex representations are disallowed. Admittedly there's a bit much "without this bit" in this approach, but it has the advantage over the regexp of saying what the digit string is supposed to mean…

@bhandsaker
Copy link

I think NaN and +/-Inf should definitely be allowed as float values.

I'm a little unclear about the issue for contig names, but I think contig names should not be allowed to look like symbolic alleles (starting with "<" and ending with ">") to avoid ambiguities. This seems like a reasonable restriction to me on contig names. I don't think it is reasonable to think that we can pre-define all important/reasonable/useful symbolic alleles. Symbolic alleles give us a way (without going outside the specification) to experiment with how to represent complex structural haplotypes.

@ghost
Copy link

ghost commented Jun 12, 2015

I have to agree, I think NaN and +/-Inf should be valid as float values.

@pd3
Copy link
Member

pd3 commented Jun 12, 2015

@drjsanger @bhandsaker I remember there was an informal discussion long time ago that +/-inf can be replaced by reasonably big floats, but I am happy with +/-inf as well.
However, what is the use case for NaN? We already have a missing value ".", parsers would have to check for both NaN and "."

@bhandsaker
Copy link

NaN is a valid float value. Depending on the API, a missing value "." or just no attribute present may be represented differently than returning a float. Also in some cases there may be difference between missing/absent (i.e. test was not performed) or NaN (test was performed but did not produce a valid value). Note that BCF distinguishes between NaN and missing (signaling NaN is used to store missing, quiet NaN to store a true NaN value).

@d-cameron
Copy link
Contributor Author

As long as locale is explicitly specified, referring to another language specification seems to be the best way to fully define support. atof() might not be the best example as we'd want to disallow both leading whitespace, trailing garbage, alternate locales. I presume base 10 encoding is preferable for human readability purposes. Does it make sense to define a minimum round-trip fp precision? I've had errors introduced by tools truncating to 2dp, but that might be outside the scope of the specs.

@d-cameron
Copy link
Contributor Author

@bhandsaker I was under the impression that complex structural variants were intended to be represented as any number of breakend variants all sharing matching EVENT identifier, with the predefined symbolic alleles used for common local SVs.

@bhandsaker
Copy link

I believe breakends are better for describing the kinds of structural rearrangements one finds in cancer genomes. Our lab focuses more on complex common SVs, such as 17q21.31 http://www.ncbi.nlm.nih.gov/pubmed/22751096 and similar sorts of situations where we might want to represent shared haplotypes in the population but may not have the level of specificity implicit in or required by the breakend notation.

I'm not suggesting any specific changes to the spec, just that we don't add language that would make a VCF invalid if it contains a new symbolic allele (although all symbolic alleles SHOULD be defined in the header).

@d-cameron
Copy link
Contributor Author

@bhandsaker that in itself is a change to the v4.3 spec as the draft currently has a pre-defined fixed set of top-level SV allele as opposed to the open-ended style of v4.2. It seems like what you want to achieve can be resolved by reverting to allowing symbolic allele of any name, forcing interpretation of any reference matching a symbolic allele name as a reference to the symbolic allele and not a named contig with the same name.

It's looks like more than just the ALT header keys that would need changing for properly modelling complex SVs. SVTYPE is currently restricted to DEL, INS, DUP, INV, CNV, BND, and complex SVs don't fit neatly into any of those, although you could probably hack 17q21.31 into a named CNV allele. Does your use case involve representing such common variants as single-line named SVs at a nominal haplotype position, or is it something else?

On a related note, the specs still seem unclear as to whether alleles such as TTTA<DEL>, or AA<namedcontig1>TAAA<namedcontig2>NNN are allowed valid allele.

@bhandsaker
Copy link

@d-cameron @bhandsaker that in itself is a change to the v4.3 spec as the draft currently has a pre-defined fixed set of top-level SV allele as opposed to the open-ended style of v4.2.
Can you point me to where this is changed so that symbolic alleles have to come from a pre-defined fixed set?

@d-cameron
Copy link
Contributor Author

Line 146-153 of the initial VCF 4.3 branch change set af41ed9

@bhandsaker
Copy link

Oh. heh heh. It's possible that I wrote that actually as part of the original SV specification.
Well, I don't think it's worth loosening the specification. Sorry for the confusion.

@pd3
Copy link
Member

pd3 commented Jun 22, 2015

@d-cameron The RFC1738 is primarily about URLs and most of the unsafe characters relevant to URLs do not apply to VCF. Wouldn't it be better to give an explicit list of characters that can be encoded?

%3a  ..  :  (colon)  
%3b  ..  ;  (semicolon)
%3d  ..  = (equal sign)
%25  ..  % (percent sign)
%2c  ..  ,  (comma)
%0d  .. CR
%0a  .. LF
%09  .. TAB

The percent sign should be always written as %25, I think.

EDIT: added TAB to the list

@d-cameron
Copy link
Contributor Author

I'd be happy with an explicit list of characters, although your list is
missing TAB.

I've used RFC-style verbs MUST and SHOULD to indicate things that
implementations need to implement, and things it would be nice to have. A
percent sign MUST be encoded, but if a VCF is encountered in which it is
not, then it would be nice for the implementation to fall back to literal
%. The idea behind this is that it would allow an implementation to use the
4.3 decoding algorithm on earlier file versions which allow a literal % in
the INFO field without dying. Happy to have omit that section to prevent
confusion.

On Tue, Jun 23, 2015 at 12:12 AM, pd3 notifications@github.com wrote:

@d-cameron https://github.com/d-cameron The RFC1738 is primarily about
URLs and most of the unsafe characters relevant to URLs do not apply to
VCF. Wouldn't it be better to give an explicit list of characters that can
be encoded?

%3a .. : (colon)
%3b .. ; (semicolon)
%3d .. = (equal sign)
%25 .. % (percent sign)
%2c .. , (comma)
%0d .. CR
%0a .. LF

The percent sign should be always written as %25, I think.


Reply to this email directly or view it on GitHub
#89 (comment).

@pd3
Copy link
Member

pd3 commented Jun 23, 2015

Ah, sorry, added TAB to the list. What others think about the literal % sign?

@bhandsaker
Copy link

I guess you mean the proposal to fall back to treating % literally
(presumably if it is not followed by a digit).
I think it's OK, but another possibility is to just trust the file
header. If you want the old behavior, set your version to 4.2.

On 6/23/15 4:11 AM, pd3 wrote:

Ah, sorry, added TAB to the list. What others think about the literal
% sign?


Reply to this email directly or view it on GitHub
#89 (comment).

@cwhelan
Copy link

cwhelan commented May 24, 2016

+1 to the section on Meta-information lines and quote-escaping

@jmarshall
Copy link
Member

FYI re #89 (comment) and the following comments on NaNs and infinities: the 4.3 spec was modified to allow them in db372bc later that June, but that was never mentioned in this discussion.

The text added is that Float values are formatted “to match the regular expression ^…numeric…$, NaN, or +/-Inf”. It is not explicit whether NaN and Inf are intended to be case sensitive, but as they are adjacent to a regexp saying …[eE]… the reader's best guess is probably that they are intended to specify that exact capitalisation.

C's printf can output nan/NAN and [-]inf/[-]INF or [-]infinity/[-]INFINITY (there is no programmer control over whether infinity is 3 or 8 letters), while its scanf understands all of those completely case-insensitively. Meanwhile Java's Double.valueOf accepts only [+-]?NaN and [+-]?Infinity spelt and capitalised exactly thus.

Thus the VCF specification's NaN and +/-Inf are difficult to output in C and its +/-Inf is difficult to parse or output in Java. See also samtools/bcftools#755.

IMHO the right way to fix this in the specification is to apply Postel's law and change that text to match what scanf accepts (i.e., {+,-,}{NAN,INF,INFINITY} case-insensitively).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants