-
Notifications
You must be signed in to change notification settings - Fork 173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
VCF ambiguities #89
Comments
This is excellent Daniel, thank you. |
No, and this is already mostly-adequately described: a final TAB in a line would introduce a final empty field, and those are disallowed by §1. This could be clarified by describing the lines as tab-separated rather than tab-delimited. So in §1.3 and §1.4.1: -The header line is tab-delimited.
+The fields of the header line are separated by TAB characters.
@@ ...
-All data lines are tab-delimited.
+The fields of data lines are separated by TAB characters. |
To me it doesn't seem necessary to limit a float to the regex above. AFAIK parsing scientific notation isn't a problem so either notation should be valid. Personally I'm thinking of VCF files generated by algorithms that may need to output several very small numbers (admittedly the preceding 0's would compress). What is the reasoning behind limiting floating point numbers to full notation? |
@drjsanger This escaped my attention, sorry. I also agree that it must be possible to express floating point numbers using the scientific notation, for example |
Re float parsing, the regexp used in the SAM spec is An alternative approach to saying what text floats should look like is to refer to the C or Java definitions of the relevant functions. So we could say something to the effect of: float values contain no white space but otherwise are as parsed by C's atof() in the default "C" locale or equivalently by Java's Float.valueOf() without any FloatTypeSuffix, perhaps except that binary/hex representations are disallowed. Admittedly there's a bit much "without this bit" in this approach, but it has the advantage over the regexp of saying what the digit string is supposed to mean… |
I think NaN and +/-Inf should definitely be allowed as float values. I'm a little unclear about the issue for contig names, but I think contig names should not be allowed to look like symbolic alleles (starting with "<" and ending with ">") to avoid ambiguities. This seems like a reasonable restriction to me on contig names. I don't think it is reasonable to think that we can pre-define all important/reasonable/useful symbolic alleles. Symbolic alleles give us a way (without going outside the specification) to experiment with how to represent complex structural haplotypes. |
I have to agree, I think NaN and +/-Inf should be valid as float values. |
@drjsanger @bhandsaker I remember there was an informal discussion long time ago that +/-inf can be replaced by reasonably big floats, but I am happy with +/-inf as well. |
NaN is a valid float value. Depending on the API, a missing value "." or just no attribute present may be represented differently than returning a float. Also in some cases there may be difference between missing/absent (i.e. test was not performed) or NaN (test was performed but did not produce a valid value). Note that BCF distinguishes between NaN and missing (signaling NaN is used to store missing, quiet NaN to store a true NaN value). |
As long as locale is explicitly specified, referring to another language specification seems to be the best way to fully define support. atof() might not be the best example as we'd want to disallow both leading whitespace, trailing garbage, alternate locales. I presume base 10 encoding is preferable for human readability purposes. Does it make sense to define a minimum round-trip fp precision? I've had errors introduced by tools truncating to 2dp, but that might be outside the scope of the specs. |
@bhandsaker I was under the impression that complex structural variants were intended to be represented as any number of breakend variants all sharing matching EVENT identifier, with the predefined symbolic alleles used for common local SVs. |
I believe breakends are better for describing the kinds of structural rearrangements one finds in cancer genomes. Our lab focuses more on complex common SVs, such as 17q21.31 http://www.ncbi.nlm.nih.gov/pubmed/22751096 and similar sorts of situations where we might want to represent shared haplotypes in the population but may not have the level of specificity implicit in or required by the breakend notation. I'm not suggesting any specific changes to the spec, just that we don't add language that would make a VCF invalid if it contains a new symbolic allele (although all symbolic alleles SHOULD be defined in the header). |
@bhandsaker that in itself is a change to the v4.3 spec as the draft currently has a pre-defined fixed set of top-level SV allele as opposed to the open-ended style of v4.2. It seems like what you want to achieve can be resolved by reverting to allowing symbolic allele of any name, forcing interpretation of any reference matching a symbolic allele name as a reference to the symbolic allele and not a named contig with the same name. It's looks like more than just the ALT header keys that would need changing for properly modelling complex SVs. SVTYPE is currently restricted to DEL, INS, DUP, INV, CNV, BND, and complex SVs don't fit neatly into any of those, although you could probably hack 17q21.31 into a named CNV allele. Does your use case involve representing such common variants as single-line named SVs at a nominal haplotype position, or is it something else? On a related note, the specs still seem unclear as to whether alleles such as |
@d-cameron |
Line 146-153 of the initial VCF 4.3 branch change set af41ed9 |
Oh. heh heh. It's possible that I wrote that actually as part of the original SV specification. |
@d-cameron The RFC1738 is primarily about URLs and most of the unsafe characters relevant to URLs do not apply to VCF. Wouldn't it be better to give an explicit list of characters that can be encoded?
The percent sign should be always written as %25, I think. EDIT: added TAB to the list |
I'd be happy with an explicit list of characters, although your list is I've used RFC-style verbs MUST and SHOULD to indicate things that On Tue, Jun 23, 2015 at 12:12 AM, pd3 notifications@github.com wrote:
|
Ah, sorry, added TAB to the list. What others think about the literal % sign? |
I guess you mean the proposal to fall back to treating % literally On 6/23/15 4:11 AM, pd3 wrote:
|
+1 to the section on Meta-information lines and quote-escaping |
FYI re #89 (comment) and the following comments on NaNs and infinities: the 4.3 spec was modified to allow them in db372bc later that June, but that was never mentioned in this discussion. The text added is that Float values are formatted “to match the regular expression C's Thus the VCF specification's IMHO the right way to fix this in the specification is to apply Postel's law and change that text to match what |
The following syntax ambiguities/parsing issues remain in the draft VCF 4.3 specs:
Non-printable ASCII characters:
Recommended solution:
Meta-information lines
Recommended solution:
-Double-quote escaping MAY be used in custom headers
-Double-quote escaping MUST be used in any field definition header whose value contains any of '= > " \ :'
Percent encoding
Recommended solution:
CR LF TAB % : ;
(and U+0000-U+001F if disallowed)=
)%
not followed by 2 hex digits as a literal%
contig name
<DEL>
is a valid contig name"A[<DEL>["
)Recommended solution:
"[!-)+-<>-~][!-~]*"
but restricted the following additional characters"<>[]:"
end of record/file
Recommended solution:
"String"
Recommended solution:
Float parsing
Recommended solution:
"\-?[0-9]+?"
"\-?[0-9]+(\.[0-9]+)?"
Misc
The text was updated successfully, but these errors were encountered: