Skip to content

Proposed updates to VCF and BCF specification

pdpriv edited this page Jul 17, 2014 · 18 revisions

Updates to VCFv4.2 / VCFv4.3

Proposed

  • Meta-information header lines can include optional field "Units" to declare unit convention used for the tag. The list of supported units is: Phred,Log,Plain (not discussed yet).

##INFO=<ID=PL,Number=G,Type=Integer,Description="...",Units=Phred>

  • Explicitly define the range of allowable characters for all fields. Many fields are defined according to the characters which are not allowed without explicitly stating the range of allowed characters. This results in some fields being extremely difficult the implement in a compliant manner. For example, C/C++ programs will be unable to use standard string handling libraries due to ASCII character 0 being a valid character for most fields.

  • Refine the VCF format using a formal grammar, see also issue 28 and this post.

Accepted

  • The ##SAMPLE field can contain optional DOI URL for the source data file:

##SAMPLE=<ID=sample,Genomes=G1_ID;G2_ID; ...;GK_ID,Mixture=N1;N2; ...;NK,Description=S1;S2;...;SK,DOI=url>

  • VCF compliant implementation must support LF (aka "\n") and CR+LF (aka "\r\n") newline conventions. (See discussion.)

  • INFO and FORMAT tag names must match the regular expression ^[A-Za-z_][0-9A-Za-z_.]*$

  • Allow space characters in INFO field values and provide a way to escape reserved characters. See the original proposal followed by a long discussion here and here and here. The proposed text is:

Space characters are allowed in INFO field values. Characters with special meaning (e.g. ';' in INFO, ':' in FORMAT, and '%' in both) can be encoded using URL encoding

  • Character encoding is declared by the ##fileencoding=charset header line. Only encodings that are a superset of US-ASCII are allowed, the default encoding is UTF-8 (see proposal). Also sample names can be in UTF-8.

  • Add a new reserved tag "CNP" analogous to "GP". The discussion raised a question about the use of phred-scaled probabilities: "GP" is phred-scaled but it was suggested that unscaled probabilities should be used with "CNP". It was proposed that both CNP and GP use 0 to 1 encoding, which is a change from previous phred-scaled GP. The proposed text (the exact wording to be clarified):

GP: Genotype posterior probabilities in the range 0 to 1 using the same ordering as the GL field; one use can be to store imputed genotype probabilities (Float)

CNP (analogous to GP): 0 to 1-scaled copy number posterior probabilities (and otherwise defined precisely as the CNL field); intended to store imputed genotype probabilities (Float).

In general, the genotype tags use the following conventions:

  • The "L" suffix means "likelihood" as log-likelihood in the sampling distribution, log10 Pr(Data | Model). Likelihoods are represented as log10 scale, so has to be negative (e.g. GL, CNL). The likelihood can be also represented in some cases as phred-scale in a separate tag (e.g. PL).

  • The "P" suffix means "probability" as linear-scale probability in the posterior distribution, which is Pr(Model | Data). Examples are GP, CNP.

  • The "Q" suffix means "quality" as log-complementary-phred-scale posterior probability, which is -10 * log10 Pr(Data | Model) where the model is the most likely genotype that appears in the GT field. Examples are GQ, CNQ. The fixed site-level QUAL field follows the same convention (represented as a phred-scaled number).

  • In order for VCF and BCF to have the same expressive power, state explicitly that Integers and Floats are 32-bit numbers. Integers are signed. See the discussion and a summary.

  • State explicitly that zero length strings are not allowed, this includes the CHROM and ID column, INFO IDs, FILTER IDs and FORMAT IDs. Meta-information lines can be in any order, with the exception of ##fileformat which must come first. State explicitly that INFO, FILTER and FORMAT IDs must be unique within that type.

Updates to BCFv2.1 / BCFv2.2

Accepted

  • Silent IDX header tag, see the original proposal here. The text of the proposed addition to section 6.2.1 is:

Defined this way, the dictionary of strings depends on the order and the presence of all preceding header lines. If an existing tag needs to be removed from a BCF, also all consequent tags throughout the whole BCF would have to be recoded. In order to avoid this costly operation, a new IDX field can be used to explicitly define the position which is dropped on BCF-to-VCF conversion. If not present, the implicit relative position is assumed. If the IDX field is present in one record, it must be present also in all other dictionary-defining records.

  • Allow missing values in variable-length vectors and variable-length FORMAT fields by recognizing a special end-of-vector byte, see the original proposal here. The text of the proposed addition to section 6.3.3 is:

For integer types, the values 0x80, 0x8000, 0x80000000 are interpreted as missing values and 0x81, 0x8001, 0x80000001 as end-of-vector indicators. Similarly for floats, the value of 0x7F800001 is interpreted as a missing value and 0x7F800002 as the end-of-vector indicator. Note that the end-of-vector byte is not part of the vector itself and only end-of-vector bytes can follow.

  • In total eight values are reserved for future use: 0x80-0x87, 0x8000-0x8007, etc.

  • String vectors in BCF do not need to start with comma, as the number of values is indicated already in the definition of the tag in the header.