Define Local Alleles in VCF to allow for sparser format #434

yfarjoun · 2019-07-24T18:51:39Z

This is the local-alleles part of #420

hts-specs-bot · 2019-07-24T18:53:47Z

Changed PDFs as of 970a87b: VCFv4.3 (diff).

lbergelson · 2019-07-24T20:12:56Z

VCFv4.3.tex

@@ -406,12 +406,14 @@ \subsubsection{Genotype fields}
      GQ		& 1		& Integer	& Conditional genotype quality \\
      GT		& 1		& String	& Genotype \\
      HQ		& 2		& Integer	& Haplotype quality \\
+      LAA		& . 		& Integer	& Local Alternate Alleles\footnotemark[1]\\ 
+      LPL		& . 		& Integer	& Phred-scaled genotype likelihoods rounded to the closest integer for genotypes that involve the Reference and the LAA alleles only.\footnotemark[1]\\ 


Shouldn't LPL be sorted after LAD?

Both LPL and LAD should be sorted after REF+LAA, the same way as REF+ALT define the order of PL and AD.

hts-specs-bot · 2019-07-24T20:53:09Z

Changed PDFs as of 4bc4299: VCFv4.3 (diff).

mlin · 2019-07-27T15:20:46Z

VCFv4.3.tex

-\end{itemize}
+  \item LAA (and LAD and LPL (*):
+  For callsets with a large number of samples, it is often the case that the majority of sites are not called and sites end up involving many alleles for which all the samples need to provide PL and AD. 
+  This can cause the file-sizes to grow super-linearly with the number of samples. 


Different take on the previous two sentences; feel free to use/recombine

In callsets with many samples, sites may grow to include numerous alternate alleles at the same POS. Usually, few of these alleles are actually observed in any one sample, but each sample must supply fields like PL and AD for all of the alleles -- a very inefficient representation as PL's size is quadratic in the allele count.

mlin · 2019-07-28T21:40:57Z

VCFv4.3.tex

@@ -406,12 +406,14 @@ \subsubsection{Genotype fields}
      GQ		& 1		& Integer	& Conditional genotype quality \\
      GT		& 1		& String	& Genotype \\
      HQ		& 2		& Integer	& Haplotype quality \\
+      LAA		& . 		& Integer	& Local Alternate Alleles the 1-based index into the alternate alleles indicating which are relevant for the current sample.\\


Suggested change

LAA & . & Integer & Local Alternate Alleles the 1-based index into the alternate alleles indicating which are relevant for the current sample.\\

LAA & . & Integer & Local Alternate Alleles the strictly-increasing, 1-based indices into the alternate alleles indicating which are relevant for the current sample.\\

Or maybe:

Suggested change

LAA & . & Integer & Local Alternate Alleles the 1-based index into the alternate alleles indicating which are relevant for the current sample.\\

LAA & . & Integer & Strictly increasing, 1-based indices indicating which alternate alleles are relevant (local) for the current sample\\

SGTM but let's say the indices have to be strictly increasing; obviously that's a pedantic constraint but without it, the ordering of LPL could be unclear as well.

Suggestion updated.

Dwelling on this further -- @yfarjoun @lbergelson -- I'm worried just slightly about confusion over the LPL ordering. Lets say the joint VCF has REF=A ALT=G,C,T and the sample GVCF has ALT=T,G. so LAA=1,3 if we require the indices to be in order, but this order is flipped with respect to the GVCF.

The merging algo then has to be really careful to permute the GVCF PL values in just the right way to match that. In the wild, I'm nervous this is going to lead to difficult-to-detect bugs with subtle swapping of probability values. Did you have any experience with this in developing the idea? Should we permit the merger to write an out-of-order LAA (this might just shift the potential confusion to readers)? I don't have a better idea right now, it just seems to me like a significant pitfall at least warranting explication where LPL is discussed.

I agree that this is a concern. The only actual implementation so far of a format with local alleles decided to store the allele string itself instead of the index because they didn't want to deal with the reordering problem. So unordered [T, G] instead of [1, 3]. Move the difficulties from the merge step to the reading step in that now if you want to look up non-local values associated with an allele you have to do a search in the INFO allele values to find the corresponding index. I

My thinking was that it was better to require the indexes as numbers and do the reordering, but I think there's a strong argument the other way.

Thanks, I'll spawn a background thread on this. BTW, if anyone knows a better formula than the following to go from PL index gt to the indices (i,j) of the constituent alleles, I would like to know. This is seriously the best I could come up with, holy sh**.

unsigned j = (unsigned)((sqrt(8*gt+1)-1.0)/2.0); unsigned i = gt - j*(j+1)/2;

I like @cyenyxe's suggestion, it solves a possible confusion. It also allows to determine the LPL ordering per allele-set, samples with the same set of local alleles are guaranteed to use the same order.

@mlin Note that the proposed formula is for the special case of diploid genotypes. In my experience, in practice it is sufficient to pre-compute the indexes using nested/recursive loops as suggested in the description of GL, section 1.6.2.

why do the indexes need to be strictly increasing?

One of the main points is to lessen the burden on the creation of these pVCFs...so that when the alleles change the only change that needs to happen in the format field is the LAA. if we require strictly increasing order, that goes out the window.

If LAA are strictly increasing, LPL's are guaranteed to have the same order as 1) the original PLs and 2) samples with the same set of alleles.

Editing of VCFs is much less frequent operation than read-only processing, therefore it is more important to optimize the speed and convenience of the latter.

cyenyxe · 2019-07-29T10:42:16Z

VCFv4.3.tex

@@ -503,7 +505,15 @@ \subsubsection{Genotype fields}
  All phased genotypes that do not contain a PS subfield are assumed to belong to the same phased set.
  If the genotype in the GT field is unphased, the corresponding PS field is ignored.
  The recommended convention is to use the position of the first variant in the set as the PS identifier (although this is not required).
-\end{itemize}
+  \item LAA (and LAD and LPL (*):


I would follow the existing of one item per field, otherwise it may be confusing to readers searching for LAD and LPL.

should I leave the paragraph as is, and add entries for LAD and LPL pointing to LAA?

The AD, ADF, ADR paragraph also groups related items into a single paragraph. Fortuitously in both cases the three entries are adjacent alphabetically.

(Though the whole LAA/LAD/LPL paragraph wants to move up to between HQ and MQ, and this label (LAA (and LAD and LPL (*)) wants less ‘and’ and better balanced parentheses!)

cyenyxe · 2019-07-29T10:43:41Z

VCFv4.3.tex

@@ -406,12 +406,14 @@ \subsubsection{Genotype fields}
      GQ		& 1		& Integer	& Conditional genotype quality \\
      GT		& 1		& String	& Genotype \\
      HQ		& 2		& Integer	& Haplotype quality \\
+      LAA		& . 		& Integer	& Local Alternate Alleles the 1-based index into the alternate alleles indicating which are relevant for the current sample.\\


Or maybe:

Suggested change

LAA & . & Integer & Local Alternate Alleles the 1-based index into the alternate alleles indicating which are relevant for the current sample.\\

LAA & . & Integer & Strictly increasing, 1-based indices indicating which alternate alleles are relevant (local) for the current sample\\

cyenyxe · 2019-07-29T10:44:49Z

VCFv4.3.tex

@@ -406,12 +406,14 @@ \subsubsection{Genotype fields}
      GQ		& 1		& Integer	& Conditional genotype quality \\
      GT		& 1		& String	& Genotype \\
      HQ		& 2		& Integer	& Haplotype quality \\
+      LAA		& . 		& Integer	& Local Alternate Alleles the 1-based index into the alternate alleles indicating which are relevant for the current sample.\\
+      LAD		& . 		& Integer	& Local Allele Depth for the reference and each of the local alternate alleles (see: LAA).\\


Suggested change

LAD & . & Integer & Local Allele Depth for the reference and each of the local alternate alleles (see: LAA).\\

LAD & . & Integer & Read depth for the reference and each of the local alternate alleles listed in LAA\\

cyenyxe · 2019-07-29T10:45:45Z

VCFv4.3.tex

@@ -406,12 +406,14 @@ \subsubsection{Genotype fields}
      GQ		& 1		& Integer	& Conditional genotype quality \\
      GT		& 1		& String	& Genotype \\
      HQ		& 2		& Integer	& Haplotype quality \\
+      LAA		& . 		& Integer	& Local Alternate Alleles the 1-based index into the alternate alleles indicating which are relevant for the current sample.\\
+      LAD		& . 		& Integer	& Local Allele Depth for the reference and each of the local alternate alleles (see: LAA).\\
+      LPL		& . 		& Integer	& Phred-scaled genotype likelihoods rounded to the closest integer for genotypes that involve the Reference and the LAA alleles only (see LAA).\\


Suggested change

LPL & . & Integer & Phred-scaled genotype likelihoods rounded to the closest integer for genotypes that involve the Reference and the LAA alleles only (see LAA).\\

LPL & . & Integer & Phred-scaled genotype likelihoods rounded to the closest integer for genotypes that involve the reference and the local alternate alleles listed in LAA\\

Suggested change

LPL & . & Integer & Phred-scaled genotype likelihoods rounded to the closest integer for genotypes that involve the Reference and the LAA alleles only (see LAA).\\

LPL & . & Integer & Phred-scaled genotype likelihoods rounded to the closest integer for genotypes that involve the reference and the local alternate alleles listed in LAA, in the genotype order implied by LAA such that LPL would be a subsequence of PL were the latter present.\\

@lbergelson @pd3 following the previous discussion, I feel it's worth leaving no possible question about the LPL order -- here's my attempt, better suggestions welcome.

For GVCF merging, we're already responsible for carefully permuting the GVCF PL values into the target joint vector, the new wrinkle will be that each sample can have a different target. I think this is okay, it was just tempting to want to copy in the GVCF PL verbatim.

I think it's better to speak about ordering only and not require a strict subsequence: it should be possible to rescale the LPL likelihoods.

mlin · 2019-09-08T08:02:31Z

Hi all, I was motivated to come back here after another collaborator just run aground on the htslib int overflow when it tries to allocate a 2GB+ PL vector 😄 😭

It seems like we are not far; the open items are

Must the LAA allele indices for each sample be strictly increasing? I advocate yes, to make reading LPL less tricky, but the trickiness burden is shifted to the writer.
Precise language for specifying the order of LPL values, if anyone agrees with me it's worthwhile to leave no doubt.
Move the item describing LAA, LAD, and LPL into alphabetical position within the itemized list. (+ misc wording and punctuation edits)

jmarshall · 2019-09-09T15:55:01Z

I think this is a nice new feature, but it took me many minutes of very close reading to figure out just what the feature is! If I understand it correctly, it could perhaps be better described as:

The table of reserved keys is just a brief taster / memory jogger of what each field is about, so: LAA List of relevant Local Alternate Alleles / LAD Read depth for each local allele / LPL Phred-scaled local genotype likelihoods rounded to the closest integer
The text in the bulleted paragraph motivating this doesn't mention that the point of all this is to only list “interesting” alleles on a per-sample basis because each sample has a different small subset of alt alleles that are present for that sample. The text suggested in Define Local Alleles in VCF to allow for sparser format #434 (comment) is rather better.
It needs a complete example so you can see what is happening. Something like

… REF  ALT        …  FORMAT          SAMPLE1                       SAMPLE2
… G    A,C,T,<*>  …  GT:LAA:LAD:LPL  1/1:2,4:20,30,10:1,2,3,4,5,6  1/1:3:15,25:1,2,3

thus demonstrating the different subsets per sample.

The numbers are all represented as ., but the paragraph text should explain the relationships between them. So:

LAA is a list of n integers, where 1 <= n <= the number of ALT alleles, giving the (1-based) indices within ALT of the alleles that are observed in the sample. [And a sentence about the ordering if so decided.] (Uh — can n be 0?)

LAD is a list of n+1 integers giving read depths (as per AD) for the REF allele and each of the local alleles as listed in LAA.

LPL is a list of n+1 choose P (or whatever the combinatoric calculation is!) integers giving phred-scaled genotype likelihoods (rounded to the closest integer; as per PL) for all possible genotypes given the set of alleles defined in the REF and LAA local alleles. [And refer to the precise ordering defined in the GL paragraph.]

(Writing out all that for each field might mean they would be better as separate paragraphs (as per #434 (comment)) or it might be better to keep them together to collaborate on the same example better…)

hts-specs-bot · 2019-09-16T15:26:47Z

Changed PDFs as of 4f131de: VCFv4.3 (diff).

hts-specs-bot · 2019-10-07T16:00:21Z

Changed PDFs as of 0c65b48: VCFv4.3 (diff).

mlin · 2019-10-07T16:53:22Z

Similarly, in rare sites, which can be the bulk of the sites, the vast majority of the samples are reference.

This is a good point; we should probably clarify how one should write LAA & LPL for samples reporting no evidence for any particular alternate allele.

hts-specs-bot · 2019-10-07T17:57:55Z

Changed PDFs as of 605e8ec: VCFv4.3 (diff).

hts-specs-bot · 2019-10-07T18:04:16Z

Changed PDFs as of fbf87d3: VCFv4.3 (diff).

VCFv4.3.tex

lbergelson · 2019-12-02T20:16:58Z

Some misc things I've been thinking about while implementing this:

./. calls are tricky to convert to local alleles in a joined gvcf because we don't know which subset of alleles is relevant, in worst case you may have to include every allele in LAA or analyze the PL's to see which ones are not actually meaningful.
Requiring LAA to be sorted is annoying when merging files.
Would it be a good idea to allow leaving out LAA in when the local alleles include all of the alleles at the site? Benefits: potentially space saving, makes single sample files in local allele format cleaner. Cons: easier to make mistakes when merging files.

lbergelson · 2019-12-03T21:25:54Z

VCFv4.3.tex

+2&A&C,G,T,\textless*\textgreater& GT:AD:PL&0/3:15,.,.,25,.:40,.,.,.,.,.,0,.,.,80,.,.,.,.\\
+3&C&G,T,\textless*\textgreater& LAA:LGT:LAD:LPL& 4:0/0:30,1:0,30,80\\
+3&C&G,T,\textless*\textgreater& GT:AD:PL& 0/0:30,.,.1:0,.,.,.,.,.,.,.,.,.,30,.,.,.,80\\
+4&G&A,T,\textless*\textgreater& LAA:LGT:LAD:LPL& :0/0:30:0\\


Typo here: we either need to remove the LAA field and indicate that it can be left out if there are no non-ref local alleles, or we need to add a leading .

I don't think this is a typo: the intention is that LAA is a 0-length list of integers (indicating that this record is homozygous ref for this sample and that none of the ALT alleles are in play), not an unknown MISSING value.

This may be a problem, as it's not clear that 0-length lists are valid syntax in VCF or BCF. Omitting LAA is probably not the right workaround, as presence of LAA is the simple indicator that local alleles are being used in a VCF record. Setting it to . (unknown) is somewhat misleading, as it is known…

This example contradicts the current text, which says “LAA is a sorted list of n distinct integers, where n ≥ 1 […]”. So the text will need adjustment if this example record is deemed valid.

bcftools converts the zero-length list into a missing value. This comes from bcftools treating everything as BCF internally, and BCF says zero-length lists are missing.

Possible fixes would be to say missing LAA means REF; to explicitly put 0 in LAA for REF-only; or use GT in these rows, but allow use of LPL, LAD etc. in the specific case that GT is REF-only.

It makes sense to use the missing value . and state it explicitly that it means REF. I can't think of a case where this could cause problems.

I agree with Petr here. While the BCF specification states that a zero length array is defined to be a MISSING array, hence ., we could add an explicit statement to LAA saying that missing LAAs means that it matches REF. What we gain is a solution to this problem. What we lose is the ability to store an unknown LAA list. I'm unsure what the latter practically means. Why would you need to explicitly state it's missing instead of just omitting it. If it's missing then we're not using LAA (and if the PL, AD etc are missing we're not defining it that way either).

PS. Technically even if it's . this still contradicts "LAA is a sorted list of n distinct integers" as . isn't an integer. However that's nit-picking and maybe somewhere else VCF already defines . as being an integer of unknown value? I don't know what we do elsewhere when we have this technical disparty. Is it just assumed that "." is a polymophic type that matches whatever the item wants it to be?

What the BCF spec says is that a type descriptor byte with ‘size’=0 denotes MISSING. It does not say what a type descriptor byte with ‘size’=15 followed by a typed Integer with value 0 denotes, but that would be a natural representation of a 0-length list. (The spec does say that ‘size’=15 implies the list length is ≥15, but the typed Integer is not biased so no meanings are described for typed Integer values <15.) Thus BCF does not provide a binding answer on whether 0-length lists are valid or representable in VCF/BCF.

But I'm not going to advocate blessing 0-length lists in VCF or BCF, especially not just for this, even though SAM found them useful for regularity (cf #135).

IMHO the clearest way to represent this corner case would be

As a special case, the homozygous REF case in which no ALT alleles are observed in a particular sample is represented as LAA=[0].

(And we'd need to rework n in the length maths: e.g. by dropping n from the description of LAA, and adding “where n is the number of ALT alleles listed in LAA” to the others.)

I could get behind using LAA=0 for the special case of no alternates. It wouldn't need much wording alteration. It mentions 0 is implicit and that LAA goes from 1 to <= n, but we could add an explicit caveat for the empty list and an explanation of why it's necessary. I don't like the mix of LAA=2,4 and LAA=[2,4] in the text though. I find it confusing to have two different syntaxes mixed together. I get it's trying to imply an array in a textual sense, but it would be clearer IMO if the textual description used the actual text encoding.

However I'm not really sure I understand the necessity for a distinction between empty and missing. I appreciate they are infact different things, however pragmatically are we expecting the tools to treat them differently? If so how? Is it a confidence thing, where we want to be able to specify a likelihood for REF-only?

I don't know enough about the internals of bcftools and htsjdk though to know whether an LAA of 0 would cause breakage. Likewise the type / size shenanigans. My instinct tells me htslib would explode if the size is incorrect and I'd rather not be relying on robustness of weird undiscovered corner cases in BCF. It's a confusing enough format as it is.

To expand on my previous comment: the logic of LAA is to define a list of alleles that can be referred to in subsequent localized fields. If only the reference allele is relevant, then it makes sense to provide the missing value ., in this sense it behaves in exactly the same way as what we write in REF and ALT columns. Note also the naming of the tag stands for "Local Alternate Alleles". By using . we are not introducing a new concept.

Parachuting in here a bit, so my apologies if I'm missing context, but defining LAA as an R-length array whose first element is always the number zero both avoids these nasal demons and avoids a branch [1] when connecting a local index into a global index.

Is it too late to consider an R-length LAA instead of an A-length one?

[1] Text processing cost probably dominates, but if you convert a VCF into some binary format without adjusting LAA, maybe this matters a bit.

jmarshall · 2019-12-05T10:39:52Z

Re (3) Would it be a good idea to allow leaving out LAA in when the local alleles include all of the alleles at the site?

Wouldn't sites where all samples include all the alleles just use GT/AD/PL rather than LAA/LGT/LAD/LPL? (And this would obviously be allowed as there is no text saying a file has to have all or none of its variant lines using LAA, and the default assumption would be that a mixture is fine.)

Or do you mean: when some samples include all the alleles and others don't, instead of writing 1,2,3,…,N for LAA for the all-alleles samples perhaps allow some special value to be an abbreviation for that? e.g. . but frankly that would be an unexpected meaning for .…

hts-specs-bot · 2020-08-11T17:37:14Z

Changed PDFs as of a5d41dc: VCFv4.3 (diff).

This is a draft implementation of samtools/hts-specs#434 and haploid Number=G tags are not handled yet.

h-2 · 2023-01-13T12:03:23Z

I was just reading through all of this, since I am implementing it for one of our tools, and I wondered: what's the point of LGT ? It doesn't save any space and just adds the indirection of parsing through LAA...

In our applications, the only point of doing this is to transform fields with number G (PL-values), although an argument could be made for transforming any/all fields with number A and R, as well. Is there a reason for only suggesting AD and not other A/R-fields?

d-cameron · 2023-07-10T21:14:18Z

My preference is to expand the scope of this and make it generic over all Type=R/A/G fields with the local notation implictly defined by the presence of LAA^. That is, PL, GT, etc. are redefined from indexes w.r.t ALT to indexes w.r.t ALT[LAA] with a missing LAA being interpreted as the identity function.

In this manner, converting between the short and long-hand notations can be done not with L-prefixed fields, but with the fields themselves based on the presence/absence of LAA. This would actually simplify VCF merging as LAA would act as an indirection/remapping marker with the only FORMAT change required when merging VCFs is the addition of the LAA field.

Another benefit of this approach is that is allows fields such as PL to preserve their original meaning.

^ AI (alt index) is probably a better name than LAA as no L-prefixed fields would be required

pd3 · 2023-07-11T08:17:46Z

@d-cameron That's a terrible idea, it would break existing tools.

Also, having implemented this in bcftools merge, I don't see how it would help with merging VCFs at all, or what do you mean by your comment that it would help preserve PL its original meaning?

Perhaps better is to apply this idea to the LXX tags, rather than changing the semantics of existing tags. (That is, allow other LXX tags even when LAA is not present.)

olest · 2023-10-06T11:08:30Z

Hi,

we are using local alleles in DRAGEN gVCF Genotyper. More specifically, we write FORMAT/LPL, LAD, LGT and LAA in the multi-sample VCF output.

In particular FORMAT/LPL gives a significant space saving compared to FORMAT/PL which scales quadratically with the number of alleles at a variant site.

This becomes important for an msVCF representing a large cohort because multi-allelic sites with hundreds or even thousand of alleles will become more frequent.

In these cases, the msVCF line can exceed the size limit of the htslib parser as @mlin mentions above.

I would love to see this change to become part of the VCF spec. If there is anything needed to support this, please let me know.

VCFv4.3.tex

jmarshall · 2023-11-13T11:05:07Z

Since this PR was started, v4.4 has become the canonical VCF version. We may wish to adjust this PR to add the new fields to the VCFv4.4.tex document instead (or perhaps to both, but primarily to the current one, v4.4).

I have drafted PR #745 applying the local alleles changes to the VCFv4.4 document on top of current master and fixing some formatting etc. It is not intended to split the conversation from this PR, but to provide a preview PDF and demonstration of how these changes applies to current v4.4.

If @yfarjoun and the VCF maintainers agree, and wish to apply the local alleles additions to v4.4 instead of v4.3 (or perhaps to both), then we can force push the new PR's branch to yfarjoun:yf_local_alleles, close #745, and continue the conversation on this original PR.

jkbonfield · 2023-11-28T14:51:23Z

Background into this, that came up in Future of VCF. All Of Us use VDS natively (Hail Variant Dataset). I think this is likely the origin of this PR and #435.

https://support.researchallofus.org/hc/en-us/articles/14927774297620-The-new-VariantDataset-VDS-format-for-All-of-Us-short-read-WGS-data
https://hail.is/docs/0.2/vds/index.html

One difference there is VDS uses LA and not LAA,and LA includes 0 for REF.

@pg

Andrew Whitwham (12): Uodate copyright for winter release. January 2023 NEWS update. Keeping the NEWS file up-to-date. More additions and improvements. Added htscodecs update to v1.4.0 More space. Switched back to openssl for Alpine. Amalgamate multiple CIGAR ops into single entry. (#1607) Stop the overwriting of the end value. Summer 2023 copyright update. Speed up removal of lines in large headers. Winter News 2023 (PR #1703) Bergur Ragnarsson (3): draft fix fix memory leak exit early on error David Seifert (1): Use POSIX `grep` Fabian Klötzl (1): improve parsing performance Fangrui Song (1): Apply the packed attribute to uint*_u types for Clang James Bonfield (86): Update man SEE ALSO sections from .BR to .IR so the website uses URLs bgzip text compression mode Make the bgzip -g option less opaque. Make tabix support CSI indices with large positions. Prevent crash when only FASTA entry has no sequence. Add an fai_line_length function. Check for invalid BC tags in fastq output. Warn if ref file is given but it doesn't contain the refs we need. Fix buffer read-overrun in bam_plp_insertion_mod. Fix ref fix from c91804c Make it easier to modify shared library permissions during install Add CRAM SQ/M5 header checking when specifying a fasta file. (PR #1522) Speed up load_ref_portion. Expand CRAM API a bit to cope with new samtools cram_size command. (PR #1546) Merges neighbouring I and D ops into one op within pileup. (PR #1552) Improve API docs for bgzf_mt FIx a bug in the codec learning algorithm for TOKA Fix a bug with multi-threading and embed_ref=2 on name sorted data Use non-ref mode when all else fails for CRAM encoding Add some documentation on cram encoder code structure Fix Cram compression container substitution matrix generation. Tweak the CRAM_SUBST_MATRIX table. Prevent spurious and random system errors from test_bgzf.c Fix cram_index_query_last function Avoid deeply nested containment list on old CRAM indices. Permit fastq output to create empty FASTQ records for seq "*". Fix a couple small VCF auto-indexing bugs. Backport attractivechaos/klib#78 to htslib. Slightly speed up various cram decoding functions (#1580) Remove CRAM 3.1 warning. Trivial fix to expr, removing "^". Add MZ:i tag as a check for base modification validity. (#1590) Fix typo in kh_int_hash_func2 macro. Rename aux tag MZ to MN. Protect against overly large containers. Don't create overly large CRAM blocks. Add a missing break statement in cram_codec_to_id. (#1614) Fix fd_seek on pipes on modern MinGW releases. Change bounds checking in probaln_glocal Fix a containment bug in cram_index_last. Migrate base modification code out of sam.c Correct base modification implicit / explicit status when mixed together. Add a bam_mods_queryi interface. Add bam_parse_basemod2 API with additional flags argument. Add more internal sam_mods.c documentation Update bam_next_basemod too to cope with HTS_MOD_REPORT_UNCHECKED. Fix decompress_peek_gz to cope with files starting on empty gzip blocks. Fix to 2e672f33 decompress_peek_gz change. Add fai_thread_pool interface. NEWS updates for pending release Makes bam_parse_cigar able to modify existing BAM records rather than Fix cut and paste errors in bam_aux2f documentation The first stage of vcf_parse_format speed improvements. Further VCF reading speeds optimisations. Revert most of the vcf_parse_info improvements. Add an hclen SAM filter function. Fix a minor memory leak in malformed CRAM EXTERNAL blocks. (#1671) Enable auto-vectorisation in CRAM 3.1 codecs. Cache key header lengths. Allow vcf_format to work on packed data, plus bcf_fmt_array improvements. Speed up kputd by approx 130%. Minor tweaks to bcf_fmt_array and bcf_str_missing usage. Fix a cram decode hang from block_resize. Always do the CRAM mutex lock/destroys. Add C++ casts for external headers. Enable optimisation level -O3 for SAM QUAL+33 formatting. Avoid a NULL pointer dereference while building CRAM embedded ref. Avoid a NULL pointer deref when erroring writing CRAM to stdout Make CRAM internal data structures use hts_pos_t. Prevent CRAM 3 from attempting to write out out-of-bounds ref positions Remove memory leak when cram_encode_container fails during a close. Fix out by one error on extend_ref memory allocation. Don't call cram_ref_decr on consensus-based references. Check for 64-bit values in BETA codec initialiser. Protect against CRAM slice end going beyond end of reference. Prevent extend_ref from making huge mallocs on very sparse data. Improve the fuzzer to write BAM/CRAM and BCF too. Fix memory leaks on failed CRAM encode. Permit embed_ref=2 mode to be reenabled after using no_ref. Tighten memory constraints for cram_decode. Further rewrite of the fuzz test harness Reduce maxmimum BCF header len in fuzzer. Fix buffer read overrun in cram_encode_aux. Disable hts_set_fai_filename call in hts_open_fuzzer. Avoid undefined behaviour integer overflow in extend_ref Fix integer overflow in cram_compress_block2 John Marshall (19): Add bam_aux_first()/bam_aux_next() tagged aux field iterator API Document that bam_aux_del()'s `s` parameter must be non-NULL (& reformat) Add symbol versioning to the ELF shared-object file Mention in INSTALL that using plugins may need -rdynamic Set _XOPEN_SOURCE in configure if it's not already set Add missing Makefile dependencies [minor] Make last_in a pointer to const [minor] Add "uncompressed" in hts_format_description() where appropriate Add hclose()-doesn't-close-fd option and use it for hopen("-") Take advantage of shared hopen("-") in htsfile.c Explicitly fclose(stdout) in test/test_view.c too Fix hfile_libcurl small seek bug Apply memchr optimisation in the general delimiter case too Add test case compiling the public headers as C++ Remove remnants of HTS_HAVE_NEON, unused since PR #1587 Remove NUMERIC_VERSION, unused since PR #1226 Document primarily MM/ML and fix base-mod-related typos Compare as off_t (not size_t) and fix printf specifiers Install annot-tsv.1 man page and alphabetise annot-tsv rules Lilian Janin (2): Fix error code 0 returned by bcftools after error Make bcftools return an error code != 0 after [E::bgzf_read_block] Invalid BGZF header at offset xxx Petr Danecek (15): Remove variable redeclaration warnings from perl test script Make bcf_hdr_seqnames() work with gapped chromosome ids Make bcf_hdr_idinfo_exists() more robust Check if VCF POS column could be fully parsed Allow repeated calls of bcf_sr_set_regions (PR #1624) An attempt to parse malformatted region such as {1:1}-2 should fail Add new annot-tsv program Output full diff on failing tests Output full diff on failing tests Make qsort in regidx order-reproducible across platforms Provide a nill (dot) value when the field is empty Use EXIT_SUCCESS with -h and EXIT_FAILURE on errors Address various comments VCF parsing fix Allow renaming of the default -a annotations (#1709) Rob Davies (65): Fix n-squared complexity in sample line with many adjacent tabs Switch to building libdeflate with cmake Add faidx_seq_len64() and fai_adjust_region() interfaces Rework / add new faidx tests Fix build on ancient versions of gcc Ensure strings in config_vars.h are escaped correctly Switch MacOS CI tests to an ARM-based image Cut down the number of embed_ref=2 tests that get run Cap bgzf_getline return value to INT_MAX Make tbx_parse1 work for lines longer than 2Gbytes Use correct type for ret in vcf_write() Don't error when making an iterator on a tid not in the index Happy New Year Catch errors from bgzf_getline() in hts_readlist, hts_readlines Add configure (enable|disable)-versioned-symbols options and tests Add Makefile rule to update the symbol version file Strip out symbol versions from shlib-exports-so.txt Update to htscodecs v1.4.0 Minor NEWS adjustment and additonal item Switch to CURLINFO_CONTENT_LENGTH_DOWNLOAD_T for newer libcurl Fix crypt4gh redirection Remove use of sprintf() from HTSlib source Make SIMD tests work when building multiarch binaries Make MacOS tests build a multiarch version of the library Fix bug where bin number could overflow when looking for max_off Make reg2bins faster on whole-chromosome queries Make reg2intervals() faster on whole-chromosome queries Update to latest htscodecs Don't set _POSIX_C_SOURCE for htscodecs tests Fix trailing space in config.h made by configure Ignore generated config_vars.h file in copyright check Switch to `/usr/bin/env perl` for all perl scripts Adjust comments in probaln_glocal() Expand test-bcf-sr.c capabilities Add synced reader region tests, and move no-index tests Fix possible double frees in bcf_hdr_add_hrec() error handling Prevent dangling hrec pointer after bcf_hdr_add_hrec() failure Remove items from hdict in bcf_hdr_remove() Ensure number of modifications is always set in bam_parse_basemod2() Ensure simple_test_driver.sh cleans up its temporary files Ensure base mod test result is noticed by the Makefile Improve test/test_mod.c Switch to htscodecs 1.5.1 Skip CRC checks when fuzzing Prevent out-of-memory reports when fuzzing Add missing dependency on libhts.a for hts_open_fuzzer Fix virtual offset adjustment when indexing on-the-fly Fix BCF/VCF on-the-fly indexing issues Fix crypt4gh redirection Update htscodecs to v1.5.2 (2aca18b3) Simplify run_test function Fix test/test.pl -F dependencies Simplify test/test.pl Move all annot_tsv tests into their own function Fix up some annot-tsv error checks / reports Reformulate man page for house style Make compiler flag detection work with zig cc Fix a couple of unused value warnings when built with NDEBUG Fix @pg linking when records make a loop Fix possible shift of negative value in cram_encode_aux() Move typecast to the right place Ensure no-stored-sequence reads are counted in container size Make output on unrecognised options go to stderr Update to latest htscodecs Update htscodecs to v1.6.0 kojix2 (2): Fix a typo in sam.h documentation Fix example in docs for sam_hdr_add_line pd3 (4): Remove a bottleneck in VCF header processing Add support for non-standard chromosome names containing [:-] characters Clarify usage, include output example Temporary workaround when excessive memory is required by FORMAT fields vasudeva8 (10): Adds bcf_strerror method (PR #1510) Ensure NUL termination of Z/H data in sam_format_aux1; fix base mod state reuse Changes to avoid segfault with uncompressed bam (PR #1632) Demonstration of htslib/sam api usage. formatting update bgzf_useek fails when offset is above block limits Add support for multiple files to bgzip Limited usage text to 80 chars per line man update for -a option (PR #1716) Fix example column names and explain core column renaming Étienne Mollier (2): cram/cram_external.c: fix external htscodecs include htslib-s3-plugin.7: fix whatis entry Noteworthy changes in release 1.19 (12th December 2023) ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Updates ------- * A temporary work-around has been put in the VCF parser so that it is less likely to fail on rows with a large number of ALT alleles, where Number=G tags like PL can expand beyond the 2Gb limit enforced by HTSlib. For now, where this happens the offending tag will be dropped so the data can be processed, albeit without the likelihood data. In future work, the library will instead convert such tags into their local alternatives (see samtools/hts-specs#434). (NEWS truncated at 15 lines)

github-actions · 2024-01-02T22:53:12Z

Changed PDFs as of 4e4bae4: VCFv4.4 (diff — see pp10–13).

jmarshall · 2024-01-02T23:01:43Z

As discussed in today's meeting, this PR has been updated to apply to the v4.4 document instead (and some formatting is improved).

Some further work needed was discussed in the meeting:

We may wish to apply this to an upcoming v4.5 draft instead;
The new genotype fields are listed as Number=., which may wish to be updated to Number=P (for some of them at least) once we have settled on a suitable definition of P (cf vcf: Define Number value P #705).

danking · 2024-01-31T00:24:49Z

@h-2

I was just reading through all of this, since I am implementing it for one of our tools, and I wondered: what's the point of LGT ? It doesn't save any space and just adds the indirection of parsing through LAA...
In our applications, the only point of doing this is to transform fields with number G (PL-values), although an argument could be made for transforming any/all fields with number A and R, as well. Is there a reason for only suggesting AD and not other A/R-fields?

We (Hail) also questioned the value of LGT until we discovered a footgun. Suppose:

You have a VCF that uses local alleles.
At locus chr1:1, the alternate allele array is A,T,G and there are two samples whose genotypes are: 0/3 and 1/2.
You decide to remove the second sample.

It is not unreasonable to also remove the now unused alternate alleles from the alternate allele array, but now your GT is wrong. So, the options when filtering samples are:

Always keep the "dead" alternate alleles.
Remove the "dead" alternate alleles and also update any affected GTs.
Use LGT.

For Hail's purposes, using LGT seems best because it allows sample subsetting without O(N_SAMPLES_KEPT) edits. It also avoids the footgun of "whoopsie I changed the meaning of all my GTs!".

danking · 2024-01-31T00:49:41Z

@jmarshall

The new genotype fields are listed as Number=., which may wish to be updated to Number=P (for some of them at least) once we have settled on a suitable definition of P (cf #705).

Do I understand correctly that

one for each allele specified in the GT

means that P is the ploidy of the GT? Concretely, 1/1, 1/2, and 2/2 all have P equal to 2?

I'd prefer three new numbers (all of which are distinct from the ploidy):

LOCAL-A. One element for each allele present in LA minus one (to exclude the reference). For example: LEC.
LOCAL-R. One element for each allele present in LA (including the reference). For example: LAD.
LOCAL-G. One element for each possible genotype. If $N_{LA}$ is the number of local alleles, LOCAL-G is the triangle number of $N_{LA}$. Concretely: if diploid, $\frac{1}{2}N_{LA} (N_{LA}+1)$, and, if haploid, $N_{LA}$. For example: LPL.

I think P is sometimes equal to LOCAL-A, sometimes equal to LOCAL-R, and sometimes neither:

GT	P	LOCAL-A	LOCAL-R
0/0	2	0	1
0/1	2	1	2
1/2	2	2	3

LeeTL1220 · 2024-01-31T12:59:43Z

All Of Us use VDS natively (Hail Variant Dataset).

Apologies, if I am being nitpicky...

@jkbonfield FYI: I don't believe that "natively" is quite accurate, since it implies that we do our joint calling, conversions, and search functionality using Hail/VDS. We use Genomic Variant Store (GVS) for all of that, except for some conversions. We use GVS to render a VDS (also Hail for this one), VCF, and pgen. We do use the Hail/VDS to render bgen and plinkbed, because GVS does not support those. Note that GVS is a variant of SVCR (no pun intended).

d-cameron · 2024-02-27T17:38:12Z

Based onto VCF 4.5. Please continue discussion on #758.

lbergelson reviewed Jul 24, 2019

View reviewed changes

yfarjoun mentioned this pull request Jul 24, 2019

Changes to VCFv4.3 to codify local-allele parlance for sparse VCF #420

Closed

mlin reviewed Jul 27, 2019

View reviewed changes

mlin reviewed Jul 28, 2019

View reviewed changes

cyenyxe suggested changes Jul 29, 2019

View reviewed changes

mlin mentioned this pull request Jul 29, 2019

Missing PL fields for 0/0 calls dnanexus-rnd/GLnexus#173

Closed

jkbonfield added the vcf label Aug 20, 2019

lbergelson reviewed Oct 7, 2019

View reviewed changes

VCFv4.3.tex Outdated Show resolved Hide resolved

lbergelson reviewed Dec 3, 2019

View reviewed changes

yfarjoun added this to the VCF v4.4 milestone May 18, 2020

pd3 added a commit to samtools/bcftools that referenced this pull request Aug 14, 2020

Local alleles merging, produce LAA and LPL when requested.

e645749

This is a draft implementation of samtools/hts-specs#434 and haploid Number=G tags are not handled yet.

mlin mentioned this pull request Sep 7, 2020

document convention for "QC squeezing" in population VCF #527

Open

whitwham reviewed Oct 12, 2023

View reviewed changes

VCFv4.3.tex Outdated Show resolved Hide resolved

Local_allele stuff

1bf2db9

Yossi Farjoun added 5 commits November 13, 2023 22:14

responding to review comments

9f04587

responding to review comments

4f2bc72

added example

692e4cd

added example

3dca8b8

added example

4e4bae4

jmarshall mentioned this pull request Nov 13, 2023

Local Alleles (PR #434) applied to VCFv4.4 #745

Closed

jmarshall force-pushed the yf_local_alleles branch from a5d41dc to 4e4bae4 Compare January 2, 2024 22:51

jmarshall mentioned this pull request Jan 2, 2024

vcf: Define Number value P #705

Open

danking mentioned this pull request Feb 1, 2024

NegativeArraySizeException when converting plink bed to hailmatrix on large chromosome hail-is/hail#14168

Open

d-cameron pushed a commit that referenced this pull request Feb 27, 2024

Rebased #434 onto VCFv4.5

9d72212

d-cameron mentioned this pull request Feb 27, 2024

VCF 4.5 Future of VCF scaling improvements #758

Merged

d-cameron closed this Feb 27, 2024

d-cameron pushed a commit that referenced this pull request Apr 20, 2024

Rebased #434 onto VCFv4.5

0743318

iris-garden mentioned this pull request May 1, 2024

Performing SampleQC using Hail on ~500k WES samples iris-garden/test-process#3391

Open

	LAA & . & Integer & Local Alternate Alleles the 1-based index into the alternate alleles indicating which are relevant for the current sample.\\
	LAA & . & Integer & Local Alternate Alleles the strictly-increasing, 1-based indices into the alternate alleles indicating which are relevant for the current sample.\\

	LAA & . & Integer & Local Alternate Alleles the 1-based index into the alternate alleles indicating which are relevant for the current sample.\\
	LAA & . & Integer & Strictly increasing, 1-based indices indicating which alternate alleles are relevant (local) for the current sample\\

	LAD & . & Integer & Local Allele Depth for the reference and each of the local alternate alleles (see: LAA).\\
	LAD & . & Integer & Read depth for the reference and each of the local alternate alleles listed in LAA\\

	LPL & . & Integer & Phred-scaled genotype likelihoods rounded to the closest integer for genotypes that involve the Reference and the LAA alleles only (see LAA).\\
	LPL & . & Integer & Phred-scaled genotype likelihoods rounded to the closest integer for genotypes that involve the reference and the local alternate alleles listed in LAA\\

Define Local Alleles in VCF to allow for sparser format #434

Define Local Alleles in VCF to allow for sparser format #434

Conversation

yfarjoun commented Jul 24, 2019

hts-specs-bot commented Jul 24, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hts-specs-bot commented Jul 24, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cyenyxe Jul 29, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mlin Jul 29, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cyenyxe Jul 29, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mlin commented Sep 8, 2019

jmarshall commented Sep 9, 2019 • edited Loading

hts-specs-bot commented Sep 16, 2019

hts-specs-bot commented Oct 7, 2019

mlin commented Oct 7, 2019

hts-specs-bot commented Oct 7, 2019

hts-specs-bot commented Oct 7, 2019

lbergelson commented Dec 2, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jmarshall commented Dec 5, 2019

hts-specs-bot commented Aug 11, 2020

h-2 commented Jan 13, 2023 • edited Loading

d-cameron commented Jul 10, 2023 • edited Loading

pd3 commented Jul 11, 2023 • edited Loading

olest commented Oct 6, 2023

jmarshall commented Nov 13, 2023

jkbonfield commented Nov 28, 2023

github-actions bot commented Jan 2, 2024 • edited by jmarshall Loading

jmarshall commented Jan 2, 2024

danking commented Jan 31, 2024 • edited Loading

danking commented Jan 31, 2024

LeeTL1220 commented Jan 31, 2024 • edited Loading

d-cameron commented Feb 27, 2024

cyenyxe Jul 29, 2019 •

edited

Loading

mlin Jul 29, 2019 •

edited

Loading

cyenyxe Jul 29, 2019 •

edited

Loading

jmarshall commented Sep 9, 2019 •

edited

Loading

h-2 commented Jan 13, 2023 •

edited

Loading

d-cameron commented Jul 10, 2023 •

edited

Loading

pd3 commented Jul 11, 2023 •

edited

Loading

github-actions bot commented Jan 2, 2024 •

edited by jmarshall

Loading

danking commented Jan 31, 2024 •

edited

Loading

LeeTL1220 commented Jan 31, 2024 •

edited

Loading