Improvements to indel calling. #1679

jkbonfield · 2022-03-15T17:22:03Z

The bcf_call_glfgen code had a heuristic to change indel type calls to 0 (REF) where the read doesn't have an indel at that location, under the assumption our assessment was bad.

This may have been helpful in the past, but with better consensus calculation our type calls are now more robust. Instead we change such data to be unclassified types instead so they don't count to AD and PL.

This is still a fudge, but overall it's beneficial. We get a 7% rise in FP value, but a 2% drop in FN and huge 16% drop in GT assignment errors.
Improved bcf_cgp_find_types, which determines the number of candidate indel sizes present, to consider per_sample_flt and min_fract parameters. Previously these were only filtered out after every read has been realigned, but filtering earlier reduces the number of incorrect indel assignments and improves accuracy.

It likely also the cause of PacBio calling speeding up, as we have a low-level of indel errors in the data.
Replaced the old bcf_cgp_ref_sample code with a new bcf_cgp_consensus algorithm.

The original logic starting with the real reference and by walking through read CIGAR strings amending any SNPs. It did this once per sample. Then per hypothetical indel type, this "sample reference" was amended at the indel site itself by inserting or deleting bases. Each read was then aligned against this modified ref to evaluate how well it matched.

Unfortunately this is simply incorrect when there are neighbouring indels present, as only the central under-investigation indel ever made it into the modified reference.

The new code produces up to two consensus sequences per sample, based on all of the SEQ+CIGAR elements (including indels). This copes with evaluating e.g. a homozygous insertion flanked by a heterozygous deletion (hence two consensuses). A small dose of real reference is applied to, to cope with low coverage artefacts.

This is the heart of these indel calling changes and the primary reason for fixing some obvious looking indel calls that were
previously missed. However it's still not a replacement for a full denovo reassembly and alignment errors can still lead to incorrect results, but by and large we now fail on messy data instead of occasionally on good.
bcf_cgp_align_score can now be given two references, and we return the alignment score that's best. The band sizing is also a little different, now operating on a band related to all observed indel types and not just the specific indel type being assessed and also taking into account the total insertions and deletions in nearby positions gathered during consensus construction.
Improved est_seqQ. This is a naive sequence quality assignment based on the size of indel vs the size of homopolymer at that position. This is reasonable for a deletion, but for an insertion it didn't check that the inserted sequence actually matched the homopolymer base type.

TODO: homopolymer is too naive. It should be looking at indel in STR repeat unit terms.
Improved the indel window-size adjustments for long reads. This worked because long reads rarely have alignment end effects where the aligner has produced reference bias by soft-clipping a few trailing bases. However it's now resized more accurately based on the total proportion of bases covered by STRs within the window. On average this is smaller than the old "indel_win_size/2" method so further speeds up alignment of long read data.
Lots of additional commenting. Some explaining how things work, and a few with "TODO" ideas for potential improvements to explore.
The various bits of commented out debugging code are now more cleanly wrapped up in a CONS_DEBUG define, to ease enabling and disabling.
Fixed a few minor memory leaks on failure. These likely never occurred, or if they did were immediately followed by program exit anyway.
Reduced the default indel_win_size from 110 to 80. With better consensus construction it was found that a smaller value worked better. Also there is no longer a hard lower limit of 110. It's now possible to override this down to 20.
Improved the find_STR function to be able to spot repeat units up to 14 bases. It was previously 8.

Please advise on whether you wish it squashed. It'll be a huge change then, but equally so it's not so useful to look at all the commits as they contain ideas which were subsequently removed again. Squashing just some of it, and in particular reordering to group things together, will be problematic as it'll lead to a myriad of conflicts due to adding and removing exploratory code.

Replaces #1648

Some stats on GIAB HG002 chr1 truth set. Showing percentages for all data, QUAL>=30 and QUAL>=30 with additional DP filtering.

15x dev  (only 1-169MB as lacking in dev)
InDel TP   94.07 /   90.28 /   88.21
InDel FP    1.47 /    1.04 /    0.73
InDel GT    2.39 /    2.20 /    2.11
InDel FN    5.93 /    9.72 /   11.79

15x new (1-169MB)
InDel TP   94.48 /   91.35 /   89.18
InDel FP    1.29 /    0.89 /    0.76 --/+
InDel GT    2.45 /    2.25 /    2.18  +/~
InDel FN    5.52 /    8.65 /   10.82 --/--

15x new (all chr)
InDel TP   94.73 /   91.64 /   89.56
InDel FP    1.26 /    0.91 /    0.79
InDel GT    2.38 /    2.19 /    2.12
InDel FN    5.27 /    8.36 /   10.44

--------------------
30x dev
InDel TP   97.94 /   96.99 /   96.80
InDel FP    1.08 /    0.83 /    0.69
InDel GT    1.04 /    0.99 /    0.98
InDel FN    2.06 /    3.01 /    3.20

30x new
InDel TP   98.11 /   97.36 /   97.18
InDel FP    0.95 /    0.77 /    0.68  -/~
InDel GT    0.97 /    0.90 /    0.89  -/-
InDel FN    1.89 /    2.64 /    2.82  -/--

--------------------
60x dev
InDel TP   98.59 /   98.35 /   98.32
InDel FP    0.84 /    0.68 /    0.62
InDel GT    0.75 /    0.74 /    0.74
InDel FN    1.41 /    1.65 /    1.68

60x new
InDel TP   98.74 /   98.58 /   98.54
InDel FP    0.80 /    0.67 /    0.61  -/~
InDel GT    0.53 /    0.51 /    0.50 --/--
InDel FN    1.26 /    1.42 /    1.46 --/--

--------------------
150x dev
InDel TP   98.80 /   98.72 /   98.70
InDel FP    0.81 /    0.72 /    0.61
InDel GT    0.58 /    0.56 /    0.56
InDel FN    1.20 /    1.28 /    1.30

150x new
InDel TP   98.88 /   98.84 /   98.80
InDel FP    0.68 /    0.62 /    0.55 --/-
InDel GT    0.35 /    0.35 /    0.35 --/--
InDel FN    1.12 /    1.16 /    1.20  -/-

30x ROC curve (FP/FN percentage rates)

Edit: culling 15x plot as it erroneously was for all of chr1 despite missing a chunk in my run of develop branch. Reproducing this VCF now.

jkbonfield · 2022-03-16T09:45:02Z

Some performance figures on GIAB HG002.GRCh38.PacBio_CCS_15Kb.bam (approx 30x coverage) data:

develop:
SNP          Q>0 /   Q>=30 / Filtered
SNP   TP   99.97 /   99.93 /   99.93
SNP   FP    1.25 /    0.88 /    0.88
SNP   GT    0.19 /    0.19 /    0.19
SNP   FN    0.03 /    0.07 /    0.07

InDel TP   97.84 /   96.09 /   96.09
InDel FP   49.14 /   16.55 /   16.48
InDel GT   13.91 /   13.88 /   13.88
InDel FN    2.16 /    3.91 /    3.91


new:
SNP          Q>0 /   Q>=30 / Filtered
...
InDel TP   98.07 /   95.73 /   95.73
InDel FP   41.06 /   12.42 /   12.36
InDel GT   12.58 /   12.53 /   12.53
InDel FN    1.93 /    4.27 /    4.27

There is a slight shift in FN vs FP figures here at Q30 cutoff, but the new code at a slightly reduced Q25 shows it beats develop on all 3 of FP, GT and FN figures.

new at Q25:
InDel TP   98.07 /   96.43 /   96.43
InDel FP   41.06 /   15.59 /   15.53
InDel GT   12.58 /   12.56 /   12.56
InDel FN    1.93 /    3.57 /    3.57

In both develop and this PR, it's still very much favouring recall over precision for the CCS data. We get a much more balanced starting point with -h 200, but overall the ROC curve is poorer so we're better off taking the default of -h 200 and just setting a slightly higher QUAL threshold to filter instead (eg Q40 is a more even recall vs preicison tradeoff).

jkbonfield · 2022-03-16T15:45:59Z

I see my tweak to filter bcf_cgp_find_types candidates by the -m and -F parameters was incorrect and it was passing if -m or -F were accepted, whereas the actual filtering applied post removal is -m and -F conditions must be met. It worked initially, but at some point I incorrectly changed it. Given the defaults of -m 2 -F 0.05 on low coverage data this basically means anything is accepted still (initially) due to the low -F value.

Tweaking this (still to be pushed) gives these plots (tweak in "PR-b"):

[EDIT: fixed scaling as I had the wrong total number of indels in my gnuplot script.]

The effect of this filtering is we lose some recall, pretty much as expected. I am guessing it's the case where only a single alignment has an indel according to the CIGAR pileup, but post realignment the number of candidates reported matching that indel type is >= 2. Or possibly it's for GT cases 1/2 where -m and -F are the combined AD values for all types combined and not applied per type?

Regardless, given these are command line parameters that can be reduced if needed, the default behaviour looks very strong as a false positive removal mechanism. I'll do a bit more testing and then push an update (plus rebase to bring in the CI testing fix).

The fixed +/- indel_win_size is still used in construction of the type[] array, but then we reduce that size down before doing the alignments and evaluating them. This means, where appropriate (nice unique data without lots of STRs) we can assess over a smaller window and avoid the negative impact of other nearby indels. It cures the example covid19 problem, but also reduces recall elsewhere as if we *do* still get other nearby indels (eg due to low complexity data between our candidate indel and a neighbouring one) then we're now paying a much larger normalised per-base hit as the length is smaller.

First 30MB of SynDip chr1: SNP Q>0 / Q>=30 / Filtered InDel TP 4915 / 4817 / 4796 InDel FP 334 / 266 / 226 InDel GT 265 / 249 / 245 InDel FN 1528 / 1626 / 1647 vs devel: InDel TP 4910 / 4828 / 4807 InDel FP 266 / 192 / 169 InDel GT 240 / 230 / 226 InDel FN 1533 / 1615 / 1636 So not a win apparently. However it's a starting point, and many of the FP/FN are related to differing calls at correct locations.

Multiple del types need to use ref_cons based on the biggest del not the current one, so we're not incorporating other bits of deletion in the current "run". Bug fix to seq-pos ("y") when doing cons_base vs ref_base on INS.

The define INS_PLUS_BASE is enabled, which is how the old code worked so no change. This includes the next base matching call in the last base of the insertion, so it's *either* cons_base *or* cons_ins. However we instead have an easier to understand alternative which is an easier to understand with cons_ins being the inserted bases and cons_base always being used for the matching bases. (This corrects some defects for reads than end on an insertion with no flanking M, although that is rare.) Unfortunately, for reasons unknown, it is very slightly higher on FP/GT. (Hence we still use the old method)

We have the most common consensus plus a sub-consensus for when neighbouring heterozygous indels are present. This is acting as an alternative to using the old ref_sample method.

Also tweak ins vs N merge threshold.

- We previously started our consensus with 1 depth copy of ref_sample. This brings in data from the overall consensus for where one specific indel 'type' doesn't have data spanning the entire region. We now ensure a minimum of 1-fold depth is copied from the ref_base data instead. It's still not quite as good results, but good enough and paves the way for removal of a lot of existing code. - The bcf_call_glfgen code had a heuristic to change type calls to 0 (REF) where the read doesn't have an indel at that location, under the assumption our assessment was bad. This may have been helpful in the past, but with better consensus calculation our type calls are now more robust. Instead we change such data to be unclassified types instead so they don't count to AD and PL. This is still a fudge. It may be preferable to leave them as-is, and/or filter by some other means such as whether they span STRs. We get a 7% rise in FP value, but a 2% drop in FN and 16% drop in GT assignment errors. TODO: study those new FPs in more detail.

Also a small correction to consensus rfract so it's a minimum rather than additive amendment.

Ie cnum==1 is always opposite of cnum==0 for hets.

The INS_PLUS_BASE define is complicated code, and validation now shows the differences are very marginal. So disabled for now and will cull soon so we only have the simpler variant left.

- local_band is computed only over left..right part of alignment instead of from the entire CIGAR string. - Fix for p->indel == type matching. This isn't sufficient to filter out when we're in the deletion being studied if it starts eg 1 base earlier, where p->indel is now zero, but p->is_del is true. - Use band size computed from all observed indels and STRs, not "type". - Replace bca->indel_win_size/2 for long reads with an indel/STR aware limit. - Replace the tbeg/tend calculations with a indel/STR aware setting.

- Fixed (?) out by one error in candidate del caller: Was "i >= pos-left" and now "i > pos-left". This was incorrectly not marking the last base in a het del as used when immediately followed by an insertion (or maybe it was vice versa). - The same line as above decided to add to either cons_base vs ref_base using "biggest_del", but this fails when we have nested deletions present. Eg -10 at pos 105 and -1 at pos 107. We now track containing deletions and set a skip_to coordinate instead of relying on "biggest_del". - Alter the fraction at which we label a heterozygous indel as heterozygous from .2 to .3. (Shift10h7j in notes)

Also bug fix "len==type" check in deletion. It should be "len==-type".

Also fixed a minor memory leak (approx 36 bytes per 1MB of genome).

Similarly for insertion. These aren't exact duplicates, as it's overall and per-sample, but close enough and the doubling up of work was accidental.

If the sequence being inserted has differing base-calls to the flanking sequence, then est_seqQ's analysis of homopolymer runs and how they impact the score is irrelevant as the insertion isn't ambiguous anyway. This has a small reduction to FN on PacBio CCS with minimal change to FP.

- We erroneously had the -m and -F pre-filter options applies as needing either to pass instead of both to pass in bcf_cgp_find_types. Fixing this is a significant reduction to FP rates. - The computation of which STRs spanned the indel being assessed was sometimes a little bit out, due to using "qpos" instead of "pos". This removes a few false negatives. - Improve the calculation of indel cost ("iscore") by counting the number of additional repeat units beyond the end of the sequence. The old penalty for STRs at the end of reads has been removed, meaning we regain some FNs while still penalising (more) the cases that lead to FPs. Overall there is some small FP / FN adjustments, but the biggest improvement here is a significant drop in GT errors (~10% fewer).

jkbonfield · 2022-03-17T16:16:46Z

Updated with some tweaks to improve things further (as mentioned above):

We erroneously had the -m and -F pre-filter options applies as needing either to pass instead of both to pass in bcf_cgp_find_types. Fixing this is a significant reduction to FP rates.
The computation of which STRs spanned the indel being assessed was sometimes a little bit out, due to using "qpos" instead of "pos". This removes a few false negatives.
Improve the calculation of indel cost ("iscore") by counting the number of additional repeat units beyond the end of the sequence. The old penalty for STRs at the end of reads has been removed, meaning we regain some FNs while still penalising (more) the cases that lead to FPs.

Overall there is some small FP / FN adjustments, but the biggest improvement here is a significant drop in GT errors (~10% fewer).

PacBioCCS plot is being generated atm.

jkbonfield · 2022-03-17T16:53:32Z

Some raw numbers, as percentages of the truth set. Columns are all (qual>0), QUAL>=30 and QUAL>=30 plus extra filters (IDV, IMF, DP).

15x dev
InDel TP   94.35 /   90.56 /   88.57
InDel FP    1.42 /    1.05 /    0.74
InDel GT    2.25 /    2.08 /    1.99
InDel FN    5.65 /    9.44 /   11.43

15x 9f10
InDel TP   93.95 /   91.10 /   89.13
InDel FP    0.95 /    0.63 /    0.52 -30%
InDel GT    2.45 /    2.29 /    2.21 +11%
InDel FN    6.05 /    8.90 /   10.87 -5%

30x dev
InDel TP   97.94 /   96.99 /   96.80
InDel FP    1.08 /    0.83 /    0.69
InDel GT    1.04 /    0.99 /    0.98
InDel FN    2.06 /    3.01 /    3.20

30x 9f10
InDel TP   98.01 /   97.31 /   97.13
InDel FP    0.87 /    0.69 /    0.61 -12%
InDel GT    0.85 /    0.80 /    0.79 -19%
InDel FN    1.99 /    2.69 /    2.87 -10%

60x dev
InDel TP   98.59 /   98.35 /   98.32
InDel FP    0.84 /    0.68 /    0.62
InDel GT    0.75 /    0.74 /    0.74
InDel FN    1.41 /    1.65 /    1.68

60x 9f10
InDel TP   98.72 /   98.57 /   98.54
InDel FP    0.76 /    0.63 /    0.58 -10%
InDel GT    0.43 /    0.42 /    0.41 -45%
InDel FN    1.28 /    1.43 /    1.46 -13%

The percentage change vs the develop branch is significant for FP and FN at all depths. GT assignment errors are up for shallow data, although in absolute numbers it's still fewer new GT errors than we've recovered in FN, so this is largely just an issue of low coverage true variants having a significant portion with the wrong genotype assignment. At higher depths the GT error really plummets.

We already had this elsewhere, but forgot this case. It trips up on GIAB HG002.GRCh38.PacBio_CCS_15Kb.bam at chr1:112149167 which has a 13KB deletion.

jkbonfield · 2022-03-17T17:29:15Z

PacBio CCS stats:

This broke at 112MB into chr1 as it found a 12KB deletion (now fixed). Rerunning, but the shape of the graph is unlikely to change. There are very high false positive and false negative rates on this data. We can use e.g. -h 200 to remove many of the low quality false positives, but it's generally best to simply filter harder instead. Most of the errors (both FN and FP) are homopolymer issues, which is simply the nature of the technology.

SynDip chr1 stats:

At 15x the recent additional commit has removed some true variants, so elevated FN rate, but it's the only change which is consistently better on FN vs FP tradeoff. Frankly all of them have terrible FN miss rates.

At 45x the new PR diverges and starts to become poorer than current develop due to higher FP rates. However this only occurs at QUAL >=40 or so and the FN rates are still so enormous as to not be practically usable. I think this is likely just the nature of the truth set, and I'm being overly pedantic in my accuracy assessment. If I recall there is a caveat in SynDip about not trusting small indels due to the nature of the underlying PacBio assembly and homopolymers, so it may simply be plain wrong on them.

jkbonfield · 2022-03-17T17:32:11Z

Finally, assuming my revised pacbio test runs to completion, I think I'm done with this PR.

What I haven't tested is multi-sample calling, simply because I don't know of appropriate truth sets to use (maybe a trio though?) and we don't have a written best practice guide for it so I don't know the normal method of doing this. Do we just do it in one command, or produce gVCFs and merge? We really need to document this pipeline, but that's a job for someone who knows what they're doing to do. :-)

So I'll leave such validation up to you.

This fixes a read buffer-overrun, although I'm baffled how we didn't hit this with the original code as it's never apparently known where the reference ends.

This is a combination of PR samtools#1679 from 2022 coupled with more recent changes to replace BAQ with edlib for evaluating the alignment of reads against candidate alleles. Due to the complexity of rebasing many commits with many conflicts, these have been squashed together for simplicity. Please consult the original commit log in the PR for more detailed history. PR 1679 is a major restructuring of the indel caller that produces proper consensus sequences. TODO: paste collected commit messages here. The newer edlib changes are designed to dramtically increase the speed of indel calling on long read data by use of a modern algorithm. Note edlib's alignments are very crude, with just +1 for all differences including substitutions and each indel base. However if we've computed our candidate allele consensus sequences correctly then the aligning against the real allele should be a minimal score regardless of scoring regime, so the cost of an HMM or even affine weights do not help much. The lack of quality values is compensated for by detection of minimum quality within STRs.

The main benefit of the new indel caller is the use of alignment against diploid consensus generation instead of simply patching the reference with candidate indel types. This greatly reduces false positives and incorrect allele alignment (leading to wrong genotype calls). This was the earlier PR samtools#1679, but has since acquired edlib as an alternative to BAQ for the indel alignment. However this is primarily a speed benefit (with some minor removal of false-negatives due to quality smashing), and isn't the main thing users should think about when choosing an indel caller.

This is a combination of PR samtools#1679 from 2022 coupled with more recent changes to replace BAQ with edlib for evaluating the alignment of reads against candidate alleles. Due to the complexity of rebasing many commits with many conflicts, these have been squashed together for simplicity. Please consult the original commit log in the PR for more detailed history. PR 1679 is a major restructuring of the indel caller that produces proper consensus sequences. TODO: paste collected commit messages here. The newer edlib changes are designed to dramtically increase the speed of indel calling on long read data by use of a modern algorithm. Note edlib's alignments are very crude, with just +1 for all differences including substitutions and each indel base. However if we've computed our candidate allele consensus sequences correctly then the aligning against the real allele should be a minimal score regardless of scoring regime, so the cost of an HMM or even affine weights do not help much. The lack of quality values is compensated for by detection of minimum quality within STRs.

The main benefit of the new indel caller is the use of alignment against diploid consensus generation instead of simply patching the reference with candidate indel types. This greatly reduces false positives and incorrect allele alignment (leading to wrong genotype calls). This was the earlier PR samtools#1679, but has since acquired edlib as an alternative to BAQ for the indel alignment. However this is primarily a speed benefit (with some minor removal of false-negatives due to quality smashing), and isn't the main thing users should think about when choosing an indel caller.

This is a combination of PR samtools#1679 from 2022 coupled with more recent changes to replace BAQ with edlib for evaluating the alignment of reads against candidate alleles. Due to the complexity of rebasing many commits with many conflicts, these have been squashed together for simplicity. Please consult the original commit log in the PR for more detailed history. PR 1679 is a major restructuring of the indel caller that produces proper consensus sequences. TODO: paste collected commit messages here. The newer edlib changes are designed to dramtically increase the speed of indel calling on long read data by use of a modern algorithm. Note edlib's alignments are very crude, with just +1 for all differences including substitutions and each indel base. However if we've computed our candidate allele consensus sequences correctly then the aligning against the real allele should be a minimal score regardless of scoring regime, so the cost of an HMM or even affine weights do not help much. The lack of quality values is compensated for by detection of minimum quality within STRs.

The main benefit of the new indel caller is the use of alignment against diploid consensus generation instead of simply patching the reference with candidate indel types. This greatly reduces false positives and incorrect allele alignment (leading to wrong genotype calls). This was the earlier PR samtools#1679, but has since acquired edlib as an alternative to BAQ for the indel alignment. However this is primarily a speed benefit (with some minor removal of false-negatives due to quality smashing), and isn't the main thing users should think about when choosing an indel caller. Also tidied up the usage statement and added an explicit "-X list" to print the profile parameters. - Add extra debugging defines. GLF_DEBUG reports GLF calculation in bam2bcf.c. ALIGN_DEBUG uses edlib EDLIB_TASK_PATH to report sequence alignment. NB to use this you need to link against edlib itself rather than the cutdown version in this repository. Also fix the edlib heuristics used in bcf_call_glfgen. We don't want to change the call (b=5) as this affects AD. Instead we change the quality so poor calls get filtered by QUAL rather than simply being removed. - Tweak edlib tuning for SeqQ/qual. Add quality value assessment into soft-clip recovery. Use /500 instead of /111 in indelQ assignment, and skew indel-bias accordingly. This gives better separation of FP/GT/FN generally. - Added --seqq-offset parameter so we can use it in tunables per profile. This is used as a limit on the seqQ reduction in the "VAL-5*MIN(20,depth)" formula, used for favouring data over seqQ scores when depth is sufficient. Experimentation showed no single value that worked for all platforms, but the default is in the middle. - Tidy up to cull ifdefed and commented out code. - Add test for indels-cns. It's minimal, but the whole indel calling has minimal testing. I think we have under 10 indels in total with develop (all short read and mostly duplications of each other), and no testing of indels-2.0. This tests 4 indels with indels-cns. - Added documentation for the new --indels-2.0 options - Cull more unused parts of edlib.c. This avoids clang warnings (which become errors with -Werror). We're only including the bits we need here for mpileup. If you want the whole thing, link against the upstream source library instead.

This is a combination of PR #1679 from 2022 coupled with more recent changes to replace BAQ with edlib for evaluating the alignment of reads against candidate alleles. Due to the complexity of rebasing many commits with many conflicts, these have been squashed together for simplicity. Please consult the original commit log in the PR for more detailed history. PR 1679 is a major restructuring of the indel caller that produces proper consensus sequences. TODO: paste collected commit messages here. The newer edlib changes are designed to dramtically increase the speed of indel calling on long read data by use of a modern algorithm. Note edlib's alignments are very crude, with just +1 for all differences including substitutions and each indel base. However if we've computed our candidate allele consensus sequences correctly then the aligning against the real allele should be a minimal score regardless of scoring regime, so the cost of an HMM or even affine weights do not help much. The lack of quality values is compensated for by detection of minimum quality within STRs.

The main benefit of the new indel caller is the use of alignment against diploid consensus generation instead of simply patching the reference with candidate indel types. This greatly reduces false positives and incorrect allele alignment (leading to wrong genotype calls). This was the earlier PR #1679, but has since acquired edlib as an alternative to BAQ for the indel alignment. However this is primarily a speed benefit (with some minor removal of false-negatives due to quality smashing), and isn't the main thing users should think about when choosing an indel caller. Also tidied up the usage statement and added an explicit "-X list" to print the profile parameters. - Add extra debugging defines. GLF_DEBUG reports GLF calculation in bam2bcf.c. ALIGN_DEBUG uses edlib EDLIB_TASK_PATH to report sequence alignment. NB to use this you need to link against edlib itself rather than the cutdown version in this repository. Also fix the edlib heuristics used in bcf_call_glfgen. We don't want to change the call (b=5) as this affects AD. Instead we change the quality so poor calls get filtered by QUAL rather than simply being removed. - Tweak edlib tuning for SeqQ/qual. Add quality value assessment into soft-clip recovery. Use /500 instead of /111 in indelQ assignment, and skew indel-bias accordingly. This gives better separation of FP/GT/FN generally. - Added --seqq-offset parameter so we can use it in tunables per profile. This is used as a limit on the seqQ reduction in the "VAL-5*MIN(20,depth)" formula, used for favouring data over seqQ scores when depth is sufficient. Experimentation showed no single value that worked for all platforms, but the default is in the middle. - Tidy up to cull ifdefed and commented out code. - Add test for indels-cns. It's minimal, but the whole indel calling has minimal testing. I think we have under 10 indels in total with develop (all short read and mostly duplications of each other), and no testing of indels-2.0. This tests 4 indels with indels-cns. - Added documentation for the new --indels-2.0 options - Cull more unused parts of edlib.c. This avoids clang warnings (which become errors with -Werror). We're only including the bits we need here for mpileup. If you want the whole thing, link against the upstream source library instead.

jkbonfield · 2024-09-17T08:48:44Z

Closing as this was superceded by #2045 which has was merged in as --indels-cns.

jkbonfield added 28 commits March 17, 2022 16:03

Fixes to consensus.

0e050ce

Multiple del types need to use ref_cons based on the biggest del not the current one, so we're not incorporating other bits of deletion in the current "run". Bug fix to seq-pos ("y") when doing cons_base vs ref_base on INS.

Fix display of type/TYPE cons. Debug change only

d826490

Apply IDV per type, not just across all indels as a whole.

7db37bb

First stab at returning dual consensus.

c50d25b

We have the most common consensus plus a sub-consensus for when neighbouring heterozygous indels are present. This is acting as an alternative to using the old ref_sample method.

Improvements to 2nd consensus generation

bfc951f

Correct consensus to match type, even if data disagrees.

72b88f7

Also tweak ins vs N merge threshold.

Change default indel-size and lower limit

76b542b

Remove debugging

1295ec1

Improvements to consensus generation

b1a2a10

Adjustments to the ref vs query region for alignments.

98f4b74

Also a small correction to consensus rfract so it's a minimum rather than additive amendment.

Code tidyup

36b972b

Give deletions the same treatment as insertions for consensus gen.

68485fb

Ie cnum==1 is always opposite of cnum==0 for hets.

Minor tidying plus tweak STR iscore code

333f203

Disable INS_PLUS_BASE, plus a start on better band calc.

c450882

The INS_PLUS_BASE define is complicated code, and validation now shows the differences are very marginal. So disabled for now and will cull soon so we only have the simpler variant left.

Improve STR finder to cope with longer repeat units

bc9cad0

Lots of tidying up of consensus code.

bbfadc1

Also bug fix "len==type" check in deletion. It should be "len==-type".

Fix a small buffer overrun introduced by changing tbeg/tend usage.

1173c62

Also fixed a minor memory leak (approx 36 bytes per 1MB of genome).

Tidy the consensus code

ac4d267

Tidy up biggest_del / max_deletion.

5271e9c

Similarly for insertion. These aren't exact duplicates, as it's overall and per-sample, but close enough and the doubling up of work was accidental.

Tidy the indel changes up a bit.

ed5458b

Fix test mpileup files

9de001b

jkbonfield force-pushed the indel-tweak-jkb1 branch from dfc1861 to 16834c7 Compare March 17, 2022 16:16

Add bounds checking to heti/hetd arrays.

155e36e

We already had this elsewhere, but forgot this case. It trips up on GIAB HG002.GRCh38.PacBio_CCS_15Kb.bam at chr1:112149167 which has a 13KB deletion.

Add a check for end of reference when indel calling.

72c0f76

This fixes a read buffer-overrun, although I'm baffled how we didn't hit this with the original code as it's never apparently known where the reference ends.

pd3 mentioned this pull request May 2, 2022

bcftools fails to genotype MNP/indels from targeted-sequencing reads #1706

Closed

jkbonfield mentioned this pull request Jan 31, 2023

failed to detect one mutation #1854

Closed

pd3 force-pushed the develop branch from ab5de54 to 2191405 Compare February 13, 2023 10:34

jkbonfield mentioned this pull request Mar 2, 2023

[E::sam_parse1] CIGAR and query sequence are of different length ksahlin/strobealign#245

Closed

jkbonfield mentioned this pull request Oct 13, 2023

bcftools run over 18 hours only calling 11Mb sequence of one chromosome #2000

Open

jkbonfield mentioned this pull request Nov 23, 2023

revision of 1679 plus switch to edlib over BAQ #2045

Merged

jkbonfield mentioned this pull request Sep 17, 2024

Document / Improve / Polish / Fix --indels-2.0 #2280

Open

jkbonfield closed this Sep 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improvements to indel calling. #1679

Improvements to indel calling. #1679

jkbonfield commented Mar 15, 2022 •

edited

Loading

jkbonfield commented Mar 16, 2022

jkbonfield commented Mar 16, 2022 •

edited

Loading

jkbonfield commented Mar 17, 2022

jkbonfield commented Mar 17, 2022

jkbonfield commented Mar 17, 2022 •

edited

Loading

jkbonfield commented Mar 17, 2022

jkbonfield commented Sep 17, 2024

Improvements to indel calling. #1679

Improvements to indel calling. #1679

Conversation

jkbonfield commented Mar 15, 2022 • edited Loading

jkbonfield commented Mar 16, 2022

jkbonfield commented Mar 16, 2022 • edited Loading

jkbonfield commented Mar 17, 2022

jkbonfield commented Mar 17, 2022

jkbonfield commented Mar 17, 2022 • edited Loading

jkbonfield commented Mar 17, 2022

jkbonfield commented Sep 17, 2024

jkbonfield commented Mar 15, 2022 •

edited

Loading

jkbonfield commented Mar 16, 2022 •

edited

Loading

jkbonfield commented Mar 17, 2022 •

edited

Loading