Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements to indel calling. #1679

Closed
wants to merge 31 commits into from

Conversation

jkbonfield
Copy link
Contributor

@jkbonfield jkbonfield commented Mar 15, 2022

  • The bcf_call_glfgen code had a heuristic to change indel type calls to 0 (REF) where the read doesn't have an indel at that location, under the assumption our assessment was bad.

    This may have been helpful in the past, but with better consensus calculation our type calls are now more robust. Instead we change such data to be unclassified types instead so they don't count to AD and PL.

    This is still a fudge, but overall it's beneficial. We get a 7% rise in FP value, but a 2% drop in FN and huge 16% drop in GT assignment errors.

  • Improved bcf_cgp_find_types, which determines the number of candidate indel sizes present, to consider per_sample_flt and min_fract parameters. Previously these were only filtered out after every read has been realigned, but filtering earlier reduces the number of incorrect indel assignments and improves accuracy.

    It likely also the cause of PacBio calling speeding up, as we have a low-level of indel errors in the data.

  • Replaced the old bcf_cgp_ref_sample code with a new bcf_cgp_consensus algorithm.

    The original logic starting with the real reference and by walking through read CIGAR strings amending any SNPs. It did this once per sample. Then per hypothetical indel type, this "sample reference" was amended at the indel site itself by inserting or deleting bases. Each read was then aligned against this modified ref to evaluate how well it matched.

    Unfortunately this is simply incorrect when there are neighbouring indels present, as only the central under-investigation indel ever made it into the modified reference.

    The new code produces up to two consensus sequences per sample, based on all of the SEQ+CIGAR elements (including indels). This copes with evaluating e.g. a homozygous insertion flanked by a heterozygous deletion (hence two consensuses). A small dose of real reference is applied to, to cope with low coverage artefacts.

    This is the heart of these indel calling changes and the primary reason for fixing some obvious looking indel calls that were
    previously missed. However it's still not a replacement for a full denovo reassembly and alignment errors can still lead to incorrect results, but by and large we now fail on messy data instead of occasionally on good.

  • bcf_cgp_align_score can now be given two references, and we return the alignment score that's best. The band sizing is also a little different, now operating on a band related to all observed indel types and not just the specific indel type being assessed and also taking into account the total insertions and deletions in nearby positions gathered during consensus construction.

  • Improved est_seqQ. This is a naive sequence quality assignment based on the size of indel vs the size of homopolymer at that position. This is reasonable for a deletion, but for an insertion it didn't check that the inserted sequence actually matched the homopolymer base type.

    TODO: homopolymer is too naive. It should be looking at indel in STR repeat unit terms.

  • Improved the indel window-size adjustments for long reads. This worked because long reads rarely have alignment end effects where the aligner has produced reference bias by soft-clipping a few trailing bases. However it's now resized more accurately based on the total proportion of bases covered by STRs within the window. On average this is smaller than the old "indel_win_size/2" method so further speeds up alignment of long read data.

  • Lots of additional commenting. Some explaining how things work, and a few with "TODO" ideas for potential improvements to explore.

  • The various bits of commented out debugging code are now more cleanly wrapped up in a CONS_DEBUG define, to ease enabling and disabling.

  • Fixed a few minor memory leaks on failure. These likely never occurred, or if they did were immediately followed by program exit anyway.

  • Reduced the default indel_win_size from 110 to 80. With better consensus construction it was found that a smaller value worked better. Also there is no longer a hard lower limit of 110. It's now possible to override this down to 20.

  • Improved the find_STR function to be able to spot repeat units up to 14 bases. It was previously 8.

Please advise on whether you wish it squashed. It'll be a huge change then, but equally so it's not so useful to look at all the commits as they contain ideas which were subsequently removed again. Squashing just some of it, and in particular reordering to group things together, will be problematic as it'll lead to a myriad of conflicts due to adding and removing exploratory code.

Replaces #1648

Some stats on GIAB HG002 chr1 truth set. Showing percentages for all data, QUAL>=30 and QUAL>=30 with additional DP filtering.

15x dev  (only 1-169MB as lacking in dev)
InDel TP   94.07 /   90.28 /   88.21
InDel FP    1.47 /    1.04 /    0.73
InDel GT    2.39 /    2.20 /    2.11
InDel FN    5.93 /    9.72 /   11.79

15x new (1-169MB)
InDel TP   94.48 /   91.35 /   89.18
InDel FP    1.29 /    0.89 /    0.76 --/+
InDel GT    2.45 /    2.25 /    2.18  +/~
InDel FN    5.52 /    8.65 /   10.82 --/--

15x new (all chr)
InDel TP   94.73 /   91.64 /   89.56
InDel FP    1.26 /    0.91 /    0.79
InDel GT    2.38 /    2.19 /    2.12
InDel FN    5.27 /    8.36 /   10.44

--------------------
30x dev
InDel TP   97.94 /   96.99 /   96.80
InDel FP    1.08 /    0.83 /    0.69
InDel GT    1.04 /    0.99 /    0.98
InDel FN    2.06 /    3.01 /    3.20

30x new
InDel TP   98.11 /   97.36 /   97.18
InDel FP    0.95 /    0.77 /    0.68  -/~
InDel GT    0.97 /    0.90 /    0.89  -/-
InDel FN    1.89 /    2.64 /    2.82  -/--

--------------------
60x dev
InDel TP   98.59 /   98.35 /   98.32
InDel FP    0.84 /    0.68 /    0.62
InDel GT    0.75 /    0.74 /    0.74
InDel FN    1.41 /    1.65 /    1.68

60x new
InDel TP   98.74 /   98.58 /   98.54
InDel FP    0.80 /    0.67 /    0.61  -/~
InDel GT    0.53 /    0.51 /    0.50 --/--
InDel FN    1.26 /    1.42 /    1.46 --/--

--------------------
150x dev
InDel TP   98.80 /   98.72 /   98.70
InDel FP    0.81 /    0.72 /    0.61
InDel GT    0.58 /    0.56 /    0.56
InDel FN    1.20 /    1.28 /    1.30

150x new
InDel TP   98.88 /   98.84 /   98.80
InDel FP    0.68 /    0.62 /    0.55 --/-
InDel GT    0.35 /    0.35 /    0.35 --/--
InDel FN    1.12 /    1.16 /    1.20  -/-

30x ROC curve (FP/FN percentage rates)

image

Edit: culling 15x plot as it erroneously was for all of chr1 despite missing a chunk in my run of develop branch. Reproducing this VCF now.

@jkbonfield
Copy link
Contributor Author

Some performance figures on GIAB HG002.GRCh38.PacBio_CCS_15Kb.bam (approx 30x coverage) data:

develop:
SNP          Q>0 /   Q>=30 / Filtered
SNP   TP   99.97 /   99.93 /   99.93
SNP   FP    1.25 /    0.88 /    0.88
SNP   GT    0.19 /    0.19 /    0.19
SNP   FN    0.03 /    0.07 /    0.07

InDel TP   97.84 /   96.09 /   96.09
InDel FP   49.14 /   16.55 /   16.48
InDel GT   13.91 /   13.88 /   13.88
InDel FN    2.16 /    3.91 /    3.91


new:
SNP          Q>0 /   Q>=30 / Filtered
...
InDel TP   98.07 /   95.73 /   95.73
InDel FP   41.06 /   12.42 /   12.36
InDel GT   12.58 /   12.53 /   12.53
InDel FN    1.93 /    4.27 /    4.27

There is a slight shift in FN vs FP figures here at Q30 cutoff, but the new code at a slightly reduced Q25 shows it beats develop on all 3 of FP, GT and FN figures.

new at Q25:
InDel TP   98.07 /   96.43 /   96.43
InDel FP   41.06 /   15.59 /   15.53
InDel GT   12.58 /   12.56 /   12.56
InDel FN    1.93 /    3.57 /    3.57

In both develop and this PR, it's still very much favouring recall over precision for the CCS data. We get a much more balanced starting point with -h 200, but overall the ROC curve is poorer so we're better off taking the default of -h 200 and just setting a slightly higher QUAL threshold to filter instead (eg Q40 is a more even recall vs preicison tradeoff).

@jkbonfield
Copy link
Contributor Author

jkbonfield commented Mar 16, 2022

I see my tweak to filter bcf_cgp_find_types candidates by the -m and -F parameters was incorrect and it was passing if -m or -F were accepted, whereas the actual filtering applied post removal is -m and -F conditions must be met. It worked initially, but at some point I incorrectly changed it. Given the defaults of -m 2 -F 0.05 on low coverage data this basically means anything is accepted still (initially) due to the low -F value.

Tweaking this (still to be pushed) gives these plots (tweak in "PR-b"):

[EDIT: fixed scaling as I had the wrong total number of indels in my gnuplot script.]

image

image

The effect of this filtering is we lose some recall, pretty much as expected. I am guessing it's the case where only a single alignment has an indel according to the CIGAR pileup, but post realignment the number of candidates reported matching that indel type is >= 2. Or possibly it's for GT cases 1/2 where -m and -F are the combined AD values for all types combined and not applied per type?

Regardless, given these are command line parameters that can be reduced if needed, the default behaviour looks very strong as a false positive removal mechanism. I'll do a bit more testing and then push an update (plus rebase to bring in the CI testing fix).

The fixed +/- indel_win_size is still used in construction of the
type[] array, but then we reduce that size down before doing the
alignments and evaluating them.

This means, where appropriate (nice unique data without lots of STRs)
we can assess over a smaller window and avoid the negative impact of
other nearby indels.

It cures the example covid19 problem,  but also reduces recall
elsewhere as if we *do* still get other nearby indels (eg due to low
complexity data between our candidate indel and a neighbouring one)
then we're now paying a much larger normalised per-base hit as the
length is smaller.
First 30MB of SynDip chr1:

SNP          Q>0 /   Q>=30 / Filtered
InDel TP    4915 /    4817 /    4796
InDel FP     334 /     266 /     226
InDel GT     265 /     249 /     245
InDel FN    1528 /    1626 /    1647

vs devel:

InDel TP    4910 /    4828 /    4807
InDel FP     266 /     192 /     169
InDel GT     240 /     230 /     226
InDel FN    1533 /    1615 /    1636

So not a win apparently.

However it's a starting point, and many of the FP/FN are related to
differing calls at correct locations.
Multiple del types need to use ref_cons based on the biggest del not
the current one, so we're not incorporating other bits of deletion in
the current "run".

Bug fix to seq-pos ("y") when doing cons_base vs ref_base on INS.
The define INS_PLUS_BASE is enabled, which is how the old code worked
so no change.

This includes the next base matching call in the last base of the
insertion, so it's *either* cons_base *or* cons_ins.

However we instead have an easier to understand alternative which is
an easier to understand with cons_ins being the inserted bases and
cons_base always being used for the matching bases.  (This corrects
some defects for reads than end on an insertion with no flanking M,
although that is rare.)

Unfortunately, for reasons unknown, it is very slightly higher on
FP/GT. (Hence we still use the old method)
We have the most common consensus plus a sub-consensus for when
neighbouring heterozygous indels are present.  This is acting as an
alternative to using the old ref_sample method.
Also tweak ins vs N merge threshold.
- We previously started our consensus with 1 depth copy of ref_sample.
  This brings in data from the overall consensus for where one
  specific indel 'type' doesn't have data spanning the entire region.

  We now ensure a minimum of 1-fold depth is copied from the ref_base
  data instead.  It's still not quite as good results, but good enough
  and paves the way for removal of a lot of existing code.

- The bcf_call_glfgen code had a heuristic to change type calls to 0
  (REF) where the read doesn't have an indel at that location, under
  the assumption our assessment was bad.

  This may have been helpful in the past, but with better consensus
  calculation our type calls are now more robust.  Instead we change
  such data to be unclassified types instead so they don't count to AD
  and PL.

  This is still a fudge.  It may be preferable to leave them as-is,
  and/or filter by some other means such as whether they span STRs.

  We get a 7% rise in FP value, but a 2% drop in FN and 16% drop in GT
  assignment errors.

  TODO: study those new FPs in more detail.
Also a small correction to consensus rfract so it's a minimum rather
than additive amendment.
Ie cnum==1 is always opposite of cnum==0 for hets.
The INS_PLUS_BASE define is complicated code, and validation now shows
the differences are very marginal.  So disabled for now and will cull
soon so we only have the simpler variant left.
- local_band is computed only over left..right part of alignment
  instead of from the entire CIGAR string.

- Fix for p->indel == type matching.  This isn't sufficient to filter
  out when we're in the deletion being studied if it starts eg 1 base
  earlier, where p->indel is now zero, but p->is_del is true.

- Use band size computed from all observed indels and STRs, not "type".

- Replace bca->indel_win_size/2 for long reads with an indel/STR aware
  limit.

- Replace the tbeg/tend calculations with a indel/STR aware setting.
- Fixed (?) out by one error in candidate del caller:
  Was "i >= pos-left" and now "i > pos-left".  This was incorrectly
  not marking the last base in a het del as used when immediately
  followed by an insertion (or maybe it was vice versa).

- The same line as above decided to add to either cons_base vs
  ref_base using "biggest_del", but this fails when we have nested
  deletions present.  Eg -10 at pos 105 and -1 at pos 107.  We now
  track containing deletions and set a skip_to coordinate instead of
  relying on "biggest_del".

- Alter the fraction at which we label a heterozygous indel as
  heterozygous from .2 to .3.

(Shift10h7j in notes)
Also bug fix "len==type" check in deletion.  It should be "len==-type".
Also fixed a minor memory leak (approx 36 bytes per 1MB of genome).
Similarly for insertion.  These aren't exact duplicates, as it's
overall and per-sample, but close enough and the doubling up of work
was accidental.
If the sequence being inserted has differing base-calls to the
flanking sequence, then est_seqQ's analysis of homopolymer runs and
how they impact the score is irrelevant as the insertion isn't
ambiguous anyway.

This has a small reduction to FN on PacBio CCS with minimal change to
FP.
- We erroneously had the -m and -F pre-filter options applies as needing
  either to pass instead of both to pass in bcf_cgp_find_types.  Fixing
  this is a significant reduction to FP rates.

- The computation of which STRs spanned the indel being assessed was
  sometimes a little bit out, due to using "qpos" instead of "pos".
  This removes a few false negatives.

- Improve the calculation of indel cost ("iscore") by counting the
  number of additional repeat units beyond the end of the sequence.
  The old penalty for STRs at the end of reads has been removed,
  meaning we regain some FNs while still penalising (more) the cases
  that lead to FPs.

  Overall there is some small FP / FN adjustments, but the biggest
  improvement here is a significant drop in GT errors (~10% fewer).
@jkbonfield
Copy link
Contributor Author

Updated with some tweaks to improve things further (as mentioned above):

  • We erroneously had the -m and -F pre-filter options applies as needing either to pass instead of both to pass in bcf_cgp_find_types. Fixing this is a significant reduction to FP rates.

  • The computation of which STRs spanned the indel being assessed was sometimes a little bit out, due to using "qpos" instead of "pos". This removes a few false negatives.

  • Improve the calculation of indel cost ("iscore") by counting the number of additional repeat units beyond the end of the sequence. The old penalty for STRs at the end of reads has been removed, meaning we regain some FNs while still penalising (more) the cases that lead to FPs.

    Overall there is some small FP / FN adjustments, but the biggest improvement here is a significant drop in GT errors (~10% fewer).

image

image

image

PacBioCCS plot is being generated atm.

@jkbonfield
Copy link
Contributor Author

Some raw numbers, as percentages of the truth set. Columns are all (qual>0), QUAL>=30 and QUAL>=30 plus extra filters (IDV, IMF, DP).

15x dev
InDel TP   94.35 /   90.56 /   88.57
InDel FP    1.42 /    1.05 /    0.74
InDel GT    2.25 /    2.08 /    1.99
InDel FN    5.65 /    9.44 /   11.43

15x 9f10
InDel TP   93.95 /   91.10 /   89.13
InDel FP    0.95 /    0.63 /    0.52 -30%
InDel GT    2.45 /    2.29 /    2.21 +11%
InDel FN    6.05 /    8.90 /   10.87 -5%
30x dev
InDel TP   97.94 /   96.99 /   96.80
InDel FP    1.08 /    0.83 /    0.69
InDel GT    1.04 /    0.99 /    0.98
InDel FN    2.06 /    3.01 /    3.20

30x 9f10
InDel TP   98.01 /   97.31 /   97.13
InDel FP    0.87 /    0.69 /    0.61 -12%
InDel GT    0.85 /    0.80 /    0.79 -19%
InDel FN    1.99 /    2.69 /    2.87 -10%
60x dev
InDel TP   98.59 /   98.35 /   98.32
InDel FP    0.84 /    0.68 /    0.62
InDel GT    0.75 /    0.74 /    0.74
InDel FN    1.41 /    1.65 /    1.68

60x 9f10
InDel TP   98.72 /   98.57 /   98.54
InDel FP    0.76 /    0.63 /    0.58 -10%
InDel GT    0.43 /    0.42 /    0.41 -45%
InDel FN    1.28 /    1.43 /    1.46 -13%

The percentage change vs the develop branch is significant for FP and FN at all depths. GT assignment errors are up for shallow data, although in absolute numbers it's still fewer new GT errors than we've recovered in FN, so this is largely just an issue of low coverage true variants having a significant portion with the wrong genotype assignment. At higher depths the GT error really plummets.

We already had this elsewhere, but forgot this case.
It trips up on GIAB HG002.GRCh38.PacBio_CCS_15Kb.bam at chr1:112149167
which has a 13KB deletion.
@jkbonfield
Copy link
Contributor Author

jkbonfield commented Mar 17, 2022

PacBio CCS stats:

image

This broke at 112MB into chr1 as it found a 12KB deletion (now fixed). Rerunning, but the shape of the graph is unlikely to change. There are very high false positive and false negative rates on this data. We can use e.g. -h 200 to remove many of the low quality false positives, but it's generally best to simply filter harder instead. Most of the errors (both FN and FP) are homopolymer issues, which is simply the nature of the technology.

SynDip chr1 stats:

image

At 15x the recent additional commit has removed some true variants, so elevated FN rate, but it's the only change which is consistently better on FN vs FP tradeoff. Frankly all of them have terrible FN miss rates.

image

At 45x the new PR diverges and starts to become poorer than current develop due to higher FP rates. However this only occurs at QUAL >=40 or so and the FN rates are still so enormous as to not be practically usable. I think this is likely just the nature of the truth set, and I'm being overly pedantic in my accuracy assessment. If I recall there is a caveat in SynDip about not trusting small indels due to the nature of the underlying PacBio assembly and homopolymers, so it may simply be plain wrong on them.

@jkbonfield
Copy link
Contributor Author

Finally, assuming my revised pacbio test runs to completion, I think I'm done with this PR.

What I haven't tested is multi-sample calling, simply because I don't know of appropriate truth sets to use (maybe a trio though?) and we don't have a written best practice guide for it so I don't know the normal method of doing this. Do we just do it in one command, or produce gVCFs and merge? We really need to document this pipeline, but that's a job for someone who knows what they're doing to do. :-)

So I'll leave such validation up to you.

This fixes a read buffer-overrun, although I'm baffled how we didn't
hit this with the original code as it's never apparently known where
the reference ends.
jkbonfield added a commit to jkbonfield/bcftools that referenced this pull request Nov 22, 2023
This is a combination of PR samtools#1679 from 2022 coupled with more recent
changes to replace BAQ with edlib for evaluating the alignment of
reads against candidate alleles.  Due to the complexity of rebasing
many commits with many conflicts, these have been squashed together
for simplicity.  Please consult the original commit log in the PR for
more detailed history.

PR 1679 is a major restructuring of the indel caller that produces
proper consensus sequences.  TODO: paste collected commit messages
here.

The newer edlib changes are designed to dramtically increase the speed
of indel calling on long read data by use of a modern algorithm.  Note
edlib's alignments are very crude, with just +1 for all differences
including substitutions and each indel base.  However if we've
computed our candidate allele consensus sequences correctly then the
aligning against the real allele should be a minimal score regardless
of scoring regime, so the cost of an HMM or even affine weights do not
help much.

The lack of quality values is compensated for by detection of minimum
quality within STRs.
jkbonfield added a commit to jkbonfield/bcftools that referenced this pull request Dec 12, 2023
The main benefit of the new indel caller is the use of alignment
against diploid consensus generation instead of simply patching the
reference with candidate indel types.

This greatly reduces false positives and incorrect allele alignment
(leading to wrong genotype calls).  This was the earlier PR samtools#1679, but
has since acquired edlib as an alternative to BAQ for the indel
alignment.  However this is primarily a speed benefit (with some
minor removal of false-negatives due to quality smashing), and
isn't the main thing users should think about when choosing an indel
caller.
jkbonfield added a commit to jkbonfield/bcftools that referenced this pull request Jan 26, 2024
This is a combination of PR samtools#1679 from 2022 coupled with more recent
changes to replace BAQ with edlib for evaluating the alignment of
reads against candidate alleles.  Due to the complexity of rebasing
many commits with many conflicts, these have been squashed together
for simplicity.  Please consult the original commit log in the PR for
more detailed history.

PR 1679 is a major restructuring of the indel caller that produces
proper consensus sequences.  TODO: paste collected commit messages
here.

The newer edlib changes are designed to dramtically increase the speed
of indel calling on long read data by use of a modern algorithm.  Note
edlib's alignments are very crude, with just +1 for all differences
including substitutions and each indel base.  However if we've
computed our candidate allele consensus sequences correctly then the
aligning against the real allele should be a minimal score regardless
of scoring regime, so the cost of an HMM or even affine weights do not
help much.

The lack of quality values is compensated for by detection of minimum
quality within STRs.
jkbonfield added a commit to jkbonfield/bcftools that referenced this pull request Jan 26, 2024
The main benefit of the new indel caller is the use of alignment
against diploid consensus generation instead of simply patching the
reference with candidate indel types.

This greatly reduces false positives and incorrect allele alignment
(leading to wrong genotype calls).  This was the earlier PR samtools#1679, but
has since acquired edlib as an alternative to BAQ for the indel
alignment.  However this is primarily a speed benefit (with some
minor removal of false-negatives due to quality smashing), and
isn't the main thing users should think about when choosing an indel
caller.
jkbonfield added a commit to jkbonfield/bcftools that referenced this pull request Feb 9, 2024
This is a combination of PR samtools#1679 from 2022 coupled with more recent
changes to replace BAQ with edlib for evaluating the alignment of
reads against candidate alleles.  Due to the complexity of rebasing
many commits with many conflicts, these have been squashed together
for simplicity.  Please consult the original commit log in the PR for
more detailed history.

PR 1679 is a major restructuring of the indel caller that produces
proper consensus sequences.  TODO: paste collected commit messages
here.

The newer edlib changes are designed to dramtically increase the speed
of indel calling on long read data by use of a modern algorithm.  Note
edlib's alignments are very crude, with just +1 for all differences
including substitutions and each indel base.  However if we've
computed our candidate allele consensus sequences correctly then the
aligning against the real allele should be a minimal score regardless
of scoring regime, so the cost of an HMM or even affine weights do not
help much.

The lack of quality values is compensated for by detection of minimum
quality within STRs.
jkbonfield added a commit to jkbonfield/bcftools that referenced this pull request Feb 9, 2024
The main benefit of the new indel caller is the use of alignment
against diploid consensus generation instead of simply patching the
reference with candidate indel types.

This greatly reduces false positives and incorrect allele alignment
(leading to wrong genotype calls).  This was the earlier PR samtools#1679, but
has since acquired edlib as an alternative to BAQ for the indel
alignment.  However this is primarily a speed benefit (with some
minor removal of false-negatives due to quality smashing), and
isn't the main thing users should think about when choosing an indel
caller.

Also tidied up the usage statement and added an explicit "-X list" to
print the profile parameters.

- Add extra debugging defines.

GLF_DEBUG reports GLF calculation in bam2bcf.c.

ALIGN_DEBUG uses edlib EDLIB_TASK_PATH to report sequence alignment.
NB to use this you need to link against edlib itself rather than the
cutdown version in this repository.

Also fix the edlib heuristics used in bcf_call_glfgen.  We don't want
to change the call (b=5) as this affects AD.  Instead we change the
quality so poor calls get filtered by QUAL rather than simply being
removed.

- Tweak edlib tuning for SeqQ/qual.
Add quality value assessment into soft-clip recovery.
Use /500 instead of /111 in indelQ assignment, and skew indel-bias
accordingly.  This gives better separation of FP/GT/FN generally.

- Added --seqq-offset parameter so we can use it in tunables per
profile.  This is used as a limit on the seqQ reduction in the
"VAL-5*MIN(20,depth)" formula, used for favouring data over seqQ
scores when depth is sufficient.  Experimentation showed no single
value that worked for all platforms, but the default is in the
middle.

- Tidy up to cull ifdefed and commented out code.

- Add test for indels-cns.

It's minimal, but the whole indel calling has minimal testing.  I
think we have under 10 indels in total with develop (all short read
and mostly duplications of each other), and no testing of indels-2.0.

This tests 4 indels with indels-cns.

- Added documentation for the new --indels-2.0 options

- Cull more unused parts of edlib.c.

This avoids clang warnings (which become errors with -Werror).
We're only including the bits we need here for mpileup.  If you want
the whole thing, link against the upstream source library instead.
jkbonfield added a commit that referenced this pull request Feb 9, 2024
This is a combination of PR #1679 from 2022 coupled with more recent
changes to replace BAQ with edlib for evaluating the alignment of
reads against candidate alleles.  Due to the complexity of rebasing
many commits with many conflicts, these have been squashed together
for simplicity.  Please consult the original commit log in the PR for
more detailed history.

PR 1679 is a major restructuring of the indel caller that produces
proper consensus sequences.  TODO: paste collected commit messages
here.

The newer edlib changes are designed to dramtically increase the speed
of indel calling on long read data by use of a modern algorithm.  Note
edlib's alignments are very crude, with just +1 for all differences
including substitutions and each indel base.  However if we've
computed our candidate allele consensus sequences correctly then the
aligning against the real allele should be a minimal score regardless
of scoring regime, so the cost of an HMM or even affine weights do not
help much.

The lack of quality values is compensated for by detection of minimum
quality within STRs.
jkbonfield added a commit that referenced this pull request Feb 9, 2024
The main benefit of the new indel caller is the use of alignment
against diploid consensus generation instead of simply patching the
reference with candidate indel types.

This greatly reduces false positives and incorrect allele alignment
(leading to wrong genotype calls).  This was the earlier PR #1679, but
has since acquired edlib as an alternative to BAQ for the indel
alignment.  However this is primarily a speed benefit (with some
minor removal of false-negatives due to quality smashing), and
isn't the main thing users should think about when choosing an indel
caller.

Also tidied up the usage statement and added an explicit "-X list" to
print the profile parameters.

- Add extra debugging defines.

GLF_DEBUG reports GLF calculation in bam2bcf.c.

ALIGN_DEBUG uses edlib EDLIB_TASK_PATH to report sequence alignment.
NB to use this you need to link against edlib itself rather than the
cutdown version in this repository.

Also fix the edlib heuristics used in bcf_call_glfgen.  We don't want
to change the call (b=5) as this affects AD.  Instead we change the
quality so poor calls get filtered by QUAL rather than simply being
removed.

- Tweak edlib tuning for SeqQ/qual.
Add quality value assessment into soft-clip recovery.
Use /500 instead of /111 in indelQ assignment, and skew indel-bias
accordingly.  This gives better separation of FP/GT/FN generally.

- Added --seqq-offset parameter so we can use it in tunables per
profile.  This is used as a limit on the seqQ reduction in the
"VAL-5*MIN(20,depth)" formula, used for favouring data over seqQ
scores when depth is sufficient.  Experimentation showed no single
value that worked for all platforms, but the default is in the
middle.

- Tidy up to cull ifdefed and commented out code.

- Add test for indels-cns.

It's minimal, but the whole indel calling has minimal testing.  I
think we have under 10 indels in total with develop (all short read
and mostly duplications of each other), and no testing of indels-2.0.

This tests 4 indels with indels-cns.

- Added documentation for the new --indels-2.0 options

- Cull more unused parts of edlib.c.

This avoids clang warnings (which become errors with -Werror).
We're only including the bits we need here for mpileup.  If you want
the whole thing, link against the upstream source library instead.
@jkbonfield
Copy link
Contributor Author

Closing as this was superceded by #2045 which has was merged in as --indels-cns.

@jkbonfield jkbonfield closed this Sep 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant