Error rate vs. genotyping error rate #232

mbhall88 · 2020-09-21T23:39:45Z

I have a slight concern/misunderstanding regarding error rate.

We have two separate error rates in pandora.

-e,--error_rate which defaults to 0.11 for nanopore and 0.001 for Illumina.
--genotyping-error-rate (which was hidden prior to my CLI PR) which is set to 0.01.

The genotyping error rate is used for computing likelihood in the VCF whereas the first error rate is used for alignment-related tasks, estimating parameters for the kmer graph model, and de novo discovery.

Why are these error rates different? From Rachel's thesis (p65 - 5.2.2 Genotyping)

We model the mean k-mer coverage on the true allele as Poisson distributed with rate equal to the expected depth of coverage d. To account for noise, we assume that coverage on alternative alleles arises as a result of an error process with rate e.

Shouldn't this error rate also be a function of the sequencing technology used?
I.e. from p45 4.3.2 Quasi-mapping to the Index (Setting the minimum cluster size)

To account for sequencing errors, we estimate that if errors occur independently as a Poisson distribution with rate e along a sequence, the probability that a k-mer has been sequenced with no errors is exp(−ke).

The text was updated successfully, but these errors were encountered:

leoisl · 2020-09-22T10:38:26Z

I remember being confused by these two different error rates when refactoring some stuff, but I did not go deep in the investigation as you did, as I was worried about assuring the refactoring would produce the same results.

Just for complementing, this is how we compute the log likelihood:

As Michael previously stated, this e should be the error rate, which is currently always fixed at 0.01. From my understanding, this is a bug, and it is easy to fix.

If we assume ONT error rate is around 0.11, ln(0.11) = -2.20. But we have been used the fixed ln(0.01) = -4.60 . So, we have been giving more weight than we should to c_b (the total mean coverage of all other alleles), effectively pushing likelihoods down. But this has a larger effect on the likelihood of lowly covered alleles than on the highly covered alleles (which are possibly the ones chosen), so I think for ONT data this is leading to overestimated genotype confidences. I think for illumina the effect is opposite, we are underestimating genotype confidences. I could be wrong on these...

Anyway, it is hard to predict what is the impact of this fix on the paper results. My blind guess is that it won't change much, but it is just really a blind guess. It is indeed easy to fix, but it would require us to reproduce the results for the paper once again. This fix should take no more than 10 minutes (I guess), and putting everything to run should be 5 minutes, but then we have to wait for the results to be produced.

Maybe we should confirm this is indeed a bug first...

iqbal-lab · 2020-09-22T11:44:11Z

We definitely don't touch this lightly. This is exactly where the difference between our mental models and reality is most clear. Definitely do not fix this Leandro until you have weeks available for debugging. I propose we add this (leandros bug) to the list of things to return to.
Returning to Michaels issue:
I'd like to understand both the intention and the current implementation and how it has been evaluated. But I want to be free of the current super heavyweight pipelines first and evaluate on something small and controlled post paper. I've chatted a bit to Rachel this week, maybe we can have a brief 4 way call next week

mbhall88 · 2020-09-23T03:47:44Z

There is always the possibility of evaluating this with my TB dataset instead? Much faster turn around time, and I am doing the evaluation anyway...

rmcolq · 2020-09-24T14:06:31Z

Interesting. I think this must be a bug. I made the commit here 062e4eb#diff-23632e736a3562808e4b2d5bf5dc59c7 to add the genotyping_error_rate parameter. It was just a constant number before this. However I did not carry it through any further. I have no idea why the error rate there should be different as the error rate used for alignment things (although I would hesitate to say that they should be the same without more brain power to think it through)

mbhall88 · 2020-09-25T04:04:38Z

I have no idea why the error rate there should be different as the error rate used for alignment things (although I would hesitate to say that they should be the same without more brain power to think it through)

Yeah, speaking with Zam we came to this conclusion too. Requires some careful thinking...

mbhall88 · 2020-11-24T07:30:18Z

I have been playing around with these two error rates today.

The data I am parameter sweeping with is Nanopore Mtb data. They have "truth" assemblies from PacBio CCS and the evaluation is done with varifier.

The x-axis labels indicate the filters/parameters used. The COMPASS (left-most) box is the sample's Illumina data SNP calls from the pipeline of the same name and bcftools is SNP calls for the nanopore data on that tool.

Filters/parameters key:

e = error rate
E = genotype error rate
g = minimum genotype confidence
G = maximum gaps fraction
d = minimum k-mer coverage
s = minimum strand bias percentage
I = maximum indel length

For a more comprehensive examination of a filter sweep, see here.

Recall - SNPs only

Precision - SNPs only

Recall - all variants

Precision - all variants

I've been staring at these sorts of plots all day, so it would be nice to have some fresh eyes to interpret.

iqbal-lab · 2020-11-24T08:26:12Z

what are the default values for e and E?

iqbal-lab · 2020-11-24T08:27:37Z

Is it possible to replot these only for samples with >40x depth, so we can disentangle coverage?

iqbal-lab · 2020-11-24T09:01:41Z

just deleted a comment.
So - we can achieve the same precision as compass for SNPs, if we lose 15% recall (some of this will be due to low coverage samples and not having GCP implemented)
Instead of snps/All could we see SNPs/indels?

mbhall88 · 2020-11-24T20:14:28Z

what are the default values for e and E?

e = 0.11 (nanopore) and E = 0.01 (always)

I'll put together those other plots today

mbhall88 · 2020-11-25T02:16:39Z

Is it possible to replot these only for samples with >40x depth, so we can disentangle coverage?

Here are the plots with samples that have coverage over 40x (which is 5/7). Note: there is a few extra filters in there now as I was concurrently working on mbhall88/head_to_head_pipeline#48

Recall - SNPs only

Precision - SNPs only

Recall - all variants

Precision - all variants

Instead of snps/All could we see SNPs/indels?

This is a lot harder and will take a bit longer. I will have to create indel-only VCFs and then run varifier on them.

mbhall88 added the question label Sep 21, 2020

mbhall88 mentioned this issue Sep 23, 2020

Reimplement and improve CLI #224

Merged

mbhall88 mentioned this issue Nov 25, 2020

Compare pandora calls to truth assemblies mbhall88/head_to_head_pipeline#49

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error rate vs. genotyping error rate #232

Error rate vs. genotyping error rate #232

mbhall88 commented Sep 21, 2020

leoisl commented Sep 22, 2020

iqbal-lab commented Sep 22, 2020

mbhall88 commented Sep 23, 2020

rmcolq commented Sep 24, 2020

mbhall88 commented Sep 25, 2020

mbhall88 commented Nov 24, 2020

iqbal-lab commented Nov 24, 2020

iqbal-lab commented Nov 24, 2020

iqbal-lab commented Nov 24, 2020

mbhall88 commented Nov 24, 2020

mbhall88 commented Nov 25, 2020 •

edited

Loading

Error rate vs. genotyping error rate #232

Error rate vs. genotyping error rate #232

Comments

mbhall88 commented Sep 21, 2020

leoisl commented Sep 22, 2020

iqbal-lab commented Sep 22, 2020

mbhall88 commented Sep 23, 2020

rmcolq commented Sep 24, 2020

mbhall88 commented Sep 25, 2020

mbhall88 commented Nov 24, 2020

Recall - SNPs only

Precision - SNPs only

Recall - all variants

Precision - all variants

iqbal-lab commented Nov 24, 2020

iqbal-lab commented Nov 24, 2020

iqbal-lab commented Nov 24, 2020

mbhall88 commented Nov 24, 2020

mbhall88 commented Nov 25, 2020 • edited Loading

Recall - SNPs only

Precision - SNPs only

Recall - all variants

Precision - all variants

mbhall88 commented Nov 25, 2020 •

edited

Loading