Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error rate vs. genotyping error rate #232

Open
mbhall88 opened this issue Sep 21, 2020 · 11 comments
Open

Error rate vs. genotyping error rate #232

mbhall88 opened this issue Sep 21, 2020 · 11 comments
Labels

Comments

@mbhall88
Copy link
Member

I have a slight concern/misunderstanding regarding error rate.

We have two separate error rates in pandora.

  1. -e,--error_rate which defaults to 0.11 for nanopore and 0.001 for Illumina.
  2. --genotyping-error-rate (which was hidden prior to my CLI PR) which is set to 0.01.

The genotyping error rate is used for computing likelihood in the VCF whereas the first error rate is used for alignment-related tasks, estimating parameters for the kmer graph model, and de novo discovery.

Why are these error rates different? From Rachel's thesis (p65 - 5.2.2 Genotyping)

We model the mean k-mer coverage on the true allele as Poisson distributed with rate equal to the expected depth of coverage d. To account for noise, we assume that coverage on alternative alleles arises as a result of an error process with rate e.

Shouldn't this error rate also be a function of the sequencing technology used?
I.e. from p45 4.3.2 Quasi-mapping to the Index (Setting the minimum cluster size)

To account for sequencing errors, we estimate that if errors occur independently as a Poisson distribution with rate e along a sequence, the probability that a k-mer has been sequenced with no errors is exp(−ke).

@leoisl
Copy link
Collaborator

leoisl commented Sep 22, 2020

I remember being confused by these two different error rates when refactoring some stuff, but I did not go deep in the investigation as you did, as I was worried about assuring the refactoring would produce the same results.

Just for complementing, this is how we compute the log likelihood:

image

As Michael previously stated, this e should be the error rate, which is currently always fixed at 0.01. From my understanding, this is a bug, and it is easy to fix.

If we assume ONT error rate is around 0.11, ln(0.11) = -2.20. But we have been used the fixed ln(0.01) = -4.60 . So, we have been giving more weight than we should to c_b (the total mean coverage of all other alleles), effectively pushing likelihoods down. But this has a larger effect on the likelihood of lowly covered alleles than on the highly covered alleles (which are possibly the ones chosen), so I think for ONT data this is leading to overestimated genotype confidences. I think for illumina the effect is opposite, we are underestimating genotype confidences. I could be wrong on these...

Anyway, it is hard to predict what is the impact of this fix on the paper results. My blind guess is that it won't change much, but it is just really a blind guess. It is indeed easy to fix, but it would require us to reproduce the results for the paper once again. This fix should take no more than 10 minutes (I guess), and putting everything to run should be 5 minutes, but then we have to wait for the results to be produced.

Maybe we should confirm this is indeed a bug first...

@iqbal-lab
Copy link
Collaborator

We definitely don't touch this lightly. This is exactly where the difference between our mental models and reality is most clear. Definitely do not fix this Leandro until you have weeks available for debugging. I propose we add this (leandros bug) to the list of things to return to.
Returning to Michaels issue:
I'd like to understand both the intention and the current implementation and how it has been evaluated. But I want to be free of the current super heavyweight pipelines first and evaluate on something small and controlled post paper. I've chatted a bit to Rachel this week, maybe we can have a brief 4 way call next week

@mbhall88
Copy link
Member Author

There is always the possibility of evaluating this with my TB dataset instead? Much faster turn around time, and I am doing the evaluation anyway...

@rmcolq
Copy link
Collaborator

rmcolq commented Sep 24, 2020

Interesting. I think this must be a bug. I made the commit here 062e4eb#diff-23632e736a3562808e4b2d5bf5dc59c7 to add the genotyping_error_rate parameter. It was just a constant number before this. However I did not carry it through any further. I have no idea why the error rate there should be different as the error rate used for alignment things (although I would hesitate to say that they should be the same without more brain power to think it through)

@mbhall88
Copy link
Member Author

I have no idea why the error rate there should be different as the error rate used for alignment things (although I would hesitate to say that they should be the same without more brain power to think it through)

Yeah, speaking with Zam we came to this conclusion too. Requires some careful thinking...

@mbhall88
Copy link
Member Author

I have been playing around with these two error rates today.

The data I am parameter sweeping with is Nanopore Mtb data. They have "truth" assemblies from PacBio CCS and the evaluation is done with varifier.

The x-axis labels indicate the filters/parameters used. The COMPASS (left-most) box is the sample's Illumina data SNP calls from the pipeline of the same name and bcftools is SNP calls for the nanopore data on that tool.

Filters/parameters key:

  • e = error rate
  • E = genotype error rate
  • g = minimum genotype confidence
  • G = maximum gaps fraction
  • d = minimum k-mer coverage
  • s = minimum strand bias percentage
  • I = maximum indel length

For a more comprehensive examination of a filter sweep, see here.

Recall - SNPs only

image

Precision - SNPs only

image

Recall - all variants

image

Precision - all variants

image


I've been staring at these sorts of plots all day, so it would be nice to have some fresh eyes to interpret.

@iqbal-lab
Copy link
Collaborator

what are the default values for e and E?

@iqbal-lab
Copy link
Collaborator

Is it possible to replot these only for samples with >40x depth, so we can disentangle coverage?

@iqbal-lab
Copy link
Collaborator

just deleted a comment.
So - we can achieve the same precision as compass for SNPs, if we lose 15% recall (some of this will be due to low coverage samples and not having GCP implemented)
Instead of snps/All could we see SNPs/indels?

@mbhall88
Copy link
Member Author

what are the default values for e and E?

e = 0.11 (nanopore) and E = 0.01 (always)

I'll put together those other plots today

@mbhall88
Copy link
Member Author

mbhall88 commented Nov 25, 2020

Is it possible to replot these only for samples with >40x depth, so we can disentangle coverage?

Here are the plots with samples that have coverage over 40x (which is 5/7). Note: there is a few extra filters in there now as I was concurrently working on mbhall88/head_to_head_pipeline#48

Recall - SNPs only

image

Precision - SNPs only

image

Recall - all variants

image

Precision - all variants

image


Instead of snps/All could we see SNPs/indels?

This is a lot harder and will take a bit longer. I will have to create indel-only VCFs and then run varifier on them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants