-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error rate vs. genotyping error rate #232
Comments
I remember being confused by these two different error rates when refactoring some stuff, but I did not go deep in the investigation as you did, as I was worried about assuring the refactoring would produce the same results. Just for complementing, this is how we compute the log likelihood: As Michael previously stated, this e should be the error rate, which is currently always fixed at 0.01. From my understanding, this is a bug, and it is easy to fix. If we assume ONT error rate is around 0.11, Anyway, it is hard to predict what is the impact of this fix on the paper results. My blind guess is that it won't change much, but it is just really a blind guess. It is indeed easy to fix, but it would require us to reproduce the results for the paper once again. This fix should take no more than 10 minutes (I guess), and putting everything to run should be 5 minutes, but then we have to wait for the results to be produced. Maybe we should confirm this is indeed a bug first... |
We definitely don't touch this lightly. This is exactly where the difference between our mental models and reality is most clear. Definitely do not fix this Leandro until you have weeks available for debugging. I propose we add this (leandros bug) to the list of things to return to. |
There is always the possibility of evaluating this with my TB dataset instead? Much faster turn around time, and I am doing the evaluation anyway... |
Interesting. I think this must be a bug. I made the commit here 062e4eb#diff-23632e736a3562808e4b2d5bf5dc59c7 to add the |
Yeah, speaking with Zam we came to this conclusion too. Requires some careful thinking... |
I have been playing around with these two error rates today. The data I am parameter sweeping with is Nanopore Mtb data. They have "truth" assemblies from PacBio CCS and the evaluation is done with The x-axis labels indicate the filters/parameters used. The COMPASS (left-most) box is the sample's Illumina data SNP calls from the pipeline of the same name and bcftools is SNP calls for the nanopore data on that tool. Filters/parameters key:
For a more comprehensive examination of a filter sweep, see here. Recall - SNPs onlyPrecision - SNPs onlyRecall - all variantsPrecision - all variantsI've been staring at these sorts of plots all day, so it would be nice to have some fresh eyes to interpret. |
what are the default values for e and E? |
Is it possible to replot these only for samples with >40x depth, so we can disentangle coverage? |
just deleted a comment. |
e = 0.11 (nanopore) and E = 0.01 (always) I'll put together those other plots today |
Here are the plots with samples that have coverage over 40x (which is 5/7). Note: there is a few extra filters in there now as I was concurrently working on mbhall88/head_to_head_pipeline#48 Recall - SNPs onlyPrecision - SNPs onlyRecall - all variantsPrecision - all variants
This is a lot harder and will take a bit longer. I will have to create indel-only VCFs and then run varifier on them. |
I have a slight concern/misunderstanding regarding error rate.
We have two separate error rates in pandora.
-e,--error_rate
which defaults to 0.11 for nanopore and 0.001 for Illumina.--genotyping-error-rate
(which was hidden prior to my CLI PR) which is set to 0.01.The genotyping error rate is used for computing likelihood in the VCF whereas the first error rate is used for alignment-related tasks, estimating parameters for the kmer graph model, and de novo discovery.
Why are these error rates different? From Rachel's thesis (p65 - 5.2.2 Genotyping)
Shouldn't this error rate also be a function of the sequencing technology used?
I.e. from p45 4.3.2 Quasi-mapping to the Index (Setting the minimum cluster size)
The text was updated successfully, but these errors were encountered: