-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to deal with ambiguous bases in reads #294
Comments
Cortex effectively cuts a read at Ns. So here it would basically trim the N off the read, but still work on the rest. Pandora ought to be able to do this really |
One option is the one Zam mentioned: we simply trim the
We could translate it into the 8 possible other minimisers, and map into the graph, generating one true hit and 7 other possible FP hits (FP only if they hit a PRG, otherwise no hit at all). With very high likelihood only the true minimizer will be kept after filtering the hits, as I think the spurious hits will be by random chance and very likely isolated hits. I propose we limit the number of So I think we have two options we could choose from, or maybe a third if you propose. Implementation-wise, I think both options are easy to implement. The second one has this translation step, but it is not hard to implement. We actually do it in |
I don't think the translate into all possible bases is a good idea, as this might confuse counts on "fake" alternate alleles. And if you have a file like this one where nearly all reads have N's you might get a weird scenario where you get an equal number of hits across all bases? Maybe the minimap2 approach is best? We just skip minimizers with N in them. Rather than skipping the whole read? |
True, agree with skipping minimizers with |
yes, skip minimizers with Ns. |
Closed via #295 |
I have a fastq that gets zero coverage on all PRGs - which is very weird as there are lots of reads.
Looking at the log, pretty much every read has this warning
Looking at the fastq, indeed, nearly every read has at least one
N
in it.Is there a way we can deal with these?
For context, I came across this while testing
drprg
. mykrobe and tbprofiler make the correct predictions for this sample, but pandora doesn't even produce a genotyped VCF as it filters out all PRGs as they all fail filtering.How does cortex deal with this sort of thing @iqbal-lab ?
The fastq is at
/hps/nobackup/iqbal/mbhall/drprg/paper/results/filtered/illumina/PRJNA268900/SAMN03246262/SRR1723506/SRR1723506.filtered.fq.gz
The text was updated successfully, but these errors were encountered: