How to deal with ambiguous bases in reads #294

mbhall88 · 2022-09-06T06:16:38Z

I have a fastq that gets zero coverage on all PRGs - which is very weird as there are lots of reads.

Looking at the log, pretty much every read has this warning

bad letter - found a non AGCT base in read so skipping read SRR1723506.87

Looking at the fastq, indeed, nearly every read has at least one N in it.

Is there a way we can deal with these?

For context, I came across this while testing drprg. mykrobe and tbprofiler make the correct predictions for this sample, but pandora doesn't even produce a genotyped VCF as it filters out all PRGs as they all fail filtering.

How does cortex deal with this sort of thing @iqbal-lab ?

The fastq is at /hps/nobackup/iqbal/mbhall/drprg/paper/results/filtered/illumina/PRJNA268900/SAMN03246262/SRR1723506/SRR1723506.filtered.fq.gz

The text was updated successfully, but these errors were encountered:

iqbal-lab · 2022-09-06T06:43:48Z

Cortex effectively cuts a read at Ns. So here it would basically trim the N off the read, but still work on the rest. Pandora ought to be able to do this really

leoisl · 2022-09-06T08:43:10Z

One option is the one Zam mentioned: we simply trim the N off. It is likely the easiest solution to implement: we always process all reads, but we skip minimisers containing N. Another option is to translate the minimisers until a certain number of Ns, let's say 2. If the minimiser looks like this:

ACCCGNGTTTCNA

We could translate it into the 8 possible other minimisers, and map into the graph, generating one true hit and 7 other possible FP hits (FP only if they hit a PRG, otherwise no hit at all). With very high likelihood only the true minimizer will be kept after filtering the hits, as I think the spurious hits will be by random chance and very likely isolated hits. I propose we limit the number of Ns we translate to a small amount (e.g. 2 or 3) because a minimiser with lots of Ns, e.g.: NNNNNAANNNNNN does not bring any information really, just noise, and we can safely discard it, and it also improves performance.

So I think we have two options we could choose from, or maybe a third if you propose. Implementation-wise, I think both options are easy to implement. The second one has this translation step, but it is not hard to implement. We actually do it in make_prg already, but we translate only RYKMSW.

mbhall88 · 2022-09-06T22:56:52Z

I don't think the translate into all possible bases is a good idea, as this might confuse counts on "fake" alternate alleles. And if you have a file like this one where nearly all reads have N's you might get a weird scenario where you get an equal number of hits across all bases?

Maybe the minimap2 approach is best? We just skip minimizers with N in them. Rather than skipping the whole read?

leoisl · 2022-09-07T09:35:32Z

True, agree with skipping minimizers with N in them as the best strategy!

iqbal-lab · 2022-09-07T09:43:31Z

yes, skip minimizers with Ns.

leoisl · 2022-09-27T14:09:59Z

Closed via #295

mbhall88 mentioned this issue Sep 6, 2022

Run drprg mbhall88/drprg#12

Closed

mbhall88 mentioned this issue Sep 12, 2022

Handle ambiguous bases #295

Merged

mbhall88 linked a pull request Sep 12, 2022 that will close this issue

Handle ambiguous bases #295

Merged

leoisl closed this as completed Sep 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to deal with ambiguous bases in reads #294

How to deal with ambiguous bases in reads #294

mbhall88 commented Sep 6, 2022 •

edited

Loading

iqbal-lab commented Sep 6, 2022

leoisl commented Sep 6, 2022

mbhall88 commented Sep 6, 2022 •

edited

Loading

leoisl commented Sep 7, 2022

iqbal-lab commented Sep 7, 2022

leoisl commented Sep 27, 2022

How to deal with ambiguous bases in reads #294

How to deal with ambiguous bases in reads #294

Comments

mbhall88 commented Sep 6, 2022 • edited Loading

iqbal-lab commented Sep 6, 2022

leoisl commented Sep 6, 2022

mbhall88 commented Sep 6, 2022 • edited Loading

leoisl commented Sep 7, 2022

iqbal-lab commented Sep 7, 2022

leoisl commented Sep 27, 2022

mbhall88 commented Sep 6, 2022 •

edited

Loading

mbhall88 commented Sep 6, 2022 •

edited

Loading