Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove low coverage bases from consensus sequence? #67

Open
jiaan-yu opened this issue Apr 16, 2024 · 7 comments
Open

Remove low coverage bases from consensus sequence? #67

jiaan-yu opened this issue Apr 16, 2024 · 7 comments

Comments

@jiaan-yu
Copy link

Hi Yan,

I use abPOA to generate a consensus from four sequences. Here is the distribtuion of n_seq / n_cov in each base in the consensus sequence. My question is, is there a way to remove low coverage bases? For this consensus sequence, the coverage should be 0.25/0.5/0.75/1; however, the low coverage bases at 0.25 are clearly errors from the input sequences.

FDE4BB4C-7937-4d47-A04A-D5C5C64982E7

Cheers,
Jiaan

@yangao07
Copy link
Owner

I guess you mean the consesus called by abPOA may not be as accurate as you expected?
Can you share your input sequences here so that I can have more ideas about that?

@jiaan-yu
Copy link
Author

Hi Yan,

Here is a section of the MSA output from abPOA (everything at default). I outlined the expected behaviour in green box and unexpected in red.
1508C185-64B2-4202-AECE-31478121347C
Also attached is the input fastq file. input.fastq.txt

Cheers

@yangao07
Copy link
Owner

yangao07 commented Apr 17, 2024

Thanks for the clarification!
abPOA did have an option to output consensus sequence with the most common/frequent base at each position, but was removed now.
Right now, the consensus sequence is a travsal with max edge weight on the graph.

I can re-add this option.
Will post the update here.

@yangao07
Copy link
Owner

The latest abPOA has an option to chose a different consensus calling method: -a1, which selects the most frequent base at each position. The default is still heaviest bundling.

abpoa -a1 input.fastq.txt -r2

image

@jiaan-yu
Copy link
Author

jiaan-yu commented Apr 25, 2024

Hi Yan,

Thanks so much for your swift response! I don't know if you have benchmarked the performance of two consensus algo, here is my findings: Sequences are generated with fixed and independent errors
The performance of default algo (-a0) is superior when the number of input sequences is >= 7; on the contrary, the performance of frequency algo (-a1) is much better when the number of input sequences is low...
It is good to know when to use which cons method.

image

You may close this issue.

Cheers,
Jiaan

@yangao07
Copy link
Owner

Hi Jiaan,

Thanks for the information. Actually I wasn't aware of this "input-number-dependent" performance difference.
This is interesting to me.
I guess you have a template sequence as the ground truth, and simulated read sets with different read counts, right?
Do you mind providing more details about your simulation, or maybe simply sharing your scripts so that I can look more into this?

@jiaan-yu
Copy link
Author

jiaan-yu commented Apr 26, 2024

Hi Yan,
Here is how I tested abPOA:

(1) I randomly subset sequences from a reference fasta (E.coli etc.) as templates. The length of the templates can be 1K to 5K; but the consensus accuracy is independent of the length anyway.

(2) Simlord to insert errors into the templates https://bitbucket.org/genomeinformatics/simlord/src/master/.

simlord \
    --read-reference $template_fasta \
    -n 10 -ps 0.2 -pi 0.2 -pd 0.2 -t 0.18 \
    --no-sam $error_fasta

-n 10 for 10 repeats; -t 0.18 for overall error rate

(3) Generate consensus sequence with abPOA:
abpoa -Q -r5 $error_fasta > $ccs_fastq

(4) I use edit distance (a python package) to calculate read accuracy. https://pypi.org/project/editdistance/

import editdistance
accuracy = 1-editdistance.eval(ccs_fastq_seq, template_fasta_seq) / float(len(template_fasta_seq))

Cheers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants