Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Profiling output table interpretation #22

Closed
lborcard opened this issue Nov 7, 2022 · 3 comments
Closed

Profiling output table interpretation #22

lborcard opened this issue Nov 7, 2022 · 3 comments

Comments

@lborcard
Copy link

lborcard commented Nov 7, 2022

Dear Shenwei,

Thank you very much for your very nice tool, we are trying to understand how to interpret the output table in KMCP format.

  • If the output table contains more than one ref per species based on which parameter should we choose the best hit?

  • According to your manual the percentage column refers to Relative abundance of the reference however, we are not sure how this value is calculated. Could you give us more details about this metric?

thank you very much,

best,

Loïc

@shenwei356
Copy link
Owner

Thanks for using KMCP.

If the output table contains more than one ref per species based on which parameter should we choose the best hit?

The real genome in samples may match more than one reference, we can't tell which one is the truth. But the similarity score (column score, the 90th percentile of k-mer coverage of all uniquely matched reads) may be an index to show which one is more similar to the real genome.

According to your manual the percentage column refers to Relative abundance of the reference however, we are not sure how this value is calculated. Could you give us more details about this metric?

First, the coverage (column coverage) of each matched reference genome is computed by dividing the total bases of matched reads with the genome size (the total bases of either complete genome or unfinished genomes like MAGs with plasmid sequences filtered out). Then the relative abundance of one species is computed by dividing the sum of genome coverages of this species with the sum of genome coverages of all genomes. At last, the relative abundances of taxa at each rank are the sum of percentages of all the child taxa.

@lborcard
Copy link
Author

lborcard commented Nov 7, 2022

thank you for the swift reply, if we have several refs with a score of 100 what would be the second metric to use to filter them? would coverage be a good one to use?

@shenwei356
Copy link
Owner

I think so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants