Interpreting abundance #549

mytluo · 2018-09-26T16:21:52Z

I ran --track-abundance with sourmash gather, and wasn't sure how to interpret the values. Should I be looking at average_abund?

On another note, for --output-unassigned, is it correct to interpret that file as the list of hashes that were unclassified for the "mins" array? If I took the length of that array and added it to the sum of average_abund (or the correct column), could I calculate the % abundance from that sum? Apologies if this is completely off the mark!

The text was updated successfully, but these errors were encountered:

mytluo · 2018-10-10T00:25:26Z

Hi @luizirber, I was wondering if you could help me with this problem as well? I was pointed to f_unique_weighted as a possible equivalence to relative abundance -- the summed fraction seems to match the query coverage, so that seems right, but I just wanted to confirm before moving forward :)

additionally, may I also ask if I could interpret the counts from lca summarize to abundance? it was suggested the counts might be number of hashes; I tried looking for the total hashes in the signature file, but only got num: 0 after the list of mins hashes. apologies for the bother!

luizirber · 2018-10-10T17:56:50Z

so, num is a bit misleading because it's a value we use internally in sourmash to control between regular minhashes (like mash) or scaled minhashes. In regular minhashes len(mins) == num, but because you calculated your signatures with --scaled you will need to check len(mins) to get how many hashes were collected.

mytluo · 2018-10-11T01:11:44Z

@luizirber, thank you! that makes sense... I got 107323 from len(mins), but when I sum up the entire count column from lca summarize, it only adds up to 67072... I would also think this is the wrong sum in general as some of the counts are redundant -- counts from E. coli would also be in Escherichia, up to the Bacteria superkingdom level. But that would lower the total even more, which is strange, given 93.7% of the query was hit. Would you be able to point me in the right direction? I apologize for all the questions, I know I've asked a lot ^^;;

mytluo · 2018-10-11T18:41:33Z

I realized I forgot the --scaled ^^;; last question (hopefully) -- in the LCA counts, would the count with no superkingdom assigned be considered unclassified reads?

ctb · 2018-11-01T16:39:05Z

While I try to find time to write more docs, please also see my STAMPS 2018 presentation at https://osf.io/8d56n/ -- thx!

Alok-Shekhar · 2019-01-11T16:03:20Z

@mytluo , @ctb
I am not getting how to interpret souurmash gather output especially f_unique_weighted, average_abund, median_abund, std_abund columns values, how to calculate %abundance value from the run, will you please explain.

how to interpret --scaled option although its explained but am not getting correctly, will you please elaborate it for better understanding. Thanks A lot!

ctb · 2019-01-13T15:02:43Z

hi @Alok-Shekhar thank you for your questions!

scaled

re scaled, have you seen the explanations in Using sourmash: a practical guide and modulo hash details?

re abundance, what specific question(s) do you want to answer?

here's an example (from the test code) that might help:

In the tests/test-data/ directory, I run:

% sourmash gather gather-abund/reads-s10x10-s11.sig gather-abund/genome-s{10,11,12}.fa.gz.sig
== This is sourmash version 2.0.0a12.dev48+ga92289b. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

select query k=21 automatically.
loaded query: 1-1... (k=21, DNA)
loaded 3 signatures.


overlap     p_query p_match avg_abund
---------   ------- ------- ---------
0.5 Mbp       91.0%  100.0%      14.5    tests/test-data/genome-s10.fa.gz
376.0 kbp      9.0%   80.0%       1.9    tests/test-data/genome-s11.fa.gz

found 2 matches total;
the recovered matches hit 100.0% of the query

here, the query reads were constructed to have 10 times as many reads from s10 as s11. And so that's what you see with p_query (91% of the query is genome s10, 9% of the query is genome s11). In terms of the average abundance (avg_abund), what you see is approximately the same relative relationship (10x as much in s10 as in s11).

For this example, there is no difference between p_query and avg_abund, because the s10 and s11 genomes are the same size. However, were s10 twice as big as s11, then the p_query should change while avg_abund should not.

Alok-Shekhar · 2019-01-14T14:50:25Z

Thanks! @ctb for the explanation
Am I right, please let me know.
f_unique_weighted columns values are considered as %abundance value?
Thanks a lot!

Alok-Shekhar · 2019-01-16T11:25:39Z

@ctb Abundance info just like we get from kraken run from "% of fragments covered " column

Thanks!

ctb · 2020-04-04T13:44:15Z

@luizirber is doing a bunch of work on this for his thesis, so we can finally lay this to rest with benchmarks and (lots of) details. soon!

cesarcardonau · 2020-06-19T21:16:00Z

Hi @ctb @luizirber, since it has been a few months, I was wondering if you guys have any more details/documentation for interpreting abundances as mentioned above.

I am struggling a bit myself, picking up these differences. For instance,

How is the abundance weighing for f_unique_weighted work, if it seems that doesn't require that the signature is pre-computed with the --track-abundance flag. So how is it able to do it? or what is different from this abundance?
If I have two metagenomics signature files (with abundances), query1 and query2, and run sourmash gather for both using the same reference database, producing gather results, gather1 and gather2, respectively. Which is the better metric to use if
(i) we want to compare relative-abundance of strain X in the gather1 and gather2 results
(ii) we want to add the abundance of strain X in gather1 + gather2 results, and
(iii) we want to add relative-abundance of strain X & Y from the same gather result file.

Thanks!

ctb · 2020-06-23T15:03:00Z

hi @cesarcardonau et al., let me update you all a bit on some documentation and testing updates!

first, our classifying signatures documentation has been improved. In particular the abundance weighting section is better, and the gather algorithm discussion at the bottom of the page should be informative.

Second, we've updated lca summarize to respect abundance info and elaborated our documentation.

I don't think either of these answers the questions above fully, but they might provide context.

On to the meat of the question --

ctb · 2020-06-23T15:22:44Z

There's a lot of questions in this issue, so let me try to answer some of them first and then iterate as people ask for clarification!

The gather-abund tests added in #886 are key. See this diff, but I interpret the important bits below:

sourmash gather and abundances

tl;dr Below is a discussion of a synthetic set of test cases using three randomly generated (fake) genomes of the same size, with two even read data set abundances of 2x each, and a third read data set with 20x.

Data set prep

First, we make some synthetic data sets:

r1.fa with 2x coverage of genome s10
r2.fa with 10x coverage of genome s10.
r3.fa with 2x coverage of genome s11.

then we make signature s10-s11 with r1 and r3, i.e. 1:1 abundance, and make signature s10x10-s11 with r2 and r3, i.e. 10:1 abundance.

test_gather_abund_1_1 in test_sourmash.py

When we project r1+r3, 1:1 abundance, onto s10, s11, and s12 genomes with gather:

sourmash gather r1+r3 genome-s10.sig genome-s11.sig genome-s12.sig

we get:

overlap     p_query p_match avg_abund
---------   ------- ------- ---------
394.0 kbp     49.6%   78.5%       1.8    genome-s10.fa.gz
376.0 kbp     50.4%   80.0%       1.9    genome-s11.fa.gz

approximately 50% of each query matching (first column, p_query)
approximately 80% of subject genome's contents being matched (this is due to the low coverage of 2 used to build queries)
approximately 2.0 abundance (third column, avg_abund)
no match to genome s12.

test_gather_abund_10_1

When we project r2+r3, 10:1 abundance, onto s10, s11, and s12 genomes with gather:

sourmash gather r2+r3 genome-s10.sig genome-s11.sig genome-s12.sig

we get:

overlap     p_query p_match avg_abund
---------   ------- ------- ---------
0.5 Mbp       91.0%  100.0%      14.5    tests/test-data/genome-s10.fa.gz
376.0 kbp      9.0%   80.0%       1.9    tests/test-data/genome-s11.fa.gz

approximately 91% of s10 matching
approximately 9% of s11 matching
approximately 100% of the high coverage genome being matched, with only 80% of the low coverage genome
approximately 2.0 abundance (third column, avg_abund) for s11, and (very) approximately 20x abundance for genome s10.

The cause of the poor approximation for genome s10 is unclear; it could be due to low coverage, or small genome size, or something else. More investigation needed.

ctb · 2020-06-23T15:28:53Z

I think the above should answer @mytluo first question.

answering @mytluo second question,

for --output-unassigned, is it correct to interpret that file as the list of hashes that were unclassified for the "mins" array? If I took the length of that array and added it to the sum of average_abund (or the correct column), could I calculate the % abundance from that sum? Apologies if this is completely off the mark!

Yes, that file is the list of hashes that were not classified / assigned to a genome. The percentages and so on referred to elsewhere in this issue should take that all into account, however, so you don't need to explicitly account for them yourself!

ctb · 2020-06-23T15:34:13Z

more @mytluo answers -

I was pointed to f_unique_weighted as a possible equivalence to relative abundance -- the summed fraction seems to match the query coverage, so that seems right, but I just wanted to confirm before moving forward :)

Yeah, in partial response to this issue, I put in some tests to make sure I understood my own code :).

Per this line, f_unique_weighted is the abundance-weighted fraction of k-mers uniquely assigned to that gather match. See f_unique_to_query discussion here, which we then adjust to be weighted by the abundances.

ctb · 2020-06-23T15:36:20Z

the answer to @Alok-Shekhar question,

[how can we get] Abundance info just like we get from kraken run from "% of fragments covered " column?

is in #1022 and in the latest docs for lca summarize

ctb · 2020-06-23T15:36:35Z

ok I think I'm done for now, thanks for all the questions and please ask more!

cesarcardonau · 2020-06-23T19:08:36Z

This is very helpful, thanks so much @ctb

ctb · 2020-06-23T19:13:53Z

thank you for sticking with & please do ask for clarification :). we have an increasing number of good synthetic and real data sets that we can use to explore and demonstrate this stuff.

DivyaRavichandar · 2020-06-23T19:38:42Z

@ctb This is very helpful in understanding f_unique_weighted. Would it be reasonable to go to from f_unique_weighted to an approximation of reads annotated to a strain by f_unique_weighted*read_depth_in_sample?

ctb · 2020-06-23T19:42:36Z

my naive thinking is that f_unique_weighted * number of reads should be what you're looking for. but I haven't thought that through :)

ctb · 2020-07-03T18:37:04Z

closing for now - moved final unanswered question to #1079 for visibility.

Updates the requirements on [pytest-cov](https://github.com/pytest-dev/pytest-cov) to permit the latest version. <details> <summary>Changelog</summary> Sourced from <a href="https://github.com/pytest-dev/pytest-cov/blob/master/CHANGELOG.rst">pytest-cov's changelog</a>. <blockquote> <h2>4.0.0 (2022-09-28)</h2> Note that this release drops support for multiprocessing. <ul> <li> <code>--cov-fail-under</code> no longer causes <code>pytest --collect-only</code> to fail Contributed by Zac Hatfield-Dodds in <code>[#511](pytest-dev/pytest-cov#511) <https://github.com/pytest-dev/pytest-cov/pull/511></code>_. </li> <li> Dropped support for multiprocessing (mostly because <code>issue 82408 <https://github.com/python/cpython/issues/82408></code>_). This feature was mostly working but very broken in certain scenarios and made the test suite very flaky and slow. There is builtin multiprocessing support in coverage and you can migrate to that. All you need is this in your <code>.coveragerc</code>:: [run] concurrency = multiprocessing parallel = true sigterm = true </li> <li> Fixed deprecation in <code>setup.py</code> by trying to import setuptools before distutils. Contributed by Ben Greiner in <code>[#545](pytest-dev/pytest-cov#545) <https://github.com/pytest-dev/pytest-cov/pull/545></code>_. </li> <li> Removed undesirable new lines that were displayed while reporting was disabled. Contributed by Delgan in <code>[#540](pytest-dev/pytest-cov#540) <https://github.com/pytest-dev/pytest-cov/pull/540></code>_. </li> <li> Documentation fixes. Contributed by Andre Brisco in <code>[#543](pytest-dev/pytest-cov#543) <https://github.com/pytest-dev/pytest-cov/pull/543></code>_ and Colin O'Dell in <code>[#525](pytest-dev/pytest-cov#525) <https://github.com/pytest-dev/pytest-cov/pull/525></code>_. </li> <li> Added support for LCOV output format via <code>--cov-report=lcov</code>. Only works with coverage 6.3+. Contributed by Christian Fetzer in <code>[#536](pytest-dev/pytest-cov#536) <https://github.com/pytest-dev/pytest-cov/issues/536></code>_. </li> <li> Modernized pytest hook implementation. Contributed by Bruno Oliveira in <code>[#549](pytest-dev/pytest-cov#549) <https://github.com/pytest-dev/pytest-cov/pull/549></code>_ and Ronny Pfannschmidt in <code>[#550](pytest-dev/pytest-cov#550) <https://github.com/pytest-dev/pytest-cov/pull/550></code>_. </li> </ul> <h2>3.0.0 (2021-10-04)</h2> Note that this release drops support for Python 2.7 and Python 3.5. <ul> <li>Added support for Python 3.10 and updated various test dependencies. Contributed by Hugo van Kemenade in <code>[#500](pytest-dev/pytest-cov#500) <https://github.com/pytest-dev/pytest-cov/pull/500></code>_.</li> <li>Switched from Travis CI to GitHub Actions. Contributed by Hugo van Kemenade in <code>[#494](pytest-dev/pytest-cov#494) <https://github.com/pytest-dev/pytest-cov/pull/494></code>_ and <code>[#495](pytest-dev/pytest-cov#495) <https://github.com/pytest-dev/pytest-cov/pull/495></code>_.</li> <li>Add a <code>--cov-reset</code> CLI option. Contributed by Danilo Šegan in <code>[#459](pytest-dev/pytest-cov#459) <https://github.com/pytest-dev/pytest-cov/pull/459></code>_.</li> <li>Improved validation of <code>--cov-fail-under</code> CLI option. Contributed by ... Ronny Pfannschmidt's desire for skark in <code>[#480](pytest-dev/pytest-cov#480) <https://github.com/pytest-dev/pytest-cov/pull/480></code>_.</li> <li>Dropped Python 2.7 support.</li> </ul>  </blockquote> ... (truncated) </details> <details> <summary>Commits</summary> <ul> <li><a href="https://github.com/pytest-dev/pytest-cov/commit/28db055bebbf3ee016a2144c8b69dd7b80b48cc5"><code>28db055</code></a> Bump version: 3.0.0 → 4.0.0</li> <li><a href="https://github.com/pytest-dev/pytest-cov/commit/57e9354a86f658556fe6f15f07625c4b9a9ddf53"><code>57e9354</code></a> Really update the changelog.</li> <li><a href="https://github.com/pytest-dev/pytest-cov/commit/56b810b91c9ae15d1462633c6a8a1b522ebf8e65"><code>56b810b</code></a> Update chagelog.</li> <li><a href="https://github.com/pytest-dev/pytest-cov/commit/f7fced579e36b72b57e14768026467e4c4511a40"><code>f7fced5</code></a> Add support for LCOV output</li> <li><a href="https://github.com/pytest-dev/pytest-cov/commit/1211d3134bb74abb7b00c3c2209091aaab440417"><code>1211d31</code></a> Fix flake8 error</li> <li><a href="https://github.com/pytest-dev/pytest-cov/commit/b077753f5d9d200815fe500d0ef23e306784e65b"><code>b077753</code></a> Use modern approach to specify hook options</li> <li><a href="https://github.com/pytest-dev/pytest-cov/commit/00713b3fec90cb8c98a9e4bfb3212e574c08e67b"><code>00713b3</code></a> removed incorrect docs on <code>data_file</code>.</li> <li><a href="https://github.com/pytest-dev/pytest-cov/commit/b3dda36fddd3ca75689bb3645cd320aa8392aaf3"><code>b3dda36</code></a> Improve workflow with a collecting status check. (<a href="https://github-redirect.dependabot.com/pytest-dev/pytest-cov/issues/548">#548</a>)</li> <li><a href="https://github.com/pytest-dev/pytest-cov/commit/218419f665229d61356f1eea3ddc8e18aa21f87c"><code>218419f</code></a> Prevent undesirable new lines to be displayed when report is disabled</li> <li><a href="https://github.com/pytest-dev/pytest-cov/commit/60b73ec673c60942a3cf052ee8a1fdc442840558"><code>60b73ec</code></a> migrate build command from distutils to setuptools</li> <li>Additional commits viewable in <a href="https://github.com/pytest-dev/pytest-cov/compare/v2.12.0...v4.0.0">compare view</a></li> </ul> </details> Dependabot will resolve any conflicts with this PR as long as you don't alter it yourself. You can also trigger a rebase manually by commenting `@dependabot rebase`. [//]: # (dependabot-automerge-start) [//]: # (dependabot-automerge-end) --- <details> <summary>Dependabot commands and options</summary> You can trigger Dependabot actions by commenting on this PR: - `@dependabot rebase` will rebase this PR - `@dependabot recreate` will recreate this PR, overwriting any edits that have been made to it - `@dependabot merge` will merge this PR after your CI passes on it - `@dependabot squash and merge` will squash and merge this PR after your CI passes on it - `@dependabot cancel merge` will cancel a previously requested merge and block automerging - `@dependabot reopen` will reopen this PR if it is closed - `@dependabot close` will close this PR and stop Dependabot recreating it. You can achieve the same result by closing it manually - `@dependabot ignore this major version` will close this PR and stop Dependabot creating any more for this major version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this minor version` will close this PR and stop Dependabot creating any more for this minor version (unless you reopen the PR or upgrade to it yourself) - `@dependabot ignore this dependency` will close this PR and stop Dependabot creating any more for this dependency (unless you reopen the PR or upgrade to it yourself) </details> Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

ctb mentioned this issue Jun 23, 2020

fix comments in test_sourmash.py around abundance testing #1041

Closed

ctb mentioned this issue Jul 3, 2020

how do we approximate number of reads annotated to strain? #1079

Closed

ctb closed this as completed Jul 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Interpreting abundance #549

Interpreting abundance #549

mytluo commented Sep 26, 2018

mytluo commented Oct 10, 2018

luizirber commented Oct 10, 2018

mytluo commented Oct 11, 2018

mytluo commented Oct 11, 2018

ctb commented Nov 1, 2018

Alok-Shekhar commented Jan 11, 2019

ctb commented Jan 13, 2019

Alok-Shekhar commented Jan 14, 2019

Alok-Shekhar commented Jan 16, 2019

ctb commented Apr 4, 2020

cesarcardonau commented Jun 19, 2020 •

edited

ctb commented Jun 23, 2020

ctb commented Jun 23, 2020 •

edited

ctb commented Jun 23, 2020

ctb commented Jun 23, 2020

ctb commented Jun 23, 2020

ctb commented Jun 23, 2020

cesarcardonau commented Jun 23, 2020

ctb commented Jun 23, 2020 via email

DivyaRavichandar commented Jun 23, 2020

ctb commented Jun 23, 2020

ctb commented Jul 3, 2020

Interpreting abundance #549

Interpreting abundance #549

Comments

mytluo commented Sep 26, 2018

mytluo commented Oct 10, 2018

luizirber commented Oct 10, 2018

mytluo commented Oct 11, 2018

mytluo commented Oct 11, 2018

ctb commented Nov 1, 2018

Alok-Shekhar commented Jan 11, 2019

ctb commented Jan 13, 2019

scaled

Alok-Shekhar commented Jan 14, 2019

Alok-Shekhar commented Jan 16, 2019

ctb commented Apr 4, 2020

cesarcardonau commented Jun 19, 2020 • edited

ctb commented Jun 23, 2020

ctb commented Jun 23, 2020 • edited

sourmash gather and abundances

Data set prep

test_gather_abund_1_1 in test_sourmash.py

test_gather_abund_10_1

ctb commented Jun 23, 2020

ctb commented Jun 23, 2020

ctb commented Jun 23, 2020

ctb commented Jun 23, 2020

cesarcardonau commented Jun 23, 2020

ctb commented Jun 23, 2020 via email

DivyaRavichandar commented Jun 23, 2020

ctb commented Jun 23, 2020

ctb commented Jul 3, 2020

cesarcardonau commented Jun 19, 2020 •

edited

ctb commented Jun 23, 2020 •

edited