Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interpreting abundance #549

Closed
mytluo opened this issue Sep 26, 2018 · 22 comments
Closed

Interpreting abundance #549

mytluo opened this issue Sep 26, 2018 · 22 comments

Comments

@mytluo
Copy link

mytluo commented Sep 26, 2018

Hi @ctb,

I ran --track-abundance with sourmash gather, and wasn't sure how to interpret the values. Should I be looking at average_abund?

On another note, for --output-unassigned, is it correct to interpret that file as the list of hashes that were unclassified for the "mins" array? If I took the length of that array and added it to the sum of average_abund (or the correct column), could I calculate the % abundance from that sum? Apologies if this is completely off the mark!

@mytluo
Copy link
Author

mytluo commented Oct 10, 2018

Hi @luizirber, I was wondering if you could help me with this problem as well? I was pointed to f_unique_weighted as a possible equivalence to relative abundance -- the summed fraction seems to match the query coverage, so that seems right, but I just wanted to confirm before moving forward :)

additionally, may I also ask if I could interpret the counts from lca summarize to abundance? it was suggested the counts might be number of hashes; I tried looking for the total hashes in the signature file, but only got num: 0 after the list of mins hashes. apologies for the bother!

@luizirber
Copy link
Member

so, num is a bit misleading because it's a value we use internally in sourmash to control between regular minhashes (like mash) or scaled minhashes. In regular minhashes len(mins) == num, but because you calculated your signatures with --scaled you will need to check len(mins) to get how many hashes were collected.

@mytluo
Copy link
Author

mytluo commented Oct 11, 2018

@luizirber, thank you! that makes sense... I got 107323 from len(mins), but when I sum up the entire count column from lca summarize, it only adds up to 67072... I would also think this is the wrong sum in general as some of the counts are redundant -- counts from E. coli would also be in Escherichia, up to the Bacteria superkingdom level. But that would lower the total even more, which is strange, given 93.7% of the query was hit. Would you be able to point me in the right direction? I apologize for all the questions, I know I've asked a lot ^^;;

@mytluo
Copy link
Author

mytluo commented Oct 11, 2018

I realized I forgot the --scaled ^^;; last question (hopefully) -- in the LCA counts, would the count with no superkingdom assigned be considered unclassified reads?

@ctb
Copy link
Contributor

ctb commented Nov 1, 2018

While I try to find time to write more docs, please also see my STAMPS 2018 presentation at https://osf.io/8d56n/ -- thx!

@Alok-Shekhar
Copy link

@mytluo , @ctb
I am not getting how to interpret souurmash gather output especially f_unique_weighted, average_abund, median_abund, std_abund columns values, how to calculate %abundance value from the run, will you please explain.

  1. how to interpret --scaled option although its explained but am not getting correctly, will you please elaborate it for better understanding. Thanks A lot!

@ctb
Copy link
Contributor

ctb commented Jan 13, 2019

hi @Alok-Shekhar thank you for your questions!

scaled

re scaled, have you seen the explanations in Using sourmash: a practical guide and modulo hash details?

re abundance, what specific question(s) do you want to answer?

here's an example (from the test code) that might help:

In the tests/test-data/ directory, I run:

% sourmash gather gather-abund/reads-s10x10-s11.sig gather-abund/genome-s{10,11,12}.fa.gz.sig
== This is sourmash version 2.0.0a12.dev48+ga92289b. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

select query k=21 automatically.
loaded query: 1-1... (k=21, DNA)
loaded 3 signatures.


overlap     p_query p_match avg_abund
---------   ------- ------- ---------
0.5 Mbp       91.0%  100.0%      14.5    tests/test-data/genome-s10.fa.gz
376.0 kbp      9.0%   80.0%       1.9    tests/test-data/genome-s11.fa.gz

found 2 matches total;
the recovered matches hit 100.0% of the query

here, the query reads were constructed to have 10 times as many reads from s10 as s11. And so that's what you see with p_query (91% of the query is genome s10, 9% of the query is genome s11). In terms of the average abundance (avg_abund), what you see is approximately the same relative relationship (10x as much in s10 as in s11).

For this example, there is no difference between p_query and avg_abund, because the s10 and s11 genomes are the same size. However, were s10 twice as big as s11, then the p_query should change while avg_abund should not.

@Alok-Shekhar
Copy link

Thanks! @ctb for the explanation
Am I right, please let me know.
f_unique_weighted columns values are considered as %abundance value?
Thanks a lot!

@Alok-Shekhar
Copy link

@ctb Abundance info just like we get from kraken run from "% of fragments covered " column

Thanks!

@ctb
Copy link
Contributor

ctb commented Apr 4, 2020

@luizirber is doing a bunch of work on this for his thesis, so we can finally lay this to rest with benchmarks and (lots of) details. soon!

@cesarcardonau
Copy link

cesarcardonau commented Jun 19, 2020

Hi @ctb @luizirber, since it has been a few months, I was wondering if you guys have any more details/documentation for interpreting abundances as mentioned above.

I am struggling a bit myself, picking up these differences. For instance,

  1. How is the abundance weighing for f_unique_weighted work, if it seems that doesn't require that the signature is pre-computed with the --track-abundance flag. So how is it able to do it? or what is different from this abundance?
  2. If I have two metagenomics signature files (with abundances), query1 and query2, and run sourmash gather for both using the same reference database, producing gather results, gather1 and gather2, respectively. Which is the better metric to use if
    (i) we want to compare relative-abundance of strain X in the gather1 and gather2 results
    (ii) we want to add the abundance of strain X in gather1 + gather2 results, and
    (iii) we want to add relative-abundance of strain X & Y from the same gather result file.

Thanks!

@ctb
Copy link
Contributor

ctb commented Jun 23, 2020

hi @cesarcardonau et al., let me update you all a bit on some documentation and testing updates!

first, our classifying signatures documentation has been improved. In particular the abundance weighting section is better, and the gather algorithm discussion at the bottom of the page should be informative.

Second, we've updated lca summarize to respect abundance info and elaborated our documentation.

I don't think either of these answers the questions above fully, but they might provide context.

On to the meat of the question --

@ctb
Copy link
Contributor

ctb commented Jun 23, 2020

There's a lot of questions in this issue, so let me try to answer some of them first and then iterate as people ask for clarification!

The gather-abund tests added in #886 are key. See this diff, but I interpret the important bits below:

sourmash gather and abundances

tl;dr Below is a discussion of a synthetic set of test cases using three randomly generated (fake) genomes of the same size, with two even read data set abundances of 2x each, and a third read data set with 20x.

Data set prep

First, we make some synthetic data sets:

  • r1.fa with 2x coverage of genome s10
  • r2.fa with 10x coverage of genome s10.
  • r3.fa with 2x coverage of genome s11.

then we make signature s10-s11 with r1 and r3, i.e. 1:1 abundance, and make signature s10x10-s11 with r2 and r3, i.e. 10:1 abundance.

test_gather_abund_1_1 in test_sourmash.py

When we project r1+r3, 1:1 abundance, onto s10, s11, and s12 genomes with gather:

sourmash gather r1+r3 genome-s10.sig genome-s11.sig genome-s12.sig

we get:

overlap     p_query p_match avg_abund
---------   ------- ------- ---------
394.0 kbp     49.6%   78.5%       1.8    genome-s10.fa.gz
376.0 kbp     50.4%   80.0%       1.9    genome-s11.fa.gz
  • approximately 50% of each query matching (first column, p_query)
  • approximately 80% of subject genome's contents being matched (this is due to the low coverage of 2 used to build queries)
  • approximately 2.0 abundance (third column, avg_abund)
  • no match to genome s12.

test_gather_abund_10_1

When we project r2+r3, 10:1 abundance, onto s10, s11, and s12 genomes with gather:

sourmash gather r2+r3 genome-s10.sig genome-s11.sig genome-s12.sig

we get:

overlap     p_query p_match avg_abund
---------   ------- ------- ---------
0.5 Mbp       91.0%  100.0%      14.5    tests/test-data/genome-s10.fa.gz
376.0 kbp      9.0%   80.0%       1.9    tests/test-data/genome-s11.fa.gz
  • approximately 91% of s10 matching
  • approximately 9% of s11 matching
  • approximately 100% of the high coverage genome being matched, with only 80% of the low coverage genome
  • approximately 2.0 abundance (third column, avg_abund) for s11, and (very) approximately 20x abundance for genome s10.

The cause of the poor approximation for genome s10 is unclear; it could be due to low coverage, or small genome size, or something else. More investigation needed.

@ctb
Copy link
Contributor

ctb commented Jun 23, 2020

I think the above should answer @mytluo first question.

answering @mytluo second question,

for --output-unassigned, is it correct to interpret that file as the list of hashes that were unclassified for the "mins" array? If I took the length of that array and added it to the sum of average_abund (or the correct column), could I calculate the % abundance from that sum? Apologies if this is completely off the mark!

Yes, that file is the list of hashes that were not classified / assigned to a genome. The percentages and so on referred to elsewhere in this issue should take that all into account, however, so you don't need to explicitly account for them yourself!

@ctb
Copy link
Contributor

ctb commented Jun 23, 2020

more @mytluo answers -

I was pointed to f_unique_weighted as a possible equivalence to relative abundance -- the summed fraction seems to match the query coverage, so that seems right, but I just wanted to confirm before moving forward :)

Yeah, in partial response to this issue, I put in some tests to make sure I understood my own code :).

Per this line, f_unique_weighted is the abundance-weighted fraction of k-mers uniquely assigned to that gather match. See f_unique_to_query discussion here, which we then adjust to be weighted by the abundances.

@ctb
Copy link
Contributor

ctb commented Jun 23, 2020

the answer to @Alok-Shekhar question,

[how can we get] Abundance info just like we get from kraken run from "% of fragments covered " column?

is in #1022 and in the latest docs for lca summarize

@ctb
Copy link
Contributor

ctb commented Jun 23, 2020

ok I think I'm done for now, thanks for all the questions and please ask more!

@cesarcardonau
Copy link

This is very helpful, thanks so much @ctb

@ctb
Copy link
Contributor

ctb commented Jun 23, 2020 via email

@DivyaRavichandar
Copy link

@ctb This is very helpful in understanding f_unique_weighted. Would it be reasonable to go to from f_unique_weighted to an approximation of reads annotated to a strain by f_unique_weighted*read_depth_in_sample?

@ctb
Copy link
Contributor

ctb commented Jun 23, 2020

my naive thinking is that f_unique_weighted * number of reads should be what you're looking for. but I haven't thought that through :)

@ctb
Copy link
Contributor

ctb commented Jul 3, 2020

closing for now - moved final unanswered question to #1079 for visibility.

@ctb ctb closed this as completed Jul 3, 2020
ctb pushed a commit that referenced this issue Oct 28, 2022
Updates the requirements on
[pytest-cov](https://github.com/pytest-dev/pytest-cov) to permit the
latest version.
<details>
<summary>Changelog</summary>
<p><em>Sourced from <a
href="https://github.com/pytest-dev/pytest-cov/blob/master/CHANGELOG.rst">pytest-cov's
changelog</a>.</em></p>
<blockquote>
<h2>4.0.0 (2022-09-28)</h2>
<p><strong>Note that this release drops support for
multiprocessing.</strong></p>
<ul>
<li>
<p><code>--cov-fail-under</code> no longer causes <code>pytest
--collect-only</code> to fail
Contributed by Zac Hatfield-Dodds in
<code>[#511](pytest-dev/pytest-cov#511)
&lt;https://github.com/pytest-dev/pytest-cov/pull/511&gt;</code>_.</p>
</li>
<li>
<p>Dropped support for multiprocessing (mostly because <code>issue 82408
&lt;https://github.com/python/cpython/issues/82408&gt;</code>_). This
feature was
mostly working but very broken in certain scenarios and made the test
suite very flaky and slow.</p>
<p>There is builtin multiprocessing support in coverage and you can
migrate to that. All you need is this in your
<code>.coveragerc</code>::</p>
<p>[run]
concurrency = multiprocessing
parallel = true
sigterm = true</p>
</li>
<li>
<p>Fixed deprecation in <code>setup.py</code> by trying to import
setuptools before distutils.
Contributed by Ben Greiner in
<code>[#545](pytest-dev/pytest-cov#545)
&lt;https://github.com/pytest-dev/pytest-cov/pull/545&gt;</code>_.</p>
</li>
<li>
<p>Removed undesirable new lines that were displayed while reporting was
disabled.
Contributed by Delgan in
<code>[#540](pytest-dev/pytest-cov#540)
&lt;https://github.com/pytest-dev/pytest-cov/pull/540&gt;</code>_.</p>
</li>
<li>
<p>Documentation fixes.
Contributed by Andre Brisco in
<code>[#543](pytest-dev/pytest-cov#543)
&lt;https://github.com/pytest-dev/pytest-cov/pull/543&gt;</code>_
and Colin O'Dell in
<code>[#525](pytest-dev/pytest-cov#525)
&lt;https://github.com/pytest-dev/pytest-cov/pull/525&gt;</code>_.</p>
</li>
<li>
<p>Added support for LCOV output format via
<code>--cov-report=lcov</code>. Only works with coverage 6.3+.
Contributed by Christian Fetzer in
<code>[#536](pytest-dev/pytest-cov#536)
&lt;https://github.com/pytest-dev/pytest-cov/issues/536&gt;</code>_.</p>
</li>
<li>
<p>Modernized pytest hook implementation.
Contributed by Bruno Oliveira in
<code>[#549](pytest-dev/pytest-cov#549)
&lt;https://github.com/pytest-dev/pytest-cov/pull/549&gt;</code>_
and Ronny Pfannschmidt in
<code>[#550](pytest-dev/pytest-cov#550)
&lt;https://github.com/pytest-dev/pytest-cov/pull/550&gt;</code>_.</p>
</li>
</ul>
<h2>3.0.0 (2021-10-04)</h2>
<p><strong>Note that this release drops support for Python 2.7 and
Python 3.5.</strong></p>
<ul>
<li>Added support for Python 3.10 and updated various test dependencies.
Contributed by Hugo van Kemenade in
<code>[#500](pytest-dev/pytest-cov#500)
&lt;https://github.com/pytest-dev/pytest-cov/pull/500&gt;</code>_.</li>
<li>Switched from Travis CI to GitHub Actions. Contributed by Hugo van
Kemenade in
<code>[#494](pytest-dev/pytest-cov#494)
&lt;https://github.com/pytest-dev/pytest-cov/pull/494&gt;</code>_ and
<code>[#495](pytest-dev/pytest-cov#495)
&lt;https://github.com/pytest-dev/pytest-cov/pull/495&gt;</code>_.</li>
<li>Add a <code>--cov-reset</code> CLI option.
Contributed by Danilo Šegan in
<code>[#459](pytest-dev/pytest-cov#459)
&lt;https://github.com/pytest-dev/pytest-cov/pull/459&gt;</code>_.</li>
<li>Improved validation of <code>--cov-fail-under</code> CLI option.
Contributed by ... Ronny Pfannschmidt's desire for skark in
<code>[#480](pytest-dev/pytest-cov#480)
&lt;https://github.com/pytest-dev/pytest-cov/pull/480&gt;</code>_.</li>
<li>Dropped Python 2.7 support.</li>
</ul>
<!-- raw HTML omitted -->
</blockquote>
<p>... (truncated)</p>
</details>
<details>
<summary>Commits</summary>
<ul>
<li><a
href="https://github.com/pytest-dev/pytest-cov/commit/28db055bebbf3ee016a2144c8b69dd7b80b48cc5"><code>28db055</code></a>
Bump version: 3.0.0 → 4.0.0</li>
<li><a
href="https://github.com/pytest-dev/pytest-cov/commit/57e9354a86f658556fe6f15f07625c4b9a9ddf53"><code>57e9354</code></a>
Really update the changelog.</li>
<li><a
href="https://github.com/pytest-dev/pytest-cov/commit/56b810b91c9ae15d1462633c6a8a1b522ebf8e65"><code>56b810b</code></a>
Update chagelog.</li>
<li><a
href="https://github.com/pytest-dev/pytest-cov/commit/f7fced579e36b72b57e14768026467e4c4511a40"><code>f7fced5</code></a>
Add support for LCOV output</li>
<li><a
href="https://github.com/pytest-dev/pytest-cov/commit/1211d3134bb74abb7b00c3c2209091aaab440417"><code>1211d31</code></a>
Fix flake8 error</li>
<li><a
href="https://github.com/pytest-dev/pytest-cov/commit/b077753f5d9d200815fe500d0ef23e306784e65b"><code>b077753</code></a>
Use modern approach to specify hook options</li>
<li><a
href="https://github.com/pytest-dev/pytest-cov/commit/00713b3fec90cb8c98a9e4bfb3212e574c08e67b"><code>00713b3</code></a>
removed incorrect docs on <code>data_file</code>.</li>
<li><a
href="https://github.com/pytest-dev/pytest-cov/commit/b3dda36fddd3ca75689bb3645cd320aa8392aaf3"><code>b3dda36</code></a>
Improve workflow with a collecting status check. (<a
href="https://github-redirect.dependabot.com/pytest-dev/pytest-cov/issues/548">#548</a>)</li>
<li><a
href="https://github.com/pytest-dev/pytest-cov/commit/218419f665229d61356f1eea3ddc8e18aa21f87c"><code>218419f</code></a>
Prevent undesirable new lines to be displayed when report is
disabled</li>
<li><a
href="https://github.com/pytest-dev/pytest-cov/commit/60b73ec673c60942a3cf052ee8a1fdc442840558"><code>60b73ec</code></a>
migrate build command from distutils to setuptools</li>
<li>Additional commits viewable in <a
href="https://github.com/pytest-dev/pytest-cov/compare/v2.12.0...v4.0.0">compare
view</a></li>
</ul>
</details>
<br />


Dependabot will resolve any conflicts with this PR as long as you don't
alter it yourself. You can also trigger a rebase manually by commenting
`@dependabot rebase`.

[//]: # (dependabot-automerge-start)
[//]: # (dependabot-automerge-end)

---

<details>
<summary>Dependabot commands and options</summary>
<br />

You can trigger Dependabot actions by commenting on this PR:
- `@dependabot rebase` will rebase this PR
- `@dependabot recreate` will recreate this PR, overwriting any edits
that have been made to it
- `@dependabot merge` will merge this PR after your CI passes on it
- `@dependabot squash and merge` will squash and merge this PR after
your CI passes on it
- `@dependabot cancel merge` will cancel a previously requested merge
and block automerging
- `@dependabot reopen` will reopen this PR if it is closed
- `@dependabot close` will close this PR and stop Dependabot recreating
it. You can achieve the same result by closing it manually
- `@dependabot ignore this major version` will close this PR and stop
Dependabot creating any more for this major version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this minor version` will close this PR and stop
Dependabot creating any more for this minor version (unless you reopen
the PR or upgrade to it yourself)
- `@dependabot ignore this dependency` will close this PR and stop
Dependabot creating any more for this dependency (unless you reopen the
PR or upgrade to it yourself)


</details>

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants