Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Improve documentation around abundance projection #1073

Merged
merged 4 commits into from
Jul 13, 2020
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
73 changes: 71 additions & 2 deletions doc/classifying-signatures.md
Expand Up @@ -58,7 +58,7 @@ metagenome and searches the database for the most highly contained
genome; it then subtracts that match from the metagenome, and repeats.
At the end it reports how much of the metagenome remains unknown. The
[basic sourmash tutorial](tutorial-basic.md#what-s-in-my-metagenome)
has some sample output from using gather with GenBank. See the appendix at
has some sample output from using gather with GenBank. See Appendix A at
the bottom of this page for more technical details.

Some benchmarking on CAMI suggests that `gather` is a very accurate
Expand Down Expand Up @@ -138,6 +138,8 @@ You can also get count-like information from the CSV output of `sourmash
gather`; the column `median_abund` contains the median abundance of the k-mers
in the match to the given genome.

Please see Appendix B, below, for some actual numbers and output.

**Buyer beware:** There are substantial challenges in doing this kind
of analysis on real metagenomic samples, relating to genome representation
and strain overlap; see [this issue](https://github.com/dib-lab/sourmash/issues/461) for a discussion.
Expand Down Expand Up @@ -182,7 +184,7 @@ This helps us figure out what people are actually interested in doing, and
any help we provide via the issue tracker will eventually be added into the
documentation.

## Appendix: how `sourmash gather` works.
## Appendix A: how `sourmash gather` works.

The sourmash gather algorithm works as follows:

Expand Down Expand Up @@ -229,3 +231,70 @@ A few quick notes for the algorithmic folk out there --
and in fact should only become more sensitive and specific as we
increase database size. (Although of course it may get a lot
slower...)

## Appendix B: sourmash gather and `--track-abundance`

Below is a discussion of a synthetic set of test cases using three
randomly generated (fake) genomes of the same size, with two even read
data set abundances of 2x each, and a third read data set with 20x.

### Data set prep

First, we make some synthetic data sets:

* r1.fa with 2x coverage of genome s10
* r2.fa with 10x coverage of genome s10.
* r3.fa with 2x coverage of genome s11.

then we make signature s10-s11 with r1 and r3, i.e. 1:1 abundance, and
make signature s10x10-s11 with r2 and r3, i.e. 10:1 abundance.

### A first experiment: 1:1 abundance.

When we project r1+r3, 1:1 abundance, onto s10, s11, and s12 genomes
with gather:

```
sourmash gather r1+r3 genome-s10.sig genome-s11.sig genome-s12.sig
```

we get:

```
overlap p_query p_match avg_abund
--------- ------- ------- ---------
394.0 kbp 49.6% 78.5% 1.8 genome-s10.fa.gz
376.0 kbp 50.4% 80.0% 1.9 genome-s11.fa.gz
```

* approximately 50% of each query matching (first column, `p_query`)
* approximately 80% of subject genome's contents being matched (this is due to the low coverage of 2 used to build queries; `p_match`)
* approximately 2.0 abundance (third column, `avg_abund`)
* no match to genome s12.

### A second experiment: 10:1 abundance.

When we project r2+r3, 10:1 abundance, onto s10, s11, and s12 genomes
with gather:

```
sourmash gather r2+r3 genome-s10.sig genome-s11.sig genome-s12.sig
```

we get:

```
overlap p_query p_match avg_abund
--------- ------- ------- ---------
0.5 Mbp 91.0% 100.0% 14.5 tests/test-data/genome-s10.fa.gz
376.0 kbp 9.0% 80.0% 1.9 tests/test-data/genome-s11.fa.gz
```

* approximately 91% of s10 matching
* approximately 9% of s11 matching
* approximately 100% of the high coverage genome being matched, with only 80% of the low coverage genome
* approximately 2.0 abundance (third column, avg_abund) for s11, and (very) approximately 20x abundance for genome s10.

The cause of the poor approximation for genome s10 is unclear; it
could be due to low coverage, or small genome size, or something
else. More investigation needed.
4 changes: 2 additions & 2 deletions tests/test_sourmash.py
Expand Up @@ -3173,9 +3173,9 @@ def test_gather_deduce_moltype():
@utils.in_thisdir
def test_gather_abund_1_1(c):
#
# make r1.fa with 1x coverage of genome s10
# make r1.fa with 2x coverage of genome s10
# make r2.fa with 10x coverage of genome s10.
# make r3.fa with 1x coverage of genome s11.
# make r3.fa with 2x coverage of genome s11.
#
# nullgraph/make-reads.py -S 1 -r 200 -C 2 tests/test-data/genome-s10.fa.gz > r1.fa
# nullgraph/make-reads.py -S 1 -r 200 -C 20 tests/test-data/genome-s10.fa.gz > r2.fa
Expand Down