explain p_match and p_query in sourmash documentation #1289

ctb · 2021-01-21T16:09:27Z

it's not easy to find!

ctb · 2021-01-25T16:16:10Z

from @bluegenes in an e-mail response -

Yes the p_match (percent of reference genome matched) does depend on both the relative abundance of that genome in the metagenome (because this can affect the # k-mers from this genome found in the metagenome signature) and the similarity with the reference genome (# of k-mers that will match). However, what you're interested in is ANI, or % DNA identity, right? We can get a decent approximation of ANI by using the "Mash Distance" equations (see implement pairwise evolutionary distance/ANI output from sourmash #1242 for explanation and a python function for conversion). Note that these were developed for jaccard, not containment. The revisions we're working on for Scaled Minhash --> ANI will improve the accuracy for Scaled MinHash containment and put confidence intervals on our estimates.
There are a few use cases where folks want to go from hashes/k-mer sequences back to reads, but it's still a work in progress. If you're planning on doing alignments, Titus has been actively developing a workflow for this ("genome-grist"), which uses sourmash to identify genomes of interest from genbank/other references, download them, and then map the metagenome reads to those reference genomes. You can find it here: https://github.com/dib-lab/genome-grist and I'm happy to help you try it out!

These sourmash issues are directly related to your #2 and might also be useful:
#1237
#483

drtamermansour · 2022-01-18T02:27:16Z

@ctb
I would extend this request to add a clear explanation of "overlap"
and if "overlap", "p_query", or "p_match" are calculated/interrupted differently in the context of abund sketches

This question is motived by unexpected values of overlap as seen below:

overlap     p_query p_match
---------   ------- -------
7.2 Mbp        6.1%    0.5%    GCA_006912095.1 Massospora platypedia...
5.8 Mbp        3.1%    0.8%    GCA_004523865.1 Nephromyces sp. ex Mo...
4.2 Mbp        2.2%    0.4%    GCA_900893395.1 Euglena gracilis stra...
4.8 Mbp        1.9%    0.6%    GCA_009809945.1 Gigaspora margarita s...
4.1 Mbp        1.6%    0.4%    GCA_003724095.1 Austropuccinia psidii...
3.2 Mbp        1.0%    0.5%    GCA_009731375.1 Paulinella micropora ...
3.8 Mbp        0.9%    0.4%    GCA_003550325.1 Gigaspora rosea strai...
2.4 Mbp        0.8%    0.2%    GCA_000978595.1 Saccharina japonica c...
2.5 Mbp        0.7%    0.3%    GCA_000711775.1 Oxytricha trifallax s...
2.2 Mbp        0.7%    0.1%    GCA_009767595.1 Symbiodinium kawaguti...
4.1 Mbp        0.6%    0.7%    GCA_002104975.1 Neocallimastix califo...
1.9 Mbp        0.6%    0.1%    GCA_000507305.1 Breviolum minutum Mf ...

The last 2 entries have the same p_query but the overlap seems to be proportional to the p_match which does not make sense to me

ctb · 2022-01-18T16:05:38Z

The documentation here (appendix A, on sourmash gather) and here (on abundance weighting) would probably be the place to look.
See also appendix B, with detailed info on abundance calculations.

Specific suggestions for changes would be very welcome!

The most likely explanation for the above situation is that the results are from query signatures computed with --track-abundance. Overlap does not use abundance weighting, while p_match does. If you could run this same search with --ignore-abundance and report back that would be lovely :)

code references

The (more optimized, hence ugly/unreadable) code for this is in the src/sourmash/search.py, class Gather Databases, method __next__. For example,

        f_match = len(intersect_mh) / len(found_mh)
        f_orig_query = len(intersect_orig_mh) / orig_query_len

The code in tests/test_sourmash.py::test_gather_f_match_orig checks that the calculations match against the straight up sourmash signature / sketch code -- e.g.

            # f_orig_query is the containment of the query by the match.        
            # (note, this only works because containment is 100% in combined).  
            assert approx_equal(combined_sig.contained_by(match), f_orig_query)

but it's only tested for non-abund signature at the moment. It would be a great addition if someone were to do something similar with abund signatures.

ctb · 2022-01-18T16:12:20Z

Oh, and the code for what's actually output by the sourmash CLI for p_match when you run sourmash gather is in src/sourmash/commands.py, function gather(...), and it looks like this:

        pct_query = '{:.1f}%'.format(result.f_unique_weighted*100)

so it does in fact use the f_unique_weighted which is abundance-weighted for track-abund signatures.

drtamermansour · 2022-01-20T20:59:19Z

Regarding the documentation structure, I have some specific suggestions:

I saw the page of "Using sourmash from the command line" as the entry point to the detailed tutorial. While the page "Classifying signatures: ..." is more like a blog post explaining the technique. Therefore, as a I user, I am expecting the detailed tutorial page to explain the command options, how to use, and the format of each command's inputs/outputs. As an alternative, we might have a separate page for inputs/outputs formats that we can cross linked everywhere.
The section of "appendix A, on sourmash gather" has a clear explanation of the csv output but does not link this anyhow to the standard output table of gather.

drtamermansour · 2022-01-20T21:51:52Z

I think --track-abundance affects p_query not p_match as you are saying above but the example in the appendix is confusing because you are changing the query. So I did some testing for the abundance effect:

I am using 2 samples (sampleA.fq and sampleB.fq). Each one has 25k reads.
You can find the samples in /home/tahmed/test_sourmash_abund
Then I created a campsite metagenome of (10 * sampleA) + (sampleB)

>  data/metagenome1_10.fq
for x in {1..10};do echo $x;
cat  data/sampleA.fq >> data/metagenome1_10.fq
done
cat  data/sampleB.fq >> data/metagenome1_10.fq

Then queried the metagenome against the 2 samples with and without --track-abundance

mkdir sig1000
sourmash sketch dna -p k=21,scaled=1000,abund  data/sampleA.fq -o sig1000/sampleA.sig --name sampleA
sourmash sketch dna -p k=21,scaled=1000,abund  data/sampleB.fq -o sig1000/sampleB.sig --name sampleB
sourmash sketch dna -p k=21,scaled=1000,abund data/metagenome1_10.fq -o sig1000/metagenome1_10.sig --name metagenome1_10

sourmash index database sig1000/sampleA.sig sig1000/sampleB.sig

sourmash gather sig1000/metagenome1_10.sig database.sbt.zip
sourmash gather --ignore-abundance sig1000/metagenome1_10.sig database.sbt.zip

The output with abund tracking

overlap     p_query p_match avg_abund
---------   ------- ------- ---------
2.3 Mbp       17.6%  100.0%       2.6    sampleB
1.7 Mbp       82.4%   99.1%      17.0    sampleA

The output with --ignore-abundance

overlap     p_query p_match
---------   ------- -------
2.3 Mbp       58.0%  100.0%    sampleB
1.7 Mbp       42.0%   99.1%    sampleA

drtamermansour · 2022-01-20T21:59:48Z

So far, I can not find a clear definition to p_query. But here is my guessing:
if p_match has the same definition of f_match, then it means "how much of the match is in the remaining query , after all of the previous matches have been removed".
Similarly p_query means "how much of the remaining query is in the match" which consider the abundance (f_unique_weighted) by default unless --ignore-abundance is specified

ctb · 2022-08-03T10:43:28Z

ref #1227

ctb · 2023-09-30T01:03:39Z

I just wrote the following up for #2184. Comments welcome!

Appendix C: sourmash gather output examples

Below we show two real gather analyses done with a mock metagenome,
SRR606249 (from
Shakya et al., 2014) and
three of the known genomes contained within it - two Shewanella baltica
strains and one Akkermansia muciniphila genome

sourmash gather with a query containing abundance information

% sourmash gather -k 31 SRR606249.trim.sig.zip podar-ref/2.fa.sig podar-ref/47.fa.sig podar-ref/63.fa.sig

== This is sourmash version 4.8.5.dev0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

selecting specified query k=31
loaded query: SRR606249... (k=31, DNA)
--
loaded 9 total signatures from 3 locations.
after selecting signatures compatible with search, 3 remain.

Starting prefetch sweep across databases.
Prefetch found 3 signatures with overlap >= 50.0 kbp.
Doing gather to generate minimum metagenome cover.

overlap     p_query p_match avg_abund
---------   ------- ------- ---------
5.2 Mbp        0.8%   99.0%      11.7    NC_011663.1 Shewanella baltica OS223...
2.7 Mbp        0.9%  100.0%      24.5    CP001071.1 Akkermansia muciniphila A...
5.2 Mbp        0.3%   51.0%       8.1    NC_009665.1 Shewanella baltica OS185...
found less than 50.0 kbp in common. => exiting

found 3 matches total;
the recovered matches hit 2.0% of the abundance-weighted query.
the recovered matches hit 2.5% of the query k-mers (unweighted).

sourmash gather with the same query, ignoring abundances

% sourmash gather -k 31 SRR606249.trim.sig.zip podar-ref/2.fa.sig podar-ref/47.fa.sig podar-ref/63.fa.sig --ignore-abundance

== This is sourmash version 4.8.5.dev0. ==
== Please cite Brown and Irber (2016), doi:10.21105/joss.00027. ==

selecting specified query k=31
loaded query: SRR606249... (k=31, DNA)
--
loaded 9 total signatures from 3 locations.
after selecting signatures compatible with search, 3 remain.

Starting prefetch sweep across databases.
Prefetch found 3 signatures with overlap >= 50.0 kbp.
Doing gather to generate minimum metagenome cover.

overlap     p_query p_match
---------   ------- -------
5.2 Mbp        1.2%   99.0%    NC_011663.1 Shewanella baltica OS223, complete...
2.7 Mbp        0.6%  100.0%    CP001071.1 Akkermansia muciniphila ATCC BAA-83...
5.2 Mbp        0.6%   51.0%    NC_009665.1 Shewanella baltica OS185, complete...
found less than 50.0 kbp in common. => exiting

found 3 matches total;
the recovered matches hit 2.5% of the query k-mers (unweighted).

Notes and comparisons

There are a few interesting things to point out about the above output:

p_match is the same whether or not abundance information is used.
This is because it is the fraction of the matching genome detected in
the metagenome, which is inherently "flat". It is also reported
progressively: only the portions of the metagenome that have not
matched to any previous matches are used in p_match; read on for
details :).
p_query is different when abundance information is used. For
queries with abundance information, p_query provides a weighted
estimate that approximates the number of metagenome reads that would
map to this genome after mapping reads to all previously reported
matches, i.e. all matches above this match.
When abundance information is not available or
not used, p_query is an estimate of what fraction of the remaining k-mers
in the metagenome match to this genome, after all previous matches
have been removed.
The avg_abund column only shows up when abundance information is
supplied. This is the k-mer coverage of the detected portion of the
match; it is a lower bound on the expected mapping-based coverage
for metagenome reads mapped to the detected portion of the match.
The percent of recovered matches for the abundance-weighted query
is the sum of the p_query column and estimates the total fraction
of metagenome reads that will map across all of the matching references.
The percent of recovered matches when ignoring abundances is likewise
the sum of the (unweighted) p_query column and is not particularly
informative - it will always be low for real metagenomes, because sourmash
cannot match erroneous k-mers created by sequencing errors.
The overlap column is the estimated size of the overlap between the
(unweighted) original query metagenome and the match. It does not take
into account previous matches.

Last but not least, something interesting is going on here with strains.
While it is not highlighted in the text output of gather, there is
substantial overlap between the two Shewanella baltica genomes. And,
in fact, both of them are entirely (well, 99%) present in the metagenome
if measured individually with sourmash search --containment.

Consider a few more details:

p_match for the first Shewanella match, NC_011663.1, is 99%!
p_match for the second Shewanella match, NC_009665.1, is only 50%!
and, confusingly, the overlap for both matches is 5.2 Mbp!

What's up?!

What's happening here is that sourmash gather is subtracting the match
to the first Shewanella genome from the metagenome before moving on to
the next result, and p_match reports only the amount of the match
detected in the remaining metagenome after that subtraction.
However, overlap is reported as the amount of overlap with the
original metagenome, which is essentially the entire genome in all
three cases.

The main things to keep in mind for gather are this:

p_query and p_match do not double-count k-mers or matches; in particular, you can sum across p_query for a metagenome without
counting anything more than once.
overlap does count matches redundantly.
the percent of recovered matches is a useful summary of the whole
metagenome!

We know it's confusing but it's the best output we've been able to
figure out across all of the different use cases for gather. Perhaps
in the future we'll find a better way to represent all of these
numbers in a more clear, concise, and interpretable way; in the
meantime, we welcome your questions and comments!

rpetit3 mentioned this issue Mar 9, 2021

[bug]Error in setting up datasets for a single species bactopia/bactopia#137

Closed

ctb added the faq things to add to an FAQ or docs label Apr 23, 2021

ctb added the doc documentation content or issues label Jun 26, 2021

ctb mentioned this issue Jan 20, 2022

be clearer about when abundance weighting is used in gather output #1805

Closed

ctb mentioned this issue Jul 12, 2022

update documentation re k-mer trimming and gather vs search #2122

Open

ctb mentioned this issue Aug 2, 2022

use 'detection' terminology for fraction-of-genome-kmers-found #2170

Closed

ctb mentioned this issue Sep 30, 2023

MRG: revise documentation structure; add internals page. #2184

Merged

ctb closed this as completed in #2184 Oct 15, 2023

ctb closed this as completed in 0db31c8 Oct 15, 2023

sapuizait mentioned this issue Jul 5, 2024

question about the output when using the sourmash gather #3242

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

explain p_match and p_query in sourmash documentation #1289

explain p_match and p_query in sourmash documentation #1289

ctb commented Jan 21, 2021

ctb commented Jan 25, 2021

drtamermansour commented Jan 18, 2022 •

edited

Loading

ctb commented Jan 18, 2022 •

edited

Loading

ctb commented Jan 18, 2022

drtamermansour commented Jan 20, 2022

drtamermansour commented Jan 20, 2022

drtamermansour commented Jan 20, 2022 •

edited

Loading

ctb commented Aug 3, 2022

ctb commented Sep 30, 2023

explain p_match and p_query in sourmash documentation #1289

explain p_match and p_query in sourmash documentation #1289

Comments

ctb commented Jan 21, 2021

ctb commented Jan 25, 2021

drtamermansour commented Jan 18, 2022 • edited Loading

ctb commented Jan 18, 2022 • edited Loading

code references

ctb commented Jan 18, 2022

drtamermansour commented Jan 20, 2022

drtamermansour commented Jan 20, 2022

drtamermansour commented Jan 20, 2022 • edited Loading

ctb commented Aug 3, 2022

ctb commented Sep 30, 2023

Appendix C: sourmash gather output examples

sourmash gather with a query containing abundance information

sourmash gather with the same query, ignoring abundances

Notes and comparisons

drtamermansour commented Jan 18, 2022 •

edited

Loading

ctb commented Jan 18, 2022 •

edited

Loading

drtamermansour commented Jan 20, 2022 •

edited

Loading