added documentation about mapping uniqueness #131

qiyunzhu · 2021-08-02T18:05:32Z

@antgonza I have incorporated the text you provided regarding the uniqueness of mapping files and how that impacts the downstream analyses, with some modifications, plus a few other edits of the documentation. Please kindly review. Thank you!

After this PR is merged, the most appropriate link to the explanation will be:

https://github.com/qiyunzhu/woltka/blob/master/doc/wol.md#comparison

Also cc'ing @droush

antgonza

Looks good, extra from the already added minor comments: should the #comparison section be a new section or in the collapse or classify pages? I'm not sure it should only be within wol as (1) this applies to wol and other references, (2) if we move to a new wol version this info could be lost; what do you think?

antgonza · 2021-08-03T12:09:33Z

doc/collapse.md

+
+
+## Use cases
+
 With this tool one can achieve the following goals:

 1. Translate feature IDs into names or descriptions.


Should we add an extra point here (within 1): Translate genome IDs to functional proteins (get functional profiles) ?

Perhaps worth adding also a reference here to the Considerations below; something like: Please read the Considerations below.

Should we add an extra point here (within 1): Translate genome IDs to functional proteins (get functional profiles) ?

I think I might be missing something here Antonio. To do the functional analysis, one would run the classify command and not collapse. One could collapse a genome id + orf id into a protein id, but a genome id to protein would correspond to all the ORFs of that genome.

Excellent question, I guess this means that the difference between collapse and classify is not clear and having more details will be useful. Perhaps having a simple page with this information + the comparison, my other comment will be useful. @qiyunzhu, what do you think?

@antgonza thanks for the review!

The comparison section is further elaborated in collapse.md (The former has a link to the latter). Here is the elaboration: https://github.com/qiyunzhu/woltka/blob/upgrade/doc/collapse.md#considerations.

The suggestion "Translate genome IDs to functional proteins" may not apply here. This is because one cannot collapse genomes into functional proteins. (Technically one can but that will be no different from PICRUSt...)

Good suggestion! Let me add the reference.

I thought a bit, and currently I tend to refrain from making a separate page. My rationales are:

collapse is a simple utility whereas classify is the main module. So the comparison is only relevant when discussing collapse.

Several operations can affect the statistical property of the feature table, and one-to-many collapsing is but one of them. It may be worth to have a page dedicated to these questions.

Currently, classify cannot handle non-tree structures (especially directed acyclic graphs, which cover most functional classification systems), so we have to over-emphasize on collapse as a complement to classify. But implementing this new function is in our plan.

Sorry, for the confusion. I meant "Translate per gene IDs to functional proteins", for example: from woltka genes to uniref proteins

Makes sense to me! I have updated the PR.

droush · 2021-08-03T16:57:20Z

doc/collapse.md

+It is important to highlight that one-to-many mapping may change some of the
+underlying statistical assumptions of downstream analyses.
+
+In the default mode, because one source may be collapsed into multiple targets, the total feature count per sample may be inflated, and the relative abundance of each feature may no longer correspond to that of the sequences assigned to it. In another word, this breaks the [compositionality](https://en.wikipedia.org/wiki/Compositional_data) of the data.


In another word > In other words,

droush · 2021-08-03T16:59:01Z

doc/collapse.md

+
+For example, in the `reaction-to-ec.txt` file under [MetaCyc](metacyc.md), 80 out of 3618 (2.2%) reactions have more than one corresponding EC number. Whether such a translation may be considered as unique (and whether the resulting table is still compositional) is a call of the user.
+
+A solution to this is to turn on the [division](#division) flag (`-d`). This guarantees that the sum of feature counts remains the same after collapsing. But one should consider the biological implication before making a decision (see [above](#division)).


Could we report the # of multiple mappings during analysis? humann3 does this when you regroup the table from gene families to other classification systems. I'm not sure how relevant it would be, but it is a consideration itself.

Good point! I didn't know that HUMAnN3 does that. Can you point to their documentation? Maybe it's worth to have an issue and consider implementation in the future.

Link to the collapsing as discussed here
Source code here

When you regroup the table, you get a final report on screen stating the # of times a feature was grouped more than once, and more than twice.

Thank you! This seems neat, and shouldn't be hard to implement.

droush · 2021-08-03T17:05:52Z

@qiyunzhu Some comments for review.

qiyunzhu · 2021-08-04T16:35:25Z

Let me merge. @antgonza @droush Thank you again!

added documentation about mapping uniqueness

7dfb4ec

qiyunzhu requested a review from droush August 2, 2021 18:05

fixed typo

94d9667

antgonza reviewed Aug 3, 2021

View reviewed changes

droush reviewed Aug 3, 2021

View reviewed changes

qiyunzhu added 3 commits August 3, 2021 13:36

Merge branch 'master' of github.com:qiyunzhu/woltka into upgrade

7771670

incorporated Antonio and Daniel's suggestions

634535d

added collapse use cases

347301f

qiyunzhu merged commit b9ea13a into master Aug 4, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

added documentation about mapping uniqueness #131

added documentation about mapping uniqueness #131

qiyunzhu commented Aug 2, 2021

antgonza left a comment

antgonza Aug 3, 2021

antgonza Aug 3, 2021

droush Aug 3, 2021

antgonza Aug 3, 2021

qiyunzhu Aug 3, 2021

antgonza Aug 3, 2021

qiyunzhu Aug 3, 2021

droush Aug 3, 2021

droush Aug 3, 2021

qiyunzhu Aug 3, 2021

droush Aug 3, 2021

qiyunzhu Aug 3, 2021

droush commented Aug 3, 2021

qiyunzhu commented Aug 4, 2021


		For example, in the `reaction-to-ec.txt` file under [MetaCyc](metacyc.md), 80 out of 3618 (2.2%) reactions have more than one corresponding EC number. Whether such a translation may be considered as unique (and whether the resulting table is still compositional) is a call of the user.

		A solution to this is to turn on the [division](#division) flag (`-d`). This guarantees that the sum of feature counts remains the same after collapsing. But one should consider the biological implication before making a decision (see [above](#division)).

added documentation about mapping uniqueness #131

added documentation about mapping uniqueness #131

Conversation

qiyunzhu commented Aug 2, 2021

antgonza left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

droush commented Aug 3, 2021

qiyunzhu commented Aug 4, 2021