Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

added documentation about mapping uniqueness #131

Merged
merged 5 commits into from
Aug 4, 2021
Merged

added documentation about mapping uniqueness #131

merged 5 commits into from
Aug 4, 2021

Conversation

qiyunzhu
Copy link
Owner

@qiyunzhu qiyunzhu commented Aug 2, 2021

@antgonza I have incorporated the text you provided regarding the uniqueness of mapping files and how that impacts the downstream analyses, with some modifications, plus a few other edits of the documentation. Please kindly review. Thank you!

After this PR is merged, the most appropriate link to the explanation will be:

https://github.com/qiyunzhu/woltka/blob/master/doc/wol.md#comparison

Also cc'ing @droush

@qiyunzhu qiyunzhu requested a review from droush August 2, 2021 18:05
Copy link

@antgonza antgonza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, extra from the already added minor comments: should the #comparison section be a new section or in the collapse or classify pages? I'm not sure it should only be within wol as (1) this applies to wol and other references, (2) if we move to a new wol version this info could be lost; what do you think?

doc/collapse.md Outdated


## Use cases

With this tool one can achieve the following goals:

1. Translate feature IDs into names or descriptions.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add an extra point here (within 1): Translate genome IDs to functional proteins (get functional profiles) ?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps worth adding also a reference here to the Considerations below; something like: Please read the Considerations below.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add an extra point here (within 1): Translate genome IDs to functional proteins (get functional profiles) ?

I think I might be missing something here Antonio. To do the functional analysis, one would run the classify command and not collapse. One could collapse a genome id + orf id into a protein id, but a genome id to protein would correspond to all the ORFs of that genome.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent question, I guess this means that the difference between collapse and classify is not clear and having more details will be useful. Perhaps having a simple page with this information + the comparison, my other comment will be useful. @qiyunzhu, what do you think?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@antgonza thanks for the review!

  1. The comparison section is further elaborated in collapse.md (The former has a link to the latter). Here is the elaboration: https://github.com/qiyunzhu/woltka/blob/upgrade/doc/collapse.md#considerations.
  2. The suggestion "Translate genome IDs to functional proteins" may not apply here. This is because one cannot collapse genomes into functional proteins. (Technically one can but that will be no different from PICRUSt...)
  3. Good suggestion! Let me add the reference.
  4. I thought a bit, and currently I tend to refrain from making a separate page. My rationales are:
  • collapse is a simple utility whereas classify is the main module. So the comparison is only relevant when discussing collapse.
  • Several operations can affect the statistical property of the feature table, and one-to-many collapsing is but one of them. It may be worth to have a page dedicated to these questions.
  • Currently, classify cannot handle non-tree structures (especially directed acyclic graphs, which cover most functional classification systems), so we have to over-emphasize on collapse as a complement to classify. But implementing this new function is in our plan.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, for the confusion. I meant "Translate per gene IDs to functional proteins", for example: from woltka genes to uniref proteins

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense to me! I have updated the PR.

doc/collapse.md Outdated
It is important to highlight that one-to-many mapping may change some of the
underlying statistical assumptions of downstream analyses.

In the default mode, because one source may be collapsed into multiple targets, the total feature count per sample may be inflated, and the relative abundance of each feature may no longer correspond to that of the sequences assigned to it. In another word, this breaks the [compositionality](https://en.wikipedia.org/wiki/Compositional_data) of the data.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In another word > In other words,


For example, in the `reaction-to-ec.txt` file under [MetaCyc](metacyc.md), 80 out of 3618 (2.2%) reactions have more than one corresponding EC number. Whether such a translation may be considered as unique (and whether the resulting table is still compositional) is a call of the user.

A solution to this is to turn on the [division](#division) flag (`-d`). This guarantees that the sum of feature counts remains the same after collapsing. But one should consider the biological implication before making a decision (see [above](#division)).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we report the # of multiple mappings during analysis? humann3 does this when you regroup the table from gene families to other classification systems. I'm not sure how relevant it would be, but it is a consideration itself.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I didn't know that HUMAnN3 does that. Can you point to their documentation? Maybe it's worth to have an issue and consider implementation in the future.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Link to the collapsing as discussed here
Source code here

When you regroup the table, you get a final report on screen stating the # of times a feature was grouped more than once, and more than twice.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! This seems neat, and shouldn't be hard to implement.

@droush
Copy link
Collaborator

droush commented Aug 3, 2021

@qiyunzhu Some comments for review.

@qiyunzhu
Copy link
Owner Author

qiyunzhu commented Aug 4, 2021

Let me merge. @antgonza @droush Thank you again!

@qiyunzhu qiyunzhu merged commit b9ea13a into master Aug 4, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants