added documentation about mapping uniqueness

qiyunzhu · Aug 2, 2021 · 7dfb4ec · 7dfb4ec
1 parent ac0823f
commit 7dfb4ec
Show file tree

Hide file tree

Showing 4 changed files with 74 additions and 14 deletions.
diff --git a/doc/classify.md b/doc/classify.md
@@ -6,8 +6,9 @@ Woltka grants users the flexibility to control the classification criteria, in a
 ## Contents
 
 - [Target rank (or no rank)](#target-rank-or-no-rank)
+- [Rank specification](#rank-specification)
 - [Ambiguous assignment](#ambiguous-assignment)
-- ["Unassigned" sequences](#"unassigned"-sequences)
+- ["Unassigned" sequences](#unassigned-sequences)
 
 
 ## Target rank (or no rank)
@@ -16,7 +17,7 @@ Woltka features the following modes of classification, as controlled by the `--r
 
 ### 1. **No** classification (`--rank none`)
 
-Simply report subject IDs. A classification system is not required in this analysis. This mode has the highest granularity and is useful in e.g. the [OGU](ogu.md) analysis.
+Simply report subject IDs. A classification system is not required in this analysis. This mode has the highest granularity and is useful in e.g. the [OGU analysis](ogu.md).
 
 ### 2. **Free-rank** classification (`--rank free`)
 
@@ -36,6 +37,9 @@ Attempt to classify at the given rank, and when this is not possible, go up in h
 
 The difference from free-rank classification (2) is that, it discards all units below the given rank, even if some of them may describe the query sequence.
 
+
+## Rank specification
+
 ### Multiple ranks
 
 Multiple ranks can be specified simultaneously, delimited by comma (e.g., `--rank none,free,phylum,genus,species`), in which case Woltka will generate one [profile](output.md) for each rank. This is significantly faster than running Woltka multiple times on individual ranks.
@@ -47,13 +51,13 @@ One can omit the `--rank` parameter. In such case, Woltka automatically performs
 
 ## Ambiguous assignment
 
-In many cases a query sequence has matches in multiple reference sequences, and those sequences may have have different assignments at the rank you instruct Woltka to classify at. Woltka deals with this situation using your choice of the following mechanisms.
+In many cases a query sequence has matches in multiple reference sequences, and those sequences may have different assignments at the target rank. Woltka deals with this situation using your choice of the following mechanisms.
 
-### 1. (Default mode) keep them all, and normalize
+### 1. Keep them all, and divide (default)
 
 In the resulting profile, each subject receives 1 / _k_ count, where _k_ is the total number of subjects of the current query.
 
-For example, sequence A was aligned to five genomes: two under genus [_Escherichia_](https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=561&lvl=3&lin=f&keep=1&srchmode=1&unlock), one under each of genera [_Salmonella_](https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=590&lvl=3&lin=f&keep=1&srchmode=1&unlock), [_Klebsiella_](https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=570&lvl=3&lin=f&keep=1&srchmode=1&unlock), and [_Enterobacter_](https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=547&lvl=3&lin=f&keep=1&srchmode=1&unlock). So at **genus** level, _Escherichia_ receives 2/5 count, and each of the other three receives 1/5 each.
+For example, sequence A was aligned to five genomes: two under genus [_Escherichia_](https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=561&lvl=3&lin=f&keep=1&srchmode=1&unlock), one under each of genera [_Salmonella_](https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=590&lvl=3&lin=f&keep=1&srchmode=1&unlock), [_Klebsiella_](https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=570&lvl=3&lin=f&keep=1&srchmode=1&unlock), and [_Enterobacter_](https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=547&lvl=3&lin=f&keep=1&srchmode=1&unlock). So at the **genus** level, _Escherichia_ receives 2/5 count, and each of the other three receives 1/5 each.
 
 In the read-to-feature maps (`--outmap`), a multi-assignment will be reported as:
 
@@ -86,18 +90,24 @@ Note: Majority rule overrides the `--above` flag. Currently, Woltka cannot combi
 
 ## "Unassigned" sequences
 
-With flag `--unassigned`, Woltka reports unassigned sequences in the profile and the feature map. They will be marked as "Unassigned".
+With flag `--unassigned`, Woltka reports unassigned sequences in the profile and the feature map. They will be marked as "**Unassigned**".
 
 A sequence is deemed unassigned because of one of the following reasons:
 
 1. The subject(s) is **not found** in the classification system. For flexibility, Woltka does NOT consider this as a conflict between data and database, and it does NOT halt the program and warn the user. Instead, it is treated as unassigned.
 
-2. The LCA (see above) of subjects is the **root**. A [tree-structured](hierarchy.md) classification hierarchy always have a root, and no matter how diverse subjects are, they always coalesce to the root eventually. But reporting "root" as an assignment is meaningless. So it will be considered as unassigned.
+2. The LCA (see above) of the subjects is the **root**. A [tree-structured](hierarchy.md) classification hierarchy always has a root. No matter how diverse the subjects are, they always coalesce to the root eventually. But reporting "root" as an assignment is meaningless. So it will be considered as unassigned.
 
 3. In unique-assignment mode, assignments of subjects are not unique (see above).
 
 4. In majority-rule mode, none of the candidate units reaches the threshold (see above).
 
+### Considerations
+
+With the unassigned part reported, the sum of feature counts in each sample of the resulting profile equals to the number of query sequences in the original alignment file. This property is potentially useful in certain downstream analyses.
+
+Even if this function was not used in generating the profile, one can still manually calculate the unassigned part by subtracting the sum of feature counts from the number of sequences.
+
 [**Important**] The "unassigned" part represent query sequences that were **aligned** to one or more **subjects**, but Woltka cannot find a suitable assignment based on those subjects. Therefore, assigned + unassigned is NOT the entire sample, but only the part of sample that are found in the alignment.
 
-In [ordinal mapping](ordinal.md), "subjects" are **genes** instead of genomes, therefore despite that some query sequences can be aligned to one or more genomes, they can still be excluded from the "unassigned" part if their coordinates do not match any gene.
+In [ordinal mapping](ordinal.md), subjects are **genes** instead of genomes, therefore despite that some query sequences can be aligned to one or more genomes, they can still be excluded from the unassigned part if their coordinates do not match any gene.
diff --git a/doc/collapse.md b/doc/collapse.md
@@ -6,6 +6,17 @@ The **profile collapsing** function is a lightweight and flexible addition to th
 woltka tools collapse -i input.biom -m mapping.txt -o output.biom
 ```
 
+
+## Contents
+
+- [Use cases](#use-cases)
+- [Mapping file format](#mapping-file-format)
+- [Parameters](#parameters)
+- [Considerations](#considerations)
+
+
+## Use cases
+
 With this tool one can achieve the following goals:
 
 1. Translate feature IDs into names or descriptions.
@@ -55,11 +66,11 @@ source4 <tab> target3
 
 ## Parameters
 
-### Normalization
+### Division
 
-By default, if one source feature is simultaneously mapped to _k_ targets, each target will be counted once. With the `--frac` or `-f` flag added to the command, each target will be counted 1 / _k_ times.
+By default, if one source feature is simultaneously mapped to _k_ targets, each target will be counted once. With the `--divide` or `-d` flag added to the command, each target will be counted 1 / _k_ times.
 
-Whether to enable normalization depends on the nature and aim of your analysis. For example, one gene is involved in two pathways (which isn't uncommon), should each pathway be counted once, or half time?
+Whether to enable division depends on the nature and aim of the analysis. For example, one gene is involved in two metabolic pathways (which isn't uncommon), should each pathway be counted once, or half time?
 
 ### Stratification
 
@@ -68,3 +79,17 @@ Woltka supports collapsing a [stratified](stratify.md) profile using one field i
 ### Feature names
 
 Once a profile is collapsed, the metadata of the source features ("Name", "Rank", and "Lineage") will be discarded. One may choose to supply a target feature name file by `--names` or `-n`, which will instruct the program to append names to the profile as a metadata column ("Name").
+
+
+## Considerations
+
+It is important to highlight that one-to-many mapping may change some of the
+underlying statistical assumptions of downstream analyses.
+
+In the default mode, because one source may be collapsed into multiple targets, the total feature count per sample may be inflated, and the relative abundance of each feature may no longer correspond to that of the sequences assigned to it. In another word, this breaks the [compositionality](https://en.wikipedia.org/wiki/Compositional_data) of the data.
+
+How significantly this may impact an analysis depends on the relative frequency of multiple mappings found in the data, the biological relevance of the affected features, and the statistical nature of the analysis.
+
+For example, in the `reaction-to-ec.txt` file under [MetaCyc](metacyc.md), 80 out of 3618 (2.2%) reactions have more than one corresponding EC number. Whether such a translation may be considered as unique (and whether the resulting table is still compositional) is a call of the user.
+
+A solution to this is to turn on the [division](#division) flag (`-d`). This guarantees that the sum of feature counts remains the same after collapsing. But one should consider the biological implication before making a decision (see [above](#division)).
diff --git a/doc/metacyc.md b/doc/metacyc.md
@@ -86,7 +86,7 @@ woltka tools collapse -i pathway.biom -m metacyc/pathway-to-super_pathway.txt -n
 woltka tools collapse -i super_pathway.biom -m metacyc/pathway_type.txt -n metacyc/all_class_name.txt -o pathway_type.biom
 ```
 
-The collapse command supports **many-to-many** mapping. For example, if one reaction is found in three pathways, each pathway will be counted **once**. In some instances (e.g., to retain compositionality of the profile), one may consider adding the `--frac` flag, which will instruct the program to count each pathway 1 / 3 times ([see details](collapse.md)).
+The collapse command supports **many-to-many** mapping. For example, if one reaction is found in three pathways, each pathway will be counted **once**. In some instances (e.g., to retain compositionality of the profile), one may consider adding the `--divide` flag, which will instruct the program to count each pathway 1 / 3 times ([see details](collapse.md)).
 
 
 ## Pathway coverage

diff --git a/doc/wol.md b/doc/wol.md
@@ -216,11 +216,28 @@ So on so forth. See [here](metacyc.md) for a graph of all available collapsing d
 
 ### Comparison
 
-**Important**: The differences between method 1 (`classify`) and method 2 (`collapse`) are:
+**Important**: The differences between [method 1](#method-1) (`classify`) and [method 2](#method-2) (`collapse`) are:
 
 `classify` only supports a tree structure, in which one child unit has exactly one parent unit. This is typical in taxonomic classification. If multiple parents are present, all but the first parent will be discarded. In contrast, `collapse` supports **one-to-multiple** mappings, therefore it is more suitable when this is the norm instead of exception, especially in functional classification (where one gene can be involved in multiple metabolic pathways).
 
-`classify` always ensures the **compositionality** of the feature table, in which the frequencies match the numbers of aligned sequences. `collapse` however does not by default. In a one-to-multiple mapping, all parents will be counted once. But one can add `--frac` to the `collapse` command to normalize the counts by the number of parents so that the compositionality is retained.
+`classify` always ensures the [**compositionality**](https://en.wikipedia.org/wiki/Compositional_data) of the feature table, in which the frequencies match the numbers of aligned sequences. `collapse` however does not by default: In a one-to-multiple mapping, all parents will be counted once. But one can add `-d` to the `collapse` command to divide the counts by the number of parents so that the compositionality is retained. See [here](#collapse.md#considerations) for a detailed discussion.
+
+However, if the mappings are **unique** (one-to-one), the two methods produce mutually identical results (minor differences may arise during number rounding), and the concern of compositionality is no longer relevant.
+
+Here is a list of currently released primary and secondary mappings and their uniqueness:
+
+Source | Target | Unique
+--- | --- | ---
+WoL protein | UniRef | Yes
+WoL protein | MetaCyc protein | Yes
+UniRef | KEGG ontology (KO) | No
+UniRef | Gene Ontology (GO) | No
+UniRef | OrthoDB | Yes
+UniRef | EggNOG | No
+UniRef | RefSeq | No
+
+Meanwhile, all name files (such as `gene_name.txt` under MetaCyc) are unique.
+
 
 ## Stratified taxonomic / functional classification
 
@@ -257,3 +274,11 @@ woltka classify \
 ```
 
 In the output table, features will be like `Escherichia|K00133`, `Salmonella|K00604`, etc. See [here](stratify.md) for mode details about stratification.
+
+With a stratified taxonomic/functional profile, one may still perform further [collapsing](#method-2) using one of the two systems. For example:
+
+```bash
+woltka tools collapse -f 2 -i ko_by_genus.biom -m $ke/ko-to-go.txt -o go_by_genus.biom
+```
+
+The parameter `-f 2` instructs the program to collapse by the second field of each feature (e.g., `K00133` of `Escherichia|K00133`). See [here](collapse.md#stratification) for details.