Skip to content

Commit

Permalink
added documentation about mapping uniqueness
Browse files Browse the repository at this point in the history
  • Loading branch information
qiyunzhu committed Aug 2, 2021
1 parent ac0823f commit 7dfb4ec
Show file tree
Hide file tree
Showing 4 changed files with 74 additions and 14 deletions.
26 changes: 18 additions & 8 deletions doc/classify.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,8 +6,9 @@ Woltka grants users the flexibility to control the classification criteria, in a
## Contents

- [Target rank (or no rank)](#target-rank-or-no-rank)
- [Rank specification](#rank-specification)
- [Ambiguous assignment](#ambiguous-assignment)
- ["Unassigned" sequences](#"unassigned"-sequences)
- ["Unassigned" sequences](#unassigned-sequences)


## Target rank (or no rank)
Expand All @@ -16,7 +17,7 @@ Woltka features the following modes of classification, as controlled by the `--r

### 1. **No** classification (`--rank none`)

Simply report subject IDs. A classification system is not required in this analysis. This mode has the highest granularity and is useful in e.g. the [OGU](ogu.md) analysis.
Simply report subject IDs. A classification system is not required in this analysis. This mode has the highest granularity and is useful in e.g. the [OGU analysis](ogu.md).

### 2. **Free-rank** classification (`--rank free`)

Expand All @@ -36,6 +37,9 @@ Attempt to classify at the given rank, and when this is not possible, go up in h

The difference from free-rank classification (2) is that, it discards all units below the given rank, even if some of them may describe the query sequence.


## Rank specification

### Multiple ranks

Multiple ranks can be specified simultaneously, delimited by comma (e.g., `--rank none,free,phylum,genus,species`), in which case Woltka will generate one [profile](output.md) for each rank. This is significantly faster than running Woltka multiple times on individual ranks.
Expand All @@ -47,13 +51,13 @@ One can omit the `--rank` parameter. In such case, Woltka automatically performs

## Ambiguous assignment

In many cases a query sequence has matches in multiple reference sequences, and those sequences may have have different assignments at the rank you instruct Woltka to classify at. Woltka deals with this situation using your choice of the following mechanisms.
In many cases a query sequence has matches in multiple reference sequences, and those sequences may have different assignments at the target rank. Woltka deals with this situation using your choice of the following mechanisms.

### 1. (Default mode) keep them all, and normalize
### 1. Keep them all, and divide (default)

In the resulting profile, each subject receives 1 / _k_ count, where _k_ is the total number of subjects of the current query.

For example, sequence A was aligned to five genomes: two under genus [_Escherichia_](https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=561&lvl=3&lin=f&keep=1&srchmode=1&unlock), one under each of genera [_Salmonella_](https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=590&lvl=3&lin=f&keep=1&srchmode=1&unlock), [_Klebsiella_](https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=570&lvl=3&lin=f&keep=1&srchmode=1&unlock), and [_Enterobacter_](https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=547&lvl=3&lin=f&keep=1&srchmode=1&unlock). So at **genus** level, _Escherichia_ receives 2/5 count, and each of the other three receives 1/5 each.
For example, sequence A was aligned to five genomes: two under genus [_Escherichia_](https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=561&lvl=3&lin=f&keep=1&srchmode=1&unlock), one under each of genera [_Salmonella_](https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=590&lvl=3&lin=f&keep=1&srchmode=1&unlock), [_Klebsiella_](https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=570&lvl=3&lin=f&keep=1&srchmode=1&unlock), and [_Enterobacter_](https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi?mode=Info&id=547&lvl=3&lin=f&keep=1&srchmode=1&unlock). So at the **genus** level, _Escherichia_ receives 2/5 count, and each of the other three receives 1/5 each.

In the read-to-feature maps (`--outmap`), a multi-assignment will be reported as:

Expand Down Expand Up @@ -86,18 +90,24 @@ Note: Majority rule overrides the `--above` flag. Currently, Woltka cannot combi

## "Unassigned" sequences

With flag `--unassigned`, Woltka reports unassigned sequences in the profile and the feature map. They will be marked as "Unassigned".
With flag `--unassigned`, Woltka reports unassigned sequences in the profile and the feature map. They will be marked as "**Unassigned**".

A sequence is deemed unassigned because of one of the following reasons:

1. The subject(s) is **not found** in the classification system. For flexibility, Woltka does NOT consider this as a conflict between data and database, and it does NOT halt the program and warn the user. Instead, it is treated as unassigned.

2. The LCA (see above) of subjects is the **root**. A [tree-structured](hierarchy.md) classification hierarchy always have a root, and no matter how diverse subjects are, they always coalesce to the root eventually. But reporting "root" as an assignment is meaningless. So it will be considered as unassigned.
2. The LCA (see above) of the subjects is the **root**. A [tree-structured](hierarchy.md) classification hierarchy always has a root. No matter how diverse the subjects are, they always coalesce to the root eventually. But reporting "root" as an assignment is meaningless. So it will be considered as unassigned.

3. In unique-assignment mode, assignments of subjects are not unique (see above).

4. In majority-rule mode, none of the candidate units reaches the threshold (see above).

### Considerations

With the unassigned part reported, the sum of feature counts in each sample of the resulting profile equals to the number of query sequences in the original alignment file. This property is potentially useful in certain downstream analyses.

Even if this function was not used in generating the profile, one can still manually calculate the unassigned part by subtracting the sum of feature counts from the number of sequences.

[**Important**] The "unassigned" part represent query sequences that were **aligned** to one or more **subjects**, but Woltka cannot find a suitable assignment based on those subjects. Therefore, assigned + unassigned is NOT the entire sample, but only the part of sample that are found in the alignment.

In [ordinal mapping](ordinal.md), "subjects" are **genes** instead of genomes, therefore despite that some query sequences can be aligned to one or more genomes, they can still be excluded from the "unassigned" part if their coordinates do not match any gene.
In [ordinal mapping](ordinal.md), subjects are **genes** instead of genomes, therefore despite that some query sequences can be aligned to one or more genomes, they can still be excluded from the unassigned part if their coordinates do not match any gene.
31 changes: 28 additions & 3 deletions doc/collapse.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,17 @@ The **profile collapsing** function is a lightweight and flexible addition to th
woltka tools collapse -i input.biom -m mapping.txt -o output.biom
```


## Contents

- [Use cases](#use-cases)
- [Mapping file format](#mapping-file-format)
- [Parameters](#parameters)
- [Considerations](#considerations)


## Use cases

With this tool one can achieve the following goals:

1. Translate feature IDs into names or descriptions.
Expand Down Expand Up @@ -55,11 +66,11 @@ source4 <tab> target3

## Parameters

### Normalization
### Division

By default, if one source feature is simultaneously mapped to _k_ targets, each target will be counted once. With the `--frac` or `-f` flag added to the command, each target will be counted 1 / _k_ times.
By default, if one source feature is simultaneously mapped to _k_ targets, each target will be counted once. With the `--divide` or `-d` flag added to the command, each target will be counted 1 / _k_ times.

Whether to enable normalization depends on the nature and aim of your analysis. For example, one gene is involved in two pathways (which isn't uncommon), should each pathway be counted once, or half time?
Whether to enable division depends on the nature and aim of the analysis. For example, one gene is involved in two metabolic pathways (which isn't uncommon), should each pathway be counted once, or half time?

### Stratification

Expand All @@ -68,3 +79,17 @@ Woltka supports collapsing a [stratified](stratify.md) profile using one field i
### Feature names

Once a profile is collapsed, the metadata of the source features ("Name", "Rank", and "Lineage") will be discarded. One may choose to supply a target feature name file by `--names` or `-n`, which will instruct the program to append names to the profile as a metadata column ("Name").


## Considerations

It is important to highlight that one-to-many mapping may change some of the
underlying statistical assumptions of downstream analyses.

In the default mode, because one source may be collapsed into multiple targets, the total feature count per sample may be inflated, and the relative abundance of each feature may no longer correspond to that of the sequences assigned to it. In another word, this breaks the [compositionality](https://en.wikipedia.org/wiki/Compositional_data) of the data.

How significantly this may impact an analysis depends on the relative frequency of multiple mappings found in the data, the biological relevance of the affected features, and the statistical nature of the analysis.

For example, in the `reaction-to-ec.txt` file under [MetaCyc](metacyc.md), 80 out of 3618 (2.2%) reactions have more than one corresponding EC number. Whether such a translation may be considered as unique (and whether the resulting table is still compositional) is a call of the user.

A solution to this is to turn on the [division](#division) flag (`-d`). This guarantees that the sum of feature counts remains the same after collapsing. But one should consider the biological implication before making a decision (see [above](#division)).
2 changes: 1 addition & 1 deletion doc/metacyc.md
Original file line number Diff line number Diff line change
Expand Up @@ -86,7 +86,7 @@ woltka tools collapse -i pathway.biom -m metacyc/pathway-to-super_pathway.txt -n
woltka tools collapse -i super_pathway.biom -m metacyc/pathway_type.txt -n metacyc/all_class_name.txt -o pathway_type.biom
```

The collapse command supports **many-to-many** mapping. For example, if one reaction is found in three pathways, each pathway will be counted **once**. In some instances (e.g., to retain compositionality of the profile), one may consider adding the `--frac` flag, which will instruct the program to count each pathway 1 / 3 times ([see details](collapse.md)).
The collapse command supports **many-to-many** mapping. For example, if one reaction is found in three pathways, each pathway will be counted **once**. In some instances (e.g., to retain compositionality of the profile), one may consider adding the `--divide` flag, which will instruct the program to count each pathway 1 / 3 times ([see details](collapse.md)).


## Pathway coverage
Expand Down
29 changes: 27 additions & 2 deletions doc/wol.md
Original file line number Diff line number Diff line change
Expand Up @@ -216,11 +216,28 @@ So on so forth. See [here](metacyc.md) for a graph of all available collapsing d

### Comparison

**Important**: The differences between method 1 (`classify`) and method 2 (`collapse`) are:
**Important**: The differences between [method 1](#method-1) (`classify`) and [method 2](#method-2) (`collapse`) are:

`classify` only supports a tree structure, in which one child unit has exactly one parent unit. This is typical in taxonomic classification. If multiple parents are present, all but the first parent will be discarded. In contrast, `collapse` supports **one-to-multiple** mappings, therefore it is more suitable when this is the norm instead of exception, especially in functional classification (where one gene can be involved in multiple metabolic pathways).

`classify` always ensures the **compositionality** of the feature table, in which the frequencies match the numbers of aligned sequences. `collapse` however does not by default. In a one-to-multiple mapping, all parents will be counted once. But one can add `--frac` to the `collapse` command to normalize the counts by the number of parents so that the compositionality is retained.
`classify` always ensures the [**compositionality**](https://en.wikipedia.org/wiki/Compositional_data) of the feature table, in which the frequencies match the numbers of aligned sequences. `collapse` however does not by default: In a one-to-multiple mapping, all parents will be counted once. But one can add `-d` to the `collapse` command to divide the counts by the number of parents so that the compositionality is retained. See [here](#collapse.md#considerations) for a detailed discussion.

However, if the mappings are **unique** (one-to-one), the two methods produce mutually identical results (minor differences may arise during number rounding), and the concern of compositionality is no longer relevant.

Here is a list of currently released primary and secondary mappings and their uniqueness:

Source | Target | Unique
--- | --- | ---
WoL protein | UniRef | Yes
WoL protein | MetaCyc protein | Yes
UniRef | KEGG ontology (KO) | No
UniRef | Gene Ontology (GO) | No
UniRef | OrthoDB | Yes
UniRef | EggNOG | No
UniRef | RefSeq | No

Meanwhile, all name files (such as `gene_name.txt` under MetaCyc) are unique.


## Stratified taxonomic / functional classification

Expand Down Expand Up @@ -257,3 +274,11 @@ woltka classify \
```

In the output table, features will be like `Escherichia|K00133`, `Salmonella|K00604`, etc. See [here](stratify.md) for mode details about stratification.

With a stratified taxonomic/functional profile, one may still perform further [collapsing](#method-2) using one of the two systems. For example:

```bash
woltka tools collapse -f 2 -i ko_by_genus.biom -m $ke/ko-to-go.txt -o go_by_genus.biom
```

The parameter `-f 2` instructs the program to collapse by the second field of each feature (e.g., `K00133` of `Escherichia|K00133`). See [here](collapse.md#stratification) for details.

0 comments on commit 7dfb4ec

Please sign in to comment.