Merge pull request #197 from qiyunzhu/filter

Updated documentation
qiyunzhu · Feb 22, 2024 · b65ccb6 · b65ccb6
2 parents e4adb4c + 0cf80d4
commit b65ccb6
Show file tree

Hide file tree

Showing 3 changed files with 52 additions and 17 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,10 +1,11 @@
 # Change Log
 
-## Ongoing
+## Version 0.1.6 (2/22/2024)
 
 ### Changed
 - Improved performance moderately ([#192](https://github.com/qiyunzhu/woltka/pull/192)).
 - Parameter `--chunk` is now the number of unique query sequences instead of the number of lines ([#192](https://github.com/qiyunzhu/woltka/pull/192)).
+- Updated GitHub Actions workflow.
 
 ### Added
 - Added parameter `-x|--exclude`, which will exclude query sequences that are mapped to given reference sequences (such as host genome, spike-in, vector, etc.) ([#192](https://github.com/qiyunzhu/woltka/pull/192)).

diff --git a/doc/faq.md b/doc/faq.md
@@ -11,10 +11,6 @@ Woltka is **deterministic**. Given the same input files and parameters, it alway
 
 The former. Woltka **exhaustively** captures all valid matches from the alignment file(s).
 
-### Are Woltka results consistent across versions?
-
-To date, all Woltka versions (0.1.0 to 0.1.5) generate **identical** output files given the same setting. Later versions are more efficient and have more features, though.
-
 ### How many CPU cores does Woltka use?
 
 Woltka works the best with **two CPU cores**: one for file decompression and the other for classification. This happens automatically. See [here](perform.md#keep-external-decompressors-on) for details.
@@ -38,20 +34,12 @@ Not out-of-the-box. But you can use SAMtools to extract BAM/CRAM files and direc
 samtools view input.bam | woltka classify -i - -o output.biom
 ```
 
-### Does Woltka support [PAF](https://github.com/lh3/miniasm/blob/master/PAF.md) format?
-
-Not out-of-the-box. But you can use the following AWK trick to convert a PAF file into mock BLAST format and feed into Woltka. There will be no percent identity, e-value or bit score, but Woltka doesn't need them anyway.
-
-```bash
-cat input.paf | awk -v OFS="\t" '{print $1,$6,0,$11,0,0,$3+1,$4,$8+1,$9,0,$12}' | woltka classify -i - -o output.biom
-```
-
 ### I ran `woltka classify -i input.fastq ...`, and got an error saying it cannot determine alignment file format. Why?
 
 Woltka takes alignment files as input, NOT original sequencing data (FASTQ, FASTA, etc.). You need to perform alignment on the sequencing data by yourself, such as:
 
 ```bash
-bowtie2 -x db -f input.fastq -S output.sam
+bowtie2 -x db -U input.fastq -S output.sam
 ```
 
 Then feed the resulting alignment(s) into Woltka.

diff --git a/doc/input.md b/doc/input.md
@@ -11,6 +11,8 @@ Also check out this [guideline](align.md) for sequence alignment generation.
 - [Filename pattern](#filename-pattern)
 - [Sample list](#sample-list)
 - [Demultiplexing](#demultiplexing)
+- [Paired-end reads](#paired-end-reads)
+- [Subject exclusion](#subject-exclusion)
 - [Subject trimming](#subject-trimming)
 
 ## Input filepath
@@ -28,7 +30,6 @@ align/
 
 2\. A **mapping file** of sample ID \<tab\> alignment file path. The paths must point to existing files. They can either be full paths, or simply filenames under the same directory as the mapping file. For example, one can place a `map.txt` of the following content to where alignment files are located.
 
-
 ```
 S01 <tab> S01.sam.gz
 S02 <tab> S02.sam.gz
@@ -57,11 +58,11 @@ Woltka supports the following alignment formats (specified by parameter `--forma
 - `map`: A **simple map** in the format of query \<tab\> subject.
 - `sam`: [**SAM**](https://en.wikipedia.org/wiki/SAM_(file_format)) format. Supported by tools such as Bowtie2, BWA and Minimap2.
 - `paf`: [**PAF**](https://github.com/lh3/miniasm/blob/master/PAF.md) format. Supported by tools such as Miniasm and Minimap2.
-- `b6o`: [**BLAST**](https://www.metagenomics.wiki/tools/blast/blastn-output-format-6) tabular format (i.e., BLAST parameter `-outfmt 6`). Supported by tools such as BLAST, DIAMOND and BURST.
+- `b6o`: [**BLAST**](https://www.metagenomics.wiki/tools/blast/blastn-output-format-6) tabular format (i.e., BLAST parameter `-outfmt 6`). Supported by tools such as BLAST, DIAMOND, MMseqs2 and BURST.
 
 If not specified, Woltka will _automatically_ infer the format of input alignment files.
 
-Other formats may be converted into any of these three formats so that Woltka can parse them. Examples include **BAM**, **CRAM** and **PAF**. Here are example [commands](faq.md#input-files).
+Other formats may be converted into any of these three formats so that Woltka can parse them. Examples include **BAM** and **CRAM**. Here are examples [commands](faq.md#input-files).
 
 Woltka supports and automatically detects common file compression formats including `gzip`, `bzip2` and `xz`. Any input files, including alignment files and database files, can be supplied in any of these three formats. This saves disk space and compute.
 
@@ -120,6 +121,51 @@ woltka classify \
   ...
 ```
 
+## Paired-end reads
+
+If the input alignment files are in SAM format, Woltka automatically extracts the paired-end information, if any, from the SAM flags, and appends it to the query ID as suffix `/1` (forward) or `/2` (reverse). Alignments will be grouped by their paired-end status. Each status under the same query ID will be treated as one query.
+
+For example, the following section of a SAM file:
+
+QNAME | FLAG | RNAME | ...
+--- | --- | --- | ---
+Q1 |  99 | G1 | ...
+Q1 | 147 | G1 | ...
+Q1 | 355 | G2 | ...
+Q1 | 403 | G2 | ...
+
+Will be converted into a mapping of:
+
+QNAME | RNAME | ...
+--- | --- | ---
+Q1/1 | G1 | ...
+Q1/2 | G1 | ...
+Q1/1 | G2 | ...
+Q1/2 | G2 | ...
+
+And be considered as:
+
+- Query "Q1/1" is simultaneously mapped to subjects "G1" and "G2".
+- Query "Q1/2" is simultaneously mapped to subjects "G1" and "G2".
+
+Note: If the query IDs in an alignment files already have `/1` and `/2` suffixes, these suffixes will not be considered by Woltka as paired-end information. Woltka only respects paired-end information coded in the SAM flags.
+
+## Subject exclusion
+
+Parameter `--exclude` or `-x` lets you specify a set of subject IDs to exclude during alignment file parsing. The value can be a list of subject IDs separated by comma, or a text file containing one subject ID per line. As long as one alignment of a query matches a subject in this set, the entire query (and its paired mate, if any) will be discarded from the analysis, regardless if it simultaneously matches other subjects that are not excluded. This function is useful for removing reads mapped to certain negative filter sequences in the reference database.
+
+For example, you can include the [bacteriophage phiX174](https://en.wikipedia.org/wiki/Phi_X_174) genome ([NC_001422.1](https://www.ncbi.nlm.nih.gov/nuccore/NC_001422.1)) (a common spike-in control for sequencing experiments) in the database, and filter out any reads that are mapped to it:
+
+```bash
+woltka classify ... -x NC_001422.1
+```
+
+For example, you can include the human reference genome [T2T-CHM13v2.0](https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_009914755.1/) in the database, and create a list of its nucleotide accessions (`human.list`). Then you can remove putative human-derived sequences during the analysis with:
+
+```bash
+woltka classify ... -x human.list
+```
+
 ## Subject trimming
 
 The parameter `--trim-sub <delim>` lets Woltka trim subject IDs at the last occurrence of the given delimiter (default: "_"). For examples: