Skip to content

Commit

Permalink
Merge pull request #197 from qiyunzhu/filter
Browse files Browse the repository at this point in the history
Updated documentation
  • Loading branch information
qiyunzhu committed Feb 22, 2024
2 parents e4adb4c + 0cf80d4 commit b65ccb6
Show file tree
Hide file tree
Showing 3 changed files with 52 additions and 17 deletions.
3 changes: 2 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,11 @@
# Change Log

## Ongoing
## Version 0.1.6 (2/22/2024)

### Changed
- Improved performance moderately ([#192](https://github.com/qiyunzhu/woltka/pull/192)).
- Parameter `--chunk` is now the number of unique query sequences instead of the number of lines ([#192](https://github.com/qiyunzhu/woltka/pull/192)).
- Updated GitHub Actions workflow.

### Added
- Added parameter `-x|--exclude`, which will exclude query sequences that are mapped to given reference sequences (such as host genome, spike-in, vector, etc.) ([#192](https://github.com/qiyunzhu/woltka/pull/192)).
Expand Down
14 changes: 1 addition & 13 deletions doc/faq.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,10 +11,6 @@ Woltka is **deterministic**. Given the same input files and parameters, it alway

The former. Woltka **exhaustively** captures all valid matches from the alignment file(s).

### Are Woltka results consistent across versions?

To date, all Woltka versions (0.1.0 to 0.1.5) generate **identical** output files given the same setting. Later versions are more efficient and have more features, though.

### How many CPU cores does Woltka use?

Woltka works the best with **two CPU cores**: one for file decompression and the other for classification. This happens automatically. See [here](perform.md#keep-external-decompressors-on) for details.
Expand All @@ -38,20 +34,12 @@ Not out-of-the-box. But you can use SAMtools to extract BAM/CRAM files and direc
samtools view input.bam | woltka classify -i - -o output.biom
```

### Does Woltka support [PAF](https://github.com/lh3/miniasm/blob/master/PAF.md) format?

Not out-of-the-box. But you can use the following AWK trick to convert a PAF file into mock BLAST format and feed into Woltka. There will be no percent identity, e-value or bit score, but Woltka doesn't need them anyway.

```bash
cat input.paf | awk -v OFS="\t" '{print $1,$6,0,$11,0,0,$3+1,$4,$8+1,$9,0,$12}' | woltka classify -i - -o output.biom
```

### I ran `woltka classify -i input.fastq ...`, and got an error saying it cannot determine alignment file format. Why?

Woltka takes alignment files as input, NOT original sequencing data (FASTQ, FASTA, etc.). You need to perform alignment on the sequencing data by yourself, such as:

```bash
bowtie2 -x db -f input.fastq -S output.sam
bowtie2 -x db -U input.fastq -S output.sam
```

Then feed the resulting alignment(s) into Woltka.
Expand Down
52 changes: 49 additions & 3 deletions doc/input.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,8 @@ Also check out this [guideline](align.md) for sequence alignment generation.
- [Filename pattern](#filename-pattern)
- [Sample list](#sample-list)
- [Demultiplexing](#demultiplexing)
- [Paired-end reads](#paired-end-reads)
- [Subject exclusion](#subject-exclusion)
- [Subject trimming](#subject-trimming)

## Input filepath
Expand All @@ -28,7 +30,6 @@ align/

2\. A **mapping file** of sample ID \<tab\> alignment file path. The paths must point to existing files. They can either be full paths, or simply filenames under the same directory as the mapping file. For example, one can place a `map.txt` of the following content to where alignment files are located.


```
S01 <tab> S01.sam.gz
S02 <tab> S02.sam.gz
Expand Down Expand Up @@ -57,11 +58,11 @@ Woltka supports the following alignment formats (specified by parameter `--forma
- `map`: A **simple map** in the format of query \<tab\> subject.
- `sam`: [**SAM**](https://en.wikipedia.org/wiki/SAM_(file_format)) format. Supported by tools such as Bowtie2, BWA and Minimap2.
- `paf`: [**PAF**](https://github.com/lh3/miniasm/blob/master/PAF.md) format. Supported by tools such as Miniasm and Minimap2.
- `b6o`: [**BLAST**](https://www.metagenomics.wiki/tools/blast/blastn-output-format-6) tabular format (i.e., BLAST parameter `-outfmt 6`). Supported by tools such as BLAST, DIAMOND and BURST.
- `b6o`: [**BLAST**](https://www.metagenomics.wiki/tools/blast/blastn-output-format-6) tabular format (i.e., BLAST parameter `-outfmt 6`). Supported by tools such as BLAST, DIAMOND, MMseqs2 and BURST.

If not specified, Woltka will _automatically_ infer the format of input alignment files.

Other formats may be converted into any of these three formats so that Woltka can parse them. Examples include **BAM**, **CRAM** and **PAF**. Here are example [commands](faq.md#input-files).
Other formats may be converted into any of these three formats so that Woltka can parse them. Examples include **BAM** and **CRAM**. Here are examples [commands](faq.md#input-files).

Woltka supports and automatically detects common file compression formats including `gzip`, `bzip2` and `xz`. Any input files, including alignment files and database files, can be supplied in any of these three formats. This saves disk space and compute.

Expand Down Expand Up @@ -120,6 +121,51 @@ woltka classify \
...
```

## Paired-end reads

If the input alignment files are in SAM format, Woltka automatically extracts the paired-end information, if any, from the SAM flags, and appends it to the query ID as suffix `/1` (forward) or `/2` (reverse). Alignments will be grouped by their paired-end status. Each status under the same query ID will be treated as one query.

For example, the following section of a SAM file:

QNAME | FLAG | RNAME | ...
--- | --- | --- | ---
Q1 | 99 | G1 | ...
Q1 | 147 | G1 | ...
Q1 | 355 | G2 | ...
Q1 | 403 | G2 | ...

Will be converted into a mapping of:

QNAME | RNAME | ...
--- | --- | ---
Q1/1 | G1 | ...
Q1/2 | G1 | ...
Q1/1 | G2 | ...
Q1/2 | G2 | ...

And be considered as:

- Query "Q1/1" is simultaneously mapped to subjects "G1" and "G2".
- Query "Q1/2" is simultaneously mapped to subjects "G1" and "G2".

Note: If the query IDs in an alignment files already have `/1` and `/2` suffixes, these suffixes will not be considered by Woltka as paired-end information. Woltka only respects paired-end information coded in the SAM flags.

## Subject exclusion

Parameter `--exclude` or `-x` lets you specify a set of subject IDs to exclude during alignment file parsing. The value can be a list of subject IDs separated by comma, or a text file containing one subject ID per line. As long as one alignment of a query matches a subject in this set, the entire query (and its paired mate, if any) will be discarded from the analysis, regardless if it simultaneously matches other subjects that are not excluded. This function is useful for removing reads mapped to certain negative filter sequences in the reference database.

For example, you can include the [bacteriophage phiX174](https://en.wikipedia.org/wiki/Phi_X_174) genome ([NC_001422.1](https://www.ncbi.nlm.nih.gov/nuccore/NC_001422.1)) (a common spike-in control for sequencing experiments) in the database, and filter out any reads that are mapped to it:

```bash
woltka classify ... -x NC_001422.1
```

For example, you can include the human reference genome [T2T-CHM13v2.0](https://www.ncbi.nlm.nih.gov/datasets/genome/GCF_009914755.1/) in the database, and create a list of its nucleotide accessions (`human.list`). Then you can remove putative human-derived sequences during the analysis with:

```bash
woltka classify ... -x human.list
```

## Subject trimming

The parameter `--trim-sub <delim>` lets Woltka trim subject IDs at the last occurrence of the given delimiter (default: "_"). For examples:
Expand Down

0 comments on commit b65ccb6

Please sign in to comment.