Skip to content

kmtricks pipeline

Téo Lemane edited this page Apr 9, 2023 · 11 revisions

kmtricks pipeline is a pipeline of kmtricks modules.

Note that it enables to run modules until a specific step using --until <step>.

Overview

Examples are provided here.

Complete usage

kmtricks pipeline v1.4.0

DESCRIPTION
  kmtricks pipeline (run all the steps, repart -> superk -> count -> merge -> format)

USAGE
  kmtricks pipeline --file <FILE> --run-dir <DIR> [--kmer-size <INT>] [--hard-min <INT>] 
                    [--mode <MODE:FORMAT:OUT>] [--repart-from <STR>] 
                    [--soft-min <INT/STR/FLOAT>] [--recurrence-min <INT>] 
                    [--share-min <INT>] [--until <STR>] [--minimizer-size <INT>] 
                    [--minimizer-type <INT>] [--repartition-type <INT>] 
                    [--nb-partitions <INT>] [--restrict-to <FLOAT>] 
                    [--restrict-to-list <STR>] [--focus <FLOAT>] [--bloom-size <INT>] 
                    [--bf-format <STR>] [--bitw <INT>] [-t/--threads <INT>] 
                    [-v/--verbose <STR>] [--hist] [--kff-output] [--keep-tmp] [--skip-merge] 
                    [--cpr] [-h/--help] [--version] 

OPTIONS
  [global]
     --file        - kmtricks input file, see README.md. 
     --run-dir     - kmtricks runtime directory. 
     --kmer-size   - size of a k-mer. [8, 127]. {31}
     --hard-min    - min abundance to keep a k-mer. {2}
     --mode        - matrix mode <mode:format:out>, see README {kmer:count:bin}
     --hist        - compute k-mer histograms. [⚑]
     --kff-output  - output counted k-mers in kff format (only with --until count). [⚑]
     --keep-tmp    - keep tmp files. [⚑]
     --repart-from - use repartition from another kmtricks run. 

  [merge options]
     --soft-min       - during merge, min abundance to keep a k-mer, see README. {1}
     --recurrence-min - min recurrence to keep a k-mer. {1}
     --share-min      - save a non-solid k-mer if it is solid in N other samples. {0}

  [pipeline control]
     --until      - run until [all|repart|superk|count|merge|format] {all}
     --skip-merge - skip merge step, only with --mode hash:bft:bin. [⚑]

  [advanced performance tweaks]
     --minimizer-size   - size of minimizers. [4, 15] {10}
     --minimizer-type   - minimizer type (0=lexi, 1=freq). {0}
     --repartition-type - minimizer repartition (0=unordered, 1=ordered). {0}
     --nb-partitions    - number of partitions (0=auto). {0}
     --restrict-to      - Process only a fraction of partitions. [0.05, 1.0] {1.0}
     --restrict-to-list - Process only some partitions, comma separated. 
     --focus            - 0: focus on disk usage, 1: focus on speed. [0.0, 1.0] {0.5}
     --cpr              - compression for kmtricks's tmp files. [⚑]

  [hash mode configuration]
     --bloom-size - bloom filter size {10000000}
     --bf-format  - bloom filter format. [howdesbt|sdsl] {howdesbt}
     --bitw       - entry width of cbf, with --mode hash:bfc:bin {2}

  [common]
    -t --threads - number of threads. {12}
    -h --help    - show this message and exit. [⚑]
       --version - show version and exit. [⚑]
    -v --verbose - verbosity level [debug|info|warning|error]. {info}
  • --mode <mode:format:out>:

    • kmer:count:bin -> k-mer count matrix
    • kmer:count:text
    • kmer:pa:bin -> k-mer presence/absence matrix
    • kmer:pa:text
    • hash:count:bin -> hash count matrix
    • hash:count:text
    • hash:pa:text -> hash presence/absence matrix
    • hash:pa:bin
    • hash:bf:bin -> Bloom filter matrix (column-major)
    • hash:bft:bin -> Bloom filter matrix (row-major)
  • --soft-min <INT/STR/FLOAT>:

    • All k-mers with an abundance between hard-min and soft-min are considering rescue-able. See kmtricks rescue.
    • <STR>: a path of a file containing one threshold per line, with the same order as in the input fof
    • <FLOAT>: one specific threshold T per sample is computed such that the number of k-mers occurring T times is smaller than VALUE x nb_kmers.
    • <INT>: same threshold for all samples.
  • --recurrence-min <INT>: All k-mers/hashes that do not occur in at least recurrence-min sample(s) are discarded.

  • --save-if <INT>: If a k-mer/hash is rescue-able, it is kept if it is solid (with an abundance greater than soft-min) in at least save-if other sample(s).

  • --kff-output: Supported only with --until count in k-mer mode.

  • --skip-merge: Skip merge step when using --mode hash:bft:bin and that the rescue is not needed.

Outputs

Depending on parameters, kmtricks can output a lot of different files, a complete description is provided here. To work with kmtricks's output files see kmtricks dump, kmtricks aggregate and API.