Kssd is a command-line tool for large-scale sequences sketching and resemblance- and containment-analysis. It sketches sequences by k-mer substring space sampling/shuffling, please see Methods part of our Genome Biology paper (https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02303-4) for how it works. It handles DNA sequences of both fasta or fastq format, whether gzipped or not. Kssd run on linux system, currently not support for MacOS and Windows OS.
- Installation
- Quick Tutorial
- For Advanced Users
- How to cite
git clone https://github.com/yhg926/public_kssd.git &&
cd public_kssd &&
make
cd test_fna;
#sketch & index references
../kssd dist -L ../shuf_file/L3K10.shuf -o reference ./seqs1
../kssd dist -o reference reference
#sketch queries
../kssd dist -L ../shuf_file/L3K10.shuf -o query ./seqs2
#Search queries against references db
../kssd dist -r reference -o distout query
# or you can compute the pairwise distance of references
../kssd dist -r reference -o distout reference
Here is the explanation of the output file "distance.out" (please see How to cite for the referred equations)
Column | Explanation |
---|---|
Qry | query |
Ref | reference |
Shared_k|Ref_s|Qry_s | number of shared k-mer between the sketches of the reference and the query|reference sketch-size|query sketch-size |
Jaccard | Jaccard-coefficient (Eq. 2) |
MashD | mash distance (Eq. 4) |
ContainmentM | containment-measurement(Eq. 3) |
AafD | aaf-distance (Eq. 5) |
Jaccard_CI | 0.95 confidence intervel for Jaccard-coefficient |
MashD_CI | 0.95 confidence intervel for mash distance |
ContainmentM_CI | 0.95 confidence intervel for containment-measurement |
AafD_CI | 0.95 confidence intervel for aaf-distance |
P-value(J) | p-value for Jaccard-coefficient(Eq. 14) |
P-value(C) | p-value for containment-measurement(Eq. 14) |
FDR(J) | false discover rate for Jaccard-coefficient |
FDR(C) | false discover rate for containment-measurement |
kssd shuffle -k <half_length_of_k-mer> -s <half_length_of_k-mer_substring> -l <dimensionality-reduction_level > -o <shuffled_k-mer_substring_space_file>
This step can be omitted, and you can skip to step 2 if you wish to use default setting of -s
. Other wise read below:
This command will generate a file suffixed by ‘.shuf’ which keeps the shuffled k-mer substring space, this file would then took as input for sequences sketching or decomposition.
-k
: Half-length of k-mer, -k x
meaning use k-mer of length 2x
. For bacterial -k 8
is recommand; for mammals or metagenomics, -k 10
is recommand; for other genome size in-between, -k 9
is recommand.
-s
: Half-length of k-mer substring, -s x
meaning the whole space is the collection of all 2x-mer
. Make sure l < s < k
. The default setting is -s 6
, usually there is no need to change this setting.
-l
: The level of dimensionality-reduction. -l x
meaning the expected rate of dimensionality-reduction is $16^x$
; for bacterial -l 3
is recommand; for mammals, -l 4
or -l 5
is recommand. l < s
.
-o
output .shuf file.
kssd dist -r <.fasta/fastq_dir> -L <.shuf_file or dimentionality-reduction_level> [-k <half_k-mer_length>] -o <outdir>
-L
: The.shuf
file generated from kssd shuffle
or the the level of dimensionality-reduction.
If you feed -L
the .shuf
file, there is no need to specify -k
again, since it has already been set in the .shuf
file.
Else if you feed -L
the level of dimensionality-reduction, new .shuf
file will generated and used. Actually, command:
kssd dist -r <.fasta/fastq_dir> -L <dimentionality-reduction_level> -k <half_length_of_k-mer> -o <ref_outdir>
is equivalent to
kssd shuffle -k <half_length_of_k-mer> -s <half_length_of_k-mer_substring> -l <dimensionality-reduction_level> -o <ref_outdir/default.shuf> &&
kssd dist -r <.fasta/fastq_dir> -L <ref_outdir/default.shuf> -o <ref_outdir>
The expected rate of dimensionality-reduction for -L x
is $16^x$
; for bacterial -L 3
is recommand; for mammals, -L 4
or -L 5
is recommand.
-r
: Feed it with the sequences (fasta or fastq, gzipped or not) that you want built as the references-db.
-k
: Half-length of k-mer, -k x
meaning use k-mer of length 2x
. For bacterial -k 8
is recommand; for mammals, -k 10
is recommand; for other genome size in-between, -k 9
is recommand.
-o
: There are two folders ref/
and qry/
in the output dir ref_outdir
. In Step 3 distance estimation ref_outdir/ref
feed as references for -r
and ref_outdir/qry
feed as queries
To compare queries with references, queries need be skeched using the same .shuf
file with that of references.
kssd dist -o <qry_outdir> -L <ref_outdir/default.shuf or the_.shuf_file_used_by_references> <queries_.fasta/fastq_dir>
-o
: There is only one folder qry/
in the output dir qry_outdir
. In Step 3 distance estimation qry_outdir/qry
feed as queries.
Suppose you want sketching Sequence Read Archive Accesssion ERR000001, just run:
kssd dist -L <your .shuf file> -n 2 -o <outdir> --pipecmd "fastq-dump --skip-technical --split-spot -Z" ERR000001
or prefetch first
prefetch ERR000001 && kssd dist -L <your .shuf file> -n 2 -o <outdir> --pipecmd "fastq-dump --skip-technical --split-spot -Z" <.sra dir>/ERR000001.sra
kssd set -u -o <union_outdir> <qry_outdir/qry>
It will create the union sketch in <union_outdir> from the combined queries sketch in <qry_outdir/qry>. Note the combined queries sketch is just a sketch combined from all queries sketches, the union operation deduplicate those integers duplicated in different queries;
kssd set -i <union_outdir> -o <intersect_outdir> <qry_outdir/qry>
It will create the intersection sketch in <intersect_outdir> between the union sketch in <union_outdir> and the combined queries sketch in <qry_outdir/qry>;
kssd set -s <union_outdir> -o <subtract_outdir> <qry_outdir/qry>
It subtracts the union sketch in <union_outdir> from the combined queries sketch in <qry_outdir/qry> and creates the remainder sketch in <subtract_outdir>
If you only want to compute pairwise distances of all references, run:
kssd dist -r <ref_outdir/ref> -o <outdir> <ref_outdir/qry>
Or if you want search the queries against the references, run:
kssd dist -r <ref_outdir/ref> -o <outdir> <qry_outdir/qry>
The ref_outdir
and qry_outdir
are the sketches created in step 2.
The distance will output to <outdir>/disntance
If you have queries generated from different running batches, you can combine them by:
kssd dist -o <outdir> <path_to_query_batch1> [<path_to_query_batch2> ...]
Make sure all queries batches use the same .shuf file
Yi, H., Lin, Y., Lin, C. et al. Kssd: sequence dimensionality reduction by k-mer substring space sampling enables real-time large-scale datasets analysis. Genome Biol 22, 84 (2021). https://doi.org/10.1186/s13059-021-02303-4