Skip to content


Folders and files

Last commit message
Last commit date

Latest commit



64 Commits

Repository files navigation


SignatureSJ is a tool for estimation of genetic relatedness between members of heterogeneous populations of closely related genomic variants. As input, it takes fasta file and creates an edges list of all related sequence pairs as output.

It works for two metrics:

  • Edit Distance (Levenshtein Distance)
  • Hamming Distance

It has two input types:

  • Single sample - tool will find all related sequence pairs within one sample
  • Multi-sample - tool will find all related sequence pairs between all given samples such that one sequence will be from one sample and second will be from another sample (tool goes through all possible sample pairs)

How to run

How to Run

  1. Copy repository to your local machine

git clone

  1. Build jar file with Maven mvn clean install It will create signature-sj.jar in the project root.


Download jar from here (I always update it with the latest version)


There are several available parameters:

  • -m mandatory parameter to specify a method that you want to run. There are four possible values:
    • 'edit-single' - find all related pairs of sequences in a single sample for edit distance
    • 'hamming-single' - find all related pairs of sequences in a single sample for Hamming distance
    • 'edit-multi' - find all related pairs of sequences between all pairs of given samples for edit distance
    • 'hamming-multi' - find all related pairs of sequences between all pairs of given samples for Hamming distance
  • -in the input path. If not specified default cleaned_independent_264/AMC_P01_1b.fas file will be used. It can be relative as well as the absolute path. For single-sample, it can be either file or folder with files. If it's a folder, the tool will read all files in the given folder and concatenate them into one sample. For multi-sample version one should give a folder and tool will consider each file as a separated sample.
  • k a threshold for related sequences, so in output will be only sequences (S, Q) such that d(S, Q) <= k. 10 is a default value
  • l for edit distance it is the length of l-mers to create the signature, the default value is 11. For Hamming distance it is the number of chunks to create signature(the lenght of chunks will be calculated based on the entrophy). For Hamming distance the number of chunks should depend on the length of the sequences: it should be somewhat close to length/11, but can be reduced for highly conservative inputs to speedup the process.
  • -outDir an output directory. output/ is a default value.
  • -threads number of threads for parallel execution. By default number of available cores will be used.

Usage examples


   java -jar signature-sj.jar -m hamming-single -in test_data\db1 -outDir test_out


Start Signature Hamming method for db1 k=10 entropy-based segments size
Input size = 1000
Running threads = 4
comparisons = 64573
related pairs found = 60420
Output is available at ...\test_out\db1-signature-hamming-output.txt
Total run time: 521, ms


java -jar signature-sj.jar -m edit-single -in test_data\db2\2000.fas -threads 2 -k 14


Start Signature method parallel for 2000 k= 14 l= 11
Input size = 2000
Running threads = 2
comparisons = 538142
passed hamming distance = 381663
edit distance comparisons = 156479
related pairs found = 381676
Output is available at ...\output\2000-signature-output.txt
Total run time: 2860, ms


As input the program takes simple fasta format as follows:


For single-sample, if a folder is specified as input, the program will concatenate all files from the folder in one sample.


The Output is a simple text file that has sample name(file name)+method+output.txt as a name (e.g., db6-signature-hamming-output.txt). It contains a set of number pairs, where each number represents a sequence number in the input file. It's done in that way since even in that way output may have a size of several Gb and sequence names from input file will significantly increase this volume. If a folder is specified as input for single sample method, the tool will read files in alphabetical order to escape ambiguity.

Any questions

For any questions, please, contact:


No description, website, or topics provided.







No releases published


No packages published
