GitHub - smsaladi/bioseq-js: A fast, lightweight implementation of Smith-Waterman

Use

Demo

Installation

bioseq is packaged as a module for Node environments, but automatically attaches itself to window if included in a browser. To get it in Node, from your project directory,

npm install bioseq --save

From there, you can require it wherever is convenient:

const bioseq = require('bioseq');

Alternately, if you just want it in the browser, download it and include it with a script tag, like so:

<script src="bioseq.js"></script>

Quick Start Example

var target = 'ATAGCTAGCTAGCATAAGC';
var query  = 'AGCTAcCGCAT';

var rst = bioseq.align(target, query);

console.log('Score: ', rst.score);
console.log('Starting position: ', rst.position);
console.log('CIGAR: ', bioseq.cigar2str(rst.CIGAR));

var fmt = bioseq.cigar2gaps(target, query, rst.position, rst.CIGAR);

console.log(fmt[0]);
console.log(fmt[1]);

For a more comprehensive example, view the source of demo/index.html

API

bioseq is an object with the following functions: align, cigar2gaps, cigar2str, bsg_enc_seq, gen_query_profile, gen_score_matrix, makeAlphabetMap, and makeIntArray.

In general, the only functions of general interest are align and cigar2*, so let's talk about those.

bioseq.align

align is the fundamental function of bioseq. It accepts seven parameters:

target - target string
query - query string or query profile

And, optionally:

is_local - perform local alignment
matrix - square score matrix or [match,mismatch] array
gapsc - [gap_open,gap_ext] array; k-length gap costs gap_open+gap_ext
w - bandwidth, disabled by default
table - encoding table. It defaults to bioseq.nt5.

Returns an object containing a score, position, and CIGAR. CIGAR is encoded in the BAM way, where higher 28 bits keeps the length and lower 4 bits the operation in order of 'MIDNSH'. See bioseq.cigar2str() for converting CIGAR to string.

bioseq.cigar2str

Before I get in to what this function does, let's talk about what it needs. CIGAR is an encoding scheme for sequence alignment instructions. Basically, it's a list of instructions to make two sequences line up. These instructions are assembled from a very simple set:

M for match
I for insertion
D for deletion
N for gap
S for substitution
H for hard clipping

Each of these can be preceded by an integer, indicating the number of base pairs for which this instruction is correct between the two sequences. However, CIGARs are stored as simpler machine instructions, rather than strings. So, if you want to show the CIGAR (for some reason), you're going to need this.

bioseq.cigar2gaps

Takes a pair of sequences and a CIGAR and returns an array of sequences which have been aligned according to the instructions in the CIGAR.

Theory

bioseq implements local and global pairwise alignment with affine gap penalties. There are two formulations: the Durbin formulation as is described in his book and the Green formulation as is implemented in phrap. The Durbin formulation is easier to understand, while the Green formulation is simpler to code and probably faster in practice.

The Durbin formulation is:

M(i,j) = max{M(i-1,j-1)+S(i,j), E(i-1,j-1), F(i-1,j-1)}

E(i,j) = max{M(i-1,j)-q-r, F(i-1,j)-q-r, E(i-1,j)-r}

F(i,j) = max{M(i,j-1)-q-r, F(i,j-1)-r, E(i,j-1)-q-r}

where q is the gap open penalty, r the gap extension penalty and S(i,j) is the score between the i-th residue in the row sequence and the j-th residue in the column sequence. Note that the original Durbin formulation disallows transitions between between E and F states, but we allow them here.

In the Green formulation, we introduce:

H(i,j) = max{M(i,j), E(i,j), F(i,j)}

The recursion becomes:

H(i,j) = max{H(i-1,j-1)+S(i,j), E(i,j), F(i,j)}

E(i,j) = max{H(i-1,j)-q, E(i-1,j)} - r

F(i,j) = max{H(i,j-1)-q, F(i,j-1)} - r

It is in fact equivalent to the Durbin formulation. In implementation, we calculate the scores in a different order:

H(i,j)   = max{H(i-1,j-1)+S(i,j), E(i,j), F(i,j)}

E(i+1,j) = max{H(i,j)-q, E(i,j)} - r

F(i,j+1) = max{H(i,j)-q, F(i,j)} - r

i.e. at cell (i,j), we compute E for the next row and F for the next column.

History

In an online conversation, Istvan Albert was complaining he could not find a good Smith-Waterman implementation in Javascript. Heng Li thought he could write one over night by porting ksw.c to javascript. It took longer than he planned because he found a couple of subtle bugs in ksw.c. And while he was porting the C code to javascript, he realized that it is not that difficult to merge local and banded global alignments in one function. Achieving that also took extra time.

The end product is a fast and lightweight javascript library for affine-gap local and banded global pairwise alignment. With a modern Javascript engine, it is not much slower than a non-SSE C implementation.

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
demo		demo
dist		dist
.gitignore		.gitignore
README.md		README.md
_config.yml		_config.yml
package-lock.json		package-lock.json
package.json		package.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Use

Demo

Installation

Quick Start Example

API

bioseq.align

bioseq.cigar2str

bioseq.cigar2gaps

Theory

History

About

Releases

Packages

smsaladi/bioseq-js

Folders and files

Latest commit

History

Repository files navigation

Use

Demo

Installation

Quick Start Example

API

bioseq.align

bioseq.cigar2str

bioseq.cigar2gaps

Theory

History

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Packages