sparseMEM - WORK IN PROGRESS

An implementation of an algorithm for finding maximal exact matches in large sequence datasets using sparse suffix arrays.

This was a project for Bioinformatics class at Faculty of Electrical Engineering and Computing, University of Zagreb. http://www.fer.unizg.hr/predmet/bio

Prerequisites

linux or OS X
gcc 4.8+ or clang 3.4+

Usage

make
./main <sequence.fasta> <query.fasta> <index level of sparse SA, K> <minimal match size>

Terminology

S - a string of n characters stored in an array

k - number of characters in alphabet

$ - the termination character of the string, the unique lexicographically smallest character in the alphabet which contains the possible characters in S

suf(S, i) - a suffix of the string S starting with the character S[i] all the way to the termination character $

SA - suffix array, a sorted array of all suffixes of a string

S-type an L-type

A suffix suf(S, i) is said to be S-type or L-type if suf(S, i) < suf(S, i+1) or suf(S, i) > suf(S, i+1), respectively.

The last suffix suf(S, n-1), consisting of only a single termination character $ is defined as S-type.

A character S[i] is S-type or L-type if the suffix suf(S, i) is S-type or L-type, respectively.

Buckets and sub-buckets

a bucket is a sub-array of the SA for all the suffixes starting with the same character. It can be further split into two sub-buckets with respect to the types of the suffixes inside: the L- and S-type buckets, where the L- type bucket is on the left of the S-type bucket.

LMS

LMS - left-most S-type

A character S[i] is called LMS if S[i] i S-type and S[i-1] is L-type. A suffix suf(S, i) is called LMS is S[i] is a LMS character.

A LMS-substring is a substring S[i..j] with both S[i] and S[j] being LMS characters and there is no other LMS character in the substring. The termination character $ is also a LMS-substring.

P1 is an array containing the pointers for all the LMS-substrings in S with their original positional order being preserved.

If we have all the LMS-substrings sorted in the buckets in their lexicographical order where all the LMS- substrings in a bucket are identical, then we name each item of P1 by the index of its bucket to produce a new string S1.

SA1 - suffix array for S1

Graphical example

S - cccagaaaactaccacctccggccagta$

	index	Type	suffix
[0]	28	S	$
[1]	27	L	a$
[2]	5	S	aaaactaccacctccggccagta$
[3]	6	S	aaactaccacctccggccagta$
[4]	7	S	aactaccacctccggccagta$
[5]	11	S	accacctccggccagta$
[6]	14	S	acctccggccagta$
[7]	8	S	actaccacctccggccagta$
[8]	3	S	agaaaactaccacctccggccagta$
[9]	24	S	agta$
[10]	13	L	cacctccggccagta$
[11]	2	L	cagaaaactaccacctccggccagta$
[12]	23	L	cagta$
[13]	12	L	ccacctccggccagta$
[14]	1	L	ccagaaaactaccacctccggccagta$
[15]	22	L	ccagta$
[16]	0	L	cccagaaaactaccacctccggccagta$
[17]	18	S	ccggccagta$
[18]	15	S	cctccggccagta$
[19]	19	S	cggccagta$
[20]	9	S	ctaccacctccggccagta$
[21]	16	S	ctccggccagta$
[22]	4	L	gaaaactaccacctccggccagta$
[23]	21	L	gccagta$
[24]	20	L	ggccagta$
[25]	25	S	gta$
[26]	26	L	ta$
[27]	10	L	taccacctccggccagta$
[28]	17	L	tccggccagta$

Name		Name	Last commit message	Last commit date
Latest commit History 235 Commits
test_cases		test_cases
.gitignore		.gitignore
.travis.yml		.travis.yml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
fasta.cc		fasta.cc
fasta_parser.cc		fasta_parser.cc
fasta_parser.h		fasta_parser.h
fasta_parser_testrun.cpp		fasta_parser_testrun.cpp
main.cc		main.cc
mummer		mummer
sa_is.cc		sa_is.cc
sa_is.h		sa_is.h
sais.c		sais.c
search.cc		search.cc
search.h		search.h
search_test.cpp		search_test.cpp
test-all.sh		test-all.sh
testy.fasta		testy.fasta

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

sparseMEM - WORK IN PROGRESS

Prerequisites

Usage

Terminology

S-type an L-type

Buckets and sub-buckets

LMS

Graphical example

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

sparseMEM - WORK IN PROGRESS

Prerequisites

Usage

Terminology

S-type an L-type

Buckets and sub-buckets

LMS

Graphical example

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages