Skip to content

SleekPanther/sequence-alignment

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

33 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Sequence Alignment

Implementation of the classic Dynamic Programming problem using the Needleman–Wunsch algorithm which requires quadratic space & time complexity.

Problem Statement

Given 2 sequences, find the minimum cost of aligning the 2 sequences (case insensitive).
Gaps can be inserted to 1 sequence or the other, but incur a penalty.

2 = Gap Penalty (δ)
If 2 characters are aligned with each other, there may be a mismatch penalty (αi j)

  • 0 = aligning identical letters
  • 1 = aligning a vowel with a vowel, or a consonant with a consonant
  • 3 = aligning a vowel with a consonant

Minimum cost = sum of mismatch & gap penalties (the optimal alignment)

Optimal Substructure

There may be multiple optimal paths, but this only finds 1 of them

Runtime

O(M*N)
Storage: also O(M*N)

Pseudocode

Usage

Requirements & Caveats

  • Alphanumeric characters only (all others including spaces are sanitized out)
  • Case insensitive
  • _ (Underscore character) represents gaps but only when displaying results (it will be removed & ignored if it's present in a sequence)
  • View results in a fixed-width font for the 2 sequences to be lined up
  • predecessorIndexes is calculated when creating memoTable but not used to find the actual alignment, only to show where the values in memoTable came from
    • For example: if predecessorIndexes[4][4] contains the array [4, 3], it means the value of memoTable[4][4] (which is 6) came from memoTable[4][3] (i.e. case3, a character in seq2 was aligned with a gap so it came from the left)
    • predecessorIndexes[0][0] contains the array [-1, -1] because the upper left corner has no predecessor

Setup

  • Provide 2 strings (edit testSequences 2D array) & run calculateAndPrintOptimalAlignment()
  • Optionally change the penalties by passing in arguments to the non-default constructor
    Currently vowelVowelMismatchPenalty, consonantConsonantMismatchPenalty & numberNumberMismatchPenalty are the same, but all are arbitrary

Example: aligning "mean" with "name"

Code Notes

  • seq1 & seq2 have a leading space added (after sanitizing illegal & whitespace characters)
    This is so that seq1.charAt(i) & seq2.charAt(j) work in the loops & the string indexes match up
    It causes some adjustments for the array sizes.
  • memoTable = new int[seq1.length()][seq2.length()] uses the exact value of length() which includes the space
  • Loops like for(int i=0; i < seq1.length(); i++) use i < seq1.length() to not go out of bounds of the string indexes
  • In findAlignment(), int i = seq1.length() - 1; & int j = seq2.length() - 1; since the 2 sequences have leading spaces & arrays indexes start from 0
  • findAlignment() retraces each calculation in the memoTable to see where it came from

References