# Module 3: Assignment B #

------

## 1. Dynamic Progamming Implementation

Implement dynamic programming algorithms for global ([Needleman-Wunsch](https://pubmed.ncbi.nlm.nih.gov/5420325/)) and local ([Smith-Waterman](https://pubmed.ncbi.nlm.nih.gov/7265238)) alignment of protein sequences. The implementation should be a stand alone, command-line application. 


> There is a jupyter notebook in the module that walkthrough a partial implemention of Needleman-Wunsch. Feel free to use this code as a starter, but you do not feel obligated to. You may find it easier to start from scratch so that you can completely understand every aspect of your code or use this to get you started. There are some differences of what is being asked of you here, as to what is implemented (eg. scoring matrix, protein only) in the notebook.

The application should allow the user to read in the sequences from a fasta formatted file, select the type of alignment (local or global).

The application should take parameter from the the command line to set the sequence files, the type of alignment, the scoring matrix and a linear gap penalty. Include the ability to select either the blosum62 or pam250 scoring matrices. The matrices are included as python dictionaries [here]().

An example of running the program is shown below:

```

align.py --seq1 sequence_file.fasta \
         --seq2 different_sequence_file.fasta \ 
         --type global \
         --matrix blosum62 \
         --gap_penalty -2

```

Please use judicious comments throughout your code. 

The program should report the alignment score, the sequence identity and a visual representation of the alignment. For example:

```

seq 1   GCTAGGATAGGCAATTGGCCTAG--T--G
seq 2   ------ATA-GTAATTGGCCT-GCTTGAG
Aligment Score:    3
Sequence Identity: 78%

```

Implement your code in a seperate module and provide an example of a command used to test it, along with the corresponding files.

In [None]:
# my example command (with files included)


Resources and tools to help test your implementation:

* Neelemen-Wunsch 
  - A general method applicable to the search for similarities in the amino acid sequence of two proteins. Needleman SB, Wunsch CD. [J Mol Biol 1970 Mar;48(3):443-53](http://www.ncbi.nlm.nih.gov/pubmed/5420325?ordinalpos=7&itool=EntrezSystem2.PEntrez.Pubmed.Pubmed_ResultsPanel.Pubmed_DefaultReportPanel.Pubmed_RVDocSum)
  - Online Needleman-Wunsch Alignment Tool: http://rna.informatik.uni-freiburg.de/Teaching/index.jsp?toolName=Needleman-Wunsch

* Smith Waterman
  - Identification of common molecular subsequences. Smith TF, Waterman MS. [J Mol Biol 1981 Mar 25;147(1):195-7]({{site.cdn}}/2015-Autumn/Session3/dayhoff-1978-apss.pdf)
  - Online Smith Waterman Alignment Tool: http://rna.informatik.uni-freiburg.de/Teaching/index.jsp?toolName=Smith-Waterman

## 2.  Statisics of Pairwise Alignments
---------------------------------------

Perform a global sequence alignment of the following sequences:

```
PAWHEAE
HEAGAWGHEE
```
Assume a match score of `1`, a gap penalty of `3` and a substitution score of `-1`.  What is the alignment score?


Next, generate 100 random sequences with the same amino acid distribution as `PAWHEAE` by. Perform an alignment of each sequence to `HEAGAWGHEE`.   Calculate the Z-score for each alignment and construct a plot using `matplotlib`. Place the Z-score on the x-axis. Include the plot inline below.

In [None]:
# Plot here

 What does the Z-score suggest about the significance of the initial alignment your performed?  

Are the alignment scores normally distributed?  _You can use SciPy or any other python modules to help._

Repeat the process above, but now generate 1,000 and 10,000 random sequences and calculate the Z-score for each set.  Does the change in number of sequences alter your evaluation of the evolutionary relatedness of the sequences?

Considering the wall time for each run, comment of the feasibility for computing the Z-score significance while searching a large database of millions of sequences.

# 3.  Recent Approaches in Sequence Alignemnt
-----------------------------------------------------

Search [PubMed](https://pubmed.ncbi.nlm.nih.gov) and identify a paper on a different sequence alignment method.  Do not use BLAST or FASTA.  Briefly (at a high-level), discuss what is different in these approaches and how they improve on what we have seen with Needleman-Wunch and/or Smith-Waterman.  You answer does not have to be uneccesarily technical, but please provide an understanding example of the improvement it makes. Limit your response to 1 paragragh at most. Share you findings and a link to the paper in the `#showcase` channel.