# 2.

Speedup is a metric used to assess the relative performance improvement gained by executing one task versus another.  In our case, we are comparing serial execution (a single process) of searching a single sequence against a database versus a task parallelism version. 

Speedup is defined as 
$$ S \equiv \frac{T_{\text{Old}}}{T_{\text{New}}} $$ 

where $S$ is speedup, $T_{\text{Old}}$ is the time taken to execute the script without improvement (serial), and $T_{\text{New}}$ is the time taken to execute the script with the improvement (parallel).



What is the speedup of the best performing `mpiBlast` runs compared to a serial Blast run on a *single node using a single thread*?


# 3.

Update your code for local and global sequence alignment to compute all vs. all comparisions of a database.  The application will take adavantage of parallel computing modes in a high-performance computing environment.  Given a database of sequences, you application will compare every sequence against everyother sequence in the database (eg. all-vs-all).  The scores will be recorded and ouput as a single file and a matrix.

Your code should take input from command line to specify the following parameters:
 - database file (in FASTA format)
 - gap penalty
 - scoring matrix (options should be PAM250 or BLOSUM62)
 - output file name
 - the number of tasks to submit
 
 For example:
 ```
 all-vs-all.py --database protein.db --task 8 --scoring_matrix PAM250 --gap_penalty -5 --output_file protein_score.txt 
 ```

The final output of your program should be 2 files.  The first should contain the description line for a seqeunce, the query and the alignment score.  For example:

```
>proteinA | >proteinB | Score
```

The second should contain the description line for a seqeunce and alignment scores in a matrix.  For example:

```
           >proteinA >proteinB >proteinC >proteinD
>proteinA       #         #         #         #
>proteinB       #         #         #         #
>proteinC       #         #         #         #
>proteinD       #         #         #         #
```

Please contain the description line in double quotes.  For exammple
``` 
>XXX1223E4 | PROTEIN A | ECOLI
```
should be 
```
">XXX1223E4 | PROTEIN A | ECOLI"
```

The applications should perform the following functions:
 * Split the database based on the number of tasks
 * Load balance the database so each bin is approximately equal
 * Perform the sequence alignments
 * Collate and sort the results
 * Write results to file

The strategy you use to accomplish this is up to you.  You may use any approach discussed in class (e.g. mpi, openMP, embarassingly parallel, hybrid, etc.).  The application should be able to be launched from a single command. 

You can use the following [database](https://raw.githubusercontent.com/uchicago-bio/RCC-Utilities/master/database/pdb1000.fasta) to test your application.  It is a selection of proteins from the Protein Databank.

# 1.

Identify a gene you are interested in studying for your gene presentation. Provide a link to NCBI.


# 7. Protein Structures

The goal of this question is to familiarize yourself with protein structures and to visualize 3D models of proteins using PyMOL.

* Download and install PyMOL (https://www.pymol.org/) on your computer. You can download the installer or use Anaconda: `conda install -c schrodinger pymol`. You will not need a license to run it.
* Consult the PyMOL wiki (https://pymolwiki.org) for answers to all your questions.

## Visualizing Proteins 
Download a protein from the PDB within Pymol interactive console.
```
PyMOL>fetch 1mbn
```
- What is this protein? 
- What organism does it come from? 
- Provide the URL to the protein from the PDB website?
- Take a screenshot and include it below.

When you load a structure, the default visualization in PyMOL is a `cartoon` represenation. You can highlight the secondary strucure elements by coloring by secondary structure elements by selecting `C -> by ss -> Helix Sheet Loop`.
- How many alpha-helices does the protein have? 
- How many beta sheets does it have?

While this visualization makes for great images for magazine covers, it does not provide the detail we may need to analyze it. In particular, the `licorice sticks` view allows us to see an all-atom reprentation that will help us view the side chains. Change the view in the GUI by selecting `H -> everything` and then `S -> sticks`.

If you prefer, you can also control the depiction from the command line.
```
PyMOL> hide everything, 1mbn
PyMOL> show sticks, 1mbn
```
Take a screenshot and include it below.

## Comparing Proteins

Download a different protein from the PDB using the following commands.
```
PyMOL> fetch 1ase
PyMOL> hide everything 
PyMOL> show cartoon
```

- What is this protein?
- How many alpha-helices does the protein have?
- How many beta sheets does it have?
- What are the ligands (small molecules not part of the protein) that is bound to it?

This protein was part of a functional study that included mutating residues. Load in another version of the protein with a mutation in residue 226.

```
PyMOL> fetch 1asf
PyMOL> align 1ase, 1asf
PyMOL> select mutant, resi 226
PyMol> show sticks, mutant
```

- What is the mutation that was made in the protein? 

In [1]:
# 4. 

Complete PyMOL Tutorial from PyRosetta: https://graylab.jhu.edu/pyrosetta/downloads/documentation/pyrosetta4_online_format/PyRosetta4_Workshop1_PyMOL.pdf PyRosetta is a powerful molecular modeling toolkit that we will be using later on in the course.