# Problem 1. RCC Single Node Blast
Using the RCC version of BLAST, conduct a series of runs to determine the optimal number of threads to run on a single node.  Conduct a BLASTP search against the pdb, swissprot, refseq_protein, and nr databases using a single node configuration. 

Conduct multiple runs and vary the number of cpus per node to 1, 2, 4, 8, and 16.  This can be accomplished by changing `.sbatch` file and updating the Slurm variables. 

Record the overall run time for each iteration.

- The databases we will be using are located on a shared directory for the class located at `/project2/mpcs56420/db`. You will need to alter (or create multiple scripts) to change the database.

- Query sequence:
``` 
>YP_009724390.1 surface glycoprotein [Severe acute respiratory syndrome coronavirus 2]
MFVFLVLLPLVSSQCVNLTTRTQLPPAYTNSFTRGVYYPDKVFRSSVLHSTQDLFLPFFSNVTWFHAIHV
SGTNGTKRFDNPVLPFNDGVYFASTEKSNIIRGWIFGTTLDSKTQSLLIVNNATNVVIKVCEFQFCNDPF
LGVYYHKNNKSWMESEFRVYSSANNCTFEYVSQPFLMDLEGKQGNFKNLREFVFKNIDGYFKIYSKHTPI
NLVRDLPQGFSALEPLVDLPIGINITRFQTLLALHRSYLTPGDSSSGWTAGAAAYYVGYLQPRTFLLKYN
ENGTITDAVDCALDPLSETKCTLKSFTVEKGIYQTSNFRVQPTESIVRFPNITNLCPFGEVFNATRFASV
YAWNRKRISNCVADYSVLYNSASFSTFKCYGVSPTKLNDLCFTNVYADSFVIRGDEVRQIAPGQTGKIAD
YNYKLPDDFTGCVIAWNSNNLDSKVGGNYNYLYRLFRKSNLKPFERDISTEIYQAGSTPCNGVEGFNCYF
PLQSYGFQPTNGVGYQPYRVVVLSFELLHAPATVCGPKKSTNLVKNKCVNFNFNGLTGTGVLTESNKKFL
PFQQFGRDIADTTDAVRDPQTLEILDITPCSFGGVSVITPGTNTSNQVAVLYQDVNCTEVPVAIHADQLT
PTWRVYSTGSNVFQTRAGCLIGAEHVNNSYECDIPIGAGICASYQTQTNSPRRARSVASQSIIAYTMSLG
AENSVAYSNNSIAIPTNFTISVTTEILPVSMTKTSVDCTMYICGDSTECSNLLLQYGSFCTQLNRALTGI
AVEQDKNTQEVFAQVKQIYKTPPIKDFGGFNFSQILPDPSKPSKRSFIEDLLFNKVTLADAGFIKQYGDC
LGDIAARDLICAQKFNGLTVLPPLLTDEMIAQYTSALLAGTITSGWTFGAGAALQIPFAMQMAYRFNGIG
VTQNVLYENQKLIANQFNSAIGKIQDSLSSTASALGKLQDVVNQNAQALNTLVKQLSSNFGAISSVLNDI
LSRLDKVEAEVQIDRLITGRLQSLQTYVTQQLIRAAEIRASANLAATKMSECVLGQSKRVDFCGKGYHLM
SFPQSAPHGVVFLHVTYVPAQEKNFTTAPAICHDGKAHFPREGVFVSNGTHWFVTQRNFYEPQIITTDNT
FVSGNCDVVIGIVNNTVYDPLQPELDSFKEELDKYFKNHTSPDVDLGDISGINASVVNIQKEIDRLNEVA
KNLNESLIDLQELGKYEQYIKWPWYIWLGFIAGLIAIVMVTIMLCCMTSCCSCLKGCCSCGSCCKFDEDD
SEPVLKGVKLHYT
```

What do the results suggest would be the optimal number of threads to use on a single core to minimize the run time? Did this differ from what you discovered when running locally on your own machine? 

# Problem 2. SLURM Array Job
One of the simplest ways to parrallelize a computing task is to break it in to smaller parts.  Write a SLURM submission script that submits the above query against the fragmented *nr* database.  The fragmented *nr* database is located in the `db` diretory in the format: `/project2/mpcs56420/db/refseq_protein.[00-21]`.  

You should create a single script that utilizes the SLURM *array job* functionality.  For this example, use 1 task per node.  Show your script below.  

What is the runtime for the entire job?


What choice did you make for setting the number of tasks per node? Did you use the default of 1 task per node or a different value? Does it make a difference?

# Problem 3. 
Now that you are familiar with using SLURM on RCC, it's time to update your sequence alignment code to run on the distributed computing resources. 

Update your code for local and global sequence alignment to search against a database of protein sequences. Your program should now take the option inputing a database in fasta format. In this mode, read in the database file then perform your sequence alignment against each member of the database. Save the score for each alignment. _You do not have to generate the Z scores_.

In addition, update your code to use either a PAM250 or BLOSUM62 scoring matrix instead of the simple match/mismatch score. You may continue to use a linear gap penalty.

Your code should take input from command line to specify the following parameters:

- database file (in FASTA format)
- gap penalty (default = -2)
- scoring matrix (options should be PAM250 or BLOSUM62)
- output file name

The input should be similiar to the following:
```
align -query=filename -database=filename -gap_penalty=-2 -scoring_matrix=pam250 -out output.txt
```
The output should be similiar to the following format:
```
Query Sequence: 
>query

Database: database name: 

Runtime: ####

> score | sequence header 1 
> score | sequence header 2
```

Develop a SLURM script to run your program on a single node against the PDB database.



## Problem 4. Visualizing Proteins 

The goal of this question is to familiarize yourself with protein structures and to visualize 3D models of proteins using PyMOL.

* Download and install PyMOL (https://www.pymol.org/) on your computer. You can download the installer or use Anaconda: `conda install -c schrodinger pymol`. You will not need a license to run it.
* Consult the PyMOL wiki (https://pymolwiki.org) for answers to all your questions.

Download a protein from the PDB within Pymol interactive console using the following command and answer the following questions:
```
PyMOL>fetch 1mbn
```
- What is this protein? 
- What organism does it come from? 
- Provide the URL to the protein from the PDB website?
- Take a screenshot and include it below.

When you load a structure, the default visualization in PyMOL is a `cartoon` represenation. You can highlight the secondary strucure elements by coloring by secondary structure elements by selecting `C -> by ss -> Helix Sheet Loop`. Answer the following questions:

- How many alpha-helices does the protein have? 
- How many beta sheets does it have?

While this visualization makes for great images for magazine covers, it does not provide the detail we may need to analyze it. In particular, the `licorice sticks` view allows us to see an all-atom reprentation that will help us view the side chains. Change the view in the GUI by selecting `H -> everything` and then `S -> sticks`.

If you prefer, you can also control the depiction from the command line.
```
PyMOL> hide everything, 1mbn
PyMOL> show sticks, 1mbn
```

To prepare high-quality images from PyMol, the command `ray` will use a ray-tracing algorithm to compute the lighting on the molecule. The following commands will set the background color to white and set the image size to 800 × 800 pixels 


```
PyMOL> bg white
PyMOL> ray 800,800

```


Take a screenshot and include it below.

In [None]:
Insert image here.

## Problem 5. Comparing Proteins

Download a different protein from the PDB using the following commands.
```
PyMOL> fetch 1ase
PyMOL> hide everything 
PyMOL> show cartoon
```

Answer the following questions.

- What is this protein?
- How many alpha-helices does the protein have?
- How many beta sheets does it have?
- What are the ligands (small molecules not part of the protein) that is bound to it?

This protein was part of a functional study that included mutating residues across multiple protein models. Doing this allowed the researchers to identify precisely the amino acids that were functionally active. 

Load another version of the protein:

```
PyMOL> delete all
PyMOL> fetch 1ase
PyMOL> fetch 1asf
```

Now, superimpose the two structures:

```
PyMOL align 1fse, 1asf 
```

The structural match between the two molecules is measured by the root-mean-squared (RMS) distance of the aligned atoms:
![image.png](attachment:image.png)
where `xi` and `yi` are the vector coordinates (displacement vectors) of the `n` atoms in the two structures. The `align` command automatically generates a sequence alignment to pick the right atoms to compare and then solves for and executes the coordinate transformation that yields the minimal RMS deviation between the structures.

Answer the following questions:
* What is the RMSD error calculated for this structural alignment (include units)? Over how many atoms?
* At what position was the mutation introduce?
* What was the mutation that was made to the protein?

# Problem 6. Structural Analysis 
Download a protein from the PDB within Pymol interactive console using the following command and answer the following questions:
```
PyMOL> fetch 1yy8
```

The structure you have downloaded is cetuximab, a therapeutic antibody in development for cancer treatment. Antibodies are composed of two heavy chains and two light chains; the particular construct of its upper part is known as a Fab fragment and contains one full light chain (chain A) and the N-terminal half of one heavy chain (chain B). At one end of the Fab fragment are six loops known as the “complementarity determining regions” (CDR), which bind a particular antigen. 

For the following problems, we will examine the N-terminal domain of chain A (the light chain). To make this easier, type the following command:

````
PyMOL> select L, chain A and resi 1-107
```

From the right panel controls, hide everything but for selection L and click `Color→Spectrum→Rainbow`.

Answer the following questions:
* Looking down the direction of the first strand, which way does it twist?
* Do all strands twist the same direction?

Next, let us analyze a couple strands in the N-terminal domain. Zoom in on strands 3 and 8, (which should be adjacent and colored cyan and marigold, respectively.) What are the residue number ranges for these two strands? 

_Hint: Click on the strand ends and look in the console window for the residue numbers._

* Strand 3: ____–____ 
* Strand 8: ____–____

Create a new object for these two strands with `select` and hide the rest of the molecule. Display the atoms of the amino acid and color them by their element (Show→Sticks and Color→ByElement). 

Answer the following questions:
* What color are oxygens? _______ 
* What color are nitrogens? _______ 
* By looking at the side chains, identify the amino-acid sequence (using 1-letter abbreviations) of these two strands: strand 3: _______________ strand 8: _______________ 
* What is the pattern in these sequences, and why does it occur?

# 6. PyRosetta 
PyRosetta is a powerful molecular modeling toolkit that we will be using later on in the course. Visualization is a key component to interpreting and analyzing molecular models so they provide a very detailed tutorial as part of their instruction manual. 

Complete the following PyMOL Tutorial from the PyRosetta group: https://graylab.jhu.edu/pyrosetta/downloads/documentation/pyrosetta4_online_format/PyRosetta4_Workshop1_PyMOL.pdf 

There are 8 questions in the tutorial, please answer them below:

# 1. 

Update your sequence alignment code to run as an array job in Slurm. You program should partition the database based on the number of nodes you are using. Then run each of the databases against your query. When all the jobs are completed, coallate the results into a single file.

You should use a separate helper script to take care of the partitioning and submitting.

# 2.

We have seen firsthand how shared computing resources have both their good and bad points. Fortunately, everyone can now have their own private computing clusters (as long as you're willing to pay for it).

Follow this codelab frrom Google Clould Platform to deploy your own HPC Cluster with Slurm: https://codelabs.developers.google.com/codelabs/hpc-slurm-on-gcp/#0

Once you have run the sample, get your sequence alignment code to run on it. Provide the sbatch script used to conduct a database search using 4 nodes.

# 3. Gene Study Lab Notebook

Update your lab notebook document to add information about the protein structure and funcion. Please add any links when possible.

* Protein Databank Identification
* SCOP/CATH identification numbers
* Active sites/functional residue identfication
* Funcional classification
* Images of your protein structure
 - Full view
 - Zoom in of active site
 - Functional residue in stick; protein in cartoon
 
 **For your PyMOL Images, keep careful notes on the commands you use so you can easily recreated your images.**
 
 
Link to your Google document:

# 7.

One of the most common tasks for performing structure analaysis is to performing structure alignments to identify the similarity between protein. You will write a program to perform both global and local structure alignments.  

There are many approaches to performing structure alignements, but you will consult [this approach to structure alignment and RMSD](http://boscoh.com/protein/rmsd-root-mean-square-deviation.html) for the technical details on performing the alignments.

Despite the complexitity of the underlying approach, you will find it boils down to a couple of matrix command in `numpy`.  As is often the case in bioinformatics, you will be building on what others have contributed to the field.  Understanding other people's code is an important part of this. There are many complex bioinformatics packages in the wild, but often they require a substantial amount of tinkering to work properly.

You can (and should) use and modify the [code from the article](https://github.com/boscoh/pdbremix/blob/master/pdbremix/rmsd.py) to help parse PDB files and perform the alignments.  We are only interested in the core PDB parsing and RMSD calculations in this code.

For each question below, you only need to consider the alpha carbon coordinates to represent the center of mass of the amino acid.  That is, use only one atom to represent each amino acid. 

Your program should take input as follows:
```
align.py -reference file1.pdb -mobile file2.pdb [-local window] -out outfile.pdb
```

Each solution should print the following to standard output:

* RMSD calculation of the alignment
* The rotation matrix that gives the best RMSD
* A list of aligned residues in the following format (`A1->B1`,`A2->B2`, etc.)

The `outfile.pdb` should be a new file (in PDB format) of the aligned structure.  Note that in an alignment, one structure is kept as reference and one is mobile.  The file should represent the mobile structure.  _You should apply the rotation matrix to the entire PDB, not just the alpha carbon atoms._

# 6. 

Find the global RMSD betwen the follwoing myoglobin structures: 1mbn, 1np4, 3qm9, 4nos. Include the output from your program below.

# 7. 

Find the best local RMSD alignment between the follwoing myoglobin structures (1mbn, 1np4, 3qm9, 4nos) by comparing segments of length 10. You should apply the rotation matrix to the entire protein.

# 8. 

Find the best structural alignment between residues in the heme binding pockets. Use both global and local approaches to find the best alignment.

Use the [HEM binding pocket from 1mbn](https://github.com/uchicago-bio/mpcs56420-2020-spring/blob/master/assignment-5/1mbn.HEM_A_155.pdb) as the query binding pocket.  Compare the query against each of the following binding pockets. 

* [HEM binding pocket from 1np4](https://github.com/uchicago-bio/mpcs56420-2020-spring/blob/master/assignment-5/1mbn.HEM_A_155.pdb/1np4.HEM_A_185.pdb)

* [HEM binding pocket from 3qm9](https://github.com/uchicago-bio/mpcs56420-2020-spring/blob/master/assignment-5/1mbn.HEM_A_155.pdb/3qm9.HEM_A_201.pdb)

* [HEM binding pocket from 4nos](https://github.com/uchicago-bio/mpcs56420-2020-spring/blob/master/assignment-5/1mbn.HEM_A_155.pdb/4nos.HEM_A_510.pdb)

How does the similarity between the HEME binding pockets compare to the overall sequence and structural similarity? Report the sequence similarity in addition to the structural similarity.