# PROVAV DOCUMENTATION
**Authors**: Sandra Castro Labrador,  Mª Rocío Valderrama Palacios 

## Contents Index
* [1. Introduction](#sec_1)
* [2. PDB files lecture](#sec_2)

    - [PDB file lecture](#sec_2_1)
    - [CIF file lecture](#sec_2_2)
    
* [3. Measures of sequences similarity](#sec_3)

    - [Root mean squared deviation](#sec_3_1)
    - [TM score](#sec_3_2)
    
* [4. 3D protein visualization](#sec_4)
* [5. Sources](#sec_5)


# 1. INTRODUCTION <a name="sec_1"/>

The goal of this proyect is to use biopython to read different types of protein files, turn the data into Bio.PDB.Structure objects and, given two proteins, see the diferences between them measuring their sequence similarity via root mean squared deviation a TM score. Aditionally, we have chosen to visualize the 3D structure of the proteins using the nglview module.

# 2. PDB FILES LECTURE <a name="sec_2"/>

## 2.1 PDB files lecture <a name="sec_2_1"/>

PDB(Protein Data Bank) offers a variety of options to download protein files: from fasta to XML. We have chosen to work with .pbd and .mmcif files, beacause those are the most used extentions when working with protein data. Eventhough the structure of this two formats differs, the information contained is practically the same.

In [1]:
from Bio.PDB import PDBParser

def pbdReader(self, name, file):
    try:
        parser = PDBParser(PERMISSIVE=1)
        structure = parser.get_structure(name, file)
        return structure
    except:
        print("Something went wrong: File not found")

This function returns a Structure object made with the data from the pdb file

## 2.1 CIF files lecture <a name="sec_2_2"/>

In [2]:
from Bio.PDB import MMCIFParser
def cifReader(self, name, file):
    try:
        parser = MMCIFParser()
        structure = parser.get_structure(name, file)
        return structure
    except:
        print("Something went wrong: File not found")

This function returns a Structure object made with the data from the pdb file, just like the previous one did with the pdb files


# 3. MEASURES OF SEQUENCES SIMILARITY <a name="sec_3"/>

We have choosen two similarity measurement values in order to get an idea of structures from selected proteins, possibly changes between these structures. 
The objective is to obtain root mean squared deviation with some module from **BioPython** based on information we get with **PDB** proteins files (PDB Module). Moreover we want to obtain other important value to comparise structures, that is tm-score, with a different python package we have also intalled. To compare our results with accurate values: on the one hand we have download **PyMol**, software to get some RMSD values, on the other hand we have used online software **TM-score** to get values of tm obtained from concrete structures proteins.

## 3.1 Root Mean Squared Deviation <a name="sec_3_1"/>

**What is RMSD?**
Is a commonly used measure of the average distance between the atoms of two superimposed structures. The equation for calculating RMSD is:

\begin{align}
\dot{RMSD} & = \sqrt{{1\over N} \sum_{k=1}^N \delta _i^2}
\end{align} 

N is the number of atoms to be aligned. We measure squared difference between the positions of the atom, coordinates of them. This measure is useful when we have two conformations of the same molecule, or parts of the this. It needs the same number of atoms in both structures, so in different proteins this measure can change depends on atoms selections. 

When comparing structures this way, one array of atoms (we can call then move atoms) is translated and rotated with respect to reference atoms. The final orientation choosen for this comparison is the one that minimizes the RMSD between the two structures. The smaller the RMSD is between two structures, more similarity exits between them.

### Biopython module used

To get this value we have choosen **Bio.SVDSuperimposer** module from BioPython. We have tested some modules like QCPSuperimposer, however selected class calculates the most exact values between proteins which with we have proved these features. The basic code is easy to apply. We have to get atoms coordinates from PDB files and the class *SVDSuperimposer()* will apply the superimposer and gets the RMSD value. It also provides initial value to compare with the final results. 

In PDB file structure we found different models proteins, with chains, residues and atoms. With the aim of achieving the best value with this module, we have created different functions and methods to get the best rmsd from all models we found in structures. So we select the minimun RMSD. We have tested with different kind of proteins and isoforms, so there are some cases that the number of atoms isn't the same in the two structures. We have matched the arrays to observe the results, and there are some variations. For this, this method is more optimal with isoforms proteins. 

## 3.2 TM Score <a name="sec_3_2"/>

We have complemented previous idea with tm score. We have found a python package, **tmscoring**, that provides good results. We get very similar values compared to TM-score results.
This scoring function assess the similarity of protein structures.

This function makes the score value more sensitive to the global fold similarity than to the local structural variations and provides normalize distances. TM-score has the value in (0,1], where 1 indicates a perfect match between two structures. It's based on statistic: 
- 0.0 < TM-score < 0.17 : random structural similarity                 
- 0.5 < TM-score < 1.00 : in about the same fold   

We have get coherence results in a very optimal way. This function has like parameters PDB files of proteins, so we haven't made any selection, so the results may be more accurate.


## 4. 3D PROTEIN VISUALIZATION <a name="sec_4"/>

We have used NGLview to see 3D proteins structures. We have combined previous alignment with **Bio.PDB.Superimposer()** to show protein superimposer and see clear differences. Also we have used **Bio.PDB.PDBIO()** to create new PDB files with alignments and atom structures. 

For the visualization of Jupiter Widget of proteins 3D structures, in this case is necessary to have the installation of **NGLview**:
- conda install -c conda-forge nglview or conda upgrade nglview --force
- pip install nglview
- might need: jupyter-nbextension enable nglview --py --sys-prefix


In [2]:
# VISUALIZATION FUNCTION
import nglview
def visualizeNGLview(fileNameProtein):
    view = nglview.NGLWidget()
    view.add_component(fileNameProtein)
    
    return view

A Jupyter Widget

In [3]:
# Function for get structure -> its needed for some visualization
from Bio.PDB.PDBParser import PDBParser
def read_protein_pdb(file: str, proteinId: str):
    
    parser = PDBParser(PERMISSIVE=1)
    structure: Structure = parser.get_structure(proteinId, file)

    return structure

### 2PKA & 2PKB (V-ATPase a2-subunit isoforms)
We get a similar display from pyMol thanks to alignment of atoms with Bio.PDB.Superimposer: We have set the atoms and apply superimposer to one of the structures. We save the result in a new PDB and we compare it with the other structure. 

In [4]:
import nglview
view = nglview.NGLWidget()
view.add_component("aligned_ver1.pdb")
view.add_component("2kpb.pdb")
view

A Jupyter Widget

### Creatine Kinase from Human Muscle & Creatine Kinase from Human Brain

In [5]:
visualizeNGLview("G2_aligned.pdb")

A Jupyter Widget

Two proteins with high RMSD, so the similarity between these structures is lower.

### Human Leukocyte Antigen

In [6]:
# HLA complex 
visualizeNGLview("4i0p.pdb")

A Jupyter Widget

In [10]:
# Concrete antigen alignment with HLA complex
view_2 = nglview.NGLWidget()
view_2.add_component("4i0p.pdb")
view_2.add_component("align_ver2.pdb")
view_2.add_licorice('ALA, GLU')
view_2

A Jupyter Widget

The histocompatibility antigen fit with HLA complex. We can appreciate the new zone blue that appear with the alignment we have calculated. We have tested high TM score and low RMSD. 

**Example of alignment code:**

![caption](files/image1.png)

### Cool representation of BRCA1 

In [None]:
struct = nglview.PdbIdStructure("6GVW")

initial_repr = [
    {"type": "licorice", "params": {
        "sele": "atoms", "color": "residueindex"
    }}
]

view_3 = nglview.NGLWidget(struct, representations = initial_repr)
view_3.add_licorice('ALA, GLU')
view_3


In [9]:
visualizeNGLview("6gvw.pdb")

A Jupyter Widget

## 5. SOURCES  <a name="sec_5"/>

### PyMol

We have obtained accurate RMSD values from PyMol. 3D protein structures have high quality and resolution

![caption](files/image2.png)

### TM-SCORE

![caption](files/image3.png)

### BioPython Tutorial:  The PDB Module