In [1]:
import os
import requests
import mdtraj as md
import nglview as nv
from Bio import Align



# Protein Analysis: Barnase-Barstar

## Downloading the 1BRS pdb

In [2]:
def fetch_pdb(pdb_id, download_path="./"):

        url = 'http://files.rcsb.org/download/{}.pdb'.format(pdb_id)
        try:
            res = requests.get(url, allow_redirects=True)
        except:
            print("Could not fetch pdb from {}".format(url))
            return 
        
        file_path = os.path.join(download_path, pdb_id + ".pdb")
        with open(file_path, "wb") as f:
            f.write(res.content)

In [3]:
fetch_pdb("1brs")

In [4]:
ls

1brs.pdb  README.md  system_inspection.ipynb


<div class="alert alert-warning">
<strong>What?</strong> A function has been defined to download the pdb file. 
</div>

## Inspection with MDTraj and NGLView

After obtaining the protein pdb file, from RCSB, we load the file with MDTraj

In [5]:
brs = md.load('1brs.pdb')

In [6]:
print(brs)

<mdtraj.Trajectory with 1 frames, 5151 atoms, 1101 residues, and unitcells>


We can now see all the atoms and residues numbers the protein has. Note that the pdb file has 1 frame, this means that there is only one image of the protein. If we want to know these quantities separately, we use the next commands:

In [7]:
print('The Barnase protein has %s atoms' % brs.n_atoms)

The Barnase protein has 5151 atoms


In [8]:
print('The Barnase protein has %s residues' % brs.n_residues)

The Barnase protein has 1101 residues


If we want to know the exact position of an atom, it is necessary to specify the frame and atom index, due we have only 1 frame, the frame index is 0.

<div class="alert alert-info">
<strong>Note:</strong> Python has a zero-based counting system for elements in a list, tuple, array, etc. The first element has always index 0
</div>

In [9]:
frame_idx = 0
atom_idx = 36
print(f'Where is the {atom_idx+1} atom at frame {frame_idx+1}?') 
print('x: %s\ty: %s\tz: %s' % tuple(brs.xyz[frame_idx, atom_idx,:]))

Where is the 37 atom at frame 1?
x: 1.563	y: 3.7888	z: 1.5591


To check the Topology, which includes specific information of the atoms, chains and bonds, we can type:

In [10]:
topo = brs.topology
print(topo)

<mdtraj.Topology with 12 chains, 1101 residues, 5151 atoms, 4740 bonds>


Inside a pdb file, there can't be atoms with the same name, so if we want to know the name of an specific atom, must consider the atoms index, we just type:

<div class="alert alert-info">
<strong>Note:</strong> Well... I know what you mean. But a pdb file can have atoms with the same name. They are distinguishable because of the atom index</a>
</div>

In [11]:
atom_nu = 3
print(f'The name of the {atom_nu+1} atom is: {topo.atom(atom_nu).name}')

The name of the 4 atom is: O


To know all the atoms:

In [12]:
print([atom for atom in topo.atoms])

[VAL3-N, VAL3-CA, VAL3-C, VAL3-O, VAL3-CB, VAL3-CG1, VAL3-CG2, ILE4-N, ILE4-CA, ILE4-C, ILE4-O, ILE4-CB, ILE4-CG1, ILE4-CG2, ILE4-CD1, ASN5-N, ASN5-CA, ASN5-C, ASN5-O, ASN5-CB, ASN5-CG, ASN5-OD1, ASN5-ND2, THR6-N, THR6-CA, THR6-C, THR6-O, THR6-CB, THR6-OG1, THR6-CG2, PHE7-N, PHE7-CA, PHE7-C, PHE7-O, PHE7-CB, PHE7-CG, PHE7-CD1, PHE7-CD2, PHE7-CE1, PHE7-CE2, PHE7-CZ, ASP8-N, ASP8-CA, ASP8-C, ASP8-O, ASP8-CB, ASP8-CG, ASP8-OD1, ASP8-OD2, GLY9-N, GLY9-CA, GLY9-C, GLY9-O, VAL10-N, VAL10-CA, VAL10-C, VAL10-O, VAL10-CB, VAL10-CG1, VAL10-CG2, ALA11-N, ALA11-CA, ALA11-C, ALA11-O, ALA11-CB, ASP12-N, ASP12-CA, ASP12-C, ASP12-O, ASP12-CB, ASP12-CG, ASP12-OD1, ASP12-OD2, TYR13-N, TYR13-CA, TYR13-C, TYR13-O, TYR13-CB, TYR13-CG, TYR13-CD1, TYR13-CD2, TYR13-CE1, TYR13-CE2, TYR13-CZ, TYR13-OH, LEU14-N, LEU14-CA, LEU14-C, LEU14-O, LEU14-CB, LEU14-CG, LEU14-CD1, LEU14-CD2, GLN15-N, GLN15-CA, GLN15-C, GLN15-O, GLN15-CB, GLN15-CG, GLN15-CD, GLN15-OE1, GLN15-NE2, THR16-N, THR16-CA, THR16-C, THR16-O, THR16-C

The same goes for residues:

In [13]:
res_nu = 5
print(f'The name of the {res_nu+1} residue is: {topo.residue(res_nu)}')

The name of the 6 residue is: ASP8


For all residues:

In [14]:
print([residue for residue in topo.residues])

[VAL3, ILE4, ASN5, THR6, PHE7, ASP8, GLY9, VAL10, ALA11, ASP12, TYR13, LEU14, GLN15, THR16, TYR17, HIS18, LYS19, LEU20, PRO21, ASP22, ASN23, TYR24, ILE25, THR26, LYS27, SER28, GLU29, ALA30, GLN31, ALA32, LEU33, GLY34, TRP35, VAL36, ALA37, SER38, LYS39, GLY40, ASN41, LEU42, ALA43, ASP44, VAL45, ALA46, PRO47, GLY48, LYS49, SER50, ILE51, GLY52, GLY53, ASP54, ILE55, PHE56, SER57, ASN58, ARG59, GLU60, GLY61, LYS62, LEU63, PRO64, GLY65, LYS66, SER67, GLY68, ARG69, THR70, TRP71, ARG72, GLU73, ALA74, ASP75, ILE76, ASN77, TYR78, THR79, SER80, GLY81, PHE82, ARG83, ASN84, SER85, ASP86, ARG87, ILE88, LEU89, TYR90, SER91, SER92, ASP93, TRP94, LEU95, ILE96, TYR97, LYS98, THR99, THR100, ASP101, HIS102, TYR103, GLN104, THR105, PHE106, THR107, LYS108, ILE109, ARG110, ALA1, GLN2, VAL3, ILE4, ASN5, THR6, PHE7, ASP8, GLY9, VAL10, ALA11, ASP12, TYR13, LEU14, GLN15, THR16, TYR17, HIS18, LYS19, LEU20, PRO21, ASP22, ASN23, TYR24, ILE25, THR26, LYS27, SER28, GLU29, ALA30, GLN31, ALA32, LEU33, GLY34, TRP35, VAL

Also, knowing the number and nature of chains in the protein is vital to predict possibles bonds.

In [15]:
print('The structure 1BRS from the PDB has %s chains' % brs.n_chains)

The structure 1BRS from the PDB has 12 chains


Visualizando el complejo con NGLView, se observa la presencia de 6 cadenas protéicas principales, sin embargo, la topología indica que hay 12 cadenas. Revisando nuevamente la estructura, se notó la presencia de moléculas de agua y se considera que las cadenas sobrantes pudieran ser éstas. Para verificarlo, se removerá el solvente y se obtendrá nuevamente la topología.

In [16]:
view = nv.show_mdtraj(brs)
view.add_licorice()
view

NGLWidget()

In [17]:
brs_nw = brs.remove_solvent()
brs_nw

<mdtraj.Trajectory with 1 frames, 4638 atoms, 588 residues, and unitcells at 0x7f9e98e60910>

In [18]:
view_mdt = nv.show_mdtraj(brs_nw)
view_mdt.add_licorice()
view_mdt

NGLWidget()

In [19]:
brs_nw_top = brs_nw.topology
print(brs_nw_top)

<mdtraj.Topology with 6 chains, 588 residues, 4638 atoms, 4740 bonds>


Como se muestra, la nueva topología contiene 6 cadenas, y revisando su contenido, se demuestra que las cadenas corresponden a las proteinas involucradas en el sistema.

In [20]:
table, chains = brs_nw_top.to_dataframe()
print(table)

      serial name element  resSeq resName  chainID segmentID
0          1    N       N       3     VAL        0          
1          2   CA       C       3     VAL        0          
2          3    C       C       3     VAL        0          
3          4    O       O       3     VAL        0          
4          5   CB       C       3     VAL        0          
...      ...  ...     ...     ...     ...      ...       ...
4633    4641   CA       C      89     SER        5          
4634    4642    C       C      89     SER        5          
4635    4643   CB       C      89     SER        5          
4636    4644   OG       O      89     SER        5          
4637    4645  OXT       O      89     SER        5          

[4638 rows x 7 columns]


Se procede a sepasar cada cadena.

<div class="alert alert-success">
<strong>Well Done!</strong> Let me continue from this point to solve the rest of the tasks
</div>

## Are all receptors, and ligands, equivalent?

The 1brs pdb structure has three protein complexes. Thus, we have three barnases and three barstars in three hetero-dimers. So, if we have to extract a single complex to work with: what would be the chosen complex?, do we need to make by hand a new complex with a receptor and a ligand coming from different dimes in the structure?

Lets first of all study the receptors (Barnase):

In [21]:
atoms_in_chain_A = brs_nw.topology.select("chainid == 0")
atoms_in_chain_B = brs_nw.topology.select("chainid == 1")
atoms_in_chain_C = brs_nw.topology.select("chainid == 2")

<div class="alert alert-info">
<strong>Note:</strong> Have a look to <a href='https://mdtraj.org/1.9.4/atom_selection.html'>the selection tool of mdtraj</a>
</div>

Lets now extract the three barnase proteins. Do they all have the same number of atoms and residues? Do they all have a comparable structure? What is the RMSD between them?

In [22]:
barnase_A = brs_nw.atom_slice(atoms_in_chain_A)
barnase_B = brs_nw.atom_slice(atoms_in_chain_B)
barnase_C = brs_nw.atom_slice(atoms_in_chain_C)

In [23]:
print(f'Atoms in barnase A: {barnase_A.n_atoms}')
print(f'Atoms in barnase B: {barnase_B.n_atoms}')
print(f'Atoms in barnase C: {barnase_C.n_atoms}')

Atoms in barnase A: 864
Atoms in barnase B: 878
Atoms in barnase C: 839


In [24]:
print(f'Residues in barnase A: {barnase_A.n_residues}')
print(f'Residues in barnase B: {barnase_B.n_residues}')
print(f'Residues in barnase C: {barnase_C.n_residues}')

Residues in barnase A: 108
Residues in barnase B: 110
Residues in barnase C: 108


Not all barnase are equally solved. The structure of chain B was solved with more residues and atoms, two atoms more. But, what's should be the canonical sequence of this protein? In the [1BRS web page of the RCSB PDB](https://www.rcsb.org/structure/1BRS) the uniprot id code can be found: P00648. Let's then obtain the canonical sequence from [the UniProtKB database](https://www.uniprot.org/uniprot/P00648).

In [25]:
barnase_canonical_sequence = 'MMKMEGIALKKRLSWISVCLLVLVSAAGMLFSTAAKTETSSHKAHTEAQVINTFDGVADYLQTYHKLPDNYITKSEAQALGWVASKGNLADVAPGKSIGGDIFSNREGKLPGKSGRTWREADINYTSGFRNSDRILYSSDWLIYKTTDHYQTFTKIR'

And let's see what we had in the 1BRS pdb file.

In [26]:
sequence_A = ''.join([residue.code for residue in barnase_A.top.residues])
sequence_B = ''.join([residue.code for residue in barnase_B.top.residues])
sequence_C = ''.join([residue.code for residue in barnase_C.top.residues])

In [27]:
sequence_A

'VINTFDGVADYLQTYHKLPDNYITKSEAQALGWVASKGNLADVAPGKSIGGDIFSNREGKLPGKSGRTWREADINYTSGFRNSDRILYSSDWLIYKTTDHYQTFTKIR'

In [28]:
sequence_B

'AQVINTFDGVADYLQTYHKLPDNYITKSEAQALGWVASKGNLADVAPGKSIGGDIFSNREGKLPGKSGRTWREADINYTSGFRNSDRILYSSDWLIYKTTDHYQTFTKIR'

In [29]:
sequence_C

'VINTFDGVADYLQTYHKLPDNYITKSEAQALGWVASKGNLADVAPGKSIGGDIFSNREGKLPGKSGRTWREADINYTSGFRNSDRILYSSDWLIYKTTDHYQTFTKIR'

Let's make some sequence pairwise alignments:

<div class="alert alert-danger">
<strong>What's a sequence alignment?</strong> Performing a sequence alignment between two or more aminoacids' sequences is not straightforward, there is not a unique way to do it. The sequences must be aligned matching common segments, but be aware of the following. A sequence can be modified with mutations, insertions and deletions. As I said, it is not a trivial task. And multiple sequence alignment is a problem far from being solved. Try to make a sequence alignment of these 3 sequences (one to one) with the help of <a href='https://biopython.org/'>Biopython</a>. Check the <a href='https://biopython.org/docs/1.75/api/Bio.Align.html#Bio.Align.PairwiseAligner'>PairWiseAligner</a> of Biopython and try it.
</div>

In [30]:
aligner = Align.PairwiseAligner()
aligner.mode = 'global'
aligner.match_score = 1.0
aligner.mismatch_score = 0.0
aligner.open_gap_score = -0.5
aligner.extend_gap_score = -0.1
aligner.target_end_gap_score = 0.0
aligner.query_end_gap_score = 0.0

In [44]:
alignment_A = aligner.align(barnase_canonical_sequence, sequence_A)
alignment_B = aligner.align(barnase_canonical_sequence, sequence_B)
alignment_C = aligner.align(barnase_canonical_sequence, sequence_C)

In [45]:
print(alignment_A[0].format())

MMKMEGIALKKRLSWISVCLLVLVSAAGMLFSTAAKTETSSHKAHTEAQVINTFDGVADYLQTYHKLPDNYITKSEAQALGWVASKGNLADVAPGKSIGGDIFSNREGKLPGKSGRTWREADINYTSGFRNSDRILYSSDWLIYKTTDHYQTFTKIR
-------------------------------------------------||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
-------------------------------------------------VINTFDGVADYLQTYHKLPDNYITKSEAQALGWVASKGNLADVAPGKSIGGDIFSNREGKLPGKSGRTWREADINYTSGFRNSDRILYSSDWLIYKTTDHYQTFTKIR



In [46]:
print(alignment_B[0].format())

MMKMEGIALKKRLSWISVCLLVLVSAAGMLFSTAAKTETSSHKAHTEAQVINTFDGVADYLQTYHKLPDNYITKSEAQALGWVASKGNLADVAPGKSIGGDIFSNREGKLPGKSGRTWREADINYTSGFRNSDRILYSSDWLIYKTTDHYQTFTKIR
-----------------------------------------------||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
-----------------------------------------------AQVINTFDGVADYLQTYHKLPDNYITKSEAQALGWVASKGNLADVAPGKSIGGDIFSNREGKLPGKSGRTWREADINYTSGFRNSDRILYSSDWLIYKTTDHYQTFTKIR



In [47]:
print(alignment_C[0].format())

MMKMEGIALKKRLSWISVCLLVLVSAAGMLFSTAAKTETSSHKAHTEAQVINTFDGVADYLQTYHKLPDNYITKSEAQALGWVASKGNLADVAPGKSIGGDIFSNREGKLPGKSGRTWREADINYTSGFRNSDRILYSSDWLIYKTTDHYQTFTKIR
-------------------------------------------------||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
-------------------------------------------------VINTFDGVADYLQTYHKLPDNYITKSEAQALGWVASKGNLADVAPGKSIGGDIFSNREGKLPGKSGRTWREADINYTSGFRNSDRILYSSDWLIYKTTDHYQTFTKIR



Apparently, when the sequences are aligned against the canonical one, no defects or mutations are found. All chains miss a first segment of the protein. But chain B missed two aminoacids less than A and C.

But wait a minute!!! A and C had different number of atoms but the same residues?

In [61]:
print(f'{barnase_A.top.n_atoms} atoms and {barnase_A.top.n_residues} residues in A')
print(f'{barnase_C.top.n_atoms} atoms and {barnase_C.top.n_residues} residues in C')

864 atoms and 108 residues in A
839 atoms and 108 residues in C


How is this possible? let's print out those aminoacids with different number of atoms:

In [74]:
for aux_index in range(barnase_A.top.n_residues):
    if barnase_A.top.residue(aux_index).n_atoms!=barnase_C.top.residue(aux_index).n_atoms:
        print(f'{barnase_A.top.residue(aux_index)} with residue index {aux_index}')

LYS19 with residue index 16
ASP22 with residue index 19
GLU29 with residue index 26
GLN31 with residue index 28
LYS39 with residue index 36
VAL45 with residue index 42
LYS49 with residue index 46
SER67 with residue index 64
ARG110 with residue index 107


In [75]:
res16_A = barnase_A.top.residue(16)

In [76]:
res16_C = barnase_C.top.residue(16)

In [77]:
res16_A.n_atoms

9

In [78]:
res16_C.n_atoms

5

In [79]:
for atom in res16_A.atoms:
    print(atom)

LYS19-N
LYS19-CA
LYS19-C
LYS19-O
LYS19-CB
LYS19-CG
LYS19-CD
LYS19-CE
LYS19-NZ


In [80]:
for atom in res16_C.atoms:
    print(atom)

LYS19-N
LYS19-CA
LYS19-C
LYS19-O
LYS19-CB


LYS19 has no sidechain (only de CB atom) in chain C. This happens sometimes with the X-ray solved structures. Some atoms, sometimes, do not give a good signal in the diffraction pattern and they are missed in the structure. There can be defects in the pdb structure. But this is not the only anomaly we can find. Some difraction patterns show the same atoms in two different locations. In this former case the pdb structure contains duplicated atoms (with alternate locations).

<div class="alert alert-danger">
<strong>To be continued</strong>
</div>

Let's see now how the structures compare. In order to do this, the least RMSD fit of chain A and C against B will be performed. Be aware of the different number of residues between them. The same number of atoms have to be used to compute the translation and rotation that makes the RMSD between both atoms groups minimal:

# NOTAS

eliminar aguas (hacer con mdtraj), todos los monomeros tiene la misma topologia? la misma estrucutra? que tan parecidos son los tres complejos? son completamente identicos?

fit least rmsd (solo si tienen la misma topologia) 
algoritmo que encuentra traslacion y rotacion que minimiza rmsd de cada atomo entre los monomeros (a esto se le llama "fiteo")

Los H no se presentan en rayos x.
a diferencia de NMR, solo se obtiene un frame en rayos x.
en nmr puede haber varias estructuras, incluyendo H.
se usó una celda unidad compuesta por 3 barnase-bastar

# Observaciones

Entender la documentacion de MDTraj y NGLView ha sido un reto, a pesar de sumergirme en las lineas y en ejemplos, aun no me queda claro el uso adecuado de 'attributes' y 'methods' para cada libreria, me cuesta entender el orden y los valores que deben llevar para poder realizar lo que quiero. Es por esto que solo he podido separar (o al menos eso creo) las topologias de las cadenas, pero no su trayectoria, lo cual es necesario para poder visualizar cada una de ellas con NGLView. 

¿es posible la creacion de todas las variables 'chain' con un ciclo for?

el uso de pandas fue util para ver de otra manera el contenido de la topología