[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/rareylab/proteins_plus_examples/blob/main/notebooks/Structureprofiler_example.ipynb)


# StructureProfiler: an all-in-one tool for 3D protein structure profiling

In this notebook we show you how to use the StructureProfiler for the automatic profiling of a given protein structure. The profiling is objective and based on the most frequently applied selection criteria currently in use to assemble benchmark datasets.
To do so the StructureProfiler requires a protein file. If an electron density map (as file or PDB code) is uploaded the StructureProfiler additionaly includes filter criteria depending on EDIAscorer. In addition, it is possible to include a ligand that is of interest.

For further information:
[StructureProfiler: an all-in-one tool for 3D protein structure profiling
Agnes Meyder, Stefanie Kampen, Jochen Sieg, Rainer Fährrolfes, Nils-Ole Friedrich, Florian Flachsenberg, and Matthias Rarey,
Bioinformatics, 2019 35 (5), 874–876](https://academic.oup.com/bioinformatics/article/35/5/874/5075170)

Note:NGLview triggers the Colab code snippet sidebar every time a structure is visualized. Don't close it but resize it. In addition, sometimes the NGL views stay white and no structure is shown. In this case just run the cell again.

In [1]:
# colab allow nglview plugin
from google.colab import output
output.enable_custom_widget_manager()

In [2]:
# colab install dependencies
!pip install biopython &>> output.log
!pip install nglview &>> output.log
!pip install rdkit &>> output.log

In [3]:
# imports
import json
import os
import io
from pathlib import Path
import requests
import sys
import time
import pandas as pd
from urllib.parse import urljoin

from Bio.PDB import *
import nglview as nv
from rdkit import Chem



In [4]:
#@title function for coloring results table (unhide if you're interested)

def table_style(row):
    """Creates color schema for result Table, rowwise
    
    :param row: Array of values in single row of result table
    :type row: array
    :return: list with color for each field in row
    :rtype: list
    """
    color_schema = list()
    if row.name == 'EDIAm' or row.name == 'residueEDIATest':
        color = 'lightblue'
        style = 'background-color: ' + color
        color_schema.append(style)
        return color_schema*len(row.values)  
    if 'Test' in row.name or 'Clash' in row.name:
        for i in range(len(row.values)):
            color = 'white'
            if row.values[i] == False:
                color = 'red'
            style = 'background-color: ' + color
            color_schema.append(style)
        return color_schema
    elif row.name == 'noCrystalContacts' or row.name == 'noAltLocs':
        for i in range(len(row.values)):
            color = 'white'
            if row.values[i] == False:
                color = 'red'
            style = 'background-color: ' + color
            color_schema.append(style)
        return color_schema
    else:
        style = 'background-color: white'
        color_schema.append(style)
        return (color_schema)*len(row.values)

In [5]:
# constants
PROTEINS_PLUS_URL = 'https://proteins.plus/api/v2/'
UPLOAD = urljoin(PROTEINS_PLUS_URL, 'molecule_handler/upload/')
UPLOAD_JOBS = urljoin(PROTEINS_PLUS_URL, 'molecule_handler/upload/jobs/')
PROTEINS = urljoin(PROTEINS_PLUS_URL, 'molecule_handler/proteins/')
LIGANDS = urljoin(PROTEINS_PLUS_URL, 'molecule_handler/ligands/')
STRUCTUREPROFILER = urljoin(PROTEINS_PLUS_URL, 'structureprofiler/')
STRUCTUREPROFILER_JOBS = urljoin(PROTEINS_PLUS_URL, 'structureprofiler/jobs/')
OUTPUT_DATA = urljoin(PROTEINS_PLUS_URL, 'structureprofiler/output/')
EBI_URL = 'https://www.ebi.ac.uk/pdbe/coordinates/files/'

In [6]:
#@title Utils functions to call API (unhide if you're interested)

# check server connection
try:
    response = requests.get(PROTEINS_PLUS_URL)
except requests.ConnectionError as error:
    if 'Connection refused' in str(error):
        print('WARNING: could not establish a connection to the server', file=sys.stderr)
    raise
    
def get_density_for_pdbcode(pdb_code):
    """Downloads electron density file in ccp4 format for PDB code.
      
    :param pdb_code: The PDB code, like 1g9v
    :type pdb_code: str
    :return: filepath to downloaded density file.
    :rtype: str
    """
    req = requests.get(urljoin(EBI_URL, f'{pdb_code}.ccp4'))
    if req.status_code != 200:
        raise RuntimeError(f'Failed to retrieve density for {pdb_code}\n'
                           f'{req.text}')
    density_file = f'{pdb_code}.ccp4'
    with open(density_file, 'wb') as file:
        file.write(bytearray(req.content))
    return density_file
    
def poll_job(job_id, poll_url, poll_interval=1, max_polls=10):
    """Poll the progress of a job
    
    Continuosly polls the server in regular intervals and updates the job information, especially the status.
    
    :param job_id: UUID of the job to poll
    :type job_id: str
    :param poll_url: URl to send the polling request to
    :type poll_url: str
    :param poll_interval: time interval between polls in seconds
    :type poll_interval: int
    :param max_polls: maximum number of times to poll before exiting
    :type max_polls: int
    :return: polled job
    :rtype: dict
    """
    job = requests.get(poll_url + job_id + '/').json()
    status = job['status']
    current_poll = 0
    while status == 'pending' or status == 'running':
        print(f'Job {job_id} is { status }')
        current_poll += 1
        if current_poll >= max_polls:
            print(f'Job {job_id} has not completed after {max_polls} polling requests' \
                  f' and {poll_interval * max_polls} seconds')
            return job
        time.sleep(poll_interval)
        job = requests.get(poll_url + job_id + '/').json()
        status = job['status']
    print(f'Job {job_id} completed with { status }')
    return job

You can use Structureprofiler for automatic, objective and customizable profiling of X-ray protein structures. Based on the most frequently applied selection criteria, the given protein structure is evaluated. Results are given for the Complex, the ActiveSites and Ligands.

Let's take a look at the first protein by visualizing its structure. 


In [7]:
# fetch the protein 4agm from the PDB
file_4agm = Path(PDBList().retrieve_pdb_file('4agm', file_format='pdb'))
os.rename(file_4agm, '4agm.pdb')
file_4agm = '4agm.pdb' # ProteinsPlus needs .pdb extension

# build a biopython protein
protein_structure = PDBParser().get_structure('4agm',file_4agm)
view = nv.show_biopython(protein_structure)
view.add_representation(repr_type='ball+stick', selection='ligand')
view

Downloading PDB structure '4agm'...




NGLWidget()

We can upload a protein file and start a job like this:

Note: Uploading a protein file is mandatory. However, there is an additional option to include an electron density map as well as a specific ligand. 

But let's keep it simple for now. 


In [8]:
with open(file_4agm) as upload_file:
    query = {'protein_file': upload_file}
    job_submission = requests.post(STRUCTUREPROFILER, files=query).json()
structureprofiler_job = poll_job(job_submission['job_id'], STRUCTUREPROFILER_JOBS)
output_data = requests.get(OUTPUT_DATA + structureprofiler_job['output_data'] + '/').json()

Job 0917996b-cff2-45fd-a3d9-7d094cd86e1c completed with success


The Structureprofiler produces a single output called "output_data", which can be divided into three tables each about Complex, Active Sites and Ligands. Where possible, the specific values are shown, otherwise it is indicated whether a test is passed, which means, that the filter criteria is fullfilled. Here the failed criteria are marked red.
The number of columns in Ligands and ActiveSites tables differ between proteins, depending on the number of given Ligands and Active sites. The specific protein currently in question has two ligands shown in the structure above. 

In [9]:
complex_data = pd.DataFrame.from_dict([output_data['output_data']['complex']])
complex_data = complex_data.transpose()
complex_data.style.apply(table_style, axis=1)


Unnamed: 0,0
DPI,0.115000
rFree,0.197000
rFactor,0.173000
resolution,1.520000
overfittingTest,True
significanceTest,True
complexStructureProfilerTests,True


In [10]:
ligand_data = pd.DataFrame.from_dict(output_data['output_data']['ligands'])
ligand_data.columns = ['Ligand_1', 'Ligand_2']
ligand_data.style.apply(table_style, axis=1)

Unnamed: 0,Ligand_1,Ligand_2
ID,400,400
NROT,5,5
OWAB,18.200000,20.100000
logP,1.070000,1.070000
name,P86_A_400,P86_B_400
chain,A,B
HETCode,P86,P86
noAltLocs,True,True
heavyAtoms,21,21
stereoCenters,0,0


In [11]:
active_site_data = pd.DataFrame.from_dict(output_data['output_data']['active_sites'])
active_site_data.columns = ['Active_site_1', 'Active_site_2']
active_site_data.style.apply(table_style, axis=1)

Unnamed: 0,Active_site_1,Active_site_2
chains,"A,B","A,B"
ligand,P86_A_400,P86_B_400
noAltLocs,True,True
uniprotID,P04637,P04637
bondAnglesTest,False,False
bondLengthsTest,True,True
bFactorRatioTest,True,True
noIntermolecularClash,True,True
noIntramolecularClash,True,True
activeSiteStructureProfilerTests,False,False


Let's talk briefly about the tests seen so far to get an idea what failing a test means. 
Most of the tests have cutoff values, that have not been explicitly mentioned yet. We will not be able to talk about all test. For more information, please read the paper mentioned in the introduction. 


Test|Description with default cutoffs
-----|-----
bondAnglesTest| no bond angle may deviate more than 16° from the VSEPR angle.
bondLengthsTest|no bond length may deviate more than 0.2Å from the sum of the covalent radii.
noCrystalContacts| no crystal symmetry contact is closer than 6Å to the ligand
ringPlanarityTest| no aromatic ring with the maximum size of 6 differing by more than 20° from planarity 

*complexStructureProfilerTests*, *ligandStructureProfilerTests* and *activeSiteStructureProfilerTests* indicate whether __all__ respective tests have been passed.

The protein we use for demonstration does not pass some ligand and active-site tests. However, whether this failed test make it unsuitable for use depends on the intended application.

Let's continue and try the same protein as before. This time we also upload a electron density map as well as a non-native ligand. The ligand NXG_A_1294 is from another protein (PDB Code: 4AGN).

In [12]:
# call the MoleculeHandler through the API for 4agn
query = {'pdb_code': '4agn'}
job_submission = requests.post(UPLOAD, data=query).json()
job = poll_job(job_submission['job_id'], UPLOAD_JOBS)    
protein_id = job['output_protein']
protein_json = requests.get(PROTEINS + protein_id + '/').json()

# select the ligand we are looking for
ligand = None
for ligand_id in protein_json['ligand_set']:
  lig = requests.get(LIGANDS + ligand_id + '/').json()
  if lig['name'] == 'NXG_A_1294':
    ligand = lig 
    break

if ligand is not None:
  print('Successfully extracted ligand NXG_A_1294!')
  # write ligand as SDF
  with open('NXG_A_1294.sdf', 'w') as f:
    f.write(ligand['file_string'])
else: 
  print('Failed to extract ligand :(')

Job b2d8bc1f-6e4d-40c4-9a80-ae2188a5eddf completed with success
Successfully extracted ligand NXG_A_1294!


In [13]:
with open('NXG_A_1294.sdf') as upload_ligand_file:
    with open('4agm.pdb') as upload_file:
        query = {'protein_file': upload_file, 'ligand_file': upload_ligand_file}
        params = {'pdb_code': '4agm'}
        job_submission = requests.post(STRUCTUREPROFILER,files=query, data=params).json()
structureprofiler_job = poll_job(job_submission['job_id'], STRUCTUREPROFILER_JOBS,poll_interval=5, max_polls=100)
output_data = requests.get(OUTPUT_DATA + structureprofiler_job['output_data'] + '/').json()

Job 5a0bc572-6529-43a3-be09-0551d14f78ce is running
Job 5a0bc572-6529-43a3-be09-0551d14f78ce is running
Job 5a0bc572-6529-43a3-be09-0551d14f78ce is running
Job 5a0bc572-6529-43a3-be09-0551d14f78ce is running
Job 5a0bc572-6529-43a3-be09-0551d14f78ce is running
Job 5a0bc572-6529-43a3-be09-0551d14f78ce completed with success


This call sent the PDB file of 4AGM, a PDB Code and the ligand file to the server. The PDB Code ist then used to retrieve the electron density map.

We could also use a local density file. This would look like this:

In [14]:
# density_4agm = get_density_for_pdbcode('4agm')

# with open('4agm.pdb') as upload_file:
#     with open(density_4agm, 'rb') as density_file:
#         with open('NXG_A_1294.sdf') as upload_ligand_file:
#                query = {'protein_file': upload_file, 'electron_density_map': density_file}
#                job_submission = requests.post(STRUCTUREPROFILER,files=query).json()
# structureprofiler_job = poll_job(job_submission['job_id'], STRUCTUREPROFILER_JOBS, poll_interval=5, max_polls=100)
# output_data = requests.get(OUTPUT_DATA + structureprofiler_job['output_data'] + '/').json()

Let's take a look at our results. As you can see the Ligand and Active Sites tables have now additional rows referring to tests depending on EDIA (marked blue). Therefore these entries are only in the results if an electron density map was given. They also have an additional column that refers to the non-native input ligand (Ligand_3). 

In [15]:
complex_data = pd.DataFrame.from_dict([output_data['output_data']['complex']])
ligand_data = pd.DataFrame.from_dict(output_data['output_data']['ligands'])
active_sites_data = pd.DataFrame.from_dict(output_data['output_data']['active_sites'])


In [16]:
complex_data = pd.DataFrame.from_dict([output_data['output_data']['complex']])
complex_data = complex_data.transpose()
complex_data.style.apply(table_style, axis=1)

Unnamed: 0,0
DPI,0.115000
rFree,0.197000
rFactor,0.173000
resolution,1.520000
overfittingTest,True
significanceTest,True
complexStructureProfilerTests,True


In [17]:
ligand_data = pd.DataFrame.from_dict(output_data['output_data']['ligands'])
ligand_data.columns = ['Ligand_1', 'Ligand_2', 'Ligand_3']
ligand_data.style.apply(table_style, axis=1)

Unnamed: 0,Ligand_1,Ligand_2,Ligand_3
ID,400,400,1294
NROT,5,5,5
OWAB,18.200000,20.100000,0.000000
logP,2.490000,1.070000,2.650000
name,P86_A_400,P86_B_400,NXG_A_1294
EDIAm,0.320000,0.710000,0.380000
chain,A,B,A
HETCode,P86,P86,NXG
noAltLocs,True,True,True
heavyAtoms,21,21,24


In [18]:
active_site_data = pd.DataFrame.from_dict(output_data['output_data']['active_sites'])
active_site_data.columns = ['Active_site_1', 'Active_site_2', 'Active_site_3']
active_site_data.style.apply(table_style, axis=1)

Unnamed: 0,Active_site_1,Active_site_2,Active_site_3
chains,"A,B","A,B","A,B"
ligand,P86_A_400,P86_B_400,NXG_A_1294
noAltLocs,True,True,True
uniprotID,P04637,P04637,P04637
bondAnglesTest,False,False,False
bondLengthsTest,True,True,True
residueEDIATest,True,False,True
bFactorRatioTest,True,True,True
noIntermolecularClash,True,True,True
noIntramolecularClash,True,True,True
