# Protoss: Protonate protein-ligand complexes 

Protoss is an high quality and fully automated hydrogen prediction tool for protein-ligand complexes. It adds missing hydrogen atoms to protein structures and detects reasonable protonation states, tautomers, and hydrogen coordinates of both protein and ligand molecules.

* [Bietz, S.; Urbaczek, S.; Schulz, B.; Rarey, M., Protoss: a holistic approach to predict tautomers and protonation states in protein-ligand complexes. J Cheminform 2014, 6, 12.](https://doi.org/10.1186/1758-2946-6-12)
* [Lippert, T.; Rarey, M., Fast automated placement of polar hydrogen atoms in protein-ligand complexes. J Cheminform 2009, 1 (1), 13.](https://doi.org/10.1186/1758-2946-1-13)

Note: NGLview triggers the Colab code snippet sidebar every time a structure is visualized. Don't close it but resize it. In addition, sometimes the NGL views stay white and no structure is shown. In this case just run the cell again.

In [1]:
from google.colab import output
output.enable_custom_widget_manager()

In [2]:
# install dependencies
!pip install biopython &>> output.log
!pip install nglview &>> output.log
!pip install rdkit-pypi &>> output.log

In [3]:
# imports
import os
import io
from pathlib import Path
import requests
import sys
import time
from urllib.parse import urljoin
import warnings

from IPython.display import Image
from Bio.PDB import *
from Bio.PDB.PDBExceptions import PDBConstructionWarning
import nglview as nv
from rdkit import Chem



In [4]:
# constants
PROTEINS_PLUS_URL = 'https://proteins.plus/api/v2/'
UPLOAD = urljoin(PROTEINS_PLUS_URL, 'molecule_handler/upload/')
UPLOAD_JOBS = urljoin(PROTEINS_PLUS_URL, 'molecule_handler/upload/jobs/')
PROTEINS = urljoin(PROTEINS_PLUS_URL, 'molecule_handler/proteins/')
LIGANDS = urljoin(PROTEINS_PLUS_URL, 'molecule_handler/ligands/')
PROTOSS = urljoin(PROTEINS_PLUS_URL, 'protoss/')
PROTOSS_JOBS = urljoin(PROTEINS_PLUS_URL, 'protoss/jobs/')

In [5]:
#@title Utils functions to call API (unhide if you're interested)

# check server connection
try:
    response = requests.get(PROTEINS_PLUS_URL)
except requests.ConnectionError as error:
    if 'Connection refused' in str(error):
        print('WARNING: could not establish a connection to the server', file=sys.stderr)
    raise
    
def poll_job(job_id, poll_url, poll_interval=1, max_polls=10):
    """Poll the progress of a job
    
    Continuosly polls the server in regular intervals and updates the job information, especially the status.
    
    :param job_id: UUID of the job to poll
    :type job_id: str
    :param poll_url: URl to send the polling request to
    :type poll_url: str
    :param poll_interval: time interval between polls in seconds
    :type poll_interval: int
    :param max_polls: maximum number of times to poll before exiting
    :type max_polls: int
    :return: polled job
    :rtype: dict
    """
    job = requests.get(poll_url + job_id + '/').json()
    status = job['status']
    current_poll = 0
    while status == 'pending' or status == 'running':
        print(f'Job {job_id} is { status }')
        current_poll += 1
        if current_poll >= max_polls:
            print(f'Job {job_id} has not completed after {max_polls} polling requests' \
                  f' and {poll_interval * max_polls} seconds')
            return job
        time.sleep(poll_interval)
        job = requests.get(poll_url + job_id + '/').json()
        status = job['status']
    print(f'Job {job_id} completed with { status }')
    return job

## Add hydrogens to a PDB structure

The following PDB file does not have any hydrogens. Missing hydrogen atoms are very common when working with crystal structures from the PDB, which makes hydrogen prediction a common preprocessing step for all kinds of downstream work, like docking, molecular dynamics calculations and many more.

In [6]:
# download the protein file from the PDB
file_4agm = Path(PDBList().retrieve_pdb_file('4agm', file_format='pdb'))
os.rename(file_4agm, file_4agm.stem + '.pdb')
file_4agm = file_4agm.stem + '.pdb' # ProteinsPlus needs .pdb extension

# visualize it
protein_structure = PDBParser().get_structure('4agm', file_4agm)
view = nv.show_biopython(protein_structure)
view.add_representation(repr_type='ball+stick', selection='protein')
view

Downloading PDB structure '4agm'...




NGLWidget()

We have to just upload the PDB file to the protoss API endpoint to process the PDB file. Protoss will predict hydrogens in just a few seconds.

In [7]:
with open(file_4agm) as upload_file:
    query = {'protein_file': upload_file}
    job_submission = requests.post(PROTOSS, files=query).json()
protoss_job = poll_job(job_submission['job_id'], PROTOSS_JOBS)
protossed_protein = requests.get(PROTEINS + protoss_job['output_protein'] + '/').json()

Job 349d587c-a001-4b70-a7eb-07842320a0b4 completed with success


The server returns a PDB file with hydrogens. We can look at the structure and see the hydrogens:



In [8]:
protein_file = io.StringIO(protossed_protein['file_string'])
protein_structure = PDBParser().get_structure(protossed_protein['name'], protein_file)
view = nv.show_biopython(protein_structure)
view.add_representation(repr_type='ball+stick', selection='protein')
view



NGLWidget()

## Predict the hydrogens for a non-native ligand
Protoss can take an additional ligand file, for example from a docking experiment into account and predict hydrogens of the whole protein-ligand complex. To demonstrate this we will use the ligand NXG_A_1294 from PDB code 4AGN and place it into 4AGM, replacing clashing ligands in the process.


To do this we first download 4AGN from the PDB and extract the ligand NXG_A_1294 to an SDF file.

In [9]:
# selector to extract the ligand we want from the biopython structure
class SingleResidueSelect(Select):

  def __init__(self, name, chain, identifier):
    """Selector to select specific residue from biopython structure.

    Residue can be amino acid, ligand, metal, water, etc. 
        
    :param name: residue name
    :type name: str
    :param chain: chain id
    :type chain: str
    :param identifier: ligand infile id
    :type identifier: int
    """
    self.name = name
    self.chain = chain
    self.identifier = identifier

  def accept_residue(self, residue):
    """Accept residue or refuses it

    :param name: residue name
    :type name: Bio.PDB.residue.residue
    :return: 1 if residue should be selected. 0 otherwise.
    :rtype: int
    """
    chain = residue.get_full_id()[2]
    identifier = residue.get_id()[1]
    if residue.get_resname() == self.name \
        and self.chain == chain \
        and self.identifier == identifier:
      return 1
    else:
      return 0

# fetch the protein 4agn from the PDB
file_4agn = Path(PDBList().retrieve_pdb_file('4agn', file_format='pdb'))
os.rename(file_4agn, file_4agn.stem + '.pdb')
file_4agn = file_4agn.stem + '.pdb' # ProteinsPlus needs .pdb extension
with warnings.catch_warnings():
  warnings.simplefilter('ignore', PDBConstructionWarning)
  structure_4agn = PDBParser().get_structure('4agn', file_4agn)
# save ligand NXG_A_1294 to PDB file using biopython
pdbio = PDBIO()
pdbio.set_structure(structure_4agn)
pdbio.save("NXG_A_1294.pdb", SingleResidueSelect('NXG', 'A', 1294))
# read ligand again and save it as SDF with rdkit
mol_NXG_A_1294 = Chem.MolFromPDBFile("NXG_A_1294.pdb")
with Chem.SDWriter("NXG_A_1294.sdf") as w:
  w.write(mol_NXG_A_1294)

Downloading PDB structure '4agn'...


Let's look at the ligand in SDF format

In [10]:
!head "NXG_A_1294.sdf"


     RDKit          3D

 24 25  0  0  0  0  0  0  0  0999 V2000
   91.4340   93.5340  -45.2570 C   0  0  0  0  0  0  0  0  0  0  0  0
   90.7660   95.8310  -44.8540 C   0  0  0  0  0  0  0  0  0  0  0  0
   90.7020   98.2900  -44.6610 N   0  0  0  0  0  0  0  0  0  0  0  0
   90.9900   96.8240  -42.6710 O   0  0  0  0  0  0  0  0  0  0  0  0
   91.8080   93.4050  -43.9090 C   0  0  0  0  0  0  0  0  0  0  0  0
   90.1700   97.1080  -45.3810 C   0  0  0  0  0  0  0  0  0  0  0  0


Now, we use the ligand of 4AGN with 4AGM to make the call to Protoss



In [11]:
with open('NXG_A_1294.sdf') as upload_ligand_file:
    with open(file_4agm) as upload_file:
        query = {'protein_file': upload_file, 'ligand_file': upload_ligand_file}
        other_job_submission = requests.post(PROTOSS, files=query).json()
other_protoss_job = poll_job(other_job_submission['job_id'], PROTOSS_JOBS)
other_protossed_protein = requests.get(PROTEINS + other_protoss_job['output_protein'] + '/').json()
other_protossed_ligand = requests.get(LIGANDS + other_protossed_protein['ligand_set'][0] + '/').json()

Job 95f38f68-fcc2-4b33-9872-8c4a328a25b6 completed with success


In [12]:
other_protein_file = io.StringIO(other_protossed_protein['file_string'])
other_protein_structure = PDBParser().get_structure(other_protossed_protein['name'], other_protein_file)
ligand_structure = Chem.MolFromMolBlock(other_protossed_ligand['file_string'], removeHs=False)

view = nv.show_biopython(other_protein_structure)
# uncomment for protein hydrogens
# view.add_representation(repr_type='ball+stick', selection='protein')
view.add_structure(nv.RdkitStructure(ligand_structure))
view



NGLWidget()

NXG_A_1294 overwrote one of the ligands in the PDB entry, but the other was kept. If you really want to completely remove all detected ligands you can preprocess the protein first and the submit the empty protein with a custom ligand to protoss:

In [13]:
with open(file_4agm) as upload_file:
    query = {'protein_file': upload_file}
    preprocessing_job_submission = requests.post(UPLOAD, files=query).json()
preprocessing_job = poll_job(preprocessing_job_submission['job_id'], UPLOAD_JOBS)

with open('NXG_A_1294.sdf') as upload_ligand_file:
    query = {'ligand_file': upload_ligand_file}
    params = {'protein_id': preprocessing_job['output_protein']}  # remember to pass params that aren't files as data
    replacing_protoss_job_submission = requests.post(PROTOSS, data=params, files=query).json()
replacing_protoss_job = poll_job(replacing_protoss_job_submission['job_id'], PROTOSS_JOBS)
replaced_protossed_protein = requests.get(PROTEINS + replacing_protoss_job['output_protein'] + '/').json()
replaced_protossed_ligand = requests.get(LIGANDS + replaced_protossed_protein['ligand_set'][0] + '/').json()

Job 0d7a3ec5-e74c-450a-8c7b-9231f8b356f8 completed with success
Job a7b1e512-368d-4f6c-997b-234e31f69481 completed with success


In [15]:
replaced_protein_file = io.StringIO(replaced_protossed_protein['file_string'])
replaced_protein_structure = PDBParser().get_structure(replaced_protossed_protein['name'], replaced_protein_file)
replaced_ligand_structure = Chem.MolFromMolBlock(replaced_protossed_ligand['file_string'], removeHs=False)

view = nv.show_biopython(replaced_protein_structure)
# uncomment for protein hydrogens
# view.add_representation(repr_type='ball+stick', selection='protein')
view.add_structure(nv.RdkitStructure(replaced_ligand_structure))
view



NGLWidget()