[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/rareylab/proteins_plus_examples/blob/main/notebooks/MoleculeHandler_example.ipynb)



# MoleculeHandler: Working with proteins and ligands 
You can upload PDB structure files before running any ProteinsPlus tools. The uploaded structure file will be split into the contained protein(s) and ligand(s) automatically which can be accessed, viewed and further processed by all ProteinsPlus tools through the API. 

The server handles molecular structures using the [NAOMI Chembio Suite](https://software.zbh.uni-hamburg.de/). We suggest to prefer processing PDB/molecule files with the molecule handler instead of splitting protein and ligand with other common libraries like BioPython and RDKit. The main reason is that the ProteinsPlus tools are well tested with NAOMI. In addition, NAOMI comes with a strong chemical model and can handle many edge cases in molecular input.


Note: NGLview triggers the Colab code snippet sidebar every time a structure is visualized. Don't close it but resize it. In addition, sometimes the NGL views stay white and no structure is shown. In this case just run the cell again.

In [None]:
# colab allow nglview plugin
from google.colab import output
output.enable_custom_widget_manager()

In [None]:
# colab install dependencies
!pip install biopython &>> output.log
!pip install nglview &>> output.log
!pip install rdkit &>> output.log

In [None]:
# imports
import io
from pathlib import Path
import requests
import sys
import time
from urllib.parse import urljoin

from IPython.display import Image
from Bio.PDB import PDBParser
import nglview as nv
from rdkit import Chem

In [None]:
# constants
PROTEINS_PLUS_URL = 'https://proteins.plus/api/v2/'
UPLOAD = urljoin(PROTEINS_PLUS_URL, 'molecule_handler/upload/')
UPLOAD_JOBS = urljoin(PROTEINS_PLUS_URL, 'molecule_handler/upload/jobs/')
PROTEINS = urljoin(PROTEINS_PLUS_URL, 'molecule_handler/proteins/')
LIGANDS = urljoin(PROTEINS_PLUS_URL, 'molecule_handler/ligands/')
PROTOSS = urljoin(PROTEINS_PLUS_URL, 'protoss/')
PROTOSS_JOBS = urljoin(PROTEINS_PLUS_URL, 'protoss/jobs/')

In [None]:
#@title Utils functions to call API (unhide if you're interested)
# utils

# check server connection
try:
    response = requests.get(PROTEINS_PLUS_URL)
except requests.ConnectionError as error:
    if 'Connection refused' in str(error):
        print('WARNING: could not establish a connection to the server',
              file=sys.stderr)
    raise
    
def poll_job(job_id, poll_url, poll_interval=1, max_polls=10):
    """Poll the progress of a job
    
    Continuously polls the server in regular intervals and updates the job
    information, especially the status.
    
    :param job_id: UUID of the job to poll
    :type job_id: str
    :param poll_url: URl to send the polling request to
    :type poll_url: str
    :param poll_interval: time interval between polls in seconds
    :type poll_interval: int
    :param max_polls: maximum number of times to poll before exiting
    :type max_polls: int
    :return: polled job
    :rtype: dict
    """
    job = requests.get(poll_url + job_id + '/').json()
    status = job['status']
    current_poll = 0
    while status == 'pending' or status == 'running':
        print(f'Job {job_id} is { status }')
        current_poll += 1
        if current_poll >= max_polls:
            print(f'Job {job_id} has not completed after {max_polls} polling'
                  f'requests and {poll_interval * max_polls} seconds')
            return job
        time.sleep(poll_interval)
        job = requests.get(poll_url + job_id).json()
        status = job['status']
    print(f'Job {job_id} completed with { status }')
    return job

def print_data_fields(model):
    """Print the fields of a model
    
    :param model: data model
    :type model: dict
    """
    for field in model.keys():
        print(f' - "{field}"')

## Upload by PDB code or file
The molecule handler is an entrypoint to working with molecular data. It is largely optional because most other API calls can be made without a round trip to the molecule handler. It will register a protein in the database so that it can be referred to by only its ID. It will also detect ligands and generate 2D images for them.

Let's start with a PDB entry. To work with a PDB entry we only need to POST the PDB code to the server and it will query the PDB for us.

In [None]:
query = {'pdb_code': '4agm'}
job_submission = requests.post(UPLOAD, data=query).json()

This call is equivalent to the following file upload:

In [None]:
# # upload a file via colab
# from google.colab import files
# uploaded = files.upload()
# upload_file = io.StringIO(next(iter(uploaded.values())).decode())
# # send a file to ProteinsPlus
# query = {'protein_file': upload_file}
# job_submission = requests.post(UPLOAD, files=query).json()

We have immediately parsed the JSON response and can now keep working with a python dict containing the job submission data.

In [None]:
job_id = job_submission['job_id']
if job_submission['retrieved_from_cache']:
    print(f'Job {job_id} could be retrieved from cache')

Job 1196ac48-13d4-4163-b283-cc61d0559524 could be retrieved from cache


The job submission data contains the job ID (a UUID) and the information whether the job was retrieved from cache. Caching jobs saves the server and you a lot of CPU time. Chances are, if you are working on a PDB entry, it may already have been processed and you can retrieve it instantly. Let's do that now:

In [None]:
job = poll_job(job_id, UPLOAD_JOBS)
print('Job data fields:')
print_data_fields(job)
    
print()
protein_id = job['output_protein']
print(f'Preprocessed protein ID: {protein_id}')

Job 1196ac48-13d4-4163-b283-cc61d0559524 completed with success
Job data fields:
 - "id"
 - "status"
 - "date_created"
 - "date_last_accessed"
 - "error"
 - "protein_name"
 - "pdb_code"
 - "output_protein"
 - "protein_string"
 - "ligand_string"

Preprocessed protein ID: b6d5dfa3-2ad1-40e5-9d64-357f54be0ae8


A job has a number of data fields, many of which are shared across jobs, such as "status" or "date_created". You can find a full list of fields in the [reference documentation](https://proteins.plus/api/v2/). In this case we have preprocessed a PDB entry and so are interested in the "output_protein". This will be the ID of our protein. Let's retrieve our protein:

In [None]:
protein = requests.get(PROTEINS + protein_id + '/').json()
print('Protein data fields:')
print_data_fields(protein)

Protein data fields:
 - "id"
 - "name"
 - "pdb_code"
 - "file_type"
 - "ligand_set"
 - "file_string"
 - "date_created"
 - "date_last_accessed"


As you can see the protein has a "file_string". We can use these to load the protein with biopython and display it in nglview:

In [None]:
protein_file = io.StringIO(protein['file_string'])
protein_structure = PDBParser().get_structure(protein['name'], protein_file)

view = nv.show_biopython(protein_structure)
view.add_representation(repr_type='ball+stick', selection='protein')
view



NGLWidget()

You can see that we're missing the ligands in the structure. The ligands are associated with the protein over the "ligand_set" field. Let's retrieve them:

In [None]:
print('Ligand IDs: ' + str(protein['ligand_set']))
ligand = requests.get(LIGANDS + protein['ligand_set'][0] + '/').json()  # get the first ligand
other_ligand = requests.get(LIGANDS + protein['ligand_set'][1] + '/').json()  # get the second ligand
Image(url=ligand['image'], width=400, height=400)  # freely scalabe SVG

Ligand IDs: ['6b870bf2-8042-4e35-9a94-da495d10debb', '91fff082-d8f1-4128-9a24-0b09e0aa44e1']


Preprocessing a structure splits the ligands from the protein and tries to generate 2D images for them. We can also load these into nglview and look at ligands and proteins individually:

In [None]:
# the ligands' experimentally determined 3D structure
ligand_structure = Chem.MolFromMolBlock(ligand['file_string'], removeHs=True)
view = nv.show_rdkit(ligand_structure)
view.add_representation(repr_type='ball+stick', selection='protein')
view

NGLWidget()

In [None]:
# the protein-ligand(s) complex
ligand_structure = Chem.MolFromMolBlock(ligand['file_string'], removeHs=False)
other_ligand_structure = Chem.MolFromMolBlock(other_ligand['file_string'], removeHs=False)

view = nv.NGLWidget()
view.add_structure(nv.RdkitStructure(ligand_structure))
view.add_structure(nv.RdkitStructure(other_ligand_structure))
view.add_structure(nv.BiopythonStructure(protein_structure))
view

NGLWidget()

## Run tools with uploaded data

We can use the IDs of the ligands and the protein for other tools on the [proteins.plus](https://proteins.plus), for example [Protoss](https://doi.org/10.1186/1758-2946-6-12) to predict hydrogens for the *structure* complex:

In [None]:
# run protoss on the server (detailed explanation in the protoss example)
query = {'protein_id': protein['id']}  # our preprocessed protein ID
protoss_job_submission = requests.post(PROTOSS, data=query).json()
protoss_job = poll_job(protoss_job_submission['job_id'], PROTOSS_JOBS)
protossed_protein = requests.get(PROTEINS + protoss_job['output_protein'] + '/').json()
protossed_protein_file = io.StringIO(protossed_protein['file_string'])

# load and visualize the protein with protoss hydrogens
protossed_protein_structure = PDBParser().get_structure(protossed_protein['name'], protossed_protein_file)
view = nv.NGLWidget()
view = nv.show_biopython(protossed_protein_structure)
view.add_representation(repr_type='ball+stick', selection='protein')
view

Job d79244c1-187f-44b5-90ab-2a23e9ab95c2 completed with success




NGLWidget()

Notice how all we had to do was give the server the ID of the protein. The server will keep such entries for about a week after they were last accessed.