# Talktorial 11 (part B)

# CADD web services that can be used via a Python API

__Developed at AG Volkamer, Charité__

Dr. Jaime Rodríguez-Guerra

## Aim of this talktorial

> This is part B of the "Online webservices" talktorial:
>
> - 11a. Querying KLIFS & PubChem for potential kinase inhibitors
> - __11b. Docking the candidates against the target obtained in 11a__
> - 11c. Assessing the results and comparing against known data


After obtaining input structures we will use molecular docking software to find good protein-ligand poses.

## Learning goals

### Theory

- Molecular docking basics
- Available software

### Practical

- Prepare the structures
- Run the calculation
- Save the results

### Discussion

Pending.

### Quiz

Pending.

***

# Theory: what is molecular docking?

Protein-ligand interactions are mainly governed by non-covalent interactions.

There are several ways to analyze the vast search space that results from exploring multiple conformations and chemical variations.

- Molecular mechanics
- Shape recognition
- Knowledge-based

## Known limitations

- False positives
- Energetic accuracy
- Choosing the correct binding site


## Existing software

Commercial
- GOLD
- Schrödinger

Free (or free for academics):
- AutoDock Vina
- DOCK
- OpenEye

# Practice


## Docking

There are a couple of webservices available online for free use: SwissDock and OPAL webservices (which includes AutoDock Vina).

### SwissDock

* Role: Perform docking calculations
* Website: http://www.swissdock.ch
* API: Yes, SOAP-based. No official client, use `suds`.
* Documentation: http://www.swissdock.ch/pages/soap_access
* Literature:
    * Nucleic Acids Res. 2011 Jul;39(Web Server issue):W270-7. doi: 10.1093/nar/gkr366. https://academic.oup.com/nar/article/39/suppl_2/W270/2506492
    * J Comput Chem. 2011 Jul 30;32(10):2149-59. doi: 10.1002/jcc.21797. https://onlinelibrary.wiley.com/doi/abs/10.1002/jcc.21797

> SwissDock, a web service to predict the molecular interactions that may occur between a target protein and a small molecule.
> SwissDock is based on the docking software EADock DSS, whose algorithm consists of the following steps:
> 1. Many binding modes are generated either in a box (local docking) or in the vicinity of all target cavities (blind docking).
> 2. Simultaneously, their CHARMM energies are estimated on a grid.
> 3. The binding modes with the most favorable energies are evaluated with FACTS, and clustered.
> 4. The most favorable clusters can be visualized online and downloaded on your computer.


### OPAL webservices
* Role: CADD as a service
* Website: http://nbcr-222.ucsd.edu/opal2/dashboard
* API: Yes, SOAP-based. No official client, use `suds`.
* Documentation: http://nbcr-222.ucsd.edu/opal2/dashboard?command=docs (currently offline)
* Literature:
    * Nucleic Acids Res. 2010 Jul;38(Web Server issue):W724-31. doi: 10.1093/nar/gkq503 https://academic.oup.com/nar/article/38/suppl_2/W724/1122840
    * J Comput Chem. 2010 Jan 30; 31(2): 455–461. doi: 10.1002/jcc.21334 https://onlinelibrary.wiley.com/doi/abs/10.1002/jcc.21334
    * Opal: Simple Web Services Wrappers for Scientific Applications http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.533.7960&rep=rep1&type=pdf
    
> Biomedical applications have become increasingly complex, and they often require large-scale high-performance computing resources with a large number of processors and memory. The complexity of application deployment and the advances in cluster, grid and cloud computing require new modes of support for biomedical research. Scientific Software as a Service (sSaaS) enables scalable and transparent access to biomedical applications through simple standards-based Web interfaces. Towards this end, we built a production web server (http://ws.nbcr.net) in August 2007 to support the bioinformatics application called MEME. The server has grown since to include docking analysis with AutoDock and AutoDock Vina, electrostatic calculations using PDB2PQR and APBS, and off-target analysis using SMAP. All the applications on the servers are powered by Opal, a toolkit that allows users to wrap scientific applications easily as web services without any modification to the scientific codes, by writing simple XML configuration files. Opal allows both web forms-based access and programmatic access of all our applications. The Opal toolkit currently supports SOAP-based Web service access to a number of popular applications from the National Biomedical Computation Resource (NBCR) and affiliated collaborative and service projects. In addition, Opal's programmatic access capability allows our applications to be accessed through many workflow tools, including Vision, Kepler, Nimrod/K and VisTrails. From mid-August 2007 to the end of 2009, we have successfully executed 239,814 jobs. The number of successfully executed jobs more than doubled from 205 to 411 per day between 2008 and 2009. The Opal-enabled service model is useful for a wide range of applications. It provides for interoperation with other applications with Web Service interfaces, and allows application developers to focus on the scientific tool and workflow development. Web server availability: http://ws.nbcr.net.


## Binding site prediction

The docking calculation will work best if we perform it in a reasonably small search space, normally covering a single binding 
pocket. To guesstimate the best one, we can use DoGSiteScorer, available for free and online at proteins.plus.

### Proteins.plus DoGSiteScorer

* Role: Interactive web interface for several CADD tools
* Website: http://proteins.plus
* API: Yes, REST-based. Simple enough to apply bare `requests`
* Documentation: https://proteins.plus/help/index
* Literature:
    * J. Chem. Inf. Model. 2010, 50, 11, 2041-2052, https://doi.org/10.1021/ci100241y
    * Bioinformatics, Volume 28, Issue 15, 1 August 2012, Pages 2074–2075, https://doi.org/10.1093/bioinformatics/bts310

> Automated prediction of protein active sites is essential for large-scale protein function prediction, classification, and druggability estimates. In this work, we present DoGSite, a new structure-based method to predict active sites in proteins based on a Difference of Gaussian (DoG) approach which originates from image processing. In contrast to existing methods, DoGSite splits predicted pockets into subpockets, revealing a refined description of the topology of active sites. DoGSite correctly predicts binding pockets for over 92% of the PDBBind and the scPDB data set, being in line with the best-performing methods available. In 63% of the PDBBind data set the detected pockets can be subdivided into smaller subpockets. The cocrystallized ligand is contained in exactly one subpocket in 87% of the predictions. Furthermore, we introduce a more precise prediction performance measure by taking the pairwise ligand and pocket coverage into account. In 90% of the cases DoGSite predicts a pocket that contains at least half of the ligand. In 70% of the cases additionally more than a quarter of the respective pocket itself is covered by the cocrystallized ligand. Consideration of subpockets produces an increase in coverage yielding a success rate of 83% for the latter measure.

# Get files from part A

In [327]:
from rdkit import Chem
from rdkit.Chem import AllChem
# Retrieve protein and ligands from previous steps
with open('data/complex.pdb') as f:
    pdbcomplex = f.read()
with open('data/protein.mol2') as f:
    protein = f.read()
with open('data/similar_smiles.txt') as f:
    smiles = [line.strip() for line in f]

ligands = []
for s in smiles:
    m = Chem.AddHs(Chem.MolFromSmiles(s))
    AllChem.EmbedMolecule(m)
    ligands.append(m)

## Use SwissDock

SwissDock uses a SOAP interface, so we will need to install `suds` for that.

> Notice: SwissDock servers are not working lately. Go to the OPAL alternative below!

In [2]:
from suds.client import Client
import zlib
import string
import requests

def swissdock_client():
    # Server seems to be down at the moment...
    # http://swissdock.vital-it.ch/soap/ replies with 503 Unavailable
    # because it points to wrong domain... patch it?
    SWISSDOCK_WSDL_URL = "http://www.swissdock.ch/soap/wsdl"
    r = requests.get("http://www.swissdock.ch/soap/wsdl")
    r.raise_for_status()
    WSDL = r.text.replace("http://swissdock.vital-it.ch/soap/", "http://www.swissdock.ch/soap/")
    with open("data/swissdock.wsdl", "w") as f:
        f.write(WSDL)
    HERE = _dh[0]
    return Client(f"file://{HERE}/data/swissdock.wsdl")


def prepare_protein(client, protein):
    """
    Given a PDB file (string contents), returns PSF and CRD
    """
    encoded_protein = zlib.compress(protein.encode('utf-8'))
    job_id = client.service.prepareTarget(target=encoded_protein)
    while True:
        result = client.service.isTargetPrepared(jobID=job_id)
        if result is None:
            raise ValueError("No such a job present")
        if result in (False, "false", 0):
            time.sleep(5)
        else:  # ready!
            break
    protein_files = client.service.getPreparedTarget(job_id)
    if protein_files is None or len(protein_files) != 2:
        raise ValueError("Could not prepare protein!")
    return protein_files
            

def prepare_ligand(client, ligand):
    """
    Given a MOL2 file (string contents), returns PDB, RTF, PAR.
    
    Ligand must be protonated beforehand!
    """
    encoded_ligand = zlib.compress(ligand.encode('utf-8'))
    job_id = client.service.prepareLigand(ligand=encoded_ligand)
    while True:
        result = client.service.isLigandPrepared(jobID=job_id)
        if result is None:
            raise ValueError("No such a job present")
        if result in (False, "false", 0):
            time.sleep(5)
        else:  # ready!
            break
    ligand_files = client.service.getPreparedLigand(job_id)
    if ligand_files is None or len(ligand_files) != 3:
        raise ValueError("Could not prepare ligand!")
    return ligand_files

def dock(client, protein, ligand, name=None):
    protein_psf, protein_crd = prepare_protein(client, protein)
    ligand_pdb, ligand_rtf, ligand_par = prepare_ligand(client, ligand)
    
    if name is None:
        name = "teachopencadd" + ''.join([random.choice(string.ascii_letters) for _ in range(5)])
    job_id = client.service.startDocking(
        protein_psf, protein_crd,
        ligand_pdb,
        [ligand_rtf],
        [ligand_par],
        name)
    if job_id in (None, "None"):
        raise ValueError("Docking job could not be submitted")
    while not client.service.isDockingTerminated(job_id):
        time.sleep(5)
    all_files = client.service.getPredictedDockingAllFiles(job_id)
    with open('docking_results.zip', 'w') as f:
        f.write(all_files)
    target, docked = client.service.getPredictedDocking(job_id)
    client.service.forget(job_id)
    return target, docked

In [3]:
try:
    import Mol2Writer
except ImportError:
    # Ugly hack to get Mol2 writer/readers in RDKit
    import os
    working_dir = os.getcwd()
    os.chdir(_dh[0])
    !wget https://raw.githubusercontent.com/rdkit/rdkit/60081d31f45fa8d5e8cef527589264c57dce7c65/rdkit/Chem/Mol2Writer.py > /dev/null
    os.chdir(working_dir)
    import Mol2Writer

In [4]:
def step_03_swissdock(protein, ligand):
    # Protein must be PDB
    # TODO: Convert from MOL2 to PDB
    # rd_protein = Mol2Writer.MolFromCommonMol2Block(protein)
    # protein_pdb = Chem.MolToPDBBlock(rd_protein)
    protein_pdb = protein
    # Ligand must be protonated Mol2
    ligand_mol2 = Mol2Writer.MolToCommonMol2Block(ligand)
    client = swissdock_client()
    return dock(client, protein_pdb, ligand_mol2)

### Perform docking with OPAL webservices
SwissDock is not working recently, so we can resort to yet another webservice. The interface is a bit more rudimentary, but it should work. However, protein and ligand must be prepared locally with `AutoDockTools`. I have prepared a Python 3 ready fork, but it's not well tested. It seems to work well enough for our purposes here, though.

Install it with:

The protocol for the docking calculation is the following:

1. Prepare the protein and the ligands with AutoDockTools (locally)
2. Find the best possible binding pocket with [proteins.plus' DoGSiteScorer](https://proteins.plus/2ozr#dogsite). We will use this information to configure the Vina calculation (geometric center and size of the search space)
3. Run the Vina calculation on OPAL

In [9]:
import time
import os
from io import StringIO

In [8]:
######################
#
# Structure preparation
#
######################

import MolKit
from AutoDockTools.MoleculePreparation import AD4ReceptorPreparation, AD4LigandPreparation

def opal_prepare_protein(protein):
    """
    AutoDock expects PDBQT files
    """
    mol = MolKit.Read(protein)[0]
    mol.buildBondsByDistance()
    RPO = AD4ReceptorPreparation(mol, outputfilename=protein+'.pdbqt')
    return protein + '.pdbqt'

def opal_prepare_ligand(ligand):
    """
    AutoDock expects PDBQT files
    """
    mol = MolKit.Read(ligand)[0]
    mol.buildBondsByDistance()
    RPO = AD4LigandPreparation(mol, outputfilename=ligand+'.pdbqt')
    return ligand + '.pdbqt'

You will also need `BeautifulSoup` to parse some HTML code during the binding site guessing.

In [None]:
######################
#
# Guess binding pocket
#
######################

from bs4 import BeautifulSoup
import requests

def dogsite_scorer_submit_with_pdbid(pdbid, chain):
    """This is the official API, but they only allow PDB codes, not custom ones..."""
    # Submit job to proteins.plus
    r = requests.post("https://proteins.plus/api/dogsite_rest",
        json={
            "dogsite": {
                "pdbCode": pdb_code,
                "analysisDetail": "1",
                "bindingSitePredictionGranularity": "1",
                "ligand": "",
                "chain": chain_id
            }
        },
        headers= {'Content-type': 'application/json', 'Accept': 'application/json'}
    )

    r.raise_for_status()
    # We have to query location for updates on the calculation
    return r.json()['location']

def dogsite_scorer_submit_with_custom_pdb(pdbfile):
    """
    In order to upload a custom PDB, we have to mimic the actual HTML frontend:
    
    1. Obtain the CSRF token out the HTML meta headers
    2. Post the file to upload it
    3. The returned HTML page will contain the URL ID, which in turn will allow
       us to obtain the internal shorthand job ID. We can use that one to
       retrieve the public job API ID by mimicking the async calls that the
       webserver performs in the frontend (as if we were using the web interface).
    4. Once we obtain the public job ID we can switch to using the REST API.
    """
    # We need to use a `session` to store intermediate cookies during the process
    session = requests.Session()
    r = session.get("https://proteins.plus/")
    r.raise_for_status()
    # The homepage contains the CSRF token needed to validate our request
    # Otherwise it wouldn't be safe! We have to use that throughout our requests
    # so the best way is to set it as part of the session HTTP headers
    html = BeautifulSoup(r.text)
    token = html.find('input', {'name': 'authenticity_token'}).attrs['value']
    session.headers['X-CSRF-Token'] = token

    # 1. Upload file
    with open(pdbfile, 'rb') as f:
        r = session.post("https://proteins.plus", files={'pdb_file[pathvar]': f})
    r.raise_for_status()

    # If the REST API supported file uploads, we would have the public ID already
    # but in the meantime you will have to work around it this way
        
    # 2. Get internal location id
    html = BeautifulSoup(r.text)
    pdb_id = html.find('input', {'name': "dogsite[pdbCode]"}).attrs['value']

    # 3. Get the internal job ID
    session.headers['Referer'] = "https://proteins.plus" + pdb_id
    r = session.post(f"https://proteins.plus/{pdb_id}/dogsites",
            json={"dogsite": {"pdbCode": pdb_id}},
            headers= {'Content-type': 'application/json', 
                      'Accept': 'application/json'}
    )
    r.raise_for_status()
    job_id = r.json()['job_id']
    time.sleep(3)  # wait a bit before continuing so the server can process the request
    
    # 4. Get the public job ID
    while True:
        r = session.get(f"https://proteins.plus/{pdb_id}/dogsites/{job_id}?_={round(time.time())}",
                        headers= {
                            'Accept': 'application/json, text/javascript, */*',
                            'Sec-Fetch-Mode': 'cors',
                            'Sec-Fetch-Site': 'same-origin',
                            # this line below makes all the difference, apparently
                            # otherwise, error 406 is thrown
                            'X-Requested-With': 'XMLHttpRequest'}
                       )
        r.raise_for_status()
        if 'Calculation in progress...' in r.text:  # not finished yet
            time.sleep(5)
            continue
        if 'Error during DogSiteScorer calculation' in r.text:  # malformed file?
            raise ValueError('Could not run DoGSiteScorer!')
        break
    
    results_id = None
    for lines in r.text.splitlines():
        for line in lines.split('\\n'):
            if 'results/dogsite' in line:
                results_id = line.split('/')[3]
                break
    if results_id is None:
        raise ValueError(r.text)
        
    return f"https://proteins.plus/api/dogsite_rest/{results_id}"
    

def dogsite_scorer_guess_binding_site(protein):
    """
    Use proteins.plus' DoGSiteScorer to retrieve most probable binding site in protein.
    """
    if len(protein) == 4:  # pdb code
        job_location = dogsite_scorer_submit_with_pdbid(protein)
    elif protein.endswith('.pdb'):
        job_location = dogsite_scorer_submit_with_custom_pdb(protein)
    else:
        raise ValueError("`protein` must be a PDB ID or a path to a .pdb file!")
    
    # Check when the calculation has finished
    while True:
        result = requests.get(job_location)
        result.raise_for_status()  # if it fails, it will stop here
        if result.status_code == 202:  # still running
            time.sleep(5)
            continue
        break
    
    # the residues files contain the geometric center and radius as a comment in the PDB file
    # first file (residues[0]) is the best scored pocket
    pdb_residues = requests.get(result.json()['residues'][0]).text
    for line in pdb_residues.splitlines():
        line = line.strip()
        if line.startswith('HEADER') and 'Geometric pocket center at' in line:
            fields = line.split()
            center = [float(x) for x in fields[5:8]]
            radius = float(fields[-1])
            break
    return center, radius  # this is what we need for our Vina calculation

In [10]:
######################
#
# Run calculation
#
######################

from suds.client import Client
from IPython.display import display, clear_output, HTML


VINA_CONFIG = """
center_x = {center[0]}
center_y = {center[1]}
center_z = {center[2]}
size_x = {size[0]}
size_y = {size[1]}
size_z = {size[2]}
"""

def opal_run_docking(protein, ligand, center, size, stream_output=True):
    """
    Connect to OPAL webservices and submit job
    """
    client = Client("http://nbcr-222.ucsd.edu/opal2/services/vina_1.1.2?wsdl")
    files = 'receptor.pdbqt', 'ligand.pdbqt', 'vina.conf'
    with open(protein) as f:
        protein_contents = f.read()
    with open(ligand) as f:
        ligand_contents = f.read()
    file_map = [
        {'name': 'receptor.pdbqt',
         'contents': base64ify(protein_contents)},
        {'name': 'ligand.pdbqt',
         'contents': base64ify(ligand_contents)},
        {'name': 'vina.conf',
         'contents': base64ify(VINA_CONFIG.format(center=center, size=size))},
        {'name': 'results.pdbqt',
         'contents': ''},
    ]
    cli_args = "--receptor receptor.pdbqt --ligand ligand.pdbqt --config vina.conf --out results.pdbqt"
    
    response = client.service.launchJob(cli_args, inputFile=file_map)
    job_id = response.jobID
    url = f"http://nbcr-222.ucsd.edu/opal-jobs/{job_id}"
    message = "Waiting for job " + url
    while True:
        status = client.service.queryStatus(job_id)
        r = requests.get(url + "/vina.out")
        try:
            r.raise_for_status()
        except:  # output file might not exist yet during the first checks
            iprint(message)
        else:
            iprint(f"{message}\n{r.text}")
        if status.code == 2:
            time.sleep(10)
            continue
        print('\nFinished!')
        break
        
    output_response = client.service.getOutputs(job_id)
    output_files = {
        'stdout.txt': requests.get(output_response.stdOut).text,
        'stderr.txt': requests.get(output_response.stdErr).text,
    }
    for f in output_response.outputFile:
        if f.name in files:
            continue
        r = requests.get(f.url)
        r.encoding = 'utf-8'
        r.raise_for_status()
        contents = r.text
        output_files[f.name] = contents 
        time.sleep(0.1)
    
    return output_files

######################
#
# Utilities
#
######################

import base64

def base64ify(bytes_or_str):
    """
    Mimic Py2k base64encode behavior
    """
    if isinstance(bytes_or_str, str):
        input_bytes = bytes_or_str.encode('utf8')
    else:
        input_bytes = bytes_or_str

    output_bytes = base64.urlsafe_b64encode(input_bytes)
    return output_bytes.decode('ascii')

def iprint(s):
    """
    We can use this function to print outputs, overwriting previous ones, so it
    looks like it's constantly updating :)
    """
    clear_output(wait=True)
    s = s.replace("\n", "<br />")
    display(HTML(f'<pre>{s}</pre>'))

def smiles_to_pdb(s, out='smiles.pdb'):
    m = Chem.AddHs(Chem.MolFromSmiles(s))
    AllChem.EmbedMolecule(m)
    Chem.MolToPDBFile(m, out)

Now that all the needed functions are defined, we can create the pipeline:

In [329]:
def step_03_opal(protein, smiles, pdbcomplex):
    """
    Given a protein structure and a list of smiles strings:
    Steps:
        1. Prepare the protein for AutoDock Vina (locally)
        2. Use DoGSiteScorer to find the most probable binding site
        3. For each ligand, use RDKit to write a 3D PDB file and
           run AutoDockVina on OPAL
    
    The whole thing should take around 5-15 mins
    
    The result is a dictionary with the output file contents. We 
    are mainly interested in result['results.pdbqt']
    """
    prepared_protein = opal_prepare_protein(protein)
    center, radius = dogsite_scorer_guess_binding_site(pdbcomplex)
    size = [radius] * 3  # Vina supports non-cubic boxes, but we will use a cube for simplicity
    for i, smile in enumerate(smiles):
        smiles_to_pdb(smile, f'data/ligand{i}.pdb')
        prepared_ligand = opal_prepare_ligand(f'data/ligand{i}.pdb')
        result = opal_run_docking(prepared_protein, prepared_ligand, center, size)
    return result

Run it!

In [340]:
# We will only process the first ligand in `smiles`
%time result = step_03_opal('data/protein.mol2', smiles[:1], 'data/complex.pdb')


Finished!
CPU times: user 1.69 s, sys: 39.1 ms, total: 1.72 s
Wall time: 2min 59s


### Understanding the output

`result` is a dictionary with several keys, corresponding to the output files and their text contents. We are mainly interested in:

- `results.pdbqt` contains the docked ligands. It's a modified multi-model PDB file. Since we kept the protein rigid, we just need to open each ligand model together with the original protein structure. Let's save the important files for the next part.
- `vina.out` is the text output you see above. It will provide the table-like information.

Save them to disk so we can retrieve them later.

In [341]:
with open('data/results.pdbqt', 'w') as f:
    f.write(result['results.pdbqt'])
with open('data/vina.out', 'w') as f:
    f.write(result['vina.out'])

# Visualize docking results

Once the calculation has run and the files have been downloaded, it's time to visualize them! You will see how to do that in Part C.

# Discussion

Pending

# Quiz

- How can you tell the docking run successfully in the remote server?
- Why do we need to prepare the AutoDock Vina input files locally?