# T04 - Fully Automated Drug Design Pipeline

**Authors:**
- Armin Ariamajd, CADD seminar, 2021, Charité/Freie Universität Berlin
- Melanie Vogel,  CADD seminar, 2021, Charité/Freie Universität Berlin


## Aim of This Talktorial
<br>
<div style="text-align: justify">     
In this talktorial we will learn to develop an <b>automated structure-based virtual screening pipeline, particularly suited for hit expansion and lead optimization</b> phases of a drug discovery project, where a known promising ligand (i.e. an initial hit or lead compound) needs to be structurally modified in order to improve its binding affinity and selectivity for the target protein. The general architecture of the pipeline can thus be summarized as follows (Figure 1):
</div>
<br>

* **Input**
    * Target protein structure and a promising ligand (e.g. lead or hit compound), plus specifications of the processes that need to be performed.
<br><br>
* **Processes**
    1. Detection of the most druggable binding site on the given protein structure (if not defined in the input data).
    2. Finding derivatives and structural analogs of the given ligand by performing similarity search on available compound databases, and filtering the results to obtain the most drug-like analogs.
    3. Performing docking experiments on the selected binding site, using the found analogs.
    4. Analyzing the protein–ligand interactions in the calculated binding modes of each analog.
    5. Visualizing the binding modes and the corresponding interactions.
<br><br>
* **Output**
    * Optimized ligand structure(s) in terms of binding affinity and selectivity.

<br>

<p style="text-align:center;"><img src="images/fig1.png"/></p>

**Figure 1.** General architecture of the automated structure-based virtual screening pipeline.


### Table of Contents

1. [**Theory**](#theory)
    * [Drug Design Pipeline](#drug_design_theory)
    * [Binding Site Detection](#binding_site_theory) 
    * [Chemical Similarity](#similarity_theory)
    * [Molecular Docking](#docking_theory)
    * [Protein-Ligand Interactions](#interactions_theory)
    * [Visual Inspection of The Docking Results](#visual_theory)
<br><br>
2. [**Practical**](#practical)
    * [Outline of the Virtual Screening Pipeline](#1_practical)
    * [Reading the Input Data and Initializing Output Paths](#2_practical)
    * [Processing the Input Protein Data](#3_practical)
    * [Processing the Input Ligand Data](#4_practical)
    * [Binding Site Detection](#5_practical)
    * [Ligand Similarity Search](#6_practical)
    * [Molecular Docking](#7_practical)
    * [Analysis of Protein–Ligand Interactions](#8_practical)
    * [Selection of the Best Optimized Ligand](#9_practical)
    * [Putting The Pieces Together: A Fully Automated Pipeline](#10_practical)
<br><br>
3. [**Discussion**](#discussion)
4. [**Quiz**](#quiz)
5. [**Supplementary Information**](#supp)

### References
* ***TeachOpenCADD*** Teaching Platform
    1. Journal article on *TeachOpenCADD* teaching platform for computer-aided drug design: [D. Sydow <i>et al.</i>, <i>J. Cheminform.</i> <b>2019</b>, 11, 29.](https://doi.org/10.1186/s13321-019-0351-x)
    2. [*TeachOpenCADD* website](https://projects.volkamerlab.org/teachopencadd/index.html) at [Volkamer lab](https://volkamerlab.org/)
    2. This talktorial is inspired by the *TeachOpenCADD* Talktorials [T013 - T017](https://github.com/volkamerlab/teachopencadd/tree/t011-base/teachopencadd/talktorials)
<br><br>
* **Drug Design Pipeline**
    3. Book on drug design: [G. Klebe, <i>Drug Design</i>, Springer, <b>2013</b>.](https://doi.org/10.1007/978-3-642-17907-5)
    4. Review article on early stages of drug discovery: [Hughes <i>et al.</i>, Br. J. Pharmacol.</i> <b>2011</b>, 162, 1239-1249.](https://doi.org/10.1111/j.1476-5381.2010.01127.x)
    5. Review article on computational drug design: [Sliwoski <i>et al.</i>, <i>Pharmacol. Rev.</i> <b>2014</b>, 66, 334-395.](https://doi.org/10.1124/pr.112.007336)
    6. Review article on computational drug discovery: [Leelananda <i>et al.</i>, <i>Beilstein J. Org. Chem.</i> <b>2016</b>, 12, 2694-2718.](https://doi.org/10.3762/bjoc.12.267)
    7. Review article on free software for building a virtual screening pipeline: [Glaab, <i>Brief. Bioinform.</i> <b>2016</b>, 17, 352-366.](https://doi.org/10.1093/bib/bbv037)
    8. Review article on automating drug discovery: [Schneider, <i>Nat. Rev. Drug Discov.</i> <b>2018</b>, 17, 97-113.](https://doi.org/10.1038/nrd.2017.232)
    9. Review article on structure-based drug discovery: [Batool <i>et al.</i>, <i>Int. J. Mol. Sci.</i> <b>2019</b>, 20, 2783.](https://doi.org/10.3390/ijms20112783)
<br><br>
* **Binding Site Detection and The *DoGSiteScorer* Program** 
    10. Book chapter on prediction and analysis of binding sites: [Volkamer <i>et al.</i>, <i>Applied Chemoinformatics</i>, Wiley, <b>2018</b>, pp. 283-311.](https://doi.org/10.1002/9783527806539.ch6g)
    11. Journal article on binding-site and druggability predictions using *DoGSiteScorer*: [Volkamer <i>et al.</i>, <i>J. Chem. Inf. Model.</i> <b>2012</b>, <i>52</i>, 360-372.](https://doi.org/10.1021/ci200454v)
    12. Journal article describing the *ProteinsPlus* web-portal: [R. Fahrrolfes <i>et al.</i>, <i>Nucleic Acids Res.</i> <b>2017</b>, 45, W337-W343.](https://doi.org/10.1093/nar/gkx333)
    13. [*ProteinsPlus* website](https://proteins.plus/), and information regarding the usage of its *DoGSiteScorer* [REST-API](https://proteins.plus/help/dogsite_rest)
    14. *TeachOpenCADD* Talktorial on binding-site detection: [Talktorial T014](https://github.com/volkamerlab/teachopencadd/tree/t011-base/teachopencadd/talktorials/T014_binding_site_detection)
    15. *TeachOpenCADD* talktorial on querying online API web-services: [Talktorial T011](https://github.com/volkamerlab/teachopencadd/tree/t011-base/teachopencadd/talktorials/T011_query_online_api_webservices)
<br><br>
* **Chemical Similarity and The *PubChem* Online Database**
    16. Review article on molecular similarity in medicinal chemistry: [G. Maggiora <i>et al.</i>, <i>J. Med. Chem.</i> <b>2014</b>, 57, 3186-3204.](https://doi.org/10.1021/jm401411z)
    17. Journal article on extended-connectivity fingerprints: [D. Rogers <i>et al.</i>, <i>J. Chem. Inf. Model.</i> <b>2010</b>, 50, 742-754.](https://doi.org/10.1021/ci100050t)
    18. Journal article describing the latest developments of the *PubChem* web-services: [S. Kim <i>et al.</i>, <i>Nucleic Acids Res.</i> <b>2019</b>, 47, D1102-D1109.](https://doi.org/10.1093/nar/gky1033)
    19. [*PubChem* website](https://pubchem.ncbi.nlm.nih.gov/), and information regarding the usage of its [APIs](https://pubchemdocs.ncbi.nlm.nih.gov/programmatic-access)
    20. Description of *PubChem*'s [custom substructure fingerprint](https://ftp.ncbi.nlm.nih.gov/pubchem/specifications/pubchem_fingerprints.pdf) and [*Tanimoto* similarity measure](https://jcheminf.biomedcentral.com/articles/10.1186/s13321-016-0163-1) used in its similarity search engine.  
    20. *TeachOpenCADD* talktorial on compound similarity: [Talktorial T004](https://github.com/volkamerlab/teachopencadd/blob/master/teachopencadd/talktorials/T004_compound_similarity/talktorial.ipynb)
    21. *TeachOpenCADD* talktorial on data acquisition from PubChem: [Talktorial T013](https://github.com/volkamerlab/teachopencadd/blob/t011-base/teachopencadd/talktorials/T013_query_pubchem/talktorial.ipynb) 
<br><br>
* **Molecular Docking and The *Smina* Program**
    22. Review article on molecular docking algorithms: [X. Y. Meng <i>et al.</i>, <i>Curr. Comput. Aided Drug Des.</i> <b>2011</b>, 7, 146-157.](https://doi.org/10.2174/157340911795677602)
    23. Review article on different software used for molecular docking: [N. S. Pagadala <i>et al.</i>, <i>Biophys. Rev.</i> <b>2017</b>, 9, 91-102.](https://doi.org/10.1007/s12551-016-0247-1)
    24. Review article on evaluation and comparison of different docking algorithms: [G. L. Warren <i> et al.</i>, <i>J. Med. Chem.</i> <b>2006</b>, 49, 5912-5931.](https://doi.org/10.1021/jm050362n)
    25. Review article on evaluation of ten docking programs on a diverse set of protein-ligand complexes: [Z. Wang <i> et al.</i>, Phys. Chem. Chem. Phys.</i> <b>2016</b>, 18, 12964-12975.](https://doi.org/10.1039/C6CP01555G)
    26. Journal article describing the Smina docking program and its scoring function: [D. R. Koes <i>et al.</i>, <i>J. Chem. Inf. Model.</i> <b>2013</b>, 53, 1893-1904.](https://doi.org/10.1021/ci300604z) 
    27. [*OpenBabel* documentation](http://openbabel.org/wiki/Main_Page)
    28. [*Smina* documentation](https://sourceforge.net/projects/smina/)
    28. *TeachOpenCADD* talktorial on protein–ligand docking: [Talktorial T015](https://github.com/volkamerlab/teachopencadd/blob/t011-base/teachopencadd/talktorials/T015_protein_ligand_docking/talktorial.ipynb)
<br><br>
* **Protein-Ligand Interactions and the *PLIP* Program**
    29. Review article on protein-ligand interactions: [X. Du <i>et al.</i>, <i>Int. J. Mol. Sci.</i> <b>2016</b>, 17, 144.](https://doi.org/10.3390/ijms17020144)
    30. Journal article analyzing the types and frequencies of different protein-ligand interactions in available protein-ligand complex structures: [R. Ferreira de Freitas <i> et al.</i>, <i>Med. Chem. Commun.</i> <b>2017</b>, 8, 1970-1981.](https://doi.org/10.1039/C7MD00381A)
    31. Journal article describing the *PLIP* algorithm: [S. Salentin <i>et al.</i>, <i>Nucleic Acids Res.</i> <b>2015</b>, 43, W443-447.](https://doi.org/10.1093/nar/gkv315)
    32. [*PLIP* website](https://plip-tool.biotec.tu-dresden.de/plip-web/plip/index)
    34. [*PLIP* documentation](https://github.com/pharmai/plip)
    33. *TeachOpenCADD* talktorial on protein-ligand interactions: [Talktorial T016](https://github.com/volkamerlab/teachopencadd/blob/t011-base/teachopencadd/talktorials/T016_protein_ligand_interactions/talktorial.ipynb)
<br><br>
* **Visual Inspection of Docking Results and the *NGLView* Program**
    34. Journal article describing the NGLView program: [H. Nguyen <i>et al.</i>, <i>Bioinformatics</i> <b>2018</b>, 34, 1241-1242.](https://doi.org/10.1093/bioinformatics/btx789)
    35. [*NGLView* documentation](http://nglviewer.org/nglview/latest/api.html)
    36. *TeachOpenCADD* talktorial on advanced NGLView usage: [Talktorial T017](https://github.com/volkamerlab/teachopencadd/blob/t011-base/teachopencadd/talktorials/T017_advanced_nglview_usage/talktorial.ipynb)

<a id='theory'></a>
## Theory

<a id='drug_design_theory'></a>
### Drug Design Pipeline
<br>
<div style="text-align: justify">
Modern drug discovery and development is a complex process, which can be categorized into several phases (Figure 2). The whole process is highly time- and capital-intensive, and on average it takes about 2–4 years and hundreds of millions of dollars only to get to the pre-clinical phase. Moreover, it is not uncommon for the drug candidates to fail in the pre-clinical and clinical phases (e.g. due to ineffectiveness or side-effects in human subjects, which could not have been predicted beforehand), so that the process has to be repeated. Therefore, development of computer-aided drug design pipelines that can greatly reduce the time and cost of drug discovery projects is an attractive alternative and an active research area.
</div>  
<br>
<p style="text-align:center;"><img src="images/fig2.png" width="800" height=auto/></p>

**Figure 2.** Schematic representation of the main phases in a modern drug discovery pipeline.
<br><br>
<div style="text-align: justify">
    In this talktorial, we are only going to focus on the <b>hit-to-lead</b> (also known as lead generation) and <b>lead optimization</b> phases of the pipeline. In both phases, a similar procedure takes place, where the initial hit/lead compound is derivatized to synthesize a set of structural analogs, which are then tested against the target protein in screening experiments, with the goal of finding analogs with improved binding affinities, selectivities, physiochemical and pharmacokinetic properties. Hereby, we are going to implement this procedure <i>in silico</i>; Given a target protein structure and a hit/lead compound, instead of chemically synthesizing a variety of analogs and testing their potency in high-throughput screening experiments, we will obtain these analogs via a similarity-search algorithm. We can then calculate their physiochemical properties and select the best analogs to perform virtual docking experiment on, in order obtain their binding affinities to the target protein. Subsequently, the analogs with the highest calculated binding affinities are selected, and we investigate their binding modes to choose those compounds that exhibit the desired protein-ligand interactions, e.g. in order to obtain higher selectivities for the target protein as well.

<a id='binding_site_theory'></a>
### Binding Site Detection
<br>
<div style="text-align: justify">
Binding sites (also known as binding pockets) are cavities in the 3-dimensional structure of a protein in its native state (Figure 3). These are mostly found on the surface of the protein structure, and are the main regions through which the protein interacts with other entities, such as small molecules (also called ligands). For a constructive interaction, the two binding partners need to have complementary steric and electronic properties (cf. lock-and-key principle, induced-fit model). Therefore, attractive intramolecular interactions between the residues of the binding pocket and the ligand is one of the key factors in determining the ligand's potency as a drug. Furthermore, a protein may possess several binding sites, and interaction of a ligand with each of these binding sites may result in a different modulation of the protein structure, and thus its function. Consequently, detailed knowledge of the target protein's binding sites is of utmost importance in a drug discovery pipeline in order to be able to design ligands that interact most favorably with a specific binding site of the protein, leading to the desired modulation (e.g. inhibition or stimulation of the protein's function). On the other hand, it is also necessary to limit the docking experiments to the binding site of interest, so that the calculated affinities correspond to binding of the ligand to the desired pocket, and not just anywhere on the protein structure. This is not a problem when a crystal structure of the target protein is available, in which a ligand has been co-crystallized in the desired binding site. In that case, the residues surrounding the co-crystallized ligand can be simply used to define the binding pocket. However, <b>when the available protein structure does not contain a co-crystallized ligand in the desired binding pocket, it is essential for the docking experiment to first detect and define a druggable pocket in the protein</b>. 
</div>
<p style="text-align:center;">
<img src="images/fig3.png" width="600" height=auto class="center"/>
</p>

**Figure 3.** Crystal structure of a protein (EGFR; PDB-code: 3W32) with a co-crystallized ligand in its main (orthosteric) binding site. The protein's surface is shown in gray. The binding site is colored blue. The ligand's carbon atoms are colored green. The image was created using *PyMol*.

#### Binding Site Detection With *DoGSiteScorer*
<br>
<div style="text-align: justify"> 
For binding site detection in this talktorial, we will use the <b><i>DoGSiteScorer</i></b> functionality of the <b><i><a href="https://proteins.plus/">ProteinsPlus</a></i></b> webserver, which uses a <b>geometry- and grid-based detection algorithm</b>. More specifically, the protein is embedded into a Cartesian 3D-grid, where each grid point is labeled as either free or occupied, depending on whether it lies within the van-der-Waals radius of any protein atoms. Subsequently, an edge-detection algorithm used in image processing, called <b>Difference of Gaussians</b> (DoG; hence the name <i>DoGSiteScorer</i>), is utilized to identify the protrusions on the protein surface. By doing so, the cavities on the protein surface that can accommodate a spheric object are identified per grid point. Lastly, cavities on neighboring grid-points are clustered together based on specific cut-off criteria, resulting in defined sub-pockets, which are then merged into pockets (Figure 4). For each (sub-)pocket, the algorithm then calculates several <b>descriptors</b>, e.g. volume, surface area, depth, hydrophobicity, number of hydrogen-bond donors/acceptors and amino acid count. Moreover, <i>DoGSiteScorer</i> also calculates two different <b>druggability</b> estimates for each (sub-)pocket; A <i>simple druggability score</i> is calculated based on a linear combination of the three descriptors describing volume, hydrophobicity and enclosure. In addition, another score is calculated by incorporating a subset of meaningful descriptors into a support vector machine (SVM) model, trained and tested on the freely available (non-redundant) druggability dataset consisting of 1069 targets. Both calculated druggability scores are between 0 and 1, where a higher score corresponds to a more druggable binding site. Using these calculated descriptors and druggability estimates, we can then choose the most suitable pocket depending on the specifics of the project in hand.
</div>
<br>
<p style="text-align:center;"><img src="images/fig4.png" width="600" height=auto/></p>
<div style="text-align: justify"> 
<b>Figure 4.</b> Visualization of some of the sub-pockets (colored meshed volumes) for the EGFR protein (PDB-code: 3W32) as detected by the <i>DoGSiteScorer</i> web-service. The co-crystallized ligand is mostly contained in the purple sub-pocket.  
</div>

<a id='similarity_theory'></a>
### Chemical Similarity
<br>
<div style="text-align: justify"> 
As described above, in the hit expansion and lead optimization steps of an experimental drug design pipeline, several derivatives of the initial hit/lead compound are chemically synthesized, in order to find the most suitable derivative in terms of potency, selectivity and physiochemical properties. This process can be considerably accelerated in a virtual screening pipeline, where the derivatives can be obtained by performing a <b>similarity search</b> on databases of existing chemical compounds. To do so, first a <b>similarity measure</b> is required. This should be a numerical description, which can be calculated for each compound, so that they can be compared to each other based on the value of their descriptor. In addition, a scoring function is needed to calculate the <b>similarity score</b> between two compounds, using their descriptor values. The simplest descriptors for a molecule are the so-called <b>1D-descriptors</b>, which are scalar values corresponding to a certain property of the molecule, e.g. molecular weight, octanol-water partition coefficient (<i>logP</i>; a measure for lipophilicity/hydrophilicity), total polar surface area etc. However, these descriptors usually do not contain enough information to assess the structural and chemical similarity of two compounds. For this purpose, usually <b>2D-descriptors</b> (also called <b>molecular fingerprints</b>) are used, which are vectors that can represent a specific molecular structure in much more detail, using a set of scalars. A variety of different algorithms are available for generation of 2D-descriptors from chemical structures, e.g. <b><i>MACCS</i></b> <b>structural keys</b> and <b><i>ECFP/Morgan</i></b> <b>fingerprints</b>. Generally, these algorithms work by extracting a set of specific features from the structure (Figure 5), generating a numerical representation for each feature, and using these to produce either a bit-vector where each component is a bit defining the presence/absence of a particular feature (e.g. Morgan fingerprints), or a count-vector where each value corresponds to the number of times a specific feature is present in the structure (e.g. MACCS keys). Moreover, there are also different scoring functions (i.e. comparison metrics) to calculate the similarity between two molecules based on their 2D-descriptors. These include <b><i>Euclidean</i></b>-<b>distance</b> and <b><i>Manhattan</i></b>-<b>distance</b> calculations, where both presence and absence of attributes are considered, or <b><i>Tanimoto</i></b> and <b><i>Dice</i></b> coefficients, which only consider the presence of attributes. It should be noted that there is no single correct approach to calculate molecular similarity, and depending on the purpose of the project different descriptors and metrics may be used, which can generally result in vastly different similarity scores.
</div>
<br>
<p style="text-align:center;">
<img src="images/fig5.png" width="350" height=auto class="center"/>
</p>
<div style="text-align: justify"> 
<b>Figure 5.</b> A simplified depiction of the process of calculating the similarity between two compounds. First, the structure of each compound is encoded into a molecular fingerprint bit-vector, where each bit corresponds -for example- to the presence/absence of a particular fragment in the structure. These fingerprints can then be compared using different similarity metrics in order to calculate a similarity score.
    </div>
<br>

<div style="text-align: justify"> 
In this talktorial, we will use the <b><i><a href="https://pubchem.ncbi.nlm.nih.gov/">PubChem</a></i></b> web-services for performing the similarity search on the input ligand. <i>PubChem</i>, which is maintained by the <a href="https://www.ncbi.nlm.nih.gov/">U.S. National Center for Biotechnology Information (NCBI)</a> contains an open database with 110 million chemical compounds and their properties (e.g. identifiers, physiochemical properties, biological activities etc.), which can be accessed through both a web-based interface, and several different web-service APIs (here, we will use their <a href="https://pubchemdocs.ncbi.nlm.nih.gov/programmatic-access">PUG-REST API</a>). It also allows for directly performing similarity searches on the database, using a <a href="https://ftp.ncbi.nlm.nih.gov/pubchem/specifications/pubchem_fingerprints.pdf">custom substructure fingerprint</a> as the 2D-descriptor, and the <a href="https://jcheminf.biomedcentral.com/articles/10.1186/s13321-016-0163-1"><i>Tanimoto</i> similarity measure</a> as the metric. Therefore, by submitting a compound's identifier (e.g. SMILES, CID, InChI etc.) to the <i>PubChem</i>'s API, and providing a similarity threshold and the desired number of maximum results, a certain number of compounds within the given similarity threshold can be obtained.
</div>

<a id='docking_theory'></a>
### Molecular Docking
<br>
<div style="text-align: justify"> 
After defining an appropriate binding site in the target protein and obtaining a set of analogs for the ligand of interest, the next step is to assess the suitability of each analog in terms of its binding mode and -affinity. This can be done using a molecular docking algorithm, which works by <b>sampling the ligand's conformational space</b> inside the protein's binding site, and evaluating the energetics of protein-ligand interactions for each generated conformation, using a <b>scoring function</b>. Doing so, the <b>binding affinity</b> of each conformation is calculated, which can then be used to determine the energetically most-favorable <b>binding modes</b> (also called docking poses) of a ligand. In order to provide accurate results, most docking programs require some preparation of the protein and ligand structures. For example: 
</div>

* Hydrogen atoms that are usually absent in crystal structures should be added to the protein. 
* Correct protonation state of each atom should be calculated based on a given pH value (usually the physiological pH, i.e. 7.4). 
* Partial charges should be assigned to all atoms. 
* For ligands, which are usually inputted only by an identifier (e.g. their SMILES), a reasonable low-energy conformer should also be generated, which is then used as the starting point in the conformational sampling process. 

<div style="text-align: justify"> 
However, most of these calculations cannot be accurately performed, since they are either computationally costly (e.g. calculating the lowest-energy conformation of the ligand) or require information that are not available beforehand (e.g. the protonation states of the protein and ligand depend on the interactions between the two, and cannot be accurately calculated without knowing the actual docking pose of the ligand). These inaccuracies, along with other limitations of docking programs (e.g. computational cost of sampling all relevant conformations of the ligand, or inability in accounting for the structural changes in both protein and ligand resulting from their interactions) can result in somewhat unreliable docking results. For examples, in many cases, the docking pose with the highest calculated binding affinity does not correspond to the experimentally determined docking pose of the ligand (Figure 6).
    </div>
<br>

<p style="text-align:center;">
<img src="images/fig6.png" width="750" height=auto class="center"/>
</p>
<div style="text-align: justify"> 
<b>Figure 6.</b> An example of two generated docking poses (red) in a re-docking experiment performed using the <i>Smina</i> program, superimposed over the corresponding protein structure (EGFR; PDB-code: 3W32) and the co-crystallized ligand in its native binding mode (green). While the generated docking pose shown on the left is calculated to have a higher binding affinity, it also displays a higher distance-RMSD to the native docking pose. The image was created using <i>PyMol</i>.
    </div>
<br>
<div style="text-align: justify"> 
In this talktorial, we will use the <b><i><a href="https://sourceforge.net/projects/smina/">Smina</a></i></b> docking program, which is an open-source fork of the docking program <b><i><a href="http://vina.scripps.edu/">Autodock Vina</a></i></b>. Other than being open-source, <i>Smina</i> is more focused on improved scoring functions and minimization; It uses a custom empirical scoring function as default, but supports user-parameterized scoring functions as well. In addition, in order to prepare the protein and ligand structures for the docking experiment (as described above), we will also use the <b><i><a href="https://github.com/pybel/pybel">Pybel</a></i></b> package - a <i>Python</i> package for the <b><i><a href="http://openbabel.org/wiki/Main_Page">OpenBabel</a></i></b> program.
</div>



<a id='interactions_theory'></a>
### Protein–Ligand Interactions
<br>
<div style="text-align: justify"> 
As mentioned earlier, the number and type of non-covalent intramolecular interactions between the protein and the ligand is a main determining factor of the ligand's potency as a drug. These interactions, which are mostly governed by steric and electronic properties of the two interacting partners, can be categorized into several main types (Figure 7):
</div>

* Hydrophobic interactions
* Hydrogen bonding (direct or water-bridged)
* Metal complexation
* Electrostatic interactions
* $\pi$–$\pi$- and cation–$\pi$ interactions
* Halogen bonds

<p style="text-align:center;">
<img src="images/fig7.gif" width="500" height=auto class="center"/>
</p>
<div style="text-align: justify"> 
<b>Figure 7.</b> Most common non-covalent intramolecular interactions in available protein-ligand complexes in the <i>Protein Data Bank</i> (PDB), and their frequency distribution. Reprinted from reference [<a href="https://doi.org/10.1039/C7MD00381A">30</a>].
</div>
<br>
<div style="text-align: justify"> 
While these interactions are already taken into account implicitly by the scoring function of the docking algorithm, it is still useful to explicitly analyze them in the binding modes generated by the docking experiment. This information can be used to validate the calculated binding poses, or to further narrow down the choice of an optimal lead compound in terms of selectivity, e.g. by choosing those derivatives that exhibit interactions with specific mutated or non-conserved residues in the protein. There are several programs available for assessing protein–ligand interactions, e.g. the <b><i><a href="https://klifs.net/index.php">KLIFS</a></i></b> webserver, which is a kinase-centric tool that can identify interactions in the PDB entries containing kinase–ligand complexes. Another more general tool that we are also going to use here is the <b><i><a href="https://plip-tool.biotec.tu-dresden.de/plip-web/plip/index">Protein–Ligand Interaction Profiler (PLIP)</a></i></b>, which is an open-source program, with both a webserver and an available package for <i>Python</i>. <i>PLIP</i> can analyze protein-ligand interactions in any given protein-ligand complex structure, by first determining a distance cut-off value based on the ligand's structure, and subsequently selecting pairs of atoms (one from the protein and one from the ligand) that lie within this cut-off value. It then identifies the potential interactions between selected pairs based on several electronic and geometric considerations. Thus, for each detected interaction, a set of information is outputted, including the interaction type, the involved atoms in the ligand and protein, as well as several properties specific to each interaction type. These information can then be used to visualize the protein-ligand interactions (Figure 8) or to analyze them algorithmically.
</div>
<br>
<p style="text-align:center;">
<img src="images/fig8.png" width="600" height=auto class="center"/>
</p>
<div style="text-align: justify">   
<b>Figure 8.</b> Visualization of the protein-ligand interactions in a protein-ligand complex structure (EGFR; PDB-code: 3W32) detected by the <i>PLIP</i> web-service. From the protein, only the interacting residues are shown, with their carbon atoms colored blue. The ligand's carbon atoms are colored brown. Hydrophobic interactions are shown as gray dashed lines. Hydrogen bonds are depicted as blue lines. $\pi$-stacking interactions are shown as green dashed lines. Halogen bonds are displayed using cyan lines.
</div>

<a id='visual_theory'></a>
### Visual Inspection of The Docking Results
<br>
<div style="text-align: justify"> 
Visual inspection of the calculated docking poses and their corresponding protein-ligand interactions is especially important due to the current limitations of docking algorithms, to an extent where most experts find docking scores to be the least important criterion for selecting docking poses. Instead, some of the usual considerations when selecting a binding mode is as follows:
</div>

* Similarity to experimentally observed binding modes in available crystal structures of the target protein
* Steric and electronic complementarity
* Minimized number of unsaturated hydrogen bonds
* No solvent-exposed hydrophobic moieties in the ligand
* displacement of, or interactions with water molecules in the pocket
* steric strain induced by the ligand binding

<div style="text-align: justify"> 
However, since manual inspection is a time-consuming process, it can only be performed on a small subset of calculated docking poses, which are usually those with the highest calculated binding affinities. To visualize the results, we use <b><i>NGLView</i></b>, which is a <i>Jupyter</i> widget (using a <i>Python</i> wrapper for the <i>Javascript</i>-based <i>NGL</i> library) allowing for visualization of structures within a <i>Jupyter</i> notebook in an interactive 3D view.</div>

<a id='practical'></a>
## Practical

In this section, we will implement and demonstrate the automated structure-based virtual screening pipeline step by step.

First of all, we import all the dependencies that we are going to need for the development of the pipeline:

In [None]:
# Standard library:
from enum import Enum  # for creating enumeration classes
import gzip  # for decompressing .gz files downloaded from DoGSiteScorer
import io  # for creating file-like objects from strings of data (needed as input for some functions)
import logging  # for setting the logging level of some packages (i.e. to disable excessive logging default to some packages e.g. PLIP)
from pathlib import Path  # for creating folders and handling local paths
import subprocess  # for creating shell processes (needed to communicate with Smina program)
import time  # for creating pauses during the runtime (to wait for the response of API requests)
from urllib.parse import quote  # for url quoting

# 3rd-party packages:
from biopandas.pdb import PandasPdb  # for working with PDB files
from IPython.display import (
    Markdown,
    Image,
)  # ror more display options in the Jupyter Notebook
from ipywidgets import (
    AppLayout,
    Layout,
    Select,
    Button,
)  # for interactive outputs in the Jupyter Notebook
import matplotlib as mpl  # for changing the display settings of plots (see bottom of the cell: Settings)
from matplotlib import (
    colors,
)  # for plotting color-maps (for visualization of protein-ligand interactions)
import matplotlib.pyplot as plt  # for plotting of data
import nglview as nv  # for visualization of the protein and protein-related data (e.g. binding sites, docking poses)
import numpy as np  # for some more functionalities when using Pandas (e.g. for handling NaN values)
from openbabel import (
    pybel,
)  # for preparing protein and ligand for docking, and other manipulations of PDB files
import pandas as pd  # for creating dataframes and handling data
import plip  # for changing the logging setting of the package (see bottom of the cell: Settings)
from plip.structure.preparation import (
    PDBComplex,
)  # for calculating protein-ligand interactions
from plip.exchange.report import (
    BindingSiteReport,
)  # for calculating protein-ligand interactions
import pypdb  # for communicating with the RCSB Protein Data Bank (PDB) to fetch PDB files
from rdkit import (
    Chem,
)  # for handling ligand data and calculating ligand-related properties
from rdkit.Chem import Draw, AllChem, Descriptors, PandasTools, rdFMCS
import requests  # for communicating with web-service APIs

# In-house packages:
from opencadd.structure.core import Structure  # for manipulating PDB files

# Settings:
logging.getLogger(plip.__name__).setLevel(
    logging.WARNING
)  # disabling excessive INFO logs of the PLIP package
mpl.rcParams["figure.dpi"] = 300  # for plots with higher resolution
mpl.rcParams["agg.path.chunksize"] = 10000  # for handling plots with large number of data points
pd.set_option("display.max_columns", 100)  # showing 100 columns at most
pd.set_option("display.max_colwidth", 200)  # increasing the maximum column width
pd.set_option("display.width", None)  # showing each row in a single line

<a id='1_practical'></a>
### Outline of the Virtual Screening Pipeline
<br>
<div style="text-align: justify"> 
Since this is a relatively large project with several different functionalities, it is a good practice to use classes. By doing so, possible name collisions are avoided, and the code will be more well-structured, and thus easier to follow, maintain, reuse and expand. Therefore, we are going to implement the following classes, each handling a specific part of the pipeline:</div>

* ***Specs***: Reads and internalizes all the specifications in the input data.
* ***Protein***: Creates a protein object, containing all the protein data and required functionalities to process the protein.
* ***Ligand***: Creates a ligand object, containing all the ligand data and required functionalities to process the ligand.
* ***BindingSiteDetection***: Automatically performs all the necessary processes for binding-site detection.
* ***LigandSimilaritySearch***: Automatically performs all the necessary processes for finding analogs.
* ***Docking***: Carries out the docking experiments for all provided analogs.
* ***InteractionAnalysis***: Analyzes the protein-ligand interactions in the calculated docking modes.
* ***OptimizedLigands***: Analyzes the whole data obtained throughout the pipeline to choose the best analogs according to the input specifications.

<div style="text-align: justify"> 
Now if we didn't want to present our code step by step, we would have started by implementing the above classes first, and then at then end, use them to build the pipeline. While this approach would be much more convenient for programming, it is not appropriate for presentation purposes. First of all, the program would not be presentable until the end due to the bottom-to-top approach. Moreover, it would result in large blocks of code, preventing us from demonstrating the workings of the program in a step-by-step fashion. To circumvent this problem, we are going to structure the code as follows:</div>

1. For each specific functionality of the pipeline (e.g. reading input data, handling PDB files etc.) a separate helper-class will be defined, containing all the required functions as static methods.
2. All functions of the helper-classes can then be separately demonstrated (in [***Supplementary Information***](#supp)). 
3. Using these helper-classes, the main classes of the pipeline can then be developed and demonstrated in a step-by-step fashion.
4. At the end, when all the necessary parts of the pipeline have been developed, a function called <b><i>run_pipeline</i></b> will be defined, which automatically carries out all of the processes we demonstrated through the talktorial, and generates the required output, merely by providing it with an input specification file for the project. 
<br>

<div style="text-align: justify"> 
By doing so, the pipeline can be implemented and executed step by step, so that the inner workings of the code can be better presented, and simultaneously we will also be able to have a fully-automated pipeline at the end. Therefore, for our demonstration we will first define a container class <b><i>LeadOptimizationPipeline</i></b> with only a single instance attribute called <b><i>name</i></b>, which allows us to instantiate the class and create a new pipeline project by providing the project's name. For the rest of the talktorial we are then going to define the other mentioned classes, instantiate them with the necessary input data, and assign that instance to our created project, in order to have all the generated data always organized in one place.

In [None]:
class LeadOptimizationPipeline:
    def __init__(self, project_name):
        self.name = project_name

#### Creating a New Project: Instantiating The ***LeadOptimizationPipeline*** Class 
<br>
<div style="text-align: justify"> 
    We can now create an instance of the <b><i>LeadOptimizationPipeline</i></b> class, which we will call <b><i>project1</i></b>. For demonstrating the pipeline, we have chosen the same protein and ligand illustrated in <i>figure 1</i>, i.e. the epidermal growth factor receptor (EGFR) as our target protein, and the ligand with the ChEMBL-ID CHEMBL328216. Thus, we will simply name our lead optimization project <b>Project1_EGFR_CHEMBL328216</b>:</div>
<br>

***Note:*** For more clarity, we will always explicitly write the names of all input parameters for each function and class when using them.

In [None]:
project1 = LeadOptimizationPipeline(project_name="Project1_EGFR_CHEMBL328216")

The instance-attribute ***name*** can then be accessed as follows:

In [None]:
project1.name

<a id='2_practical'></a>
### Reading the Input Data and Initializing Output Paths
<br>
<div style="text-align: justify"> 
The first thing the pipeline should be able to do is to read and process the input data for the protein and the ligand, as well as specifications for the processes that need to be performed on them. Since this involves many parameters, it is best to use a file to store all the necessary input data for a specific project. Thus for each project, the user only has to fill in a template input file with all the necessary data, and then specify the filepath of the input data when running the program. Here, we use a CSV file, which is stored in the folder <b><i>data</i></b>, under the name <b><i>InputData_Template</i></b>. For demonstration purposes, we can open the empty template file here to have a closer look:</div>

In [None]:
pd.read_csv("data/InputData_Template.csv")

As can be seen, the table contains four columns:
* ***Subject***: Specifies the subject of the input parameter. As discussed, we need to input a ***Protein***, a ***Ligand*** and a set of specifications corresponding to each part of the pipeline, namely ***Binding Site***, ***Ligand Similarity Search***, ***Docking***, ***Interaction Analysis*** and ***Optimized Ligand***. 

* ***Property***: Specifies a specific property of the ***Subject***. Required properties are marked with an asterisk. All other properties are optional (i.e. have default values set in the program), and some are dependent on other properties. For example, If the ***Binding Site Definition Method*** is not ***'coordinates'***, then there is no need to enter the value for the ***Binding Site Coordinates*** row. 

* ***Value***: The only column that should be filled by the user. Each value corresponds to a specific ***Property*** of a specific ***Subject***.

* ***Description***: Provides a short description as to what input data is expected in each specific row, and when it should be provided.


#### Defining The Required Constants For Handling the Input Data: The ***Consts*** Data-Class
<br>
<div style="text-align: justify"> 
In order to import and process the input file and extract all the different parameters, we will use the <i>Pandas</i> package, which can directly read CSV files and transform them into a <i>DataFrame</i> object – the <i>Pandas</i> equivalence of a table in a database. In a <i>quick-and-dirty</i> approach, the input parameters are then called - whenever needed in the code - using their respective index- and column names in the dataframe. However, this will lead to a code that is not easily maintainable or expandable, since a small change in the input data will require the whole code to be revised accordingly. A much more efficient approach is to first internalize all the input data in a data-class, so that the rest of the program needs only to communicate with this class and not the input dataframe. There are several advantages to this approach. First, any error in the input data is recognized in the beginning, before any process is performed. Also, during the initial processing, all missing values can be replaced with the default values at once. Furthermore, in the next versions of the program, any change in the input data needs only to be accounted for in this class, and not in the entire code. Also, the class gives an overview of all the possible inputs for the program. For these purposes, we define a data-class called <i><b>Consts</b></i>, which contains all the possible keywords in the input dataframe, such as the column names, index names, subject names, properties, etc. These keywords are all stored in their respective sub-classes as <b>Enumerations</b>, so that the rest of the code needs only to refer to the enumeration names, and not their values, which are subject to change in the future versions of the program.
    </div>

In [None]:
class Consts:
    """
    Data-class containing the required constants for defining the expected input data.
    The rest of the code will refer to these constants.
    """

    # Constants for the input dataframe
    class DataFrame:

        # Name of the columns
        class ColumnNames(Enum):
            SUBJECT = "Subject"
            PROPERTY = "Property"
            VALUE = "Value"
            DESCRIPTION = "Description"

        # Name of the subjects
        class SubjectNames(Enum):
            PROTEIN = "Protein"
            LIGAND = "Ligand"
            BINDING_SITE = "Binding Site"
            LIGAND_SIMILARITY_SEARCH = "Ligand Similarity Search"
            DOCKING = "Docking"
            INTERACTION_ANALYSIS = "Interaction Analysis"
            OPTIMIZED_LIGAND = "Optimized Ligand"

    # Constants for the input protein data
    # i.e. name of protein properties, and their respective allowed values
    class Protein:
        class Properties(Enum):
            INPUT_TYPE = "Input Type*"
            INPUT = "Input Value*"

        class InputTypes(Enum):
            PDB_CODE = "pdb_code"
            PDB_FILEPATH = "pdb_filepath"

    # Constants for the input ligand data
    class Ligand:
        class Properties(Enum):
            INPUT_TYPE = "Input Type*"
            INPUT = "Input Value*"

        class InputTypes(Enum):
            NAME = "name"
            IUPAC_NAME = "iupac_name"
            SMILES = "smiles"
            CID = "cid"
            INCHI = "inchi"
            INCHIKEY = "inchikey"

    # Constants for the input specification data regarding binding-site
    class BindingSite:
        class Properties(Enum):
            DEFINITION_METHOD = "Definition Method"
            COORDINATES = "Coordinates"
            LIGAND = "LIGAND"
            DETECTION_METHOD = "Detection Method"
            SELECTION_METHOD = "Selection Method"
            SELECTION_CRITERIA = "Selection Criteria"
            PROTEIN_CHAIN_ID = "Protein Chain-ID"
            PROTEIN_LIGAND_ID = "Protein Ligand-ID"

        class DefinitionMethods(Enum):
            DETECTION = "detection"
            LIGAND = "ligand"
            COORDINATES = "coordinates"

        class DetectionMethods(Enum):
            DOGSITESCORER = "dogsitescorer"

        class SelectionMethods(Enum):
            SORTING = "sorting"
            FUNCTION = "function"

        class SelectionCriteria(Enum):
            LIGAND_COVERAGE = "lig_cov"
            POCKET_COVERAGE = "poc_cov"
            VOLUME = "volume"
            ENCLOSURE = "enclosure"
            SURFACE = "surface"
            DEPTH = "depth"
            SURFACE_VOLUME_RATIO = "surf/vol"
            ELLIPSOID_MAIN_AXIS_C_A_RATIO = "ell c/a"
            ELLIPSOID_MAIN_AXIS_B_A_RATIO = "ell b/a"
            NUM_POCKET_ATOMS = "siteAtms"
            NUM_CARBONS = "Cs"
            NUM_NITROGENS = "Ns"
            NUM_OXYGENS = "Os"
            NUM_SULFURS = "Ss"
            NUM_OTHER_ELEM = "Xs"
            NUM_H_ACCEPTORS = "accept"
            NUM_H_DONORS = "donor"
            NUM_HYDROPHOBIC_INTERACTIONS = "hydrophobic_interactions"
            NUM_HYDROPHOBICITY = "hydrophobicity"
            NUM_METALS = "metal"
            NUM_NEGATIVE_AA = "negAA"
            NUM_POSITIVE_AA = "posAA"
            NUM_POLAR_AA = "polarAA"
            NUM_APOLAR_AA = "apolarAA"
            SIMPLE_SCORE = "simpleScore"
            DRUG_SCORE = "drugScore"

    # Constants for the input specification data regarding ligand similarity search
    class LigandSimilaritySearch:
        class Properties(Enum):
            SEARCH_ENGINE = "Search Engine"
            MIN_SIMILARITY_PERCENT = "Minumum Similarity [%]"
            MAX_NUM_RESULTS = "Maximum Number of Results"
            MAX_NUM_DRUGLIKE = "Maximum Number of Most Drug-Like Analogs to Continue With"

        class SearchEngines(Enum):
            PUBCHEM = "pubchem"

    # Constants for the input specification data regarding docking
    class Docking:
        class Properties(Enum):
            PROGRAM = "Program"
            NUM_POSES_PER_LIGAND = "Number of Docking Poses per Ligand"
            EXHAUSTIVENESS = "Exhaustiveness"
            RANDOM_SEED = "Random Seed"

        class Programs(Enum):
            SMINA = "smina"

    # Constants for the input specification data regarding interaction analysis
    class InteractionAnalysis:
        class Properties(Enum):
            PROGRAM = "Program"

        class Programs(Enum):
            PLIP = "plip"

    class OptimizedLigands:
        class Properties(Enum):
            NUM_RESULTS = "Number of Results"
            SELECTION_METHOD = "Selection Method"
            SELECTION_CRITERIA = "Selection Criteria"

        class SelectionMethods(Enum):
            SORTING = "sorting"
            FUNCTION = "function"

        class SelectionCriteria(Enum):
            BINDING_AFFINITY = "affinity"
            NUM_H_BONDS = "h_bond"
            NUM_HYDROPHOBIC_INTERACTIONS = "hydrophobic"
            NUM_SALT_BRIDGES = "salt_bridge"
            NUM_WATER_BRIDGES = "water_bridge"
            NUM_PI_STACKINGS = "pi_stacking"
            NUM_CATION_PI = "pi_cation"
            NUM_HALGON_BONDS = "halogen"
            NUM_METAL_BONDS = "metal"
            NUM_ALL_INTERACTIONS = "total_num_interactions"
            DRUG_SCORE_LIPINSKI = "drug_score_lipinski"
            DRUG_SCORE_QED = "drug_score_qed"
            DRUG_SCORE_CUSTOM = "drug_score_custom"
            DRUG_SCORE_TOTAL = "drug_score_total"

    # Constants for the input data regarding output paths
    class Output:
        class FolderNames(Enum):
            PROTEIN = "1_Protein"
            LIGAND = "2_Ligand"
            BINDING_SITE_DETECTION = "3_Binding Site Detection"
            SIMILARITY_SEARCH = "4_Ligand Similarity Search"
            DOCKING = "5_Docking"
            INTERACTION_ANALYSIS = "6_Interaction Analysis"
            VISUALIZATION = "7_Visualizations"
            OPTIMIZED_LIGANDS = "8_Optimized Ligands"

#### Implementing The Required Functions: The ***IO*** Helper-Class
<br>
<div style="text-align: justify"> 
Now to be able to read and process the input data, we will first define a helper class called <b><i>IO</i></b>, which contains all the necessary functions for handling the input and output data, e.g. for creating a dataframe from the input CSV file, extracting specific information from the dataframe, or creating folders for storing the output data. <br><br><b><i>Note:</i></b> For demonstration of each function, see the <a href="#io_demo">corresponding section</a> in <i>Supplementary Information</i>.</div>

In [None]:
class IO:
    """
    Set of functions for handling the input/output data.
    """

    @staticmethod
    def create_dataframe_from_csv_input_file(
        input_data_filepath, list_of_index_column_names, list_of_columns_to_keep
    ):
        """
        Read a CSV file and create a pandas DataFrame with given specifications.

        Parameters
        ----------
        input_data_filepath : str or pathlib.Path object
            Path of the CSV data file.
        list_of_index_column_names : list of strings
            List of column names in the CSV file to be used as indices for the dataframe.
        list_of_columns_to_keep : list of strings
            List of column names from the CSV file to keep in the dataframe.

        Returns
        -------
            Pandas DataFrame
        """
        input_df = pd.read_csv(input_data_filepath)
        input_df.set_index(list_of_index_column_names, inplace=True)
        input_df.drop(input_df.columns.difference(list_of_columns_to_keep), 1, inplace=True)
        return input_df

    @staticmethod
    def copy_series_from_dataframe(input_df, index_name, column_name):
        """
        Take a multi-index dataframe, and make a copy of the data
        corresponding to a given index and column.

        Parameters
        ----------
        input_df : Pandas DataFrame
            The dataframe to extract the data from.
        index_name : str
            The index-value of the needed rows.
        column_name : str
            The column-name of the needed values.
        Returns
        -------
            Pandas Series
            Copy of the data corresponding to given index- and column-name.
        """
        subject_data = input_df.xs(index_name, level=0, axis=0)[column_name].copy()
        return subject_data

    @staticmethod
    def create_folder(folder_name, folder_path=""):
        """
        Create a folder with a given name in a given path.
        Also creates all non-existing parent folders.

        Parameters
        ----------
        folder_name : str
            Name of the folder to be created.

        folder_path : str (optional; default: current path)
            Either relative or absolute path of the folder to be created.

        Returns
        -------
            pathlib.Path object
            Full path of the created folder.
        """
        path = Path(folder_path) / folder_name
        # Creating the folder and all non-existing parent folders.
        path.mkdir(parents=True, exist_ok=True)
        return path

#### Implementing The Pipeline's ***Specs*** Class
<br>
<div style="text-align: justify"> 
We can now create the <b><i>Specs</i></b> class of the pipeline, using the functionalities in the <b><i>IO</i></b> helper-class. This class is responsible for automatically reading and internalizing all the input data contained in the input file, so that the rest of the program needs only to communicate with this class and not the input dataframe for getting the required data. It also contains some logic, e.g. to check if all the necessary data for a specific project have been inputted by the user, and to fill in some default values if needed. Furthermore, it creates the necessary folders for the output data, and stores their paths.</div>

In [None]:
class Specs:
    """
    Data-class containing all the input data and output paths of the project.
    Take the filepath to the CSV input-data as well as output path, and
    internalize all the data as instance attributes.

    Parameters
    ----------
    input_data_filepath : str or pathlib.Path object
        Relative or absolute local path of the input CSV-data-file for the project.
    output_data_root_folder_path : str or pathlib.Path object
        Relative or absolute local path of root folder to store output data in.
    """

    # Defining the default values for all optional entries
    # -----------------------------------------------------
    class BindingSiteDefaults(Enum):
        DEFINITION_METHOD = Consts.BindingSite.DefinitionMethods.DETECTION
        DETECTION_METHOD = Consts.BindingSite.DetectionMethods.DOGSITESCORER
        SELECTION_METHOD = Consts.BindingSite.SelectionMethods.SORTING
        SELECTION_CRITERIA_SORTING = [
            Consts.BindingSite.SelectionCriteria.LIGAND_COVERAGE.value,
            Consts.BindingSite.SelectionCriteria.POCKET_COVERAGE.value,
        ]
        SELECTION_CRITERIA_FUNCTION = f"(df[{Consts.BindingSite.SelectionCriteria.DRUG_SCORE.value}] + df[{Consts.BindingSite.SelectionCriteria.SIMPLE_SCORE.value}]) / df[{Consts.BindingSite.SelectionCriteria.VOLUME}]"
        PROTEIN_CHAIN_ID = ""
        PROTEIN_LIGAND_ID = ""

    class LigandSimilaritySearchDefaults(Enum):
        SEARCH_ENGINE = Consts.LigandSimilaritySearch.SearchEngines.PUBCHEM
        MIN_SIMILARITY_PERCENT = 80
        MAX_NUM_RESULTS = 100
        MAX_NUM_DRUGLIKE = 30

    class DockingDefaults(Enum):
        PROGRAM = Consts.Docking.Programs.SMINA
        NUM_POSES_PER_LIGAND = 5
        EXHAUSTIVENESS = 10
        RANDOM_SEED = 1111

    class InteractionAnalysisDefaults(Enum):
        PROGRAM = Consts.InteractionAnalysis.Programs.PLIP

    class OptimizedLigandsDefaults(Enum):
        NUM_RESULTS = 1
        SELECTION_METHOD = Consts.OptimizedLigands.SelectionMethods.SORTING
        SELECTION_CRITERIA_SORTING = [
            Consts.OptimizedLigands.SelectionCriteria.BINDING_AFFINITY.value,
            Consts.OptimizedLigands.SelectionCriteria.NUM_ALL_INTERACTIONS.value,
        ]
        SELECTION_CRITERIA_FUNCTION = f"-2*df[{Consts.OptimizedLigands.SelectionCriteria.BINDING_AFFINITY.value}] + df[{Consts.OptimizedLigands.SelectionCriteria.NUM_ALL_INTERACTIONS}] * df[{Consts.OptimizedLigands.SelectionCriteria.DRUG_SCORE_TOTAL}]"

    # -----------------------------------------------------

    def __init__(self, input_data_filepath, output_data_root_folder_path):
        self.RawData = self.RawData(input_data_filepath)
        self.Protein = self.Protein(self.RawData.protein)
        self.Ligand = self.Ligand(self.RawData.ligand)
        self.BindingSite = self.BindingSite(self.RawData.binding_site)
        self.LigandSimilaritySearch = self.LigandSimilaritySearch(
            self.RawData.ligand_similarity_search
        )
        self.Docking = self.Docking(self.RawData.docking)
        self.InteractionAnalysis = self.InteractionAnalysis(self.RawData.interaction_analysis)
        self.OptimizedLigands = self.OptimizedLigands(self.RawData.optimized_ligand)
        self.OutputPaths = self.OutputPaths(output_data_root_folder_path)

    # Defining a sub-class for each part of the pipeline
    # -----------------------------------------------------
    class RawData:
        def __init__(self, input_data_filepath):
            self.filepath = input_data_filepath
            self.all_data = IO.create_dataframe_from_csv_input_file(
                input_data_filepath=input_data_filepath,
                list_of_index_column_names=[
                    Consts.DataFrame.ColumnNames.SUBJECT.value,
                    Consts.DataFrame.ColumnNames.PROPERTY.value,
                ],
                list_of_columns_to_keep=[Consts.DataFrame.ColumnNames.VALUE.value],
            )

            for subject_name in Consts.DataFrame.SubjectNames:
                subject_data = IO.copy_series_from_dataframe(
                    input_df=self.all_data,
                    index_name=subject_name.value,
                    column_name=Consts.DataFrame.ColumnNames.VALUE.value,
                )
                setattr(self, subject_name.name.lower(), subject_data)

    class Protein:
        def __init__(self, input_protein_data):
            self.input_type = Consts.Protein.InputTypes(
                input_protein_data[Consts.Protein.Properties.INPUT_TYPE.value]
            )
            self.input_value = input_protein_data[Consts.Protein.Properties.INPUT.value]

    class Ligand:
        def __init__(self, input_ligand_data):
            self.input_type = Consts.Ligand.InputTypes(
                input_ligand_data[Consts.Ligand.Properties.INPUT_TYPE.value]
            )
            self.input_value = input_ligand_data[Consts.Ligand.Properties.INPUT.value]

    class BindingSite:
        def __init__(self, input_binding_site_data):
            definition_method = input_binding_site_data[
                Consts.BindingSite.Properties.DEFINITION_METHOD.value
            ]
            # check if definition method is given, if not set to default (i.e. detection)
            self.definition_method = (
                Specs.BindingSiteDefaults.DEFINITION_METHOD
                if definition_method is np.nan
                else Consts.BindingSite.DefinitionMethods(definition_method)
            )

            if self.definition_method is Consts.BindingSite.DefinitionMethods.COORDINATES:
                coordinates_as_string = input_binding_site_data[
                    Consts.BindingSite.Properties.COORDINATES.value
                ]
                coordinates = coordinates_as_string.split(" ")
                self.coordinates = {"center": coordinates[:3], "size": coordinates[3:]}

            elif self.definition_method is Consts.BindingSite.DefinitionMethods.LIGAND:
                self.ligand = input_binding_site_data[Consts.BindingSite.Properties.LIGAND.value]

            elif self.definition_method is Consts.BindingSite.DefinitionMethods.DETECTION:
                detection_method = input_binding_site_data[
                    Consts.BindingSite.Properties.DETECTION_METHOD.value
                ]
                self.detection_method = (
                    Specs.BindingSiteDefaults.DETECTION_METHOD
                    if detection_method is np.nan
                    else Consts.BindingSite.DetectionMethods(detection_method)
                )

                protein_chain_id = input_binding_site_data[
                    Consts.BindingSite.Properties.PROTEIN_CHAIN_ID.value
                ]
                self.protein_chain_id = (
                    Specs.BindingSiteDefaultValues.PROTEIN_CHAIN_ID
                    if protein_chain_id is np.nan
                    else protein_chain_id
                )

                protein_ligand_id = input_binding_site_data[
                    Consts.BindingSite.Properties.PROTEIN_LIGAND_ID.value
                ]
                self.protein_ligand_id = (
                    Specs.BindingSiteDefaultValues.PROTEIN_LIGAND_ID
                    if protein_ligand_id is np.nan
                    else protein_ligand_id
                )

                selection_method = input_binding_site_data[
                    Consts.BindingSite.Properties.SELECTION_METHOD.value
                ]
                self.selection_method = (
                    Specs.BindingSiteDefaultValues.SELECTION_METHOD
                    if selection_method is np.nan
                    else Consts.BindingSite.SelectionMethods(selection_method)
                )

                if self.selection_method is Consts.BindingSite.SelectionMethods.SORTING:
                    selection_criteria = input_binding_site_data[
                        Consts.BindingSite.Properties.SELECTION_CRITERIA.value
                    ]
                    if selection_criteria is np.nan:
                        self.selection_criteria = (
                            Specs.BindingSiteDefaults.SELECTION_CRITERIA_SORTING.value
                        )
                    else:
                        # pass the column names through the SelectionCriteria enumeration class to make sure they are valid
                        self.selection_criteria = [
                            Consts.BindingSite.SelectionCriteria(pocket_property.strip()).value
                            for pocket_property in selection_criteria.split(",")
                        ]

                elif self.selection_method is Consts.BindingSite.SelectionMethods.FUNCTION:
                    selection_criteria = input_binding_site_data[
                        Consts.BindingSite.Properties.SELECTION_CRITERIA.value
                    ]
                    self.selection_criteria = (
                        Specs.BindingSiteDefaults.SELECTION_CRITERIA_FUNCTION.value
                        if selection_criteria is np.nan
                        else selection_criteria
                    )

    class LigandSimilaritySearch:
        def __init__(self, similarity_search_data):

            search_engine = similarity_search_data[
                Consts.LigandSimilaritySearch.Properties.SEARCH_ENGINE.value
            ]
            self.search_engine = (
                Specs.LigandSimilaritySearchDefaults.SEARCH_ENGINE
                if search_engine is np.nan
                else Consts.LigandSimilaritySearch.SearchEngines(search_engine)
            )

            min_similarity_percent = similarity_search_data[
                Consts.LigandSimilaritySearch.Properties.MIN_SIMILARITY_PERCENT.value
            ]
            self.min_similarity_percent = (
                Specs.LigandSimilaritySearchDefaults.MIN_SIMILARITY_PERCENT
                if min_similarity_percent is np.nan
                else min_similarity_percent
            )

            max_num_results = similarity_search_data[
                Consts.LigandSimilaritySearch.Properties.MAX_NUM_RESULTS.value
            ]
            self.max_num_results = (
                Specs.LigandSimilaritySearchDefaults.MAX_NUM_RESULTS
                if max_num_results is np.nan
                else max_num_results
            )

            max_num_druglike = int(
                similarity_search_data[
                    Consts.LigandSimilaritySearch.Properties.MAX_NUM_DRUGLIKE.value
                ]
            )
            self.max_num_druglike = (
                Specs.LigandSimilaritySearchDefaults.MAX_NUM_DRUGLIKE
                if max_num_druglike is np.nan
                else max_num_druglike
            )

    class Docking:
        def __init__(self, docking_data):

            program = docking_data[Consts.Docking.Properties.PROGRAM.value]
            self.program = (
                Specs.DockingDefaults.PROGRAM
                if program is np.nan
                else Consts.Docking.Programs(program)
            )

            num_poses_per_ligand = docking_data[
                Consts.Docking.Properties.NUM_POSES_PER_LIGAND.value
            ]
            self.num_poses_per_ligand = (
                Specs.DockingDefaults.NUM_POSES_PER_LIGAND.value
                if num_poses_per_ligand is np.nan
                else num_poses_per_ligand
            )

            exhaustiveness = docking_data[Consts.Docking.Properties.EXHAUSTIVENESS.value]
            self.exhaustiveness = (
                Specs.DockingDefaults.EXHAUSTIVENESS.value
                if exhaustiveness is np.nan
                else exhaustiveness
            )

            random_seed = docking_data[Consts.Docking.Properties.RANDOM_SEED.value]
            self.random_seed = (
                Specs.DockingDefaults.RANDOM_SEED.value if random_seed is np.nan else random_seed
            )

    class InteractionAnalysis:
        def __init__(self, interaction_analysis_data):

            program = interaction_analysis_data[
                Consts.InteractionAnalysis.Properties.PROGRAM.value
            ]
            self.program = (
                Specs.InteractionAnalysisDefaults.PROGRAM
                if program is np.nan
                else Consts.InteractionAnalysis.Programs(program)
            )

    class OptimizedLigands:
        def __init__(self, optimized_ligand_data):
            num_results = optimized_ligand_data[
                Consts.OptimizedLigands.Properties.NUM_RESULTS.value
            ]
            self.num_results = (
                Specs.OptimizedLigandsDefaults.NUM_RESULTS.value
                if num_results is np.nan
                else int(num_results)
            )

            selection_method = optimized_ligand_data[
                Consts.OptimizedLigands.Properties.SELECTION_METHOD.value
            ]
            self.selection_method = (
                Specs.OptimizedLigandsDefaults.SELECTION_METHOD
                if selection_method is np.nan
                else Consts.OptimizedLigands.SelectionMethods(selection_method)
            )

            if self.selection_method is Consts.OptimizedLigands.SelectionMethods.SORTING:
                selection_criteria = optimized_ligand_data[
                    Consts.OptimizedLigands.Properties.SELECTION_CRITERIA.value
                ]
                if selection_criteria is np.nan:
                    self.selection_criteria = (
                        Specs.OptimizedLigandsDefaults.SELECTION_CRITERIA_SORTING.value
                    )
                else:
                    # pass the column names through the SelectionCriteria enumeration class to make sure they are valid
                    self.selection_criteria = [
                        Consts.OptimizedLigands.SelectionCriteria(criterion.strip()).value
                        for criterion in selection_criteria.split(",")
                    ]

            elif self.selection_method is Consts.OptimizedLigands.SelectionMethods.FUNCTION:
                selection_criteria = optimized_ligand_data[
                    Consts.OptimizedLigands.Properties.SELECTION_CRITERIA.value
                ]
                self.selection_criteria = (
                    Specs.OptimizedLigandsDefaults.SELECTION_CRITERIA_FUNCTION.value
                    if selection_criteria is np.nan
                    else selection_criteria
                )

    class OutputPaths:
        """
        Data-class containing all the output paths for different parts of the pipeline.
        Take a main output path, and create all required parent folders,
        as well as sub-folders for each part of the pipeline.
        """

        def __init__(self, output_path):
            self.root = Path(output_path)
            for folder_name in Consts.Output.FolderNames:
                folder_path = IO.create_folder(folder_name.value, output_path)
                setattr(self, folder_name.name.lower(), folder_path)

##### Processing The Input Data: Instantiating The ***Specs*** Class
<br>
<div style="text-align: justify"> 
We can now instantiate the <b><i>Specs</i></b> class by specifying the filepath of the input CSV file, and the output path for storing the output data. As discussed, we will assign the created instance to our project, just to have everything organized in one place .</div>

In [None]:
project1.Specs = Specs(
    input_data_filepath="data/PipelineInputData_Project1.csv",
    output_data_root_folder_path="data/Outputs/" + project1.name,
)

<div style="text-align: justify"> 
All the available data in the input CSV file of our project are now internalized in the project's <b><i>Specs</i></b> instance, and can be accessed using the corresponding instance attributes. Some few examples are shown below.
    <br><br>
    <b><i>Note:</i></b> Another advantage of storing the data as instance attributes and assigning them to <b><i>project1</i></b> is that everywhere in the code we can directly see all the project's data and know how to access them; Just write <span style="background-color: #D3D3D3">project1.</span> and press the <i>tab</i> button. Code completion will then display a list of all available options to choose from. In contrast, for example storing the data in a dataframe requires you to have the dataframe always in front of you to know the exact adress of each specific data you may need to use in another part of the code.  
</div>

In [None]:
project1.Specs.Protein.input_value

In [None]:
project1.Specs.Ligand.input_value

In [None]:
project1.Specs.Docking.num_poses_per_ligand

In [None]:
project1.Specs.OutputPaths.binding_site_detection

It is also possible to see the raw input data in its entirety:

In [None]:
project1.Specs.RawData.all_data

<a id='3_practical'></a>
### Processing the Input Protein Data 
<br>
<div style="text-align: justify"> 
As mentioned above, it is a good idea to use the input protein data to make a <b><i>Protein</i></b> object, which would have its own attributes and methods. In order to implement this, we first need to define some functions for processing protein data, such as fetching PDB files from the PDB webserver, parsing PDB files to extract useful data etc. For this purpose, we first define a helper-class called <b><i>PDB</i></b>, which implements these functionalities. </div>

#### Implementing The Required Functions: The ***PDB*** Helper-Class
<br>
<div style="text-align: justify"> 
This helper-class contains all the necessary functions for handling protein data, especially via processing PDB files. <br><br><b><i>Note:</i></b> For demonstration of each function, see the <a href="#pdb_demo">corresponding section</a> in <i>Supplementary Information</i>.</div>

In [None]:
class PDB:
    """
    Set of functions for working with PDB files.
    """

    @staticmethod
    def read_pdb_file_content(input_type, input_value):
        """
        Read the content of a PDB file either from a local path or via fetching the file from PDB webserver.

        Parameters
        ----------
        input_type : str
            Either 'pdb_code' or 'pdb_filepath'.

        input_value : str
            Either a valid PDB-code, or a local filepath of a PDB file.

        Returns
        -------
        str
            Content of the PDB file as a single string.
        """
        if input_type == "pdb_code":
            pdb_file_content = pypdb.get_pdb_file(input_value)
        elif input_type == "pdb_filepath":
            with open(input_value) as f:
                pdb_file_content = f.read()
        return pdb_file_content

    @staticmethod
    def fetch_and_save_pdb_file(pdb_code, output_filepath):
        """
        Fetch a PDB file from the PDB webserver and save locally.

        Parameters
        ----------
        pdb_code : str
            PDB-code of the protein structure.

        output_filepath : str or pathlib.Path object
            Local file path (including file name, but not the extension) to save the PDB file in.

        Returns
        -------
        pathlib.Path object
            The full path of the saved PDB file.
        """
        pdb_file_content = pypdb.get_pdb_file(pdb_code)
        full_filepath = str(output_filepath) + ".pdb"
        with open(full_filepath, "w") as f:
            f.write(pdb_file_content)
        return Path(full_filepath)

    @staticmethod
    def extract_molecule_from_pdb_file(molecule_name, input_filepath, output_filepath):
        """
        Extract a specific molecule (i.e. the protein or a ligand)
        from a local PDB file and save as a new PDB file in a given path.

        Parameters
        ----------
        molecule_name : str
            Name of the molecule to be extracted.
            For the protein, enter 'protein'. For a ligand, enter the ligand-ID.

        input_filepath : str or pathlib.Path object
            Local path of the original PDB file.

        output_filepath : str or pathlib.Path object
            Local file path (including file name) to save the PDB file of the extracted molecule in.

        Returns
        -------
            <Universe> Structure object
            Structure object of the extracted molecule.
        """

        pdb_structure = Structure.from_string(input_filepath)
        molecule_name = f"resname {molecule_name}" if molecule_name != "protein" else molecule_name
        extracted_structure = pdb_structure.select_atoms(molecule_name)
        extracted_structure.write(output_filepath)
        return extracted_structure

    @staticmethod
    def load_pdb_file_as_dataframe(pdb_file_text_content):
        """
        Transform the textual content of a PDB file into a dictionary of Pandas DataFrames.

        Parameters
        ----------
        pdb_file_text_content : str
            Textual content of a PDB file as a single string.

        Returns
        -------
        Dictionary of Pandas DataFrames.
        The dictionary has four entries with following keys: 'ATOM', 'HETATM', 'ANISOU' and 'OTHERS'.
        Each value is a Pandas DataFrame corresponding to the specific information described by the key.
        """
        ppdb = PandasPdb()
        pdb_df = ppdb._construct_df(pdb_file_text_content.splitlines(True))
        # TODO: Change _construct_df to read_pdb_from_lines once biopandas
        # cuts a new release (currently: 0.2.7), see https://github.com/rasbt/biopandas/pull/72
        return pdb_df

    @staticmethod
    def extract_info_from_pdb_file_content(pdb_file_text_content):
        """
        Extract some useful information from the contents of a PDB file.

        Parameters
        ----------
        pdb_file_text_content : str
            Textual content of a PDB file as a single string.

        Returns
        -------
            dict
            Dictionary of the successfully extracted information.
            Possible keys are:
                'structure_title' : str
                    Title of the PDB structure.
                'name' : str
                    Name of the protein.
                'chains' : list of strings
                    List of chain-IDs of the available chains in the protein.
                'ligands' : list of lists of strings
                    List of ligand information: [ligand-ID, chain-ID+residue number, number of heavy atoms]
        """

        pdb_content = pdb_file_text_content.strip().split("\n")
        for index in range(len(pdb_content)):
            pdb_content[index] = pdb_content[index].split(" ", 1)
            try:
                pdb_content[index][1] = pdb_content[index][1].strip()
                if pdb_content[index][1][-1] == ";":
                    pdb_content[index][1] = pdb_content[index][1][:-1]
                if (
                    pdb_content[index][0] in ["REMARK", "COMPND"]
                    and pdb_content[index][1][0].isdigit()
                ):
                    try:
                        pdb_content[index][1] = pdb_content[index][1].split(" ", 1)[1]
                    except:
                        pass
            except:
                pdb_content[index].append(" ")

        info = {}
        ligands = []
        for content in pdb_content:
            if content[0] == "TITLE":
                info["Structure Title"] = content[1]
            if content[0] == "COMPND" and content[1].startswith("MOLECULE: "):
                info["Name"] = content[1].split("MOLECULE: ")[1]
            if content[0] == "COMPND" and content[1].startswith("CHAIN: "):
                info["Chains"] = content[1].split("CHAIN: ")[1].split(", ")
            if content[0] == "HET":
                lig = list(filter(lambda x: x != "", content[1].split(" ")))
                lig[-1] = int(lig[-1])
                ligands.append(lig)
        info["Ligands"] = ligands
        return info

#### Implementing The Required Functions: The ***NGLView*** Helper-Class
<br>
We also implement another helper-class, containing all the necessary functions for visualizing protein-related data, such as the protein structure, and later the protein binding site, docking poses of ligands and the protein-ligand interactions present in them.

In [None]:
class NGLView:
    @staticmethod
    def protein(input_type, input_value, output_image_filename=None):
        """
        Visualize the protein.

        Parameters
        ----------
        input_type : str
            Either "pdb_code" or a file extension e.g. "pdb".
        input_value: str or pathlib.Path object
            Either the PDB-code of the protein, or a local filepath.
        output_image_filename : str (optional; default: None)
            Filename to save a static image of the protein.

        Returns
        -------
            NGLViewer object
            Interactive NGL viewer of the given Protein
            and (if available) its co-crystallized ligand.
        """

        if input_type == "pdb_code":
            viewer = nv.show_pdbid(input_value, height="600px")
        else:
            with open(input_value) as f:
                viewer = nv.show_file(
                    f, ext=input_type, height="600px", default_representation=False
                )
                viewer.add_representation("cartoon", selection="protein")

        viewer.add_representation(repr_type="ball+stick", selection="hetero and not water")
        viewer.center("protein")

        if output_image_filename != None:
            viewer.render_image(trim=True, factor=2)
            viewer._display_image()
            viewer.download_image(output_image_filename)
        return viewer

    @staticmethod
    def binding_site(protein_input_type, protein_input_value, ccp4_filepath):
        """
        3D visualization of a binding pocket using a CCP4 file.

        Parameters
        ----------
        protein_input_type : str
            Either "pdb_code" or a file extension e.g. "pdb".
        protein_input_value: str or pathlib.Path object
            Either the PDB-code of the protein, or a local filepath.
        ccp4_filepath : str
            Local file path of the output of the Binding Site Detection.

        Returns
        -------
        NGL viewer that visualizes the selected pocket at its respective position.
        """
        viewer = NGLView.protein(protein_input_type, protein_input_value)
        with open(ccp4_filepath, "rb") as f:
            viewer.add_component(f, ext="ccp4")
        viewer.center()

        return viewer

    @staticmethod
    def docking(
        protein_filepath,
        protein_file_extension,
        list_docking_poses_filepaths,
        docking_poses_file_extension,
        list_docking_poses_labels,
        list_docking_poses_affinities,
    ):
        """
        Visualize a list of docking poses
        in the protein structure, using NGLView.

        Parameters
        ----------
        protein_filepath : str or pathlib.Path object
            Filepath of the extracted protein structure used in docking experiment.
        protein_file_extension : str
            File extension of the protein file, e.g. "pdb", "pdbqt" etc.
        list_docking_poses_filepaths : list of strings/pathlib.Path objects
            List of filepaths for the separated docking poses.
        docking_poses_file_extension : str
            File extension of the docking-pose files, e.g. "pdb", "pdbqt" etc.
        list_docking_poses_labels : list of strings
            List of labels for docking poses to be used for the selection menu.
        list_docking_poses_affinities : list of strings/numbers
            List of binding affinities in kcal/mol, to be viewed for each docking pose.

        Returns
        -------
            NGLView viewer
            Interactive viewer containing the protein structure and all docking poses,
            with menu to select between docking poses.
        """

        # JavaScript code needed to update residues around the ligand
        # because this part is not exposed in the Python widget
        # Based on: http://nglviewer.org/ngl/api/manual/snippets.html
        _RESIDUES_AROUND = """
        var protein = this.stage.compList[0];
        var ligand_center = this.stage.compList[{index}].structure.atomCenter();
        var around = protein.structure.getAtomSetWithinPoint(ligand_center, {radius});
        var around_complete = protein.structure.getAtomSetWithinGroup(around);
        var last_repr = protein.reprList[protein.reprList.length-1];
        protein.removeRepresentation(last_repr);
        protein.addRepresentation("licorice", {{sele: around_complete.toSeleString()}});
        """
        print("Docking modes")
        print("(CID - mode)")
        # Create viewer widget
        viewer = nv.NGLWidget(height="860px")
        viewer.add_component(protein_filepath, ext=protein_file_extension)
        # viewer.add_representation("cartoon", selection="protein")
        # Select first atom in molecule (@0) so it holds the affinity label
        label_kwargs = dict(
            labelType="text",
            sele="@0",
            showBackground=True,
            backgroundColor="black",
        )
        list_docking_poses_affinities = list(
            map(lambda x: str(x) + " kcal/mol", list_docking_poses_affinities)
        )
        for docking_pose_filepath, ligand_label in zip(
            list_docking_poses_filepaths, list_docking_poses_affinities
        ):
            ngl_ligand = viewer.add_component(
                docking_pose_filepath, ext=docking_poses_file_extension
            )
            ngl_ligand.add_label(labelText=[str(ligand_label)], **label_kwargs)

        # Create selection widget
        #   Options is a list of (text, value) tuples. When we click on select, the value will be passed
        #   to the callable registered in `.observe(...)`
        selector = Select(
            options=[(label, i) for (i, label) in enumerate(list_docking_poses_labels, 1)],
            description="",
            rows=len(list_docking_poses_filepaths)
            if len(list_docking_poses_filepaths) <= 52
            else 52,
            layout=Layout(flex="flex-grow", width="auto"),
        )

        # Arrange GUI elements
        # The selection box will be on the left, the viewer will occupy the rest of the window
        display(AppLayout(left_sidebar=selector, center=viewer, pane_widths=[1, 6, 1]))

        # This is the event handler - action taken when the user clicks on the selection box
        # We need to define it here so it can "see" the viewer variable
        def _on_selection_change(change):
            # Update only if the user clicked on a different entry
            if change["name"] == "value" and (change["new"] != change["old"]):
                viewer.hide(
                    list(range(1, len(list_docking_poses_filepaths) + 1))
                )  # Hide all ligands (components 1-n)
                component = getattr(viewer, f"component_{change['new']}")
                component.show()  # Display the selected one
                component.center(500)  # Zoom view
                # Call the JS code to show sidechains around ligand
                viewer._execute_js_code(_RESIDUES_AROUND.format(index=change["new"], radius=6))

        # Register event handler
        selector.observe(_on_selection_change)
        # Trigger event manually to focus on the first solution
        _on_selection_change({"name": "value", "new": 1, "old": None})
        return viewer

    @staticmethod
    def interactions(
        protein_filepath,
        protein_file_extension,
        list_docking_poses_filepaths,
        docking_poses_file_extension,
        list_docking_poses_labels,
        list_docking_poses_affinities,
        list_docking_poses_plip_dicts,
    ):

        color_map = {
            "hydrophobic": [0.90, 0.10, 0.29],
            "hbond": [0.26, 0.83, 0.96],
            "waterbridge": [1.00, 0.88, 0.10],
            "saltbridge": [0.67, 1.00, 0.76],
            "pistacking": [0.75, 0.94, 0.27],
            "pication": [0.27, 0.60, 0.56],
            "halogen": [0.94, 0.20, 0.90],
            "metal": [0.90, 0.75, 1.00],
        }

        # Create selection widget
        # Options is a list of (text, value) tuples.
        # When we click on select, the value will be passed
        # to the callable registered in `.observe(...)`
        selector = Select(
            options=[(label, i) for (i, label) in enumerate(list_docking_poses_labels, 1)],
            description="",
            rows=len(list_docking_poses_filepaths)
            if len(list_docking_poses_filepaths) <= 52
            else 52,
            layout=Layout(flex="flex-grow", width="auto"),
        )

        # Arrange GUI elements
        # The selection box will be on the left,
        # the viewer will occupy the rest of the window (but it will be added later)
        app = AppLayout(
            left_sidebar=selector,
            center=None,
            pane_widths=[1, 6, 1],
            height="860px",
        )

        # Show color-map
        fig, axs = plt.subplots(nrows=2, ncols=4, figsize=(12, 1))
        plt.subplots_adjust(hspace=1)
        fig.suptitle("Color-map of interactions", size=10, y=1.3)
        for ax, (interaction, color) in zip(fig.axes, color_map.items()):
            ax.imshow(np.zeros((1, 5)), cmap=colors.ListedColormap(color_map[interaction]))
            ax.set_title(interaction, loc="center", fontsize=10)
            ax.set_axis_off()
        plt.show()

        list_docking_poses_affinities = list(
            map(lambda x: str(x) + " kcal/mol", list_docking_poses_affinities)
        )

        # This is the event handler - action taken when the user clicks on the selection box
        # We need to define it here so it can "see" the viewer variable
        def _on_selection_change(change):
            # Update only if the user clicked on a different entry
            if change["name"] == "value" and (change["new"] != change["old"]):
                if app.center is not None:
                    app.center.close()

                # NGL Viewer
                app.center = viewer = nv.NGLWidget(height="860px", default=True, gui=True)
                prot_component = viewer.add_component(
                    protein_filepath, ext=protein_file_extension, default_representation=False
                )  # add protein
                prot_component.add_representation("cartoon")
                time.sleep(1)

                label_kwargs = dict(
                    labelType="text",
                    sele="@0",
                    showBackground=True,
                    backgroundColor="black",
                )
                lig_component = viewer.add_component(
                    list_docking_poses_filepaths[change["new"]], ext=docking_poses_file_extension
                )  # add selected ligand
                lig_component.add_label(
                    labelText=[str(list_docking_poses_affinities[change["new"]])], **label_kwargs
                )
                time.sleep(1)
                lig_component.center(duration=500)

                # Add interactions
                interactions = list_docking_poses_plip_dicts[change["new"]]

                interacting_residues = []

                for interaction_type, interaction_list in interactions.items():
                    color = color_map[interaction_type]
                    if len(interaction_list) == 1:
                        continue
                    df_interactions = pd.DataFrame.from_records(
                        interaction_list[1:], columns=interaction_list[0]
                    )
                    for _, interaction in df_interactions.iterrows():
                        name = interaction_type
                        viewer.shape.add_cylinder(
                            interaction["LIGCOO"],
                            interaction["PROTCOO"],
                            color,
                            [0.1],
                            name,
                        )
                        interacting_residues.append(interaction["RESNR"])
                # Display interacting residues
                res_sele = " or ".join([f"({r} and not _H)" for r in interacting_residues])
                res_sele_nc = " or ".join(
                    [f"({r} and ((_O) or (_N) or (_S)))" for r in interacting_residues]
                )

                prot_component.add_ball_and_stick(
                    sele=res_sele, colorScheme="chainindex", aspectRatio=1.5
                )
                prot_component.add_ball_and_stick(
                    sele=res_sele_nc, colorScheme="element", aspectRatio=1.5
                )

        # Register event handler
        selector.observe(_on_selection_change)
        # Trigger event manually to focus on the first solution
        _on_selection_change({"name": "value", "new": 1, "old": None})
        return app

#### Implementing The Pipeline's ***Protein*** Class
<br>
<div style="text-align: justify"> Now using the functionalities defined in the helper-classes, we can implement the <b><i>Protein</i></b> class of the pipeline, which takes in the protein input data and creates an object with extended attributes and methods.</div>

In [None]:
class Protein:
    """
    Protein object with properties as attributes and methods to visualize and work with the protein.
    Take a protein identifier type and corresponding value,
    and create a Protein object, while assigning some properties as attributes.

    Parameters
    ----------
    identifier_type : enum 'InputTypes' from the 'Consts.Protein' class
        Type of the protein identifier, e.g. InputTypes.PDB_CODE.
    indentifier_value : str
        Value of the protein identifier, e.g. its PDB-code.
    protein_output_path : str or pathlib.Path object
        Output path of the project for protein data.
    """

    class Consts:
        # Available properties that are assigned as instance attributes upon instantiation.
        class Properties(Enum):
            STRUCTURE_TITLE = "Structure Title"
            NAME = "Name"
            CHAINS = "Chains"
            LIGANDS = "Ligands"
            RESIDUE_NUMBER_FIRST = "First Residue Number"
            RESIDUE_NUMBER_LAST = "Last Residue Number"
            RESIDUES_LENGTH = "Number of Residues"

    def __init__(self, identifier_type, identifier_value, protein_output_path):

        setattr(self, identifier_type.name.lower(), identifier_value)

        self.file_content = PDB.read_pdb_file_content(identifier_type.value, identifier_value)

        dict_of_dataframes = PDB.load_pdb_file_as_dataframe(self.file_content)
        for key, value in dict_of_dataframes.items():
            setattr(self, f"dataframe_PDBcontent_{key.lower()}", value)

        self.residue_number_first = self.dataframe_PDBcontent_atom.iloc[0]["residue_number"]
        self.residue_number_last = self.dataframe_PDBcontent_atom.iloc[-1]["residue_number"]
        self.residues_length = self.residue_number_last - self.residue_number_first + 1

        protein_info = PDB.extract_info_from_pdb_file_content(self.file_content)

        for protein_property in self.Consts.Properties:
            if protein_property.value in protein_info:
                setattr(
                    self,
                    protein_property.name.lower(),
                    protein_info[protein_property.value],
                )

        if identifier_type is Consts.Protein.InputTypes.PDB_CODE:
            self.pdb_filepath = PDB.fetch_and_save_pdb_file(
                identifier_value, str(protein_output_path) + "/" + identifier_value
            )

    def __call__(self):
        for protein_property in self.Consts.Properties:
            if hasattr(self, protein_property.name.lower()):
                display(
                    Markdown(
                        f"<span style='color:black'>&nbsp;&nbsp;&nbsp;&nbsp;{protein_property.value}: </span><span style='color:black'>**{getattr(self, protein_property.name.lower())}**</span>"
                    )
                )
        if hasattr(self, "pdb_code"):
            viewer = NGLView.protein("pdb_code", self.pdb_code)
        else:
            viewer = NGLView.protein("pdb", self.pdb_filepath)

        return viewer

    def __repr__(self):
        return f"<Protein: {self.name}>"

##### Creating a ***Protein*** Object From The Protein Input Data: Instantiating The ***Protein*** Class
<br>
<div style="text-align: justify"> 
    We now create an instance of the <b><i>Protein</i></b> class, by inputting the protein data of our project, i.e. the protein input type, the corresponding input value, and the output path of the project for storing the protein data. We then assign this instance to our project:</div>

In [None]:
project1.Protein = Protein(
    identifier_type=project1.Specs.Protein.input_type,
    identifier_value=project1.Specs.Protein.input_value,
    protein_output_path=project1.Specs.OutputPaths.protein,
)

We have implemented a ***\_\_call\_\_*** method for our ***Protein***, which prints out some useful information and visualizes the protein's structure, simply by calling the object:

In [None]:
project1.Protein()

<div style="text-align: justify"> 
All of these information and some other properties are also stored separately as instance attributes. For example, a list of information on all co-crystallized ligands:
<br><br>
    <b><i>Note:</i></b> Each sub-list corresponds to one ligand, where the first entry is the ligand-ID, the second entry is the chain-ID (to which the ligand is bound) followed by the ligand residue number, and the third and last entry is the number of heavy atoms in the ligand. For example, here the first ligand has the ID 'W32', is on chain 'A' at residue number '1101', and has 39 heavy atoms.</div>

In [None]:
project1.Protein.ligands

Moreover, when the protein is inputted by its PDB-code, the PDB file will also be automatically downloaded and stored in the defined output path for the protein output data. The full path is also accessible via the attribute ***pdb_filepath***: 

In [None]:
project1.Protein.pdb_filepath

<a id='4_practical'></a>
### Processing the Input Ligand Data
<br>
<div style="text-align: justify"> 
    Similar to the protein, we can also use the ligand input data to create a <b><i>Ligand</i></b> object with some extended attributes and methods. For this, we first need to define a helper-class to be able to obtain some new information on the input ligand, and another helper-class for calculating ligand's properties, creating 3D conformations, visualizations etc. </div>

#### Implementing The Required Functions: The ***PubChem*** Helper-Class
<br>
<div style="text-align: justify"> 
Here, we define a class called <b><i>PubChem</i></b>, which contains all the necessary functions to use the <i>PubChem</i> web-service APIs, in order to obtain new information on ligands, such as other identifies (e.g. trivial name, IUPAC name, SMILES, CID, InChI and InChIKey), physiochemical properties, descriptions etc. <i>PubChem</i> has also the ability to perform similarity searches on a given ligand, which we will also implement here, and use later in the <b><i>LigandSimilaritySearch</i></b> class of the pipeline.<br><br><b><i>Note:</i></b> For demonstration of each function, see the <a href="#pubchem_demo">corresponding section</a> in <i>Supplementary Information</i>.  </div>

In [None]:
class PubChem:
    """
    Implementation of the functionalities of PubChem PUG REST API.
    """

    # -----------------------------------------------------------------------------
    # Constants for API requests
    class APIConsts:
        """
        Constants for API requests.
        Request URLs should have the format:
            APIConsts.URLs.PROLOG + APIConsts.URLs.Inputs.<type>.value + ...
            ... <input_value> + APIConsts.URLs.Operations.GET_<property>.value + ...
            ... APIConsts.URLs.Outputs.<type>.value + <?optional parameters>
        """

        class URLs:
            PROLOG = "https://pubchem.ncbi.nlm.nih.gov/rest/pug/"

            class Inputs(Enum):
                CID = "compound/cid/"
                NAME = "compound/name/"
                SMILES = "compound/smiles/"
                INCHI = "compound/inchi/"
                INCHIKEY = "compound/inchikey/"
                SIMILARITY_FROM_SMILES = "compound/similarity/smiles/"
                SIMILARITY_RESULTS = "compound/listkey/"

            class Operations(Enum):
                GET_CID = "/cids/"
                GET_NAME = "/property/title/"
                GET_SMILES = "/property/CanonicalSMILES/"
                GET_INCHI = "/property/InChI/"
                GET_INCHIKEY = "/property/InChIKey/"
                GET_IUPAC_NAME = "/property/IUPACName/"
                GET_DESCRIPTION = "/description/"
                GET_RECORD = "/record/"

            class Outputs(Enum):
                TXT = "TXT"
                JSON = "JSON"
                PNG = "PNG"
                CSV = "CSV"
                SDF = "SDF"
                XML = "XML"

        class ResponseMsgs:
            class SimilaritySearch(Enum):
                JOBKEY_KEY1 = "Waiting"
                JOBKEY_KEY2 = "ListKey"
                RESULT_KEY1 = "PropertyTable"
                RESULT_KEY2 = "Properties"

            class GetRecords(Enum):
                RESPONSE_KEY = "PC_Compounds"

            class GetDescription(Enum):
                RESPONSE_KEY1 = "InformationList"
                RESPONSE_KEY2 = "Information"

    # -----------------------------------------------------------------------------

    @staticmethod
    def send_request(partial_url, response_type="txt", optional_params=""):
        """
        Send an API request to PubChem and get the response data.

        Parameters
        ----------
        partial_url : str
            The URL part of the request consisting of input-type, input-value and operation.
            E.g. 'compound/cid/2244/property/CanonicalSMILES/' requests the SMILES of
            the compound with an CID of 2244.
        response_type : str (optional; default: txt)
            Expected response-type of the API request.
            Valid values are 'txt', 'json', 'png', 'csv' and 'sdf'.
            Valid values are stored in: PubChem.APIConsts.URLs.Outputs
        optional_params : str
            The URL part of the request consisting of optional parameters.

        Returns
        -------
            Datatype depends on the value of input parameter 'response_type'.
            The response data of the API request.
        """
        full_url = (
            PubChem.APIConsts.URLs.PROLOG
            + partial_url
            + getattr(PubChem.APIConsts.URLs.Outputs, response_type.upper()).value
            + f"?{optional_params}"
        )
        response = requests.get(full_url)
        response.raise_for_status()
        if response_type == "txt":
            response_data = response.text
        elif response_type == "json":
            response_data = response.json()
        else:
            response_data = response.content
        return response_data

    @staticmethod
    def convert_compound_identifier(
        input_id_type, input_id_value, output_id_type, output_data_type="txt"
    ):
        """
        Convert an identifier to another identifier, e.g. CID to SMILES, SMILES to IUPAC-name etc.

        Parameters
        ----------
        input_id_type : str
            Type of the input identifier.
            Valid values are: 'name', 'cid', 'smiles', 'inchi' and 'inchikey'.
            Valid values are stored in: PubChem.APIConsts.URLs.Inputs
        input_id_value : str, integer, list of strings or list of integers
            Value of the input identifier.
        output_id_type : str
            Type of the ouput identifier.
            Valid values are: 'name', 'cid', 'smiles', 'inchi', 'inchikey', 'iupac_name'.
            Valid values are stored in: PubChem.APIConsts.URLs.Operations
        output_data_type : str (optional; default: 'txt')
            Datatype of the output data.
            Valid values are 'txt', 'json', 'csv'.
            A list of all valid values are stored in: PubChem.APIConsts.URLs.Outputs

        Returns
        -------
            Datatype depends on the value of input parameter 'response_type'
            The response data of the API request.
        """
        if isinstance(input_id_value, list):
            input_id_value = ",".join(map(str, input_id_value))
        url = (
            getattr(PubChem.APIConsts.URLs.Inputs, input_id_type.upper()).value
            + str(input_id_value)
            + getattr(PubChem.APIConsts.URLs.Operations, f"GET_{output_id_type}".upper()).value
        )
        response_data = PubChem.send_request(url, output_data_type)
        if isinstance(input_id_value, list):
            return response_data.strip().split("\n")
        else:
            return response_data.strip()

    @staticmethod
    def get_compound_record(input_id_type, input_id_value, output_data_type="json"):
        """
        Get a full record of all physiochemical properties of the compound.

        Parameters
        ----------
        input_id_type : str
            Type of the input identifier.
            Valid values are: 'name', 'cid', 'smiles', 'inchi' and 'inchikey'.
            Valid values are stored in: PubChem.APIConsts.URLs.Inputs
        input_id_value : str or integer
            Value of the input identifier.
        output_data_type : str (optional; default: 'txt')
            Datatype of the output data.
            Valid values are 'txt', 'json', 'csv'.
            A list of all valid values are stored in: PubChem.APIConsts.URLs.Outputs

        Returns
        -------
            dict
            Dictionary keys are: 'id', 'atoms', 'bonds', 'coords', 'charge', 'props', 'count'
        """
        url = (
            getattr(PubChem.APIConsts.URLs.Inputs, input_id_type.upper()).value
            + str(input_id_value)
            + getattr(PubChem.APIConsts.URLs.Operations, "GET_RECORD").value
        )
        response_data = PubChem.send_request(url, output_data_type)[
            PubChem.APIConsts.ResponseMsgs.GetRecords.RESPONSE_KEY.value
        ][0]
        return response_data

    @staticmethod
    def get_description_from_smiles(smiles, output_data_type="json", printout=False):
        """
        Get a textual description of a molecule, including its usage, properties, source etc.

        Parameters
        ----------
        smiles : str
            SMILES of the molecule.
        output_data_type : str (optional; default: 'txt')
            Datatype of the output data.
            Valid values are 'txt', 'json', 'csv'.
            A list of all valid values are stored in: PubChem.APIConsts.URLs.Outputs
        printout : bool
            Whether to print the descriptions, or return the data.

        Returns
        -------
            list of dicts
            When the parameter printout is set to False, the raw data is
            returned as a list of dicts, where each element of the list
            corresponds to a description from a specific source (e.g. a journal article).
        """
        url = (
            PubChem.APIConsts.URLs.Inputs.SMILES.value
            + smiles
            + PubChem.APIConsts.URLs.Operations.GET_DESCRIPTION.value
        )
        response_data = PubChem.send_request(url, output_data_type)[
            PubChem.APIConsts.ResponseMsgs.GetDescription.RESPONSE_KEY1.value
        ][PubChem.APIConsts.ResponseMsgs.GetDescription.RESPONSE_KEY2.value]

        if printout:
            for entry in response_data:
                try:
                    print(entry["Description"] + "\n")
                except:
                    pass
        else:
            return response_data

    @staticmethod
    def similarity_search(
        smiles,
        min_similarity=80,
        max_num_results=100,
        output_data_type="json",
        max_num_attempts=30,
    ):
        """
        Run a similarity search on a molecule and get all the similar ligands.

        Parameters
        ----------
        smiles : str
            The canonical SMILES string for the given compound.
        min_similarity : int (optional; default: 80)
            The threshold of similarity in percent.
        max_num_results : int (optional; default: 100)
            The maximum number of feedback records.
        output_data_type : str (optional; default: 'json')
            Datatype of the output data.
            Valid values are 'txt', 'json', 'csv'.
            A list of all valid values are stored in: PubChem.APIConsts.URLs.Outputs
        max_num_attempts : int (optional; default: 30)
            Maximum number of attempts to fetch the API response, after the job has been submitted.
            Each failed attempt is followed by a 10-second pause.
        Returns
        -------
            Datatype depends on the 'output_data_type' parameter
            E.g. when set to "json", returns a list of dicts
            Each dictionary in the list corresponds to a similar compound,
            which has a 'CID' and a 'CanonicalSMILES' key.
        """
        escaped_smiles = quote(smiles).replace("/", ".")
        url = PubChem.APIConsts.URLs.Inputs.SIMILARITY_FROM_SMILES.value + escaped_smiles + "/"
        response_data = PubChem.send_request(
            url,
            output_data_type,
            f"Threshold={min_similarity}&MaxRecords={max_num_results}",
        )
        job_key = response_data[PubChem.APIConsts.ResponseMsgs.SimilaritySearch.JOBKEY_KEY1.value][
            PubChem.APIConsts.ResponseMsgs.SimilaritySearch.JOBKEY_KEY2.value
        ]

        url = (
            PubChem.APIConsts.URLs.Inputs.SIMILARITY_RESULTS.value
            + job_key
            + PubChem.APIConsts.URLs.Operations.GET_SMILES.value
        )

        num_attempts = 0
        while num_attempts < max_num_attempts:
            response_data = PubChem.send_request(url, output_data_type)
            if PubChem.APIConsts.ResponseMsgs.SimilaritySearch.RESULT_KEY1.value in response_data:
                similar_compounds = response_data[
                    PubChem.APIConsts.ResponseMsgs.SimilaritySearch.RESULT_KEY1.value
                ][PubChem.APIConsts.ResponseMsgs.SimilaritySearch.RESULT_KEY2.value]
                break
            time.sleep(10)
            num_attempts += 1
        else:
            raise ValueError(f"Could not find matches in the response URL: {url}")
        return similar_compounds

#### Implementing The Required Functions: The ***RDKit*** Helper-Class
<br>
<div style="text-align: justify"> 
We will also define another helper-class for working with ligands, called <b><i>RDKit</i></b>. This class uses the <i>RDKit</i> python library to implement some useful functionalities for ligands, such as calculating properties and descriptors, e.g. molecular weight, partition coefficients, number of hydrogen-bond acceptors/donors, number of rotatable bonds etc. These properties will then also be used to calculate several drug-likeness scores, for example the Lipinski's rule of 5, and the quantitative estimate of drug-likeness (QED). Other useful functionalities that will be implemented here include visualization of molecular structures, and saving molecules to file, either as an image or an SDF file. We will also define a function to calculate the similarity between two molecules, based on the Dice similarity metric using circular (Morgan) fingerprints. This can be used later to assess the similarity of each of the ligand's analogs found by the similarity search performed using <i>PubChem</i>, since <i>PubChem</i> does not include the exact similarity value of the search results, but only accepts a threshold value. 
    <br><br>
    <b><i>Note:</i></b> For demonstration of each function, see the <a href="#rdkit_demo">corresponding section</a> in <i>Supplementary Information</i>.</div>

In [None]:
class RDKit:
    @staticmethod
    def create_molecule_object(input_type, input_value):
        """
        Create an RDKit molecule object from various sources.

        Parameters
        ----------
        input_type : str
            Type of the input.
            Allowed input-types are: 'smiles', 'inchi', 'smarts', 'pdb_files'
        input_value : str
            Value of the corresponding input type.

        Returns
        -------
            RDKit molecule object.
        """

        functions = {
            "smiles": Chem.MolFromSmiles,
            "inchi": Chem.MolFromInchi,
            "smarts": Chem.MolFromSmarts,
            "pdb_file": Chem.MolFromPDBFile,
        }
        Molobj = functions[input_type](input_value)
        return Molobj

    @staticmethod
    def draw_molecules(
        list_mol_objs,
        list_legends=None,
        mols_per_row=3,
        sub_img_size=(550, 550),
        filepath=None,
    ):
        """
        Take a list of RDKit molecule objects and draws them as a grid image.

        Parameters
        ----------
        list_mol_objs: list
            List of RDKit molecule objects to be drawn.
        list_legends: list (optional)
            List of legends for the molecules.
            If not provided, the list indices (+1) will be used as legend.
        mols_per_row : int (optional; default: 3)
            Number of structures to show per row.
        sub_img_size : tuple (int, int)
            Size of each structure.
        filepath : str or pathlib.Path object
            Full filepath to save the image in.

        Returns
        -------
            RDKit MolsToGridImage object.
        """
        if list_legends == None:
            list_legends = list(range(1, len(list_mol_objs) + 1))
        figure = Draw.MolsToGridImage(
            list_mol_objs,
            molsPerRow=mols_per_row,
            subImgSize=sub_img_size,
            legends=list_legends,
        )
        if filepath != None:
            with open(str(filepath) + ".png", "wb") as f:
                f.write(figure.data)
        return figure

    @staticmethod
    def save_molecule_image_to_file(mol_obj, filepath):
        """
        Save the image of a single molecule as a PNG file.

        Parameters
        ----------
        mol_obj : RDKit Molecule objects
            The molecule to be saved as image.
        filepath : str or pathlib.Path object
            Full filpath to save the image in.

        Returns
        -------
            None
        """
        Draw.MolToFile(mol_obj, str(filepath) + ".png")

    @staticmethod
    def save_3D_molecule_to_SDfile(mol_obj, filepath):
        """
        Generate a 3D conformer and save as SDF file.

        Parameters
        ----------
        mol_obj : RDKit Molecule objects
            The molecule to be saved as SDF file.
        filepath : str or pathlib.Path object
            Full filpath to save the image in.

        Returns
        -------
            None
        """
        mol = Chem.AddHs(mol_obj)
        embedding = AllChem.EmbedMolecule(mol, maxAttempts=1000, clearConfs=True)
        uffoptim = AllChem.UFFOptimizeMolecule(mol, maxIters=1000)
        # check if calculations converged (both should return 0 when converged)
        if embedding + uffoptim != 0:
            raise ValueError("Embedding/Optimization failed to converge.")
        session = Chem.SDWriter(str(filepath) + ".sdf")
        session.write(mol)
        session.close()

    @staticmethod
    def calculate_similarity_dice(mol_obj1, mol_obj2):
        """
        Calculate the Dice similarity between two molecules,
        based on 4096-bit Morgan fingerprints with a radius of 2.

        Parameters
        ----------
        mol_obj1 : RDKit Molecule objects
            The first molecule.
        mol_obj2 : RDKit Molecule objects
            The second molecule.

        Returns
        -------
            float
            Dice similarity between the two molecules
        """
        morgan_fp_mol1 = AllChem.GetMorganFingerprintAsBitVect(mol_obj1, radius=2, nBits=4096)
        morgan_fp_mol2 = AllChem.GetMorganFingerprintAsBitVect(mol_obj2, radius=2, nBits=4096)
        dice_similarity = round(
            AllChem.DataStructs.DiceSimilarity(morgan_fp_mol1, morgan_fp_mol2), 2
        )
        return dice_similarity

    @staticmethod
    def calculate_druglikeness(mol_obj):
        """
        Calculate several molecular properties and drug-likeness scores,
        from an RDKit molecule object.

        Parameters
        ----------
        MolObj: RDKit molecule object
            Molecule object of interest.

        Returns
        -------
            dict
            The calculated values are returned in a dictionary with following keys:
            MolWt, NumHAcceptors, NumHDonors, MolLogP, TPSA, NumRotBonds, Saturation,
            lipinski_score, custom_drug_score, qed_score, total_drug_score
        """
        properties = {
            "mol_weight": round(Descriptors.MolWt(mol_obj), 3),
            "num_H_acceptors": Descriptors.NumHAcceptors(mol_obj),
            "num_H_donors": Descriptors.NumHDonors(mol_obj),
            "logp": round(Descriptors.MolLogP(mol_obj), 2),
            "tpsa": round(Descriptors.TPSA(mol_obj), 2),
            "num_rot_bonds": Descriptors.NumRotatableBonds(mol_obj),
            "saturation": round(Descriptors.FractionCSP3(mol_obj), 2),
            "drug_score_qed": round(Descriptors.qed(mol_obj), 2),
        }

        # Calculating Lipinski score
        l1 = int(properties["mol_weight"] < 500)
        l2 = int(properties["num_H_acceptors"] <= 10)
        l3 = int(properties["num_H_donors"] <= 5)
        l4 = int(properties["logp"] < 5)
        properties["drug_score_lipinski"] = round((l1 + l2 + l3 + l4) / 4, 2)

        # Calculating druglikeness score with custom scoring functions
        # derived from Hopkins paper
        def molWt_score(molWt):
            if molWt <= 440:
                return np.exp(-((molWt - 300) ** 2) / 15000)
            else:
                return np.exp(-(molWt - 180) / 190) + 0.01

        def molLogP_score(molLogP):
            return np.exp(-((molLogP - 2.5) ** 2) / 9)

        def numHDonors_score(numHDonors):
            if numHDonors == 0:
                return 0.6
            elif numHDonors < 5:
                return np.exp(-((numHDonors - 1) ** 2) / 5)
            else:
                return np.exp(-((numHDonors - 1) ** 2) / 5) + (0.4 / numHDonors)

        def numHAcceptors_score(numHAcceptors):
            if numHAcceptors < 4:
                return np.exp(-((numHAcceptors - 3) ** 2) / 3)
            else:
                return np.exp(-0.3 * numHAcceptors / 0.8 + 1.4)

        def TPSA_score(TPSA):
            if TPSA < 50:
                return 0.015 * TPSA + 0.25
            else:
                return np.exp(-((TPSA - 50) ** 2) / 8000)

        def numRotBonds_score(numRotBonds):
            if numRotBonds < 10:
                return np.exp(-((numRotBonds - 4) ** 2) / 19)
            else:
                return np.exp(-((numRotBonds - 4) ** 2) / 19) + (1.5 / numRotBonds ** 1.5)

        def saturation_score(saturation):
            return np.exp(-((saturation - 0.625) ** 2) / 0.05)

        d1 = molWt_score(properties["mol_weight"])
        d2 = numHAcceptors_score(properties["num_H_acceptors"])
        d3 = numHDonors_score(properties["num_H_donors"])
        d4 = molLogP_score(properties["logp"])
        d5 = TPSA_score(properties["tpsa"])
        d6 = numRotBonds_score(properties["num_rot_bonds"])
        d7 = saturation_score(properties["saturation"])
        properties["drug_score_custom"] = round((d1 + d2 + d3 + d4 + d5 + d6 + d7) / 7, 2)

        properties["drug_score_total"] = round(
            (
                3 * properties["drug_score_qed"]
                + 2 * properties["drug_score_custom"]
                + properties["drug_score_lipinski"]
            )
            / 6,
            2,
        )
        return properties

#### Implementing The Pipeline's ***Ligand*** Class
<br>
<div style="text-align: justify">Now using the defined functionalities in the <b><i>PubChem</i></b> and <b><i>RDKit</i></b> helper-classes, we can implement the pipeline's <b><i>Ligand</i></b> class. Similar to the <b><i>Protein</i></b> class, this class also takes in the ligand's input data, and creates an object with extended attributes and methods to work with ligands.</div>

In [None]:
class Ligand:
    """
    Ligand object with properties as attributes and methods to visualize and work with ligands.
    Take a ligand identifier type and corresponding value,
    and create a Ligand object, while assigning some properties as attributes.

    Parameters
    ----------
    identifier_type : enum 'InputTypes' from the 'Consts.Ligand' class
        Type of the ligand identifier, e.g. InputTypes.SMILES.

    indentifier_value : str
        Value of the ligand identifier, e.g. its SMILES.
    """

    class Consts:
        # Available properties that are assigned as instance attributes upon instantiation.
        class IdentifierTypes(Enum):
            NAME = "name"
            IUPAC_NAME = "iupac_name"
            SMILES = "smiles"
            CID = "cid"
            INCHI = "inchi"
            INCHIKEY = "inchikey"

    def __init__(self, identifier_type, identifier_value, ligand_output_path):

        self.dataframe = pd.DataFrame(columns=["Value"])
        self.dataframe.index.name = "Property"

        setattr(self, identifier_type.name.lower(), identifier_value)
        for identifier in self.Consts.IdentifierTypes:
            try:
                new_id = PubChem.convert_compound_identifier(
                    identifier_type.value, identifier_value, identifier.value
                )
                setattr(self, identifier.value, new_id)
                self.dataframe.loc[identifier.value] = new_id
            except:
                pass

        self.rdkit_obj = RDKit.create_molecule_object("smiles", self.smiles)

        dict_of_properties = RDKit.calculate_druglikeness(self.rdkit_obj)
        for property_ in dict_of_properties:
            setattr(self, property_, dict_of_properties[property_])
            self.dataframe.loc[property_] = dict_of_properties[property_]

        self.save_as_image(ligand_output_path / ("CID_" + self.cid))
        self.dataframe.to_csv(ligand_output_path / ("CID_" + self.cid + ".csv"))

    def __repr__(self):
        return f"<Ligand CID: {self.cid}>"

    def __call__(self):
        df = pd.DataFrame(columns=["smiles"])
        df.loc[1] = self.smiles
        PandasTools.AddMoleculeColumnToFrame(df, smilesCol="smiles")
        romol = df.loc[1, "ROMol"]

        return pd.concat({romol: self.dataframe}, names=["Structure"])

    def remove_counterion(self):
        """
        Remove the counter-ion from the SMILES of a salt.

        Returns
        -------
            str
            SMILES of the molecule without its counter-ion.
        """
        if (
            "." in self.smiles
        ):  # SMILES of salts contain a dot, separating the anion and the cation
            ions = self.smiles.split(".")
            length_ions = list(map(len, ions))
            molecule_index = length_ions.index(
                max(length_ions)
            )  # The parent molecule is almost always larger than its corresponding counter-ion
            molecule_smiles = ions[molecule_index]
        else:
            molecule_smiles = self.smiles
        return molecule_smiles

    def dice_similarity(self, mol_obj):
        """
        Calculate Dice similarity between the ligand and another input ligand,
        based on 4096-bit Morgan fingerprints with a radius of 2.

        Parameters
        ----------
        mol_obj : RDKit molecule object
            The molecule to calculate the Dice similarity with.

        Returns
        -------
            float
            Dice similarity between the two ligands.
        """
        return RDKit.calculate_similarity_dice(self.rdkit_obj, mol_obj)

    def save_as_image(self, filepath):
        """
        Save the ligand as image.

        Parameters
        ----------
        filepath : str or pathlib.Path object
            Full filepath of the image to be saved.

        Returns
        -------
            None
        """
        RDKit.save_molecule_image_to_file(self.rdkit_obj, filepath)

    def save_3D_structure_as_SDF_file(self, filepath):
        """
        Generate a 3D conformer and save as SDF file.

        Parameters
        ----------
        filepath : str or pathlib.Path object
            Full filpath to save the image in.

        Returns
        -------
            None
        """
        RDKit.save_3D_molecule_to_SDfile(self.rdkit_obj, filepath)

##### Creating a ***Ligand*** Object From The Ligand Input Data: Instantiating The ***Ligand*** Class
<br>
<div style="text-align: justify">
We can now create an instance of the <b><i>Ligand</i></b> class, using the input data of our project's ligand, and assign it to our project. </div>

In [None]:
project1.Ligand = Ligand(
    identifier_type=project1.Specs.Ligand.input_type,
    identifier_value=project1.Specs.Ligand.input_value,
    ligand_output_path=project1.Specs.OutputPaths.ligand,
)

Similar to the ***Protein*** obejct, we have implemented a ***\_\_call\_\_*** method for our ***Ligand***, which prints out some useful information and visualizes the ligand's structure:

In [None]:
project1.Ligand()

All of these information and some other properties are also stored separately as instance attributes. For example, the ligand's identifiers:

In [None]:
project1.Ligand.iupac_name

In [None]:
project1.Ligand.cid

Or its physiochemical properties:

In [None]:
project1.Ligand.mol_weight

In [None]:
project1.Ligand.logp

<div style="text-align: justify">
We have also implemented some useful functions as methods for the <b><i>Ligand</i></b> class. For example, the <b><i>remove_counterion</i></b> method can be used to remove the counter-ion of salt compounds from the main molecule in the SMILES. This is necessary for the docking process, since <i>Smina</i> has some problems processing PDBQT files of salt compounds. Calling this method will simply return the modified SMILES (In the case of our ligand, which is not charged and does not have a counter-ion, the original SMILES is returned): </div>

In [None]:
project1.Ligand.remove_counterion()

<a id='5_practical'></a>
### Binding Site Detection
<br>
<div style="text-align: justify">
Now that the processing of all input data is completed, we can commence with implementing the pipeline's processes. We start with the binding-site detection process, where a suitable binding site should be identified for the target protein. In the input CSV file, the user has the option to select between three definition methods for the binding site:
</div>

1. ***coordinates***: The user must specify the coordinates of the binding site, in which case there is no need for binding-site detection. 
2. ***ligand***: The user should specify the ID of a co-crystallized ligand in the protein structure, which will then be used here to define the binding site. 
3. ***detection***: The user should specify a ***detection method***. 

<div style="text-align: justify">
For this talktorial we are only going to implement one detection method, using the <b><i>DoGSiteScorer</i></b> functionality of the <a href="https://proteins.plus"><b><i>ProteinsPlus</i></b></a> webserver. Therefore, we first need to create a helper-class for communicating with the webserver's API and submitting detection jobs. Afterwards, we will implement the pipeline's <b><i>BindingSiteDetection</i></b> class, which processes the input specification data on the binding site, and acts accordingly. 


#### Implementing The Required Functions: The ***DoGSiteScorer*** Helper-Class
<br>
<div style="text-align: justify"><i>DoGSiteScorer</i> provides a web-service <a href="https://proteins.plus/help/dogsite_rest">API</a>, which can be used to submit binding-site detection jobs, either by providing the PDB-code of protein structure, or by uploading its PDB file. It will then return a table of all detected pockets and sub-pockets, and their corresponding descriptors. Moreover, for each detected (sub-)pocket, a PDB file and a CCP4 file is generated, which can be downloaded and used to define the coordinates of the (sub-)pocket (needed for the docking experiment), as well as for visualization purposes. We will also define the function <b><i>select_best_pocket</i></b>, which provides several methods for selecting the most suitable binding-site.
    <br><br> <i><b>Note:</b></i> For demonstration of each function, see the <a href="#dogsitescorer_demo">corresponding section</a> in <i>Supplementary Information</i>.</div>

In [None]:
class DoGSiteScorer:
    """
    Class Containing all the required functions and constants
    to communicate with the DoGSiteScorer's Rest-API.
    """

    class APIConsts:
        # See https://proteins.plus/help/
        # and https://proteins.plus/help/dogsite_rest
        # for API specifications.

        class FileUpload:
            URL = "https://proteins.plus/api/pdb_files_rest"
            REQUEST_MSG = "pdb_file[pathvar]"
            RESPONSE_MSG = {
                "status": "status_code",
                "status_codes": {"accepted": "accepted", "denied": "bad_request"},
                "message": "message",
                "url_of_id": "location",
            }
            RESPONSE_MSG_FETCH_ID = {"message": "message", "id": "id"}

        class SubmitJob:
            URL = "https://proteins.plus/api/dogsite_rest"
            QUERY_HEADERS = {
                "Content-type": "application/json",
                "Accept": "application/json",
            }

            RESPONSE_MSG = {"url_of_job": "location"}

            RESPONSE_MSG_FETCH_BINDING_SITES = {
                "result_table": "result_table",
                "pockets_pdb_files": "residues",
                "pockets_ccp4_files": "pockets",
            }

    @staticmethod
    def upload_pdb_file(filepath):
        """
        Upload a PDB file to DoGSiteScorer webserver using their API
        and get back a dummy PDB-code, which can be used to submit a detection job.

        Parameters
        ----------
        filepath : str
            Relative or absolute path of the PDB file.

        Returns
        -------
            str
            Dummy PDB-code of the uploaded structure,
            which can then be used instead of a real PDB-code.
        """
        url = DoGSiteScorer.APIConsts.FileUpload.URL  # Read API URL from Constants
        request_msg = (
            DoGSiteScorer.APIConsts.FileUpload.REQUEST_MSG
        )  # Read API request message from Constants
        with open(filepath, "rb") as f:  # Open the local PDB file for reading in binary mode
            response = requests.post(
                url, files={request_msg: f}
            )  # Post API query and get the response
        response.raise_for_status()  # Raise HTTPError if one occured during query
        if response.ok:
            response_values = response.json()  # Turn the response values into a dict
            # If the request is accepted, the response will contain a URL,
            # from which the needed ID of the uploaded protein can be obtained.
            # Here, we store this URL from the response values in the url_of_id variable.
            url_of_id = response_values[
                DoGSiteScorer.APIConsts.FileUpload.RESPONSE_MSG["url_of_id"]
            ]
        else:
            raise ValueError(
                "Uploading PDB file failed.\n"
                + f"The response values are as follows: {response_values}"
            )
        # After getting the URL, it may take some time for the server to process the uploaded file
        # and return an ID. Thus, we try 30 times in intervals of 5 seconds to query the URL,
        # until we get the ID
        for try_nr in range(30):
            id_response = requests.get(url_of_id)  # Query the URL containing the ID
            id_response_values = id_response.json()  # Turn the response values into a dict
            # The response should contain the ID keyword:
            if id_response.ok & (
                DoGSiteScorer.APIConsts.FileUpload.RESPONSE_MSG_FETCH_ID["id"]
                in id_response_values
            ):
                id_response_values = id_response.json()
                protein_id = id_response_values[
                    DoGSiteScorer.APIConsts.FileUpload.RESPONSE_MSG_FETCH_ID["id"]
                ]
                break
            else:
                time.sleep(5)
        if not (
            id_response.ok
            & (
                DoGSiteScorer.APIConsts.FileUpload.RESPONSE_MSG_FETCH_ID["id"]
                in id_response_values
            )
        ):
            raise ValueError(
                "Fetching the ID of uploaded protein failed.\n"
                + f"The response values are as follows: {id_response_values}"
            )
        return protein_id

    @staticmethod
    def submit_job(pdb_id, ligand_id="", chain_id="", num_attempts=30):
        """
        Submit a protein structure to DoGSiteScorer webserver using their API
        and get back all the information on the detected binding-sites.

        Parameters
        ----------
        pdb_id : str
            Either a valid 4-letter PDB-code (e.g. '3w32'),
            or a dummy PDB-code of an uploaded PDB file.
        ligand_id : str
            DogSiteScorer-name of the co-crystallized ligand of interest, e.g. 'W32_A_1101'.
            DogSiteScorer naming convention is: <PDB ligand-ID>_<chain-ID>_<PDB residue number of the ligand>
        chain_id : str (optional; default: none)
            Chain-ID to limit the binding-site detection to.
        num_attempts : int (optional; default: 30)
            Number of times to attempt to fetch the results after the job has been submitted.
            After each failed attempt there is a 10-second pause.

        Returns
        -------
            Pandas DataFrame
            Dataframe containing all the information on all detected binding-sites.
        """
        response = requests.post(
            DoGSiteScorer.APIConsts.SubmitJob.URL,
            json={
                "dogsite": {
                    "pdbCode": pdb_id,  # PDB code of protein
                    "analysisDetail": "1",  # 1 = include subpockets in results
                    "bindingSitePredictionGranularity": "1",  # 1 = include drugablity scores
                    "ligand": ligand_id,  # if name is specified, ligand coverage is calculated
                    "chain": chain_id,  # if chain is specified, calculation is only performed on this chain
                }
            },
            headers=DoGSiteScorer.APIConsts.SubmitJob.QUERY_HEADERS,
        )

        response.raise_for_status()
        response_values = response.json()
        url_of_job = response_values[DoGSiteScorer.APIConsts.SubmitJob.RESPONSE_MSG["url_of_job"]]

        attempt_count = 0
        while attempt_count <= num_attempts:
            job_response = requests.get(url_of_job)
            job_response.raise_for_status()
            job_response_values = job_response.json()

            if (
                DoGSiteScorer.APIConsts.SubmitJob.RESPONSE_MSG_FETCH_BINDING_SITES["result_table"]
                in job_response_values
            ):
                binding_site_data_url = job_response_values[
                    DoGSiteScorer.APIConsts.SubmitJob.RESPONSE_MSG_FETCH_BINDING_SITES[
                        "result_table"
                    ]
                ]
                binding_sites_pdb_files_urls = job_response_values[
                    DoGSiteScorer.APIConsts.SubmitJob.RESPONSE_MSG_FETCH_BINDING_SITES[
                        "pockets_pdb_files"
                    ]
                ]
                binding_sites_ccp4_files_urls = job_response_values[
                    DoGSiteScorer.APIConsts.SubmitJob.RESPONSE_MSG_FETCH_BINDING_SITES[
                        "pockets_ccp4_files"
                    ]
                ]
                break
            attempt_count += 1
            time.sleep(10)
        else:
            raise ValueError(
                "Fetching the binding-site data failed.\n"
                + f"The response values are as follows: {job_response_values}"
            )

        binding_site_data_table = requests.get(binding_site_data_url).text
        binding_site_data_file = io.StringIO(binding_site_data_table)
        binding_site_df = pd.read_csv(binding_site_data_file, sep="\t").set_index("name")
        binding_site_df["pdb_file_url"] = binding_sites_pdb_files_urls
        binding_site_df["ccp4_file_url"] = binding_sites_ccp4_files_urls
        return binding_site_df

    @staticmethod
    def save_binding_sites_to_file(binding_site_df, output_path):
        """
        download and save the PDB and CCP4 files corresponding to the calculated binding-sites.

        Parameters
        ----------
        binding_site_df : Pandas DataFrame
            Binding-site data retrieved from the DoGSiteScorer webserver.
        output_path : str or pathlib.Path object
            Local file path to save the files in.

        Returns
        -------
            None
        """
        for binding_site in binding_site_df.index:
            for column in ["pdb_file_url", "ccp4_file_url"]:
                response = requests.get(binding_site_df.loc[binding_site, column])
                response.raise_for_status()
                if column == "pdb_file_url":
                    response_file_content = response.content
                    file_extension = ".pdb"
                else:
                    response_file_content = gzip.decompress(response.content)
                    file_extension = ".ccp4"

                file_name = binding_site + file_extension
                with open(output_path / file_name, "wb") as f:
                    f.write(response_file_content)
        return

    @staticmethod
    def select_best_pocket(binding_site_df, selection_method, selection_criteria, ascending=False):
        """
        Select the best binding-site from the table of all detected binding-sites,
        either by sorting the binding-sites based on a set of properties in the table,
        or by applying a function on the property values.

        Parameters
        ----------
        binding_site_df : Pandas DataFrame
            Binding-site data retrieved from the DoGSiteScorer webserver.
        selection_method : str
            Selection method for selecting the best binding-site.
            Either 'sorting' or 'function'.
        selection_criteria : str or list
            When 'selection_method' is 'sorting':
                List of one or several property names.
            When 'selection_method' is 'function':
                Any valid python syntax that generates a list-like object
                with the same length as the number of detected binding-sites.
        ascending : bool (optional; default: False)
            If set to True, the binding-site with the lowest value will be selected,
            otherwise, the binding-site with the highest value is selected.

        Returns
        -------
            str
            Name of the selected binding-site.
        """
        df = binding_site_df
        if selection_method == "sorting":
            sorted_df = df.sort_values(by=selection_criteria, ascending=ascending)
        elif selection_method == "function":
            df["function_score"] = eval(selection_criteria)
            sorted_df = df.sort_values(by="function_score", ascending=ascending)

        selected_pocket_name = sorted_df.iloc[0].name
        return selected_pocket_name

    @staticmethod
    def calculate_pocket_coordinates_from_pocket_pdb_file(filepath):
        """
        Calculate the coordinates of a binding-site using the binding-site's PDB file
        downloaded from DoGSiteScorer.

        Parameters
        ----------
        filepath : str or pathlib.Path object
            Local filepath (including filename, without extension) of the binding-site's PDB file.

        Returns
        -------
            dict of lists of integers
            Binding-site coordinates in format:
            {'center': [x, y, z], 'size': [x, y, z]}
        """
        with open(str(filepath) + ".pdb") as f:
            pdb_file_text_content = f.read()
        pdb_file_df = PDB.load_pdb_file_as_dataframe(pdb_file_text_content)
        pocket_coordinates_data = pdb_file_df["OTHERS"].loc[5, "entry"]
        coordinates_data_as_list = pocket_coordinates_data.split(" ")
        coordinates = []
        for elem in coordinates_data_as_list:
            try:
                coordinates.append(float(elem))
            except:
                pass
        pocket_coordinates = {
            "center": coordinates[:3],
            "size": [coordinates[-1] * 2 for dim in range(3)],
        }
        return pocket_coordinates

    @staticmethod
    def get_pocket_residues(pocket_residues_url):
        """
        Gets residue IDs and names of a specified pocket (via URL).

        Parameters
        ----------
        pocket_residues_url : str
            URL of selected pocket file on the DoGSiteScorer web server.

        Returns
        -------
            pandas.DataFrame
            Table of residues names and IDs for the selected binding site.
        """
        # Retrieve PDB file content from URL
        result = requests.get(pocket_residues_url)
        # Get content of PDB file
        pdb_residues = result.text
        # Load PDB format as DataFrame
        ppdb = PandasPdb()
        # TODO: Change _construct_df to read_pdb_from_lines once biopandas
        # cuts a new release (currently: 0.2.7), see https://github.com/rasbt/biopandas/pull/72
        pdb_df = ppdb._construct_df(pdb_residues.splitlines(True))["ATOM"]
        # Drop duplicates
        # PDB file contains per atom entries, we only need per residue info
        pdb_df.sort_values("residue_number", inplace=True)
        pdb_df.drop_duplicates(subset="residue_number", keep="first", inplace=True)
        return pdb_df[["residue_number", "residue_name"]]

#### Implementing The Pipeline's ***BindingSiteDetection*** Class
<br>
<div style="text-align: justify">
Now we can automate the binding-site detection process by implementing the <b><i>BindingSiteDetection</i></b> class of the pipeline, where depending on the input binding site specification data, the corresponding processes are carried out to output the coordinates of the selected binding site based on the input data.</div>

In [None]:
class BindingSiteDetection:
    """
    Automated binding-site detection process of the pipeline.
    Take in the Protein object, Specs.BindingSite object and
    the corresponding output path, and automatically run all the necessary calculations
    to output the suitable binding-site coordinates based on the input specifications of the project.

    Parameters
    ----------
    Protein : Protein object
        The Protein object of the project.
    BindingSiteSpecs : Specs.BindingSite object
        The binding-site specification data-class of the project.
    binding_site_output_path : str or pathlib.Path object
        Output path of the project's binding-site information.
    """

    class Consts:
        class DefinitionMethods(Enum):
            DETECTION = "detection"
            LIGAND = "ligand"
            COORDINATES = "coordinates"

    def __init__(self, Protein, BindingSiteSpecs, binding_site_output_path):

        self.output_path = binding_site_output_path
        self.Protein = Protein
        # derive the relevant function name from definition method
        definition_method_name = "compute_by_" + BindingSiteSpecs.definition_method.name.lower()
        # get the function from its name
        definition_method = getattr(self, definition_method_name)
        # call the function
        definition_method(Protein, BindingSiteSpecs, binding_site_output_path)

    def compute_by_coordinates(self, Protein, BindingSiteSpecs, binding_site_output_path):
        Protein.binding_site_coordinates = BindingSiteSpecs.coordinates

    def compute_by_ligand(self, Protein, BindingSiteSpecs, binding_site_output_path):
        ligand_object = PDB.extract_molecule_from_pdb_file(
            BindingSiteSpecs.protein_ligand_id,
            Protein.pdb_filepath,
            binding_site_output_path / BindingSiteSpecs.protein_ligand_id,
        )
        # calculate the geometric center of the molecule,
        # which represents the center of the rectangular box,
        # as well as the length of the molecule in each dimension,
        # which corresponds to the edge lengths of the rectangular box.
        # Also, we will add a 5 Å buffer in each dimension to allow the
        # correct placements of ligands that are bigger than the co-crystallized ligands
        # or bind in a different fashion.
        pocket_center = (
            ligand_object.positions.max(axis=0) + ligand_object.positions.min(axis=0)
        ) / 2
        pocket_size = ligand_object.positions.max(axis=0) - ligand_object.positions.min(axis=0) + 5
        pocket_coordinates = {
            "center": pocket_center.tolist(),
            "size": [pocket_size.tolist()],
        }
        Protein.binding_site_coordinates = pocket_coordinates
        return

    def compute_by_detection(self, Protein, BindingSiteSpecs, binding_site_output_path):
        # derive the relevant function name from detection method
        detection_method_name = "detect_by_" + BindingSiteSpecs.detection_method.name.lower()
        # get the function from its name
        detection_method = getattr(self, detection_method_name)
        # call the function
        detection_method(Protein, BindingSiteSpecs, binding_site_output_path)

    def detect_by_dogsitescorer(self, Protein, BindingSiteSpecs, binding_site_output_path):

        if hasattr(Protein, "pdb_code"):
            self.dogsitescorer_pdb_id = Protein.pdb_code
        elif hasattr(Protein, "pdb_filepath"):
            self.dogsitescorer_pdb_id = DoGSiteScorer.upload_pdb_file(Protein.pdb_filepath)

        # try to get the chain_id for binding-site detection if it's available in input data
        if hasattr(BindingSiteSpecs, "protein_chain_id"):
            self.dogsitescorer_chain_id = BindingSiteSpecs.protein_chain_id
        else:
            # if chain_id is not in input data, try to set it to the first chain found in pdb file
            try:
                self.dogsitescorer_chain_id = Protein.chains[0]
            # if no chain is found in pdb file either, leave the chain_id empty
            except:
                self.dogsitescorer_chain_id = ""

        # try to get the ligand_id for detection calculation if it's available in input data
        if hasattr(BindingSiteSpecs, "protein_ligand_id"):
            ligand_id = BindingSiteSpecs.protein_ligand_id
            # check if the given ligand_id is already given in the DoGSiteScorer format
            if "_" in ligand_id:
                self.dogsitescorer_ligand_id = ligand_id
                self.dogsitescorer_chain_id = ligand_id.split("_")[1][0]
            else:
                try:
                    for ligand in Protein.ligands:
                        if (ligand[0] == ligand_id) and (
                            ligand[1][0] == self.dogsitescorer_chain_id
                        ):
                            self.dogsitescorer_ligand_id = (
                                ligand_id + "_" + self.dogsitescorer_chain_id + "_" + ligand[1][1:]
                            )
                            break
                except:
                    self.dogsitescorer_ligand_id = ""

        self.dogsitescorer_binding_sites_df = DoGSiteScorer.submit_job(
            self.dogsitescorer_pdb_id,
            self.dogsitescorer_ligand_id,
            self.dogsitescorer_chain_id,
        )

        DoGSiteScorer.save_binding_sites_to_file(
            self.dogsitescorer_binding_sites_df, binding_site_output_path
        )

        self.best_binding_site_name = DoGSiteScorer.select_best_pocket(
            self.dogsitescorer_binding_sites_df,
            BindingSiteSpecs.selection_method.value,
            BindingSiteSpecs.selection_criteria,
        )

        self.best_binding_site_data = self.dogsitescorer_binding_sites_df.loc[
            self.best_binding_site_name
        ]

        self.best_binding_site_coordinates = (
            DoGSiteScorer.calculate_pocket_coordinates_from_pocket_pdb_file(
                binding_site_output_path / (self.best_binding_site_name)
            )
        )

        Protein.binding_site_coordinates = self.best_binding_site_coordinates
        return

    def visualize(self, pocket_name):
        """
        Visualize a detected pocket.

        Parameters
        ----------
        pocket_name : str
            Name of the detected pocket, which has a corresponding CCP4 file
            with the same name stored in the output folder for binding site data.

        Returns
        -------
            NGLView viewer
            Viewer showing the given pocket.
        """
        if hasattr(self.Protein, "pdb_code"):
            viewer = NGLView.binding_site(
                "pdb_code",
                self.Protein.pdb_code,
                str(self.output_path / pocket_name) + ".ccp4",
            )
        else:
            viewer = NGLView.protein(
                "pdb",
                self.Protein.pdb_filepath,
                str(self.output_path / pocket_name) + ".ccp4",
            )
        return viewer

    def visualize_best(self):
        """
        Visualize the selected binding pocket.
        The binding pocket should have a corresponding CCP4 file
        with the same name stored in the output folder for binding site data.

        Returns
        -------
            NGLView viewer
            Viewer showing the given pocket.
        """
        return self.visualize(self.best_binding_site_name)

##### Automatically Calculating the Binding-Site: Instantiating The ***BindingSiteDetection*** Class
We can now instantiate the ***BindingSiteDetection*** class, using the ***Protein*** object, the ***Specs.BindingSite*** object, and the binding-site output path of our project. This will automatically run all the processes needed to result in the coordinates of the selected binding-site, based on the specifications in the input data. The results will then be stored as instance attributes in the instantiated ***BindingSiteDetection*** object. We thus assign this object to our project: 

In [None]:
project1.BindingSiteDetection = BindingSiteDetection(
    project1.Protein,
    project1.Specs.BindingSite,
    project1.Specs.OutputPaths.binding_site_detection,
)

All intermediate information leading to the selected binding-pocket's coordinates are now stored in the ***BindingSiteDetection*** instance of our project. For example, a dataframe containing all retrieved information on all detected binding sites:

In [None]:
project1.BindingSiteDetection.dogsitescorer_binding_sites_df

Name of the selected binding site:

In [None]:
project1.BindingSiteDetection.best_binding_site_name

The coordinates of the selected binding site are also assigned to the ***Protein*** object in the project:

In [None]:
project1.Protein.binding_site_coordinates

We can also visualize the selected binding pocket:

In [None]:
project1.BindingSiteDetection.visualize_best()

Or any other binding pocket, by providing its name:

In [None]:
project1.BindingSiteDetection.visualize("P_0")

<a id='6_practical'></a>
### Ligand Similarity Search
<br>
<div style="text-align: justify">
With the coordinates of the protein's binding-site in hand, we now focus on the ligand similarity-search part of the pipeline. For performing the similarity-search, we will use the <i>PubChem</i> webserver. We have already implemented all the required functions to perform the similarity-search in the <b><i>PubChem</i></b> helper-class that we used for developing the <b><i>Ligand</i></b> class. Therefore, we can now directly commence with writing the <b><i>LigandSimilaritySearch</i></b> class of the pipeline. </div>

#### Implementing The Pipeline's ***LigandSimilaritySearch*** Class
<br>
<div style="text-align: justify"> 
Here, we will automate the similarity-search process of the pipeline. This class will take in the <b><i>Ligand</i></b> object and the <b><i>Specs.LigandSimilaritySearch</i></b> object of the pipeline, and initialize a similarity-search using <i>PubChem</i> webserver. Subsequently, several drug-likeness scores are calculated for each of the initially retrieved analogs. Using these scores, a given number (specified in the input file) of most drug-like analogs are then selected and used to create <b><i>Ligand</i></b> objects with the help of the <b><i>Ligand</i></b> class that we defined earlier. </div>

In [None]:
class LigandSimilaritySearch:
    """
    Automated ligand similarity-search process of the pipeline.
    Take in input Ligand object, Specs.LigandSimilaritySearch object,
    and the corresponding output path, and automatically run all the necessary
    processes to output a set of analogs with the highest drug-likeness scores.

    Parameters
    ----------
    Ligand_obj : Ligand object
        The Protein object of the project.
    BindingSiteSpecs : Specs.BindingSite object
        The binding-site specification data-class of the project.
    binding_site_output_path : str or pathlib.Path object
        Output path of the project's binding-site information.
    """

    def __init__(self, Ligand_obj, SimilaritySearchSpecs, similarity_search_output_path):

        if (
            SimilaritySearchSpecs.search_engine
            is Consts.LigandSimilaritySearch.SearchEngines.PUBCHEM
        ):
            analogs_info = PubChem.similarity_search(
                Ligand_obj.smiles,
                SimilaritySearchSpecs.min_similarity_percent,
                SimilaritySearchSpecs.max_num_results,
            )

        # create dataframe from initial results
        all_analog_identifiers_df = pd.DataFrame(analogs_info)
        all_analog_identifiers_df["Mol"] = all_analog_identifiers_df["CanonicalSMILES"].apply(
            lambda smiles: RDKit.create_molecule_object("smiles", smiles)
        )
        all_analog_identifiers_df["dice_similarity"] = all_analog_identifiers_df["Mol"].apply(
            lambda mol: RDKit.calculate_similarity_dice(Ligand_obj.rdkit_obj, mol)
        )
        all_analog_properties_df = pd.DataFrame(
            (
                all_analog_identifiers_df["Mol"].apply(
                    lambda mol: RDKit.calculate_druglikeness(mol)
                )
            ).tolist()
        )
        all_analogs_df = pd.concat([all_analog_identifiers_df, all_analog_properties_df], axis=1)
        all_analogs_df.index = all_analogs_df["CID"]
        all_analogs_df.drop("CID", axis=1, inplace=True)
        all_analogs_df.sort_values(by="dice_similarity", ascending=False, inplace=True)
        self.all_analogs = all_analogs_df
        all_analogs_df.drop("Mol", axis=1).to_csv(
            similarity_search_output_path / "analogs_all.csv"
        )

        analogs_dict = {}
        for analog_cid in (
            self.all_analogs.sort_values(by="drug_score_total", ascending=False)
            .head(SimilaritySearchSpecs.max_num_druglike)
            .index
        ):
            new_analog_object = Ligand(
                Consts.Ligand.InputTypes.CID,
                analog_cid,
                similarity_search_output_path,
            )

            new_analog_object.dice_similarity = RDKit.calculate_similarity_dice(
                Ligand_obj.rdkit_obj, new_analog_object.rdkit_obj
            )

            new_analog_object.dataframe.loc["similarity"] = new_analog_object.dice_similarity

            analogs_dict[new_analog_object.cid] = new_analog_object

        Ligand_obj.analogs = analogs_dict

##### Finding Structural Analogs of the Input Ligand: Instantiating The ***LigandSimilaritySearch*** Class
<br>
<div style="text-align: justify">
    By instantiating the <b><i>LigandSimilaritySearch</i></b> class with the <b><i>Ligand</i></b> object and the <b><i>Specs.LigandSimilaritySearch</i></b> data-class of the project, analogs are automatically found and turned to <b><i>Ligand</i></b> objects, which are then assigned as an instance attribute to the <b><i>LigandSimilaritySearch</i></b> class, as well as the input <b><i>Ligand</i></b> object. We instantiate the class and assign it to our project:
<br><br>
<b><i>Note:</i></b> This process will take about 2 minutes to complete.</div>

In [None]:
project1.LigandSimilaritySearch = LigandSimilaritySearch(
    project1.Ligand,
    project1.Specs.LigandSimilaritySearch,
    project1.Specs.OutputPaths.similarity_search,
)

Now we can view the full list of all fetched analogs, and their calculated physiochemical properties and drug-likeness scores (here showing only the first 5 entries):

In [None]:
project1.LigandSimilaritySearch.all_analogs.head()

From these analogs, a certain number of most drug-like compounds are selected according to the input specifications.  These selected analogs are then turned into <b><i>Ligand</i></b> objects and assigned to the input ***Ligand*** under the attribute name ***analogs***: 

In [None]:
project1.Ligand.analogs

As can be seen, the list contains ***Ligand*** objects, each corresponding to a found analog. Each analog has thus its own attributes and methods (just like our input ***Ligand***), which can then be accessed separately, via the analog's CID. For example:

In [None]:
project1.Ligand.analogs["65997"]()

<a id='7_practical'></a>
### Molecular Docking 
<br>
<div style="text-align: justify">
We now have successfully defined the binding site of our input protein, and found a list of analogs for our input ligand. The next step in the pipeline is thus to perform docking experiments on the input protein's binding site, using the ligand analogs. But before that, we need to implement some helper functions that are needed to prepare the protein and the analogs for docking.</div>

#### Implementing The Required Functions: The ***OBabel*** Helper-Class
Here we will use the <b><i><a href="https://github.com/pybel/pybel">Pybel</a></i></b> package of the <b><i><a href="http://openbabel.org/wiki/Main_Page">OpenBabel</a></i></b> program in order to implement the functions needed to prepare the protein and the ligand analogs for docking.

In [None]:
class OBabel:
    """
    A set of functions based on the OpenBabel's pybel package,
    for preparing proteins and ligands for the docking experiment.
    """

    @staticmethod
    def optimize_structure_for_docking(
        pybel_structure_object,
        add_hydrogens=True,
        protonate_for_pH=7.4,
        calculate_partial_charges=True,
        generate_3d_structure=False,
    ):
        """
        Take a pybel structure object and prepare for docking.

        Parameters
        ----------
        pybel_structure_object : object
            The structure to optimize.
        add_hydrogens : bool (optional; default: True)
            Whether to add hydrogen atoms to the structure.
        protonate_for_pH : float (optional; default: 7.4)
            pH value to protonate the structure at.
            If set to 0 or False, will not protonate.
        calculate_partial_charges : bool (optional; default: True)
            Whether to calculate partial charges for each atom
        generate_3d_structure : bool (optional; default: False)
            Whether to generate a 3D conformation.
            Must be set to True if the pybel structure is 2D,
            for example is structure is made from SMILES string.

        Returns
        -------
            None
            The structure object is optimized in place.
        """
        if add_hydrogens:
            pybel_structure_object.addh()
        if protonate_for_pH:
            pybel_structure_object.OBMol.CorrectForPH(protonate_for_pH)
        if generate_3d_structure:
            pybel_structure_object.make3D(forcefield="mmff94s", steps=10000)
        if calculate_partial_charges:
            for atom in pybel_structure_object.atoms:
                atom.OBAtom.GetPartialCharge()
        return

    @staticmethod
    def create_pdbqt_from_pdb_file(pdb_filepath, pdbqt_filepath, pH=7.4):
        """
        Convert a PDB file to a PDBQT file,
        while adding hydrogen atoms, correcting the protonation state,
        and assigning partial charges.

        Parameters
        ----------
        pdb_filepath: str or pathlib.Path
            Path to input PDB file.
        pdbqt_filepath: str or pathlib.path
            Path to output PDBQT file.
        pH: float
            pH value for defining the protonation state of the atoms.

        Returns
        -------
            Pybel Molecule object
            Molecule object of PDB file optimized for docking.
        """
        # readfile() provides an iterator over the Molecules in a file.
        # To access the first (and possibly only) molecule in a file,
        # we use .next()
        molecule = next(pybel.readfile("pdb", str(pdb_filepath)))
        OBabel.optimize_structure_for_docking(molecule, protonate_for_pH=pH)
        molecule.write("pdbqt", str(pdbqt_filepath), overwrite=True)
        return molecule

    @staticmethod
    def create_pdbqt_from_smiles(smiles, pdbqt_path, pH=7.4):
        """
        Convert a SMILES string to a PDBQT file,
        while adding hydrogen atoms, correcting the protonation state, assigning partial charges,
        and generating a 3D conformer.

        Parameters
        ----------
        smiles: str
            SMILES string.
        pdbqt_path: str or pathlib.path
            Path to output PDBQT file.
        pH: float
            Protonation at given pH.

        Returns
        -------
            None
        """
        molecule = pybel.readstring("smi", smiles)
        OBabel.optimize_structure_for_docking(
            molecule, protonate_for_pH=pH, generate_3d_structure=True
        )
        molecule.write("pdbqt", str(pdbqt_path), overwrite=True)
        return

    @staticmethod
    def split_multistructure_file(filetype, filepath, output_folder_path=None):
        """
        Split a multi-structure file into seperate files (with the same format)
        for each structure.Each file is named with consecutive numbers (starting at 1)
        at the end of the original filename.

        Parameters
        ----------
        filetype : str
            Type of the multimodel file to be split.
            Examples: 'sdf', 'pdb', 'pdbqt' etc.
            For a full list of acceptable file-types, call pybel.informats
        filepath : str or pathlib.Path
            Path of the file to be split.
        output_folder_path : str or pathlib.Path
            (optional; default: same folder as input filepath)
            Path of the output folder to save the split files.

        Returns
        -------
            list of pathlib.Path objects
            List of the full-paths for each split file.
        """
        filepath = Path(filepath)
        filename = filepath.stem
        if output_folder_path == None:
            output_folder_path = filepath.parent
        else:
            output_folder_path = Path(output_folder_path)
            output_folder_path.mkdir(parents=True, exist_ok=True)

        structures = pybel.readfile(filetype, str(filepath))
        output_filepaths = []
        for i, structure in enumerate(structures, 1):
            output_filepath = output_folder_path / f"{filename}_{i}.{filetype}"
            output_filepaths.append(output_filepath)
            structure.write(filetype, str(output_filepath), overwrite=True)
        return output_filepaths

    @staticmethod
    def merge_molecules_to_single_file(
        list_of_pybel_molecule_objects, output_filetype, output_filepath
    ):
        """
        Create a single file containing several molecules.

        Parameters
        ----------
        list_of_pybel_molecule_objects : list of pybel Molecule objects
            List of molecule ojects to be merged into a single file.
        output_filetype : str
            Type of the output file.
            Examples: 'sdf', 'pdb', 'pdbqt' etc.
            For a full list of acceptable file-types, call pybel.outformats
        output_filepath : str or pathlib.Path
            Path of the output file including file-name, but excluding extension.

        Returns
        -------
            pathlib.Path object
            Full-path (including extension) of the output file.
        """
        fullpath = Path(str(output_filepath) + "." + output_filetype)

        merged_molecule_file = pybel.Outputfile(output_filetype, str(fullpath))

        for pybel_molecule_object in list_of_pybel_molecule_objects:
            merged_molecule_file.write(pybel_molecule_object)

        merged_molecule_file.close()
        return fullpath

#### Implementing The Required Functions: The ***Smina*** Helper-Class
<br>
<div style="text-align: justify">For the docking, we are going to use the <a href="https://sourceforge.net/projects/smina/"><b><i>Smina</i></b></a> program, which does not have a Python-API. However, we can simply communicate with the program via the <b><i>subprocess</i></b> library, which can execute shell commands. We also need to write a function to read the output log of the program and extract the useful data. These functions are implemented here in the <b><i>Smina</i></b> helper-class.</div>

In [None]:
class Smina:
    """
    Set of functions for communicating with the Smina docking program,
    and extracting data from its log file.
    """

    @staticmethod
    def dock(
        ligand_path,
        protein_path,
        pocket_center,
        pocket_size,
        output_path,
        output_format="pdbqt",
        num_poses=10,
        exhaustiveness=10,
        random_seed="",
        log=True,
    ):
        """
        Perform docking with Smina.

        Parameters
        ----------
        ligand_path : str or pathlib.Path
            Path to ligand PDBQT file that should be docked.
        protein_path : str or pathlib.Path
            Path to protein PDBQT file that should be docked to.
        pocket_center : iterable of float or int
            Coordinates defining the center of the binding site.
        pocket_size : iterable of float or int
            Lengths of edges defining the binding site.
        output_path : str or pathlib.Path
            Path to which docking poses should be saved, SDF or PDB format.
        num_poses : int or str
            Maximum number of poses to generate.
        exhaustiveness : int or str
            Accuracy of docking calculations.
        random_seed : int or str
            Seed number to make the docking deterministic for reproducibility.
        log : bool (optional; default: True)
            Whether to also write a log-file in the same output path for each docking.

        Returns
        -------
        output_text: str
            The output log of the docking calculation.
        """
        smina_command = (
            [
                "smina",
                "--ligand",
                str(ligand_path),
                "--receptor",
                str(protein_path),
                "--out",
                str(output_path) + "." + output_format,
                "--center_x",
                str(pocket_center[0]),
                "--center_y",
                str(pocket_center[1]),
                "--center_z",
                str(pocket_center[2]),
                "--size_x",
                str(pocket_size[0]),
                "--size_y",
                str(pocket_size[1]),
                "--size_z",
                str(pocket_size[2]),
                "--num_modes",
                str(num_poses),
                "--exhaustiveness",
                str(exhaustiveness),
            ]
            + (["--log", str(output_path) + "_log.txt"] if log else [])
            + (["--seed", str(random_seed)] if random_seed != "" else [])
        )

        output_text = subprocess.check_output(
            smina_command,
            universal_newlines=True,
        )
        return output_text

    @staticmethod
    def convert_log_to_dataframe(raw_log):
        """
        Convert docking's raw output log into a Pandas DataFrame.

        Parameters
        ----------
        raw_log : str
            Raw output log generated after docking.

        Returns
        -------
            Pandas DataFrame
            DataFrame containing columns 'mode', 'affinity[kcal/mol]',
            'dist from best mode_rmsd_l.b', and 'dist from best mode_rmsd_u.b'
            for each generated docking pose.
        """

        # Remove the unnecessary parts and extract the results table as list of lines
        # The table starts after the line containing: -----+------------+----------+----------
        # and ends before the word "Refine"
        log = (
            raw_log.split("-----+------------+----------+----------")[1]
            .split("Refine")[0]
            .strip()
            .split("\n")
        )

        # parse each line and remove everything except the numbers
        for index in range(len(log)):
            # turn each line into a list
            log[index] = log[index].strip().split(" ")
            # First element is the mode, which is an int.
            # The rest of the elements are either empty strings, or floats
            # Elements that are not empty strings should be extracted
            # (first element as int and the rest as floats)
            log[index] = [int(log[index][0])] + [
                float(value) for value in log[index][1:] if value != ""
            ]

        df = pd.DataFrame(
            log,
            columns=[
                "mode",
                "affinity[kcal/mol]",
                "dist from best mode_rmsd_l.b",
                "dist from best mode_rmsd_u.b",
            ],
        )
        df.index = df["mode"]
        df.drop("mode", axis=1, inplace=True)
        return df

#### Implementing The Pipeline's ***Docking*** Class
We can now implement the ***Docking*** class, which automatically performs docking experiments on all provided ligands, and processes the results in several useful ways. 

In [None]:
class Docking:
    """
    Automated docking process of the pipeline.
    Take in a Protein and a list of ligands, and
    dock each ligand on the protein, using the given specifications.

    Parameters
    ----------
    Protein_object : object; instance of Protein class
        The protein to perform docking on.
    list_Ligand_objects : list of Ligand objects (instances of Ligand class)
        List of ligands to dock on the protein.
    DockingSpecs_object : object; instance of Specs.Docking class
        Specifications for the docking experiment.
    docking_output_path : str or pathlib.Path object
        Output folder path to store the docking data in.
    """

    def __init__(
        self,
        Protein_object,
        list_Ligand_objects,
        DockingSpecs_object,
        docking_output_path,
    ):
        self.pdb_filepath_extracted_protein = docking_output_path / (
            Protein_object.pdb_code + "_extracted_protein.pdb"
        )
        Protein_object.Universe = PDB.extract_molecule_from_pdb_file(
            "protein", Protein_object.pdb_filepath, self.pdb_filepath_extracted_protein
        )

        self.pdbqt_filepath_extracted_protein = docking_output_path / (
            Protein_object.pdb_code + "_extracted_protein_ready_for_docking.pdbqt"
        )

        OBabel.create_pdbqt_from_pdb_file(
            self.pdb_filepath_extracted_protein, self.pdbqt_filepath_extracted_protein
        )

        temp_list_results_df = []
        temp_list_master_df = []

        for ligand in list_Ligand_objects:
            ligand.pdbqt_filepath = docking_output_path / ("CID_" + ligand.cid + ".pdbqt")
            OBabel.create_pdbqt_from_smiles(ligand.remove_counterion(), ligand.pdbqt_filepath)

            ligand.docking_poses_filepath = docking_output_path / (
                "CID_" + ligand.cid + "_docking_poses.pdbqt"
            )

            raw_log = Smina.dock(
                ligand.pdbqt_filepath,
                self.pdbqt_filepath_extracted_protein,
                Protein_object.binding_site_coordinates["center"],
                Protein_object.binding_site_coordinates["size"],
                str(ligand.docking_poses_filepath).split(".")[0],
                output_format="pdbqt",
                num_poses=DockingSpecs_object.num_poses_per_ligand,
                exhaustiveness=DockingSpecs_object.exhaustiveness,
                random_seed=DockingSpecs_object.random_seed,
                log=True,
            )

            ligand.docking_poses_split_filepaths = OBabel.split_multistructure_file(
                "pdbqt", ligand.docking_poses_filepath
            )

            # Assigning the the dataframe of the Smina output
            # to the ligand's attribute 'dataframe_docking'
            df = Smina.convert_log_to_dataframe(raw_log)
            ligand.dataframe_docking = df.copy()

            # Extracting some useful information from the Smina-output dataframe
            # and assigning them as separate ligand attributes
            # Adding the same summarized information the ligand's general dataframe as well
            ligand.dataframe.loc["binding_affinity_best"] = ligand.binding_affinity_best = df[
                "affinity[kcal/mol]"
            ].min()
            ligand.dataframe.loc["binding_affinity_mean"] = ligand.binding_affinity_mean = df[
                "affinity[kcal/mol]"
            ].mean()
            ligand.dataframe.loc["binding_affinity_std"] = ligand.binding_affinity_std = df[
                "affinity[kcal/mol]"
            ].std()
            ligand.dataframe.loc[
                "docking_poses_dist_rmsd_lb_mean"
            ] = ligand.docking_poses_dist_rmsd_lb_mean = df["dist from best mode_rmsd_l.b"].mean()
            ligand.dataframe.loc[
                "docking_poses_dist_rmsd_lb_std"
            ] = ligand.docking_poses_dist_rmsd_lb_std = df["dist from best mode_rmsd_l.b"].std()
            ligand.dataframe.loc[
                "docking_poses_dist_rmsd_ub_mean"
            ] = ligand.docking_poses_dist_rmsd_ub_mean = df["dist from best mode_rmsd_u.b"].mean()
            ligand.dataframe.loc[
                "docking_poses_dist_rmsd_ub_std"
            ] = ligand.docking_poses_dist_rmsd_ub_std = df["dist from best mode_rmsd_u.b"].std()

            df["CID"] = ligand.cid
            df["drug_score_total"] = ligand.drug_score_total
            df.set_index(["CID", df.index], inplace=True)

            master_df = df.copy()
            master_df["filepath"] = ligand.docking_poses_split_filepaths

            temp_list_results_df.append(df)
            temp_list_master_df.append(master_df)

        self.results_dataframe = pd.concat(temp_list_results_df)
        self.master_df = pd.concat(temp_list_master_df)
        self.results_dataframe.to_csv(docking_output_path / "Results_Summary.csv")

    def visualize_all_poses(self):
        """
        Visualize docking poses of a all analogs, using NGLView.

        Returns
        -------
            NGLViewer object
            Interactive viewer of all analogs' docking poses,
            sorted by their binding affinities.
        """
        df = self.master_df.sort_values(by=["affinity[kcal/mol]", "CID", "mode"])
        self.visualize(df)
        return

    def visualize_analog_poses(self, cid):
        """
        Visualize docking poses of a certain analog, using NGLView.

        Parameters
        ----------
        cid : str or int
            CID of the analog.

        Returns
        -------
            NGLViewer object
            Interactive viewer of given analog's docking poses,
            sorted by their binding affinities.
        """
        df = self.master_df.xs(str(cid), level=0, axis=0, drop_level=False)
        self.visualize(df)
        return

    def visualize(self, fitted_master_df):
        """
        Visualize any collection of docking poses, using NGLView.

        Parameters
        ----------
        fitted_master_df : Pandas DataFrame
            Any section of the master docking dataframe,
            stored under self.master_df.

        Returns
        -------
            NGLViewer object
            Interactive viewer of given analog's docking poses,
            sorted by their binding affinities.
        """
        list_docking_poses_labels = list(
            map(lambda x: x[0] + " - " + str(x[1]), fitted_master_df.index.tolist())
        )
        NGLView.docking(
            self.pdb_filepath_extracted_protein,
            "pdb",
            fitted_master_df["filepath"].tolist(),
            "pdbqt",
            list_docking_poses_labels,
            fitted_master_df["affinity[kcal/mol]"].tolist(),
        )
        return

##### Performing Docking Experiments on The Selected Analogs: Instantiating The ***Docking*** Class
<br>
<div style="text-align: justify">By instantiating the <b><i>Docking</i></b> class with the <b><i>Protein</i></b> object, the list of analogs, and the <b><i>Specs.Docking</i></b> data-class of the project, analogs are automatically docked onto the protein, and the results are stored separately for each docking pose. Moreover, some meaningful information is extracted from the results of all docking poses for each analog, and stored separately. We instantiate the class and assign it to our project:
<br><br>
<b><i>Note:</i></b> This part is the most computationally intense process of the pipeline, and will take between 5-10 minutes (for 20 ligands) to complete.</div>

In [None]:
project1.Docking = Docking(
    project1.Protein,
    list(project1.Ligand.analogs.values()),
    project1.Specs.Docking,
    project1.Specs.OutputPaths.docking,
)

We can now view all the calculated docking parameters for each docking pose of each analog (here showing only the 5 first enteries):

In [None]:
project1.Docking.results_dataframe.sort_values(by="affinity[kcal/mol]").head()

Alternatively, by accessing a specific analog, we can view the full results for that analog using the attribute ***dataframe_docking***:

In [None]:
project1.Ligand.analogs["11292933"].dataframe_docking

A summary of the docking results (e.g. highest/mean binding affinities) are also added to the main dataframe of each analog, and can be viewed, e.g. by calling its object (here showing only the relevant rows, i.e. the last 7 rows):

In [None]:
project1.Ligand.analogs["11292933"]().tail(7)

The same summary results are also added as instance attributes for each object:

In [None]:
project1.Ligand.analogs["11292933"].binding_affinity_best

Now with the help of the <b><i>NGLView</i></b> helper-class we had defined earlier, we have also added methods to the <b><i>Docking</i></b> class for visualization of the docking poses. For example, we can just view all docking poses together in an interactive way, using the <b><i>visualize_all_poses</i></b> method. In this method, the docking poses are sorted by their binding affinities, and labeled by their CID and respective docking pose number (i.e. mode). By selecting an analog from the menu below, the viewer automatically shows the protein residues in close proximity (i.e. 6 Å) of the ligand, as well as its corresponding binding affinity.
<br><br>
<b><i>Note:</i></b> If we are interested in visualization of a certain analog's docking poses, we can use the <b><i>visualize_analog_poses</i></b> method instead, and provide the analog's CID.

In [None]:
project1.Docking.visualize_all_poses()

Now let's also dock the input ligand in order to be able to compare the results later and see whether any of the analogs has a higher binding affinity:

In [None]:
project1.Ligand.Docking = Docking(
    project1.Protein,
    [project1.Ligand],
    project1.Specs.Docking,
    project1.Specs.OutputPaths.ligand,
)

Similar to the analogs, the docking results of the input ligand is also stored in it's object. For example, to see the docking dataframe:

In [None]:
project1.Ligand.dataframe_docking

<a id='8_practical'></a>
### Analysis of Protein–Ligand Interactions
With the calculated docking poses of each analog in hand, we can now focus on analyzing the protein-ligand interactions in each docking pose of each analog. For the analysis we use the ***PLIP*** package, for which we first define a helper-class.

#### Implementing The Required Functions: The ***PLIP*** Helper-Class
Here will use the [***PLIP***](https://github.com/pharmai/plip) package to implement the functions needed to analyze the protein-ligand interactions.

In [None]:
class PLIP:
    """
    Set of functions required to analyze protein-ligand interactions using the PLIP package.
    """

    class Consts:
        class InteractionTypes(Enum):
            H_BOND = "hbond"
            HYDROPHOBIC = "hydrophobic"
            SALT_BRIDGE = "saltbridge"
            WATER_BRIDGE = "waterbridge"
            PI_STACKING = "pistacking"
            PI_CATION = "pication"
            HALOGEN = "halogen"
            METAL = "metal"

    @staticmethod
    def calculate_interactions(pdb_filepath):
        """
        Calculate protein-ligand interactions in a PDB file.

        Parameters
        ----------
        pdb_filepath : str or pathlib.Path object
            Filepath of the PDB file containing the protein-ligand complex.

        Returns
        -------
            dict of dicts
            Dictionary of all different interaction data for all detected ligands.
            The keys of first dictionary correspond to the ligand-IDs of detected ligands in the PDB file.
            The keys of each sub-dictionary correspond to interaction types, as defined in PLIP.Consts.InteractionTypes.
        """
        protein_ligand_complex = PDBComplex()
        protein_ligand_complex.load_pdb(str(pdb_filepath))

        for ligand in protein_ligand_complex.ligands:
            protein_ligand_complex.characterize_complex(ligand)

        all_ligands_interactions = {}

        for ligand, ligand_binding_site in sorted(protein_ligand_complex.interaction_sets.items()):

            interaction_object = BindingSiteReport(
                ligand_binding_site
            )  # collect data about interactions

            interaction_data = {
                interaction_type.value: (
                    [getattr(interaction_object, interaction_type.value + "_features")]
                    + getattr(interaction_object, interaction_type.value + "_info")
                )
                for interaction_type in PLIP.Consts.InteractionTypes
            }

            all_ligands_interactions[ligand] = interaction_data

        return all_ligands_interactions

    @staticmethod
    def create_dataframe_of_ligand_interactions(ligand_interaction_data, interaction_type):
        """
        Create a Pandas DataFrame from interaction data of a specific ligand,
        for a specific interaction type.

        Parameters
        ----------
        ligand_interaction_data : dict
            Interaction data calculated by the 'calculate_interactions' function,
            where only one ligand's interaction data is chosen from the output of
            that function.
        interaction_type : Enum
            One of the enumerations from PLIP.Consts.InteractionTypes,
            specifying the type of interaction, for which the DataFrame should be created.

        Returns
        -------
            pandas DataFrame
            DataFrame containing all the information for the specified interactions
            in the input ligand interaction data.
        """

        interaction_df = pd.DataFrame.from_records(
            # interaction data are stored after the first element
            ligand_interaction_data[interaction_type.value][1:],
            # the first element corresponds to interaction parameters,
            # which are used here as column names.
            columns=ligand_interaction_data[interaction_type.value][0],
        )

        return interaction_df

    @staticmethod
    def create_protein_ligand_complex(
        protein_pdbqt_filepath, docking_pose_pdbqt_filepath, ligand_id, output_filepath
    ):
        """
        Create a protein-ligand-complex PDB file out of separate protein and ligand files.

        Parameters
        ----------
        protein_pdbqt_filepath : str or pathlib.Path object
            Filepath of the PDB file containing the protein.
        docking_pose_pdbqt_filepath : str or pathlib.Path object
            Filepath of the PDB file containing the ligand.
        ligand_id : str
            An identifier for the ligand to write into the PDB file.
        output_filepath : str or pathlib.Path object
            Output filepath of the PDB file containing the protein-ligand complex.

        Returns
        -------
            pathlib.Path object
            Complete filepath of the created protein-ligand-complex PDB file.
        """

        def pdbqt_to_pdbblock(pdbqt_filepath):
            lines = []
            with open(pdbqt_filepath) as file:
                for line in file:
                    if line[:4] == "ATOM":
                        lines.append(line[:67].strip())
            return "\n".join(lines)

        protein_pdbblock = pdbqt_to_pdbblock(protein_pdbqt_filepath)

        ligand_pdbblock = pdbqt_to_pdbblock(docking_pose_pdbqt_filepath)
        # ligand_pdbblock = ligand_pdbblock.replace('ATOM', 'HETATM')
        # ligand_pdbblock = ligand_pdbblock.replace('UNL     ', (ligand_id[:8]+
        #                                    " "*(8-len(ligand_id))))

        full_output_filepath = str(output_filepath) + ".pdb"
        with open(full_output_filepath, "w") as file:
            file.write(protein_pdbblock)
            file.write(f"\nCOMPND    {ligand_id}\n")
            file.write(ligand_pdbblock)

        return Path(full_output_filepath)

#### Implementing The Pipeline's ***InteractionAnalysis*** Class
We can now implement the ***InteractionAnalysis*** class, which automatically calculates all interaction information for each docking pose of each ligand. 

In [None]:
class InteractionAnalysis:
    """
    Automated protein-ligand interaction analysis process of the pipeline.

    Parameters
    ----------
    separated_protein_pdbqt_filepath : str or pathlib.Path object
        Filepath of the separated protein PDBQT file used in docking.
    list_Ligand_objects : list of Ligand objects (instances of Ligand class)
        List of ligands to analyze their interactions with the protein.
    docking_results_df : pandas DataFrame
        Summary dataframe created by the docking class.
    InteractionAnalysisSpecs_object : object; instance of Specs.InteractionAnalysis class
        Specifications for the interaction analysis processes.
    interaction_analysis_output_path : str or pathlib.Path object
        Output folder path to store the analyzed data in.
    """

    def __init__(
        self,
        separated_protein_pdbqt_filepath,
        separated_protein_pdb_filepath,
        protein_first_residue_number,
        list_Ligand_objects,
        docking_master_df,
        InteractionAnalysisSpecs_object,
        interaction_analysis_output_path,
    ):

        self._analogs = list_Ligand_objects
        self._pdb_filepath_extracted_protein = separated_protein_pdb_filepath

        if InteractionAnalysisSpecs_object.program is Consts.InteractionAnalysis.Programs.PLIP:
            results_df = docking_master_df.copy()
            results_df.drop("filepath", axis=1, inplace=True)
            results_df["total_num_interactions"] = 0

            interaction_master_df = docking_master_df.copy()

            for analog in list_Ligand_objects:
                analog.dataframe.loc["average_num_total_interactions", "Value"] = 0
                for interaction_type in PLIP.Consts.InteractionTypes:
                    analog.dataframe.loc[
                        f"average_num_{interaction_type.name.lower()}", "Value"
                    ] = 0

                for index, docking_pose_filepath in zip(
                    range(len(analog.docking_poses_split_filepaths)),
                    analog.docking_poses_split_filepaths,
                ):
                    analog.protein_complex_filepath = PLIP.create_protein_ligand_complex(
                        separated_protein_pdbqt_filepath,
                        docking_pose_filepath,
                        analog.cid,
                        interaction_analysis_output_path / (f"CID_{analog.cid}_{index+1}"),
                    )

                    # interaction_data will be dict of dicts, where each of the
                    # outer dict's items correspond to a ligand found in the pdb file
                    interaction_data = PLIP.calculate_interactions(analog.protein_complex_filepath)

                    # Since we are only passing PDB files with a single ligand, the outer dict
                    # will only have one item, which we extract:
                    ligand_interaction_data = interaction_data[list(interaction_data.keys())[0]]
                    # This extracted item is again a dict, where items correspond to
                    # different interaction-types.

                    # NOTICE: when using pybel to create the protein PDBQT file
                    # for the docking experiment, the residue numbers are reset (i.e. start at 1).
                    # Since this PDBQT file is also used to create a protein-ligand complext file
                    # for interaction-analysis with PLIP, the residue numbers in PLIP are also affected.
                    # Now we have to fix this here, before further processing the PLIP data.
                    # To do so, we simply loop through all the interaction data to find all the
                    # residue numbers, add the protein's first residue number to them, and subtract 1,
                    # so that we get back the original residue numbers. Also, since the data are stored
                    # in tuples, we have to convert them to a list first, in order to be able to change them.
                    for certain_interactions_data in ligand_interaction_data.values():
                        for single_interaction_data in range(1, len(certain_interactions_data)):
                            list_from_tuple = list(
                                certain_interactions_data[single_interaction_data]
                            )
                            list_from_tuple[0] += protein_first_residue_number - 1
                            certain_interactions_data[single_interaction_data] = tuple(
                                list_from_tuple
                            )

                    interaction_master_df.loc[(analog.cid, index + 1), "plip_dict"] = [
                        interaction_data[list(interaction_data.keys())[0]]
                    ]

                    for interaction_type in PLIP.Consts.InteractionTypes:
                        df = PLIP.create_dataframe_of_ligand_interactions(
                            interaction_data[list(interaction_data.keys())[0]],
                            interaction_type,
                        )
                        setattr(
                            analog,
                            f"docking_pose_{index+1}_interactions_{interaction_type.name.lower()}",
                            df,
                        )

                        results_df.loc[
                            (analog.cid, index + 1), interaction_type.name.lower()
                        ] = len(df)
                        analog.dataframe.loc[
                            f"average_num_{interaction_type.name.lower()}", "Value"
                        ] += len(df)
                        analog.dataframe.loc["average_num_total_interactions", "Value"] += len(df)

                        results_df[interaction_type.name.lower()] = pd.to_numeric(
                            results_df[interaction_type.name.lower()],
                            downcast="integer",
                        )
                        results_df.loc[(analog.cid, index + 1), "total_num_interactions"] += len(
                            df
                        )

                analog.num_total_interactions_highest = results_df.loc[
                    analog.cid, "total_num_interactions"
                ].max()
                analog.dataframe.loc["average_num_total_interactions", "Value"] /= len(
                    analog.docking_poses_split_filepaths
                )
                analog.num_total_interactions_mean = analog.dataframe.loc[
                    "average_num_total_interactions", "Value"
                ]
                for interaction_type in PLIP.Consts.InteractionTypes:
                    analog.dataframe.loc[
                        f"average_num_{interaction_type.name.lower()}", "Value"
                    ] /= len(analog.docking_poses_split_filepaths)

            results_df["total_num_interactions"] = pd.to_numeric(
                results_df["total_num_interactions"], downcast="integer"
            )
            self.results = results_df
            self.master_df = interaction_master_df

    def find_poses_with_specific_interactions(self, list_interaction_data, all_or_any):
        """
        Find docking poses containing a specific set of interactions.

        Parameters
        ----------
        list_interaction_data : list of lists
            List of desired interactions. Each sub-list should have the following format:
            [interaction_type, residue_number]
            interaction_type : str
                Type of desired interaction. Allowed values are:
                'h_bond', 'hydrophobic', 'salt_bridge', 'water_bridge',
                'pi_stacking', 'pi_cation', 'halogen', 'metal'
            residue_nr : int
                Residue number involved in the given interaction_type
            Example: [["h_bond", 793], ["hydrophobic", 860]]
        all_or_any : str
            Allowed values: "all", "any"
            "all": all given interactions should be present in a docking pose.
            "any": it is enough when one of the given interactions is present in a docking pose.

        Returns
        -------
            list of tuples
            List of identifiers for the docking poses in the format:
            (CID, pose_nr)
            CID : str
                CID of the analog with the eligible docking pose
            pose_nr : int
                Docking pose number of the eligible docking pose
        """
        list_eligible_analog_cid_docking_pose_nr_tuple = []
        for analog in self._analogs:
            for analog_docking_pose_nr in range(1, len(analog.dataframe_docking) + 1):
                num_hits_in_analog_docking_pose = 0
                for interaction_type, residue_nr in list_interaction_data:
                    interaction_df = getattr(
                        analog,
                        f"docking_pose_{analog_docking_pose_nr}_interactions_{interaction_type}",
                    )
                    if residue_nr in interaction_df["RESNR"].values:
                        num_hits_in_analog_docking_pose += 1
                if (
                    (all_or_any == "all")
                    and (num_hits_in_analog_docking_pose == len(list_interaction_data))
                ) or ((all_or_any == "any") and (num_hits_in_analog_docking_pose != 0)):
                    list_eligible_analog_cid_docking_pose_nr_tuple.append(
                        (analog.cid, analog_docking_pose_nr)
                    )
        return list_eligible_analog_cid_docking_pose_nr_tuple

    def visualize_all_interactions(self):
        """
        Visualize docking poses of a all analogs, using NGLView.
        The docking poses are sorted by their binding affinities in a menu.

        Returns
        -------
            None
        """
        df = self.master_df.sort_values(by=["affinity[kcal/mol]", "CID", "mode"])
        view = self.visualize(df)
        return view

    def visualize_analog_interactions(self, cid):
        """
        Visualize interactions in the docking poses of a certain analog, using NGLView.
        The docking poses are sorted by their binding affinities in a menu.

        Parameters
        ----------
        cid : str or int
            CID of the analog.

        Returns
        -------
            None
        """
        df = self.master_df.xs(str(cid), level=0, axis=0, drop_level=False)
        view = self.visualize(df)
        return view

    def visualize_docking_poses_interactions(self, list_of_cid_pose_nr_tuples):
        """
        Visualize interactions in a given set of docking poses, using NGLView.
        The docking poses are sorted in a menu, in the same order as the given list.

        Parameters
        ----------
        list_of_cid_pose_nr_tuples : list of tuples
            List of identifiers for the docking poses in the format:
            (CID, pose_nr)
            CID : str
                CID of the analog
            pose_nr : int
                Docking pose number
            Example: [("184614", 1), ("3628", 4)]

        Returns
        -------
            None
        """
        df = self.master_df.loc[list_of_cid_pose_nr_tuples]
        view = self.visualize(df)
        return view

    def visualize(self, fitted_master_df):
        """
        Visualize interactions in any collection of docking poses, using NGLView.
        The docking poses are sorted by their given order in a menu.

        Parameters
        ----------
        fitted_master_df : Pandas DataFrame
            Any section of the master InteractionAnalysis dataframe,
            stored under self.master_df.

        Returns
        -------
            None
        """
        list_docking_poses_labels = list(
            map(lambda x: x[0] + " - " + str(x[1]), fitted_master_df.index.tolist())
        )
        view = NGLView.interactions(
            self._pdb_filepath_extracted_protein,
            "pdb",
            fitted_master_df["filepath"].tolist(),
            "pdbqt",
            list_docking_poses_labels,
            fitted_master_df["affinity[kcal/mol]"].tolist(),
            fitted_master_df["plip_dict"].tolist(),
        )
        return view

    def plot_interaction_affinity_correlation(self):
        """
        View a correlation plot between binding affinity and number of interactions
        in each docking pose.

        Returns
        -------
            None
        """
        df = self.results.sort_values(by="affinity[kcal/mol]", ascending=True)

        fig, ax1 = plt.subplots()
        color = "tab:red"
        ax1.set_ylabel("Binding Affinity [kcal/mol]", color=color)
        ax1.tick_params(axis="y", labelcolor=color)
        ax1.plot(
            list(map(abs, df["affinity[kcal/mol]"].tolist())),
            linewidth=0.5,
            linestyle="--",
            color="r",
            marker=".",
            markersize=2,
            markerfacecolor="blue",
            markeredgecolor="red",
        )

        ax2 = ax1.twinx()
        color = "tab:blue"
        ax2.set_ylabel("Number of Interactions", color=color)
        ax2.tick_params(axis="y", labelcolor=color)

        ax2.plot(
            df["total_num_interactions"].tolist(),
            linewidth=0.5,
            label="Total Interactions",
            color="black",
        )
        ax2.plot(
            df["h_bond"].tolist(),
            linewidth=0.5,
            label="H-bond Interactions",
            color="g",
            linestyle="--",
        )
        ax2.plot(
            df["hydrophobic"].tolist(),
            linewidth=0.5,
            label="Hydrophobic Interactions",
            linestyle="dashed",
        )

        plt.legend(fontsize=7)
        plt.show()
        return

##### Analyzing the Protein-Ligand Interactions in Calculated Docking Poses: Instantiating The ***InteractionAnalysis*** Class
The ***InteractionAnalysis*** class can be instantiated by providing the filepaths of the separated protein structure, a list of all analogs (as ***Ligand*** objects), the results dataframe of the docking process, and the output path for storing the interaction analysis data:

In [None]:
project1.InteractionAnalysis = InteractionAnalysis(
    project1.Docking.pdbqt_filepath_extracted_protein,
    project1.Docking.pdb_filepath_extracted_protein,
    project1.Protein.residue_number_first,
    list(project1.Ligand.analogs.values()),
    project1.Docking.master_df,
    project1.Specs.InteractionAnalysis,
    project1.Specs.OutputPaths.interaction_analysis,
)

The interactions can now be inspected collectively for all docking poses of all analogs. Here, only the number of interactions are recorded for each interaction type (showing only the 5 first entries):

In [None]:
project1.InteractionAnalysis.results.sort_values(
    by="total_num_interactions", ascending=False
).head()

If we are interested in the details of a specific interaction type for a specific docking pose, we can access these information from the corresponding ***Ligand*** objects of the analogs. For example, accessing the data for hydrophobic interactions of the docking pose 1 of analog 65997 (showing only the 5 first entries):

In [None]:
project1.Ligand.analogs["65997"].docking_pose_1_interactions_hydrophobic.head()

Or hydrogen-bond interactions of docking pose 2 of analog 11292933:

In [None]:
project1.Ligand.analogs["11292933"].docking_pose_2_interactions_h_bond.head()

Again, a summary of interaction analysis data are also added to each analog's main dataframe, and can be viewed, for example by calling the object (here showing only the relevant data, i.e. the last 9 rows):

In [None]:
project1.Ligand.analogs["11292933"]().tail(9)

Also, let's use ***plot_interaction_affinity_correlation*** function see if there is any visible correlation between the calculated binding affinities and the number of interactions for each docking pose:

In [None]:
project1.InteractionAnalysis.plot_interaction_affinity_correlation()

As can be seen, no obvious correlation is visible between the two sets of data; The number of total interactions seems to be very weakly correlated with the binding affinity, however, with several outliers.

Now with the help of the <b><i>NGLView</i></b> helper-class we had defined earlier, we have also added methods to the <b><i>InteractionAnalysis</i></b> class for visualization of the protein-ligand interactions. For example, we can just view the interactions for all docking poses in an interactive way. Here, by selecting each docking pose (labeled by their CID and mode-number, and sorted by their binding affinities), all the interacting residues are shown, and the interactions are visualized with colored lines, for which a color-map is also provided.
<br><br>
<b><i>Note:</i></b> If we are interested in visualization of a certain analog's interactions, we can use the <b><i>visualize_analog_interactions</i></b> method instead, and provide the analog's CID.

In [None]:
project1.InteractionAnalysis.visualize_all_interactions()

We can also search for docking poses with a specific set of interactions with specific residues of the protein, using the ***find_poses_with_specific_interactions*** method. For example, many EGFR inhibitors are involved in hydrogen-bonding with the M793 residue of the protein. To find the docking poses that exhibit this interaction:

In [None]:
desired_interactions = project1.InteractionAnalysis.find_poses_with_specific_interactions(
    [["h_bond", 793]], "any"
)
desired_interactions

And we can also visualize the results separately, instead of looking for them in the menu of the viewer above.

In [None]:
project1.InteractionAnalysis.visualize_docking_poses_interactions(desired_interactions)

Now let's also analyze the interactions in the docking poses of the input ligand:

In [None]:
project1.Ligand.InteractionAnalysis = InteractionAnalysis(
    project1.Docking.pdbqt_filepath_extracted_protein,
    project1.Docking.pdb_filepath_extracted_protein,
    project1.Protein.residue_number_first,
    [project1.Ligand],
    project1.Ligand.Docking.master_df,
    project1.Specs.InteractionAnalysis,
    project1.Specs.OutputPaths.ligand,
)

The results can be viewed similar to the analogs. For example, for viewing the summary data:

In [None]:
project1.Ligand.InteractionAnalysis.results.sort_values(
    by="total_num_interactions", ascending=False
)

<a id='9_practical'></a>
### Selection of the Best Optimized Ligand
<br>
<div style="text-align: justify"> 
At this point, we have carried out all the processes in the pipeline, and have gathered all the information in our <b><i>LeadOptimizationPipeline</i></b> instance, i.e. <b><i>project1</i></b>. The project now contains all the possible information required for selecting the most suited analog. The choice of selecting the best analog is highly dependent on the specific project at hand, and can be approached from different angles. To recap, we have gathered a number of analogs to the input ligand, and filtered them to choose those with the highest drug-likeness scores. We then calculated several docking poses and corresponding binding affinities for each analog, and analyzed their interactions with the protein. Now we define a class <b><i>OptimizedLigands</i></b>, which takes in the whole project, and based on the specifications in the input file, selects the best analog. </div>

In [None]:
class OptimizedLigands:
    """
    The automated selection process of optimized analog(s) at the end of the pipeline.
    Take in the whole project, create a short summary of results, and select the best
    optimized analogs based on user's specifications defined in the input data.
    """
    def __init__(self, project):

        self._project = project

        df = project.InteractionAnalysis.results.rename(columns={"affinity[kcal/mol]": "affinity"})

        self.higher_affinity_poses = (
            df[df["affinity"] < project.Ligand.binding_affinity_best]
            .copy()
            .sort_values(by="affinity")
        )
        self.higher_affinity_analogs = [
            project.Ligand.analogs[cid]
            for cid in self.higher_affinity_poses.index.get_level_values(0).unique()
        ]
        self.higher_interacting_poses = (
            df[
                df["total_num_interactions"]
                > project.Ligand.InteractionAnalysis.results["total_num_interactions"].max()
            ]
            .copy()
            .sort_values(by="total_num_interactions", ascending=False)
        )
        self.higher_interacting_analogs = [
            project.Ligand.analogs[cid]
            for cid in self.higher_interacting_poses.index.get_level_values(0).unique()
        ]
        self.higher_affinity_and_interacting_poses = df.loc[
            self.higher_affinity_poses.index.intersection(self.higher_interacting_poses.index)
        ]
        self.higher_affinity_and_interacting_analogs = [
            project.Ligand.analogs[cid]
            for cid in self.higher_affinity_and_interacting_poses.index.get_level_values(
                0
            ).unique()
        ]
        self.higher_affinity_and_interacting_and_druglike_analogs = [
            analog
            for analog in self.higher_affinity_and_interacting_analogs
            if analog.drug_score_total > project.Ligand.drug_score_total
        ]

        if (
            project.Specs.OptimizedLigands.selection_method
            is Consts.OptimizedLigands.SelectionMethods.SORTING
        ):
            df["affinity"] = df["affinity"].apply(abs)
            final_results_cids = (
                df.sort_values(
                    by=project.Specs.OptimizedLigands.selection_criteria,
                    ascending=False,
                )
                .index.get_level_values(0)
                .unique()
            )

        elif (
            project.Specs.OptimizedLigands.selection_method
            is Consts.OptimizedLigands.SelectionMethods.FUNCTION
        ):
            df["function_score"] = eval(selection_criteria)
            final_results_cids = (
                df.sort_values(by="function_score", ascending=False)
                .index.get_level_values(0)
                .unique()
            )

        self.output = [project.Ligand.analogs[cid] for cid in final_results_cids][
            : int(project.Specs.OptimizedLigands.num_results)
        ]

    def show_higher_affinity_analogs(self):
        for analog in self.higher_affinity_analogs:
            display(analog())

    def show_higher_interacting_analogs(self):
        for analog in self.higher_interacting_analogs:
            display(analog())

    def show_higher_affinity_and_interacting_analogs(self):
        for analog in self.higheraffinity_and_interacting_analogs:
            display(analog())

    def show_higher_affinity_and_interacting_and_druglike_analogs(self):
        for analog in self.higher_affinity_and_interacting_and_druglike_analogs:
            display(analog)

    def show_final_output(self):
        for analog in self.output:
            display(analog())

    def __call__(self):
        def pprint(text1, text2):
            display(
                Markdown(
                    f"<span style='color:blue'>{text1}</span><span style='color:black'>{text2}</span>"
                )
            )

        pprint(
            "Number of docking poses with higher binding affinity than highest binding affinity of ligand: ",
            len(self.higher_affinity_poses),
        )
        pprint(
            "&nbsp;&nbsp;&nbsp;&nbsp;CIDs of analogs corresponding to these docking poses: ",
            [analog.cid for analog in self.higher_affinity_analogs],
        )
        pprint(
            "Number of docking poses with higher number of total interactions than highest interacting pose of ligand: ",
            len(self.higher_interacting_poses),
        )
        pprint(
            "&nbsp;&nbsp;&nbsp;&nbsp;CIDs of analogs corresponding to these docking poses: ",
            [analog.cid for analog in self.higher_interacting_analogs],
        )
        pprint(
            "Number of docking poses with higher affinity and number of total interactions than best corresponding poses of ligand: ",
            len(self.higher_affinity_and_interacting_poses),
        )
        pprint(
            "&nbsp;&nbsp;&nbsp;&nbsp;CIDs of analogs corresponding to these docking poses: ",
            [analog.cid for analog in self.higher_affinity_and_interacting_analogs],
        )
        pprint(
            "CIDs of analogs with higher binding affinity, number of total interactions and drug-likeness score than ligand: ",
            [analog.cid for analog in self.higher_affinity_and_interacting_and_druglike_analogs],
        )
        pprint(
            "**CIDs of selected analogs as final output:** ",
            [analog.cid for analog in self.output],
        )

        pprint("Comparison between the input ligand and optimized analog: ", "")

        self.comparison_dataframe = df = pd.DataFrame(columns=["Input Ligand", "Optimized Analog"])
        df.loc["Drug-Score"] = [
            self._project.Ligand.drug_score_total,
            self._project.OptimizedLigands.output[0].drug_score_total,
        ]
        df.loc["Highest Binding Affinity"] = [
            self._project.Ligand.binding_affinity_best,
            self._project.OptimizedLigands.output[0].binding_affinity_best,
        ]
        df.loc["Highest Number of Total Interactions"] = [
            self._project.Ligand.num_total_interactions_highest,
            self._project.OptimizedLigands.output[0].num_total_interactions_highest,
        ]
        display(self.comparison_dataframe)
        return

We now instantiate the class, and assign it to our project:

In [None]:
project1.OptimizedLigands = OptimizedLigands(project1)

The class has a ***\_\_call\_\_*** method, which prints out a summary:

In [None]:
project1.OptimizedLigands()

As you can see, ***the pipeline successfully found an analog, which is better than the input ligand in all of the three metrics, i.e. drug-likeness, binding affinity, and total number of protein-ligand interactions.***
We can also simply visualize each of the above groups of analogs. For example for visualizing the final output:

In [None]:
project1.OptimizedLigands.show_final_output()

<a id='10_practical'></a>
### Putting the Pieces Together: A Fully Automated Pipeline
<br>
<div style="text-align: justify"> 
Now that we have implemented all the necessary parts of our pipeline, we can write a function to automatically run the whole pipeline and display the results. The function thus takes in the project name, the filepath of the input data, and the output path, and performs all the necessary processes to generate the final output. It will also print the intermediate results of each part of the pipeline, so that the process can be followed.</div>

In [None]:
def run_pipeline(project_name, input_data_filepath, output_data_root_folder_path):
    """
    Automatically run the whole lead optimization pipeline to completion,
    and print out a summary in real-time.

    Parameters
    ----------
    project_name : str
        Name of the lead optimization project.
    input_data_filepath : str or pathlib.Path object
        Filepath of the input CSV file containing all the specifications of the project.
    output_data_root_folder_path : str or pathlib.Path object
        Root folder path to save the pipeline's data in.

    Returns
    -------
        LeadOptimizationPipeline object
        Object containing all the information about the pipeline.
    """

    def pprint(markdown_list):
        markdown_command = ""
        for command in markdown_list:
            text, color = command
            markdown_command += f"<span style='color:{color}'>{text}</span>"
        display(Markdown(markdown_command))

    def pprint_header(header):
        display(
            Markdown(
                f"<span style='color:blue'>**{header}:** </span><span style='color:green'>Successful</span>"
            )
        )

    project = LeadOptimizationPipeline(project_name=project_name)
    pprint_header("1. Initializing Project")
    pprint([(f"&nbsp;&nbsp;&nbsp;&nbsp;Project name: **{project.name}**", "black")])

    project.Specs = Specs(
        input_data_filepath=input_data_filepath,
        output_data_root_folder_path=f"{output_data_root_folder_path}/{project.name}",
    )
    pprint_header("2. Initializing Input/Output")
    pprint(
        [
            (
                f"&nbsp;&nbsp;&nbsp;&nbsp;Input data read from: **{input_data_filepath}**",
                "black",
            )
        ]
    )
    pprint(
        [
            (
                f"&nbsp;&nbsp;&nbsp;&nbsp;Output folders created at: **{output_data_root_folder_path}/{project.name}**",
                "black",
            )
        ]
    )

    project.Protein = Protein(
        identifier_type=project.Specs.Protein.input_type,
        identifier_value=project.Specs.Protein.input_value,
        protein_output_path=project.Specs.OutputPaths.protein,
    )
    pprint_header("3. Processing Protein Data")
    display(project.Protein())

    project.Ligand = Ligand(
        identifier_type=project.Specs.Ligand.input_type,
        identifier_value=project.Specs.Ligand.input_value,
        ligand_output_path=project.Specs.OutputPaths.ligand,
    )
    pprint_header("4. Processing Ligand Data")
    display(project.Ligand())

    project.BindingSiteDetection = BindingSiteDetection(
        project.Protein,
        project.Specs.BindingSite,
        project.Specs.OutputPaths.binding_site_detection,
    )
    pprint_header("5. Binding Site Detection")
    pprint(
        [
            (
                f"&nbsp;&nbsp;&nbsp;&nbsp;Binding site definition method: **{project.Specs.BindingSite.definition_method.name}**",
                "black",
            )
        ]
    )
    if (
        project.Specs.BindingSite.definition_method
        is Consts.BindingSite.DefinitionMethods.DETECTION
    ):
        pprint(
            [
                (
                    f"&nbsp;&nbsp;&nbsp;&nbsp;Binding site detection method: **{project.Specs.BindingSite.detection_method.name}**",
                    "black",
                )
            ]
        )
        pprint(
            [
                (
                    f"&nbsp;&nbsp;&nbsp;&nbsp;Selection method for best binding site: **{project.Specs.BindingSite.selection_method.name}**",
                    "black",
                )
            ]
        )
        pprint(
            [
                (
                    f"&nbsp;&nbsp;&nbsp;&nbsp;Selection criteria for best binding site: **{project.Specs.BindingSite.selection_criteria}**",
                    "black",
                )
            ]
        )
        pprint(
            [
                (
                    f"&nbsp;&nbsp;&nbsp;&nbsp;Name of selected binding site: **{project.BindingSiteDetection.best_binding_site_name}**",
                    "black",
                )
            ]
        )
    pprint(
        [
            (
                f"&nbsp;&nbsp;&nbsp;&nbsp;Binding site coordinates - center: **{project.Protein.binding_site_coordinates['center']}**",
                "black",
            )
        ]
    )
    pprint(
        [
            (
                f"&nbsp;&nbsp;&nbsp;&nbsp;Binding site coordinates - size: **{project.Protein.binding_site_coordinates['size']}**",
                "black",
            )
        ]
    )

    display(project.BindingSiteDetection.visualize_best())

    project.LigandSimilaritySearch = LigandSimilaritySearch(
        project.Ligand,
        project.Specs.LigandSimilaritySearch,
        project.Specs.OutputPaths.similarity_search,
    )
    pprint_header("6. Ligand Similarity Search")
    pprint(
        [
            (
                f"&nbsp;&nbsp;&nbsp;&nbsp;Search engine: **{project.Specs.LigandSimilaritySearch.search_engine.name}**",
                "black",
            )
        ]
    )
    pprint(
        [
            (
                f"&nbsp;&nbsp;&nbsp;&nbsp;Number of fetched analogs with a similarity higher than **{project.Specs.LigandSimilaritySearch.min_similarity_percent}%**: **{len(project.LigandSimilaritySearch.all_analogs)}**",
                "black",
            )
        ]
    )
    pprint(
        [
            (
                f"&nbsp;&nbsp;&nbsp;&nbsp;Dice-similarity range of fetched analogs: **{project.LigandSimilaritySearch.all_analogs['dice_similarity'].min()} - {project.LigandSimilaritySearch.all_analogs['dice_similarity'].max()}**",
                "black",
            )
        ]
    )
    pprint(
        [
            (
                f"&nbsp;&nbsp;&nbsp;&nbsp;CID of analog with the highest Dice-similarity: **{project.LigandSimilaritySearch.all_analogs.sort_values(by='dice_similarity', ascending=False).head(1).index.values[0]}**",
                "black",
            )
        ]
    )
    pprint(
        [
            (
                f"&nbsp;&nbsp;&nbsp;&nbsp;Number of selected drug-like analogs: **{len(project.Ligand.analogs)}**",
                "black",
            )
        ]
    )
    sorted_analogs_df = project.LigandSimilaritySearch.all_analogs.sort_values(
        by="drug_score_total", ascending=False
    ).head(len(project.Ligand.analogs))
    pprint(
        [
            (
                f"&nbsp;&nbsp;&nbsp;&nbsp;Range of drug-likeness score in selected analogs: **{sorted_analogs_df['drug_score_total'].min()} - {sorted_analogs_df['drug_score_total'].max()}**",
                "black",
            )
        ]
    )
    pprint(
        [
            (
                f"&nbsp;&nbsp;&nbsp;&nbsp;CID of analog with the highest drug-likeness score: **{sorted_analogs_df.head(1).index.values[0]}**",
                "black",
            )
        ]
    )

    project.Docking = Docking(
        project.Protein,
        list(project.Ligand.analogs.values()),
        project.Specs.Docking,
        project.Specs.OutputPaths.docking,
    )
    project.Ligand.Docking = Docking(
        project.Protein,
        [project.Ligand],
        project.Specs.Docking,
        project.Specs.OutputPaths.ligand,
    )
    pprint_header("7. Docking Experiment")
    pprint(
        [
            (
                f"&nbsp;&nbsp;&nbsp;&nbsp;Highest binding affinity of input ligand: **{project.Ligand.binding_affinity_best}**",
                "black",
            )
        ]
    )
    pprint(
        [
            (
                f"&nbsp;&nbsp;&nbsp;&nbsp;Binding affinity range of analogs: **{project.Docking.results_dataframe['affinity[kcal/mol]'].max()} - {project.Docking.results_dataframe['affinity[kcal/mol]'].min()}**",
                "black",
            )
        ]
    )
    pprint(
        [
            (
                f"&nbsp;&nbsp;&nbsp;&nbsp;Number of analog docking poses with higher affinity than ligand: **{len(project.Docking.results_dataframe[project.Docking.results_dataframe['affinity[kcal/mol]']<project.Ligand.binding_affinity_best].sort_values(by=['affinity[kcal/mol]']))}**",
                "black",
            )
        ]
    )
    pprint(
        [
            (
                f"&nbsp;&nbsp;&nbsp;&nbsp;Number of analogs with higher affinity than ligand: **{len(set(project.Docking.results_dataframe[project.Docking.results_dataframe['affinity[kcal/mol]']<project.Ligand.binding_affinity_best].sort_values(by=['affinity[kcal/mol]']).index.get_level_values(0)))}**",
                "black",
            )
        ]
    )

    highest_aff_cids = set(
        project.Docking.results_dataframe[
            project.Docking.results_dataframe["affinity[kcal/mol]"]
            < project.Ligand.binding_affinity_best
        ]
        .sort_values(by=["affinity[kcal/mol]"])
        .index.get_level_values(0)
    )
    pprint(
        [
            (
                f"&nbsp;&nbsp;&nbsp;&nbsp;CIDs of analogs with higher affinity than ligand: **{highest_aff_cids}**",
                "black",
            )
        ]
    )

    project.Docking.visualize_all_poses()

    project.InteractionAnalysis = InteractionAnalysis(
        project.Docking.pdbqt_filepath_extracted_protein,
        project.Docking.pdb_filepath_extracted_protein,
        project.Protein.residue_number_first,
        list(project.Ligand.analogs.values()),
        project.Docking.master_df,
        project.Specs.InteractionAnalysis,
        project.Specs.OutputPaths.interaction_analysis,
    )

    project.Ligand.InteractionAnalysis = InteractionAnalysis(
        project.Docking.pdbqt_filepath_extracted_protein,
        project.Docking.pdb_filepath_extracted_protein,
        project.Protein.residue_number_first,
        [project.Ligand],
        project.Ligand.Docking.master_df,
        project.Specs.InteractionAnalysis,
        project.Specs.OutputPaths.ligand,
    )
    pprint_header("8. Protein-Ligand Interaction Analysis")
    pprint(
        [
            (
                f"&nbsp;&nbsp;&nbsp;&nbsp;Highest number of total interactions in a docking pose of input ligand: **{project.Ligand.InteractionAnalysis.results.sort_values(by='total_num_interactions',ascending=False).head(1)['total_num_interactions'].values[0]}**",
                "black",
            )
        ]
    )
    pprint(
        [
            (
                f"&nbsp;&nbsp;&nbsp;&nbsp;Range of total number of interactions in docking poses of analogs: **{project.InteractionAnalysis.results['total_num_interactions'].min()} - {project.InteractionAnalysis.results['total_num_interactions'].max()}**",
                "black",
            )
        ]
    )
    pprint(
        [
            (
                f"&nbsp;&nbsp;&nbsp;&nbsp;Correlation plot between binding affinity and number of interactions in docking poses of analogs:",
                "black",
            )
        ]
    )
    project.InteractionAnalysis.plot_interaction_affinity_correlation()

    display(project.InteractionAnalysis.visualize_all_interactions())

    pprint_header("9. Selecting The Optimized Analog")
    project.OptimizedLigands = OptimizedLigands(project)
    project.OptimizedLigands()
    print("\n")
    pprint([("**Selected analogs as final output:**", "blue")])
    project.OptimizedLigands.show_final_output()

    cid_pose_nr_list = []
    for final_analog in project.OptimizedLigands.output:
        for docking_pose_nr in range(1, len(final_analog.dataframe_docking) + 1):
            cid_pose_nr_list.append((final_analog.cid, docking_pose_nr))
    display(project.InteractionAnalysis.visualize_docking_poses_interactions(cid_pose_nr_list))

    pprint_header("10. Pipeline Completed")

    return project

By calling the above function with the necessary parameters, the whole pipeline will automatically run, and we will get a full summary of the pipeline processes and obtain the final results.

For this demonstration, as the input ligand let's use the optimized ligand we just found, to see whether we can one up our previous optimization results and obtain a yet better analog. We first need to create the corresponding input file. To do so, we can simply open the input file of our project, modify the ligand input value, and save the file as a new input file:

In [None]:
new_input_df = pd.read_csv(project1.Specs.RawData.filepath)
new_input_df.head()

In [None]:
new_input_df.loc[3, "Value"] = project1.OptimizedLigands.output[0].smiles
new_input_df.head()

In [None]:
new_input_df.to_csv("data/PipelineInputData_Project2.csv")

<div style="text-align: justify"> 
Now let's run the pipeline again, this time fully automated and with the already optimized ligand. Other than printing out a short summary of results in real-time, the function also returns the whole project at the end. Thus, by assigning the return value to a variable (here <b><i>project2</i></b>), we can later further investigate the the information generated by the pipeline in more detail.</div>

In [None]:
project2 = run_pipeline(
    "Project2_EGFR_CID11292933",
    input_data_filepath="data/PipelineInputData_Project2.csv",
    output_data_root_folder_path="data/Outputs",
)

<a id='discussion'></a>
## Discussion
<br>
<div style="text-align: justify"> 
In this talktorial, we have successfully implemented a <b>fully-automated virtual screening pipeline, particularly targeted at hit-expansion and lead-optimization phases of a drug discovery project</b>. As the input, the pipeline accepts a single CSV data-file, specifying a protein structure (either as PDB-code or local PDB-file), a ligand (as one of the many possible identifiers, e.g. SMILES, CID, InChI etc.), and several other settings regarding the different processes of the pipeline. It then <b>performs all the necessary processes in order to derive at an optimized analog of the provided ligand, in terms of binding affinity, number of protein-ligand interactions and drug-likeness</b>. The process start with <b>binding-site definition</b>, which can be carried out via several provided methods, including binding-site detection by the <i>DoGSiteScorer</i> functionality of the <i>ProteinsPlus</i> webserver. The next step is <b>ligand similarity-search</b>, which is performed using the <i>PubChem</i> web-services via the provided API backend. Subsequently, the structural analogs detected via the similarity search are filtered based on their physiochemical properties, in order to <b>select the most drug-like analogs</b>. These are then assessed in a virtual screening process using the <i>Smina</i> program, in order to <b>calculate their binding affinities and corresponding binding-poses</b>. Using the <i>PLIP</i> package, these generated docking poses can be <b>analyzed in terms of their specific non-covalent interactions with the target protein</b>, and analogs displaying a specific set of desired interactions can be selected, e.g. in order to achieve <b>higher selectivities for the target protein</b>. After all required processes are carried out, the pipeline can <b>output one or several optimized analogs</b>, based on the criteria defined by the user in the input data, such as binding affinity, physiochemical and pharmacokinetic properties, and specific interactions with the protein structure. 
<br><br><div style="text-align: justify"> 
The code is specifically structured in such a way that it will be easy to digest, maintain, and expand. For each functionality of the pipeline, one or several helper classes are defined, containing static methods that are required to develop a certain part of the pipeline. While this may have added some extra workload in the development of the pipeline, the advantage is that, first, long classes that make the demonstration impractical are avoided, making the code more readable and communicative. Moreover, these <b>static methods can be easily adopted for other projects as well</b>. On the other hand, the pipeline is also contained in a single class, with sub-classes for every process, which makes the access to all parts of the data much more easier. Furthermore, after each part of the pipeline is separately developed at demonstrated, at the end of the pipeline a function is provided, which puts the whole program together, and <b>automatically runs a given project to completion, merely by providing an input specification file</b>. Since the pipeline provides much more functionalities and options than what the scope of this talktorial allows to demonstrate, we have also provided a <i>Supplementary Information</i> section, which showcases the possibilities of the pipeline in more detail. Furthermore, we have structured the code to be easily expandable, so it can be freely adopted and built upon for various situations and needs.<br><br>
During the talktorial we demonstrated the pipeline with a sample project, using EGFR (PDB-code: 3W32) as the target protein, and a promising inhibitor with the ChEMBL-ID CHEMBL328216. The abilities of the pipeline was successfully displayed with this example, where at the end of the pipeline <b>we were able to identify an analog with much higher binding affinity and drug-likeness score</b> (Figure 9). Moreover, to showcase the fully automated version of the pipeline at the end, we re-ran the pipeline using this optimized ligand as the input. We were then delighted to see that <b>a second run could still improve the ligand and result in yet another analog with higher binding affinity</b>. The automated function at the end of the pipeline also prints out a full report and interactive visualization of the processes while running, allowing the user to have a better understanding of the inner workings of the program in real-time. 
</div><br>
<p style="text-align:center;">
<img src="images/fig9.png" width="950" height=auto class="center"/>
</p>
<b>Figure 9.</b> Results of the pipeline's performance in the demonstrated examples; Starting with a ligand with a binding affinity of -8.7 kcal/mol, the pipeline was able to find an optimized ligand with an improved affinity of -10.3 kcal/mol. Subjecting this optimized ligand to the same process further improved the binding affinity, and a yet more suitable analog was found with a binding affinity of -10.9 kcal/mol. <i><b>Note:</b></i> The results of the pipeline are not fully deterministic, due to several randomized events taking place during the processes. Therefore, re-evaluating the pipeline under the same conditions may sometimes result in different outputs. </div>
    
    

<a id='quiz'></a>
## Quiz
**Conceptual Questions:**
1. Describe the processes involved in the hit-to-lead and lead optimization phases of a drug design pipeline. How are these processes implemented in the virtual pipeline described in this talktorial?
2. Describe the importance of binding-site definition in a virtual screening pipeline. What options are available for binding-site definition and -selection in this pipeline? 
3. What are the possibilities for a virtual ligand-derivatization process? Which one is used in this pipeline? 
4. Why do some of the obtained analogs have a lower calculated similarity measure than the threshold specified in the input data?
5. Which criteria are used in this pipeline to filter the initially obtained analogs for the docking experiment?
6. Why don't we carry out the docking experiment for all initially obtained analogs, so that we can choose the best candidate between all available options at the end?
7. What are the general preparative steps for performing a docking experiment?
8. Why do we calculate protein-ligand interactions for each docking pose, although these interactions have been implicitly taken into account by the scoring function of the docking algorithm?
9. How can the interaction-analysis data be used to select an analog with higher selectivity for the target protein?
10. Why do we have to visualize the docking poses and their corresponding interactions and manually inspect them? What criteria should we be looking for in these inspections?
11. What criteria are generally used for selecting the best optimized ligand at the end of the pipeline? What options do our pipeline offer?
**Exercises:**
1. In order to prevent the talktorial from being too lengthy, we only demonstrated a sub-set of the options and information that the pipeline offers. Execute the pipeline and explore its other options by looking up the attributes and methods of different classes comprising the pipeline. Since we have organized the whole project in a ***LeadOptimizationPipeline*** instance, this should be a fairly easy task. 
2. Try to implement a loop, where the final output of a pipeline is re-entered as the input for a new pipeline (i.e. try to optimize your initial input ligand through several runs, same as we did for the demonstration of the ***run_pipeline*** function). On average, after how many cycles does the pipeline reach a plateau where no better analog can be found anymore?
3. Generate your own input CSV-files, each time choosing a different set of specifications, and compare the results. Can you find a set of specifications that perform considerably better than the others?
4. Since this was a relatively large project and we have also offered many different options for each process of the pipeline, it may be the case that some options are not fully thought-out and developed. Try to run the pipeline with all possible combinations of specifications, and see if you can find a problem in any of the branches. 

<a id='supp'></a>
## Supplementary Information
<br>
<div style="text-align: justify"> 
Here we will also demonstrate the functions of each helper-class used in the pipeline, to provide a better understanding of the inner workings of the code.<br><br>
    <b><i>Note:</i></b> Some cells use variables defined in an early cell. Therefore, in order for all cells to work you have to run them sequentially.
</div>

<a id='io_demo'></a>
### Demonstration of The ***IO*** Helper-Class
Using this class, the input CSV file can be easily imported via the ***create_dataframe_from_csv_input_file*** function, by specifying the corresponding filepath, the name of the columns to be used as indices for the dataframe, and the name of the other columns we want to keep:

In [None]:
example_input_df = IO.create_dataframe_from_csv_input_file(
    input_data_filepath="data/PipelineInputData_Project1.csv",
    list_of_index_column_names=[
        Consts.DataFrame.ColumnNames.SUBJECT.value,
        Consts.DataFrame.ColumnNames.PROPERTY.value,
    ],
    list_of_columns_to_keep=[Consts.DataFrame.ColumnNames.VALUE.value],
)

Note that for inputting the ***list_of_index_column_names*** and ***list_of_column_names_to_keep*** we didn't write the name of the columns (i.e. '***Subject***', '***Property***' and '***Value***') explicitly (although we could), but used the respective constants stored in the ***Consts*** class. In this way, if the column names in the input file are changed in the future, they only need to be corrected in the ***Consts*** class, and nowhere else in the code.

We can now view the imported data in its entirety:

In [None]:
example_input_df

Moreover, we can also extract all the data corresponding to a specific ***Subject***, using the ***copy_series_from_dataframe*** function and specifying the main dataframe and the name of the index-level and column that we want to extract. For example, to extract the binding-site specification data:

In [None]:
example_input_protein_data = IO.copy_series_from_dataframe(
    input_df=example_input_df,
    index_name=Consts.DataFrame.SubjectNames.BINDING_SITE.value,
    column_name=Consts.DataFrame.ColumnNames.VALUE.value,
)

# Displaying the extracted data:
example_input_protein_data

Later, for storing the output data we will also need the ***create_folder*** function, which can create a folder (and all its necessary parent folders) with a given name at a given path. The function then returns the full path of the created folder. We now create a folder named *Examples*, which we will use to store the files used in the demonstration sections.

In [None]:
example_output_path = IO.create_folder(folder_name="Examples", folder_path="data/Outputs")

# Displaying the full path of the generated folder:
example_output_path

<a id='pdb_demo'></a>
### Demonstration of The ***PDB*** Helper-Class
We can use the ***fetch_and_save_pdb_file*** function to download a protein's PDB file using its PDB-code. The function takes in a full filepath, including the name of the file (but excluding the .PDB extension), and returns the full filepath (including the .PDB extension).

In [None]:
example_downloaded_protein_filepath = PDB.fetch_and_save_pdb_file(
    pdb_code="3w32", output_filepath=example_output_path / "3W32"
)

example_downloaded_protein_filepath

The PDB file is now downloaded on the disk. Now, the file content of the PDB file can be obtained as follows:

In [None]:
example_pdb_file_content = PDB.read_pdb_file_content(
    input_type="pdb_filepath", input_value=example_downloaded_protein_filepath
)

# Here we are only showing the first 800 characters, since the full content is too long:
example_pdb_file_content[:800]

Note that alternatively we also could have set the ***input_type*** to 'pdb_code' and entered a valid PDB-code as the ***input_value*** to directly fetch and read the file contents of a protein PDB file from the PDB webserver. However, in that case, the PDB file itself would not have been saved on the disk.

This file content can the be used to extract some useful information from the PDB file:

In [None]:
PDB.extract_info_from_pdb_file_content(example_pdb_file_content)

Or directly without storing the PDB file content in a variable (here using another PDB-code as example):

In [None]:
PDB.extract_info_from_pdb_file_content(
    pdb_file_text_content=PDB.read_pdb_file_content(input_type="pdb_code", input_value="5gty")
)

The ***load_pdb_file_as_dataframe*** function can also be used, but it does not contain all the information that the ***extract_info_from_pdb_file_content*** can extract. The function return a dictionary, where each value is a *Pandas DataFrame*:

In [None]:
example_pdb_dataframe = PDB.load_pdb_file_as_dataframe(example_pdb_file_content)
example_pdb_dataframe.keys()

For example, all the information regarding the atoms in the protein can be accessed via the ***ATOM*** key. This data was used in the pipeline, for example to extract the range of residue numbers of the protein.

In [None]:
example_pdb_dataframe["ATOM"]

<a id='pubchem_demo'></a>
### Demonstration of The ***PubChem*** Helper-Class
The ***convert_compound_identifier*** function can be used to get a specific compound's identifier, by providing another identifier. For example, getting the SMILES of Aspirin, providing its name: 

In [None]:
PubChem.convert_compound_identifier(
    input_id_type="name", input_id_value="aspirin", output_id_type="smiles"
)

Or getting it's IUPAC name:

In [None]:
PubChem.convert_compound_identifier(
    input_id_type="name", input_id_value="aspirin", output_id_type="iupac_name"
)

Or getting its name from its SMILES:

In [None]:
PubChem.convert_compound_identifier(
    input_id_type="smiles",
    input_id_value="CC(=O)OC1=CC=CC=C1C(=O)O",
    output_id_type="name",
)

Or getting its CID from its name:

In [None]:
PubChem.convert_compound_identifier(
    input_id_type="name", input_id_value="aspirin", output_id_type="cid"
)

Using the ***get_compound_record*** function, all available records on the molecule can be obtained in a dictionary with following keys: '***id***', '***atoms***', '***bonds***', '***coords***', '***charge***', '***props***', '***count***'.
As an example, we access the '***props***' key, which contains the physiochemical properties of the compound: 

In [None]:
# Only showing the first 2 entries, since the full list is too long:
PubChem.get_compound_record(input_id_type="name", input_id_value="aspirin")["props"][:2]

The ***get_description_from_smiles*** function provides a textual description of the compound. However, it is not available for every compound. If the parameter ***printout*** is set to False, the function will return a dictionary containing the descriptions, sources, and other information. By setting it to True, only the description part will be printed out. For example, for aspirin: 

In [None]:
PubChem.get_description_from_smiles(smiles="CC(=O)OC1=CC=CC=C1C(=O)O", printout=True)

<a id='rdkit_demo'></a>
### Demonstration of The ***RDKit*** Helper-Class
To use any other function in the ***RDKit*** class, we first have to generate an RDKit molecule object, using the ***create_molecule_object*** function, which can accept several different identifiers as input, for example the SMILES (here again using Aspirin as an example);

In [None]:
example_mol_obj = RDKit.create_molecule_object(
    input_type="smiles", input_value="CC(=O)OC1=CC=CC=C1C(=O)O"
)

The RDKit molecule objects can be called directly, which simply visualizes the molecule:

In [None]:
example_mol_obj

The ***calculate_druglikeness*** function can then be used to calculate several molecular properties and drug-likeness scores, which will be returned as a dictionary:

In [None]:
RDKit.calculate_druglikeness(mol_obj=example_mol_obj)

The ***calculate_similarity_dice*** function is used to calculate the Dice similarity metric between two ligands, using 4096-bit Morgan fingerprints with a raidus of 2. Here, as an example we will input the same molecule twice, which results in a similarity score of 1 (i.e. 100% similarity): 

In [None]:
RDKit.calculate_similarity_dice(mol_obj1=example_mol_obj, mol_obj2=example_mol_obj)

Using the ***save_molecule_image_to_file*** function, an image file containing the ligand's structure can be saved at a given path as well: 

In [None]:
RDKit.save_molecule_image_to_file(
    mol_obj=example_mol_obj, filepath=example_output_path / "aspirin"
)

In addition, we can use the ***save_3D_molecule_to_SDfile*** function, which first prepares the molecule for docking experiments (i.e. adds hydrogen atoms, creates 3D conformation and runs energy minimizations etc.) and then saves it to an SDF file: 

In [None]:
RDKit.save_3D_molecule_to_SDfile(mol_obj=example_mol_obj, filepath=example_output_path / "aspirin")

<a id='dogsitescorer_demo'></a>
### Demonstration of The ***DoGSiteScorer*** Class
If we want to detect the binding sites of a protein from a local PDB file, we should first upload the file to the *DogSiteScorer* webserver, using the ***upload_pdb_file*** function. For this, we will use the PDB file we downloaded earlier: 

In [None]:
example_dummy_pdb_id = DoGSiteScorer.upload_pdb_file(filepath=example_downloaded_protein_filepath)

example_dummy_pdb_id

The ***upload_pdb_file*** returns a dummy PDB-code for the uploaded structure, which can be used in place of a valid PDB-code to submit a binding-site detection job. Notice that the ***submit_job*** function can also take in a chain-ID to limit the detection on that specific chain, and a ligand-ID to also calculate the coverage of each detected binding-site. However, the ligand-ID that *DoGSiteScorer* accepts has its own format, and is not the same as the ligand-ID in the protein PDB file. Nevertheless, it follows the following format: (ligand-ID)_(chain-ID)_(ligand residue-ID). When implementing the ***BindingSiteDetection*** class, we will circumvent this by automatically generating the *DogSiteScorer* ligand-ID from the normal ligand-ID, so that the user does not have to manually look up and enter the ligand-ID in this specific format. 

In [None]:
example_binding_site_data = DoGSiteScorer.submit_job(
    pdb_id=example_dummy_pdb_id, ligand_id="W32_A_1101", chain_id="A"
)

example_binding_site_data

Alternatively, we can also use a valid PDB-code to directly submit a binding-site detection job:  

In [None]:
DoGSiteScorer.submit_job(
    pdb_id=project1.Protein.pdb_code,
    ligand_id="W32_A_1101",
    chain_id=project1.Protein.chains[0],
)

As can be seen, the ***submit_job*** function returns a DataFrame containing all the detected binding-sites and their respective properties. It also contains the URLs of the PDB and CCP4 files for each of the detected binding-sites, which will be used to download the files later. Among the calculated properties, the most important are ***drugScore***, ***simpleScore***, ***volume***, as well as ***lig_cov*** and ***poc_cov*** for when a ligand-ID is also inputted. The list of all properties can be accessed by retrieving the column names of the DataFrame: 

In [None]:
example_binding_site_data.columns

The ***simpleScore*** is a simple druggability score, based on a linear combination of the three descriptors describing volume, hydrophobicity and enclosure. In addition, a subset of meaningful descriptors is incorporated in a support vector machine (libsvm) to predict a druggability score called ***drugScore***, which has a value between 0 and 1 (the higher the score the more druggable the binding-site is estimated to be). ***lig_cov*** gives the percentage of the ligand volume that is covered by the binding-site, and ***poc_cov*** gives the percentage of the binding-site volume that is covered by the ligand. <br>

Depending on the specifics of the project, these values can be used to select the most-suitable detected binding-site. Here, we have implemented two possibilities for this selection process in the ***select_best_pocket*** function. The first option is to provide a list of properties in order of importance, based on which the binding-site with the highest or lowest value is to be chosen. The function then returns the name of the selected binding-site (The list of properties should be inputted as a list):

In [None]:
example_best_pocket_name = DoGSiteScorer.select_best_pocket(
    binding_site_df=example_binding_site_data,
    selection_method="sorting",
    selection_criteria=["lig_cov", "poc_cov"],
    ascending=False,
)

example_best_pocket_name

For example, the above ***selection_criteria*** sorts the binding-sites by their ***lig_cov*** values, and if there are two or more binding-sites with the same ***lig_cov*** value, it then sorts them by their ***poc_cov*** values.<br> 

Another possibility for selecting a binding site is to provide any valid python expression, which generates a list-like object with the same length as the number of detected binding-site. This python expression is inputted as string, so that the user can directly input a python expression in the input CSV file, and it will be evaluated during the runtime. For example, using this method we can perform any calculation on the properties in the binding-site DataFrame (note that for referring to the binding-site dataframe, **'df'** should be used):

In [None]:
DoGSiteScorer.select_best_pocket(
    binding_site_df=example_binding_site_data,
    selection_method="function",
    selection_criteria="((df['lig_cov'] + df['poc_cov']) / 2) * ((df['drugScore'] + df['simpleScore']) / 2) / df['volume']",
    ascending=False,
)

For example, the above ***selection_criteria*** is a function that calculates the average of ***lig_cov*** and ***poc_cov***, as well as the average of ***drugScore*** and ***simpleScore***, multiplies the two together and then divides the result by the ***volume***. The binding-site that has the highest value is then chosen (If we want to have the lowest value, we can set the ***ascending*** parameter to False). Note that the dataframe should always be referred to as **'df'**. Another example is: 

In [None]:
DoGSiteScorer.select_best_pocket(
    binding_site_df=example_binding_site_data,
    selection_method="function",
    selection_criteria="df[['drugScore', 'simpleScore']].min(axis=1) * df[['poc_cov', 'lig_cov']].max(axis=1)",
    ascending=False,
)

The above example calculates the minimum value between ***drugScore*** and ***simpleScore***, and multiplies the results with the maximum value between the ***poc_cov*** and ***lig_cov***. 

We now want to calculate the coordinates of the selected binding-site, so that we can use it in the docking process. For this, we first need to download the PDB files of the detected binding-sites, using the ***save_binding_sites_to_file*** function: 

In [None]:
DoGSiteScorer.save_binding_sites_to_file(
    binding_site_df=example_binding_site_data, output_path=example_output_path
)

Having downloaded the necessary binding-site files, we can now calculate the coordinates of the selected binding-site: 

In [None]:
DoGSiteScorer.calculate_pocket_coordinates_from_pocket_pdb_file(
    filepath=example_output_path / example_best_pocket_name
)