<div class="alert alert-block alert-info">

<b>Thank you for contributing to TeachOpenCADD!</b>

</div>

<div class="alert alert-block alert-info">

<b>Set up your PR</b>: Please check out our <a href="https://github.com/volkamerlab/teachopencadd/issues/41">issue</a> on how to set up a PR for new talktorials, including standard checks and TODOs.

</div>

# T04 · Predicting Drug Drug Interactions using SVM

Authors:

- Vanessa Siegel, 2023, CADD Seminar, Centre for Bioinformatics

## Aim of this talktorial

This talktorial introduces and explores the subject of drug-drug interactions (DDI) and their different types, paying special attention to the concepts of antagonism, additivity, and synergism. This will be followed by a closer look at Support Vector Machines (SVM) that use soft margin classifiers and then move towards a more detailed explanation on how to use a combined similarity matrix as a pairwise kernel function to solve the non-linear classification problem of predicting new DDI by comparing them to already known DDI.

To build the combined similarity matrix, we will look at 2D and 3D structural similarity as well as similarity between interaction profiles and create databases for each of them. The dataset used during the practical part of this talktorial will be retrieved from [__DrugBank__](www.drugbank.ca) and filtered to only contain small molecule drugs that are annotated as approved for medical use in at least one country and for which 2D and 3D structural data is available.


### Contents in *Theory*

-	Drug-Drug Interactions
    - Importance of drug-drug interactions
    - Drug-drug interaction types
-	Drug Similarity
-	Support Vector Machines
    - Soft Margin Classifer
    - Kernel Trick
-	DrugBank
    - History
    - Drug Entries
-	Workflow: Similarity-Based SVM for DDI Prediction
    - Feature Selection
    - Data Selection
    - Creating drug-drug similarity databases
    - Creating drug-pair similarity matrices
    - Creating the pairwise kernel
    - Modelling and evaluating the SVM


### Contents in *Practical*

-	Retrieve data from DrugBank
-	Create a drug-drug 2D molecular structure similarity database using RDkit
-	Create a drug-drug 3D pharmacophoric similarity database using the E3FP and RDkit
-	Create a drug-drug interaction profile database using RDkit
-	Construct a combined pairwise similarity matrix for the kernel function
-	Model and evaluate the SVM


### References

* Paper 
* Tutorial links
* Other useful resources

*We suggest the following citation style:*
* Keyword describing resource: <i>Journal</i> (year), <b>volume</b>, pages (link to resource) 

*Example:*
* ChEMBL web services: [<i>Nucleic Acids Res.</i> (2015), <b>43</b>, 612-620](https://academic.oup.com/nar/article/43/W1/W612/2467881) 

## Theory

<div class="alert alert-block alert-info">

<b>Sync section titles with TOC</b>: Please make sure that all section titles in the <i>Theory</i> section are synced with the bullet point list provided in the <i>Aim of this talktorial</i> > <i>Contents in Theory</i> section.

</div>

### Drug-Drug Interactions

#### Importance of drug-drug interactions


Clinical investigations and research in the biomedical field showed that to treat more complex diseases the administration of just one drug is often not enough. Diseases like HIV, cancer, or kidney failure, to name a few, often require a combination of drugs to achieve satisfactory results or improvements in the patients' health. However, the simultaneous use of multiple drugs often leads to the occurrence of drug-drug interactions (DDI).

DDI are caused when one drug interfers with another in one or more stages of its lifetime circle in the body and through that influences the effectiveness of said drug. This means they either cause an unexpected medical effect or creates an unexpected but measurable difference of the two drugs in the patient's bloodstream. 

Notably, it does not matter if said influence or effect is beneficial or harmful to the biological system to be classified as a DDI. Thus, while DDI can be taken advantage of to increase a therapeutic effect, they are also a mayor cause for unwanted or unexpected adverse side effects in patients. Therefore, extensive knowledge about potential DDI is crucial in medical care.

From ensuring that the patient doesn't take drugs which are known to negatively affect each other (e.g., causing adverse reactions, making one drug unusable, potentiating side effects to a severely harmful degree), over preventing existing medical conditions from worsening, up to taking advantage of synergistic DDI to give better treatment, knowledge of DDI has a wide field of applications. As such analysing new drugs or drug combinations for potential DDI is an important aspect in the advancement of personalized medicine.

Since finding and verifying DDI is a costly process due to the amount of in vitro and in vivo experiments, computational means started to develop to filter the large number of potential drug combinations for those who show a high possibility for expressing DDI. Many of these computational methods use machine-learning techniques, such as the one discussed in this talktorial in more detail.

#### Drug-drug interaction types

Drug-drug interactions are commonly classified by the cause of their occurrence, meaning they are either caused by pharmacokinetic (PK) or pharmacodynamic (PD) interactions.

__Pharmacokinetic DDI__ occur when drug A influences drug B's concentration in the blood stream. It does not matter if it is the active component of drug A or one of the additives, added to assist in the delivery of the drug, that is the cause of this interaction. 
PK DDI are separated into four categories depending on which stage of a drug's lifetime circle is affected: absorption, distribution, metabolism, or excretion (ADME).

__Pharmacodynamic DDI__, on the other hand, occur when the pharmacological effect of drug A influences the pharmacological effect of drug B. This happens when A and B target similar or related biological pathways or targets. However, a PD DDI can also occur when the drugs affected pathways seem to be completely unrelated, but their pharmacological effects still cause an unexpected medical observation.

PD DDI get classified into three groups: 
-	__Antagonistic__: 
The combined effect caused by a drug combination is smaller than the sum of the pharmacological effects seen when each drug is given alone.
-	__Additive__:
The combined effect caused by a drug combination is the sum of the pharmacological effects seen when each drug is given alone.
-	__Synergistic__:
The combined effect caused by a drug combination is greater than the sum of the pharmacological effects seen when each drug is given alone.


![Comparison Pharmacodynamic DDI Types](./images/PD%20DDI.jpg)

*Figure 1:* Graphical representation of the three types of pharmacodynamic drug-drug interactions

It is important to note that a DDI can be of both types, signifying that this way of classification is to be seen more as a widely used guideline rather than hard-split categories. Similarly, the words antagonistic, additive, synergistic, and their synonyms are sometimes used to also categorize PK DDI, especially in the medical field where the distinction between PK and PD may not be of importance.

<div class="alert alert-block alert-info">

__IMPORTANT__:

For the remainder of this talktorial, unless specified otherwise, DDI will not be differentiated into PK or PD. Likewise, the terms antagonistic, additive, and synergistic – if applied – will be used to describe all DDI as smaller, equal, and greater than the sum of their __therapeutic effect__ respectively as to not distract from the actual topic of this talktorial.

</div>

### Drug Similarity

A base assumption in drug discovery is that similar drugs express similar properties and thus behave similar when introduced to a biological system. Computer-based methods use that assumption to cluster drugs or find molecules that have the potential to have an increased therapeutic effect to already known drugs. As such, it is important to clearly define how *similarity* is to be judged computationally.


#### Defining drug similarity to a computer

There are multiple different categories and properties that can be used to define similarity between different drugs, and machine learning algorithms often use a combination of them to make better predictions. This process is called feature selection and can have significant impact on the reliability of a model. But how does a computer compare two drugs?

The most common way is to use fingerprints for the different properties and calculate similarity between them with the Tanimoto coefficient as described in Talktorial T004.

For the Tanimoto coefficient we will use the following formula during this talktorial:

$ TC(A,B) = |A \cap B| / |A \cup B| $

As shown the formula divides the intersection of fingerprints A and B with the number of features present in the union of both fingerprints and calculates the *similarity* as a float value between 0 and 1 with 0 representing no similarity at all and 1 indicating the two drugs are *identical* in the given property.

![Schematic example of 2D structural similarity calculation](./images/Fingerprint-based-molecular-similarity-approache-Modified.png)

*Figure 2:*
Example for the comparison between fingerprints of two molecules and the resulting Tanimoto coefficient. 

(Modified figure. Original taken from [*Molecular informatics.* (2021), __40__](https://www.researchgate.net/publication/351084895_Differential_Consistency_Analysis_Which_Similarity_Measures_can_be_Applied_in_Drug_Discovery))

#### 2D structural similarity
For 2D structural similarity, we will create MACCS fingerprints and use the Tanimoto coefficient to calculate our 2D similarity score. A more detailed explanation of MACCS fingerprints can be found in [__Talktorial T004__](https://github.com/volkamerlab/teachopencadd/blob/master/teachopencadd/talktorials/T004_compound_similarity/talktorial.ipynb).


#### 3D structural similarity

For the calculation of the 3D structural similarity, we will use the "Shape" screening module which is part of the [__Schrödinger 2011 package__](http://www.schrodinger.com). How this package calculates 3D similarity can be read up in the package's documentation and will not be discussed here.


#### Interaction profile similarity

Like 2D structural similarity, we will use the fingerprint method here as well and calculate the Tanimoto coefficient. For this, our vector will be the size of the number of drugs in our database, each position filled with a 1 if the drug has a known DDI with the drug corresponding to the respective cell and 0 otherwise.


### Support Vector Machines

Support Vector Machines (SVM) are supervised learning models, used to analyse data for classification and regression purposes. Initially developed by Vladimir Vapnik and colleagues during the 1990s, SVMs became more and more popular as one of the most robust prediction methods in machine learning.

Using a set of training examples and a binary label system, the SVM training algorithm maps each labelled training example to a point in space and then determines an optimal hyperplane to separate the two classes from each other. The hyperplane itself gets defined by a set of points from both classes, the so-called support vectors, and situated to maximize the margin, meaning the distance between itself and the chosen support vectors on both sides, to reliably separate the two classes.


![SVM schema](./images/optimal-hyperplane.png)

*Figure 3:* A schematic example of an optimal separating hyperplane.
Figure was taken from [Open Source Computer Vision](https://docs.opencv.org/4.x/d1/d73/tutorial_introduction_to_svm.html)

#### Soft Margin Classifer

In this talktorial, we will use a so-called __Soft Margin Classifier__, opposed to the stricter Hard Margin Classifier. This means that we allow both, outliers and potential misclassifications, in our training data when determining the position of the optimal separating hyperplane.

That way, we guarantee that the positioning of the separating hyperplane is much more robust and less prone to overfit our training data, making the whole model more reliable when classifying new data points.

![Hard vs Soft Margin Classifier](./images/Hard-vs-Soft-Margin.jpg)

*Figure 4:* A comparison between a Hard Margin Classifier (left) and a Soft Margin Classifier (right). Both optimal separating hyperplanes would be identical if the one blue outlier was not part of the data set. Figure was taken from [Mubaris NK](https://mubaris.com/posts/svm/).

However, to do so, the error parameter C which is used to punish the presence of outliers and misclassifications must be carefully chosen as to neither overfit our model nor to reduce specificity to the point that predictions will mean nothing. It is a very important step in handling the Bias/Variance Trade-off typical for machine learning models and unique to Soft Margin Classifiers.

![Effect of C](./images/Effect-of-soft-margin-constant-C-Modified.png)

*Figure 5:* Comparison of different values for error parametre C and their impact on determining the optimal separating hyperplane.

(Figure was modified. Original from [*PLoS computational biology.* (2008), __4__](https://www.researchgate.net/publication/23442384_Support_Vector_Machines_and_Kernels_for_Computational_Biology))

#### Kernel Trick

Notably the figures above all show examples of linear classification problems, meaning the two classes can be easily separated from each other with a linear hyperplane. However, this does not work when faced with a classification problem as depicted in the figure below. 

![CNon-linear Classification Problem](./images/nonlinear%20data.png)

*Figure 6:* An example of a non-linear classification problem.
Figure was taken from [Andrea Perlato](https://www.andreaperlato.com/theorypost/introduction-to-support-vector-machine/).

For problems like this, SVMs have to make use of the __kernel trick__ to change a non-linear classification problem into a linear one before finding the optimal separating hyperplane.

To do so, SVMs take the data points and artificially move them into a higher dimension with the help of a __kernel function__ which plots the data points in a manner to make them linearly separable. In this higher dimension, the SVM will then find an optimal separating hyperplane as described. In the last step, said hyperplane will get transformed back into the original dimension where it may no longer be linear.

A simple example of how the kernel trick works is detailed in Figure 7, where we artificially plot the datapoints from a one-dimensional space into a two-dimensional space with a polynomial kernel function, determine the hyperplane, and then transform the data back into the original space.

![Kernel Trick for Non-Linear Classification Problem](./images/Kernel%20trick.png)

*Figure 7:* An example of the kernel trick on one-dimensional drug-dosage data. The chosen kernel function is a polynomial function of degree two: $f(x) = dosage^2$.
Figure was taken from [Andrea Perlato](https://www.andreaperlato.com/theorypost/introduction-to-support-vector-machine/).

Since the kernel function does not actually transform the data into a higher dimension and only calculates the relationships between every pair of points as if they were in the higher dimension, the whole method is called the kernel __trick__.

As for the kernel function itself, there are many different ways of creating one. The kernel function introduced in this talktorial is a pairwise kernel method which uses a similarity matrix of drug pairs to perform the kernel trick. How exactly we calculate said function will be discussed in detail in the corresponding subsection of the *Workflow* chapter.

### DrugBank

#### History

DrugBank is a comprehensive, free-to-access, online database containing information on drugs and drug targets. First established in 2006 by Dr David Wishart's lab at the University of Alberta as a project to grant academic researchers easier access to detailed, structured information about drugs, DrugBank grew in size and popularity thanks to the backing of various research organisations as well as government funding. Now in its 5th version (version 5.1.10 as of 1st April 2023), Drug Bank contains over 15,000 drug entries with almost 5,296 non-redundant protein sequences being linked to them. Each entry contains more than 200 data fields and combines detailed drug data (chemical, pharmacological, pharmaceutical, etc.) with comprehensive drug target data like sequence, structure, and pathway, collected from bioinformatics and cheminformatics resources.

#### Drug Entries

Each entry in the DrugBank database is clearly structured, systemically ordered, and possesses a unique number through which each drug can be clearly identified and addressed within the database. The toolbar to the left provides an easy way to navigate through the different categories and subcategories for which information for the drug at hand is provided, allowing for intuitive and easy access to the sought after information.

![Screenshot of DrugBank Entry Aspirin](./images/DrugBank%20Aspirin.png)

*Figure 8:* 
Screenshot of the DrugBank Entry of [__Aspirin__](https://go.drugbank.com/drugs/DB00945). The red-framed toolbar provides easy access to different categories of drug information provided by DrugBank. The blue square marks the unique accession number through which each drug can be identified within the database.

It is to note that while categories for which no information is available might be removed from the navigation help, the presence of a certain field does not necessitate the presence of information. The example below nicely illustrates that while the category Pharmacology is present, for most of its subfields, information is not available.

![Screenshot of DrugBank Entry 1,2-Benzodiazepin](./images/DrugBank%201%202-Benzodiazepine.png)

*Figure 9:* 
Screenshot of the DrugBank Entry of [__1,2-Benzodiazepin__](https://go.drugbank.com/drugs/DB12537).

For the purpose of this talktorial the only relevant information needed from these entries are:

-	The __DrugBank Accession Number__
-	The __Type__ of the drug (Small Molecule or Biotech)
-	Which __Group(s)__ the drug belongs to
-	The 2D structure in the __SMILE__ format
-	The 3D structure in the __SDF file__ format
-	The list of known __Drug Interactions__


The DrugBank accession number will serve as an identification system during computations and consists out of the letters DB followed by a five-digit number.

The Type and Group information is required to filter the database for only those drugs that qualify as small molecule drugs, and which were approved for use. 

The 2D and 3D structure information will be used to create fingerprints with RDkit and the Schrödinger package respectively. 

Lastly the drug interactions will be used to create interaction profile fingerprints for each drug and also serve as the foundation for the assignment of labels to our training data.

<div class="alert alert-block alert-info">

__IMPORTANT__:

It is to note that unlike the *Type*, the *Groups* are not an either-or classification system since DrugBank incorporates information from various countries and a drug that might be approved in one can still be illicit or experimental in another. 

</div>

### Workflow: Similarity-Based SVM for DDI Prediction

#### Feature Selection

Feature selection is a very big part of machine learning algorithms, especially when handling biological data where there are lots of features but only a limited amount of datapoints to use. How one chooses features depends on the availability of the data, the impact said feature might have on the problem at hand, and the computational effort to use it.

As such, the first thing we need to do is decide which features we want to focus on. As previously mentioned, three different features were chosen for the sake of this tutorial: 2D structure, 3D structure, and drug interactions.

2D structure is a commonly used feature in drug discovery since it allows for relatively easy first comparisons between drugs and proved reliable enough to build the basis for QSAR methods in pharmacy and drug Discovery. Moreover, a 2D molecular representation is generally available for drugs.

The 3D structure was chosen to take into account that spatial orientation plays a very important role in all forms of protein-ligand interactions, and just like 2D structure, 3D structure is often available as well.

Since drugs who already share a list of drugs which they interact with are much more likely to also interact with the drugs for which experimental data only exists for one of them, drug interactions were chosen as a feature that is both closely related to the purpose of the SVM we build and has to be sufficiently available for our future training dataset.


#### Data Selection

Now that we have our features, it is time to collect our data. To predict DDI with a SVM, we require labelled datapoints for training with sufficiently enough examples for both classes to prevent too much bias towards one category.

In our practical, we label drug pairs as having (1) or not having a DDI (0). For this example, neither the strength nor the type of DDI is of importance. If one wishes to pay more attention to a certain type of interaction, they need to adjust the labels accordingly.

Downloading the DrugBank database version 5.1.10, we filter the database for all small molecule drugs that were approved for use in at least one country. In the next step we filter out all drugs for which DrugBank is unable to provide a SMILE string of the 2D structure or an SDF file for the 3D structure. If Drug Interactions are marked as 'Not Available' we will assume that these drugs do not show any noteworthy DDI.

The list of drugs that remains will be our final dataset for the Practical.

#### Creating drug-drug similarity databases

With our dataset ready, it is now time to calculate similarity for each of the three features while looking at all possible drug combinations.

To create the 2D structure similarity database we use RDkit (…) to transform the SMILE representation of each drug into a MACCS fingerprint. Once we have these, we calculate the Tanimoto Coefficient for all possible combinations of drugs and store these values in a similarity matrix for easy access.

For the 3D structure similarity database, we use, as mentioned before, the E3FP package to create fingerprints directly from the 3D-SDF files. Those will then be transformed into fingerprints as they are used by RDkit, before we calculate a similarity score for each possible drug combination.

Lastly, to create an interaction profile database we first create fingerprints the size of our number of drugs and then transform them into RDkit fingerprints. Afterwards, we use the Tanimoto Coefficient to calculate the similarity between the interaction profiles between two drugs.

All these fingerprints as well as the three final databases will be stored for quick check-ups during later calculations.

#### Creating drug-pair similarity matrices

Now that we have tables to look up similarity scores for each possible drug combination, it is time to build the drug-pair similarity matrices. The process is the same for all three features.

Since the SVM treats pairs of drugs as singular instances, the similarity matrix needs to contain similarity scores between drug-pairs rather than individual drugs as is the case in the previously created databases.

To calculate the entries in the similarity matrices for each feature we use the following formula:

$S\big((d_1,d_1') , (d_2,d_2')) = max \big( s(d_1,d_2) \cdot s(d_1',d_2') , s(d_2,d_1') \cdot s(d_1, d_2'))$

The letter $s$ denotes the similarity scores previously calculated and stored in the above-mentioned databases, while $S$ stands for the similarity score of the drug-pairs and will be saved in the new similarity matrix.

Although the values $s(d_1,d_1')$ and $s(d_1',d_1)$ are identical, there is a difference between $s(d_1,d_2) \cdot s(d_1',d_2')$ and $s(d_1',d_2) \cdot s(d_1,d_2')$. Therefore, we choose the maximum between those two values for $S$ to ensure we store the score for the comparison with maximum similarity between the individual drugs for better accuracy and predictability.

Notably, the resulting drug-pair similarity matrices only contains the pair $(d_1,d_1')$ once and no cell or row marked with $(d_1',d_1)$.

#### Creating the pairwise kernel

In this part we simply add the scores of all three drug-pair similarity matrices together, giving each feature the same importance. As a result, the values in the final similarity matrix we use as our pairwise kernel function range from 0 to 3 rather than 0 to 1 as in the underlying drug-pair similarity matrices.

#### Modelling and Evaluating the SVM

Adding our labelled training dataset and our new kernel function we create the SVM. Since we use a soft margin classifier, we add an error penalty parameter C with C>0 which is called the soft-margin constant.

TODO (Look into how exactly to set up the SVM with all the appropriate parameters)

We then use 10-fold cross-validation to evaluate the model, using the area under the receiver operating characteristic curve (AUC) as a criterion due to it being not affected by the ratio of positives to negatives. During each of these rounds we perform one additional 3-fold cross validation to select the best performing value for the error parametre $C$ in the range of {0.1, 1, 10} according to AUC.


## Practical

In this practical, we will create drug similarity databases for 2D similarity, 3D similarity and interaction profile. Said databases will then be used to create a combined drug-pair similarity matrix shich will serve as the kernel function for a soft-margin-classifier SVM to predict DDI.

<div class="alert alert-block alert-info">

<b>Sync section titles with TOC</b>: Please make sure that all section titles in the <i>Practical</i> section are synced with the bullet point list provided in the <i>Aim of this talktorial</i> > <i>Contents in Practical</i> section.

</div>

In [159]:
from pathlib import Path
#import math

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
#from matplotlib.lines import Line2D
#import matplotlib.patches as mpatches

# RDkit imports
from rdkit import Chem, DataStructs
from rdkit.Chem import (
    Descriptors,
    Draw,
    PandasTools,
    MACCSkeys,
    rdFingerprintGenerator,
    AllChem,
)
# E3FP package imports
from e3fp.fingerprint.generate import (
    fp,
    fprints_dict_from_sdf,
)

# Scikit-Learn imports
from sklearn import (
    svm,
    datasets,
    metrics
)

<div class="alert alert-block alert-info">

<b>Imports</b>: Please add all your imports on top of this section, ordered by standard library / 3rd party packages / our own (<code>teachopencadd.*</code>). 
Read more on imports and import order in the <a href="https://www.python.org/dev/peps/pep-0008/#imports">"PEP 8 -- Style Guide for Python Code"</a>.
    
</div>

In [134]:
HERE = Path(_dh[-1])
DATA = HERE / "data"

<div class="alert alert-block alert-info">

<b>Relative paths</b>: Please define all paths relative to this talktorial's path by using the global variable <code>HERE</code>.
If your talktorial has input/output data, please define the global <code>DATA</code>, which points to this talktorial's data folder (check out the default folder structure of each talktorial).
    
</div>

### Retrieving Data from DrugBank

In [135]:
# Molecules in SMILES format
drug_smiles = [
    #"CC1C2C(C3C(C(=O)C(=C(C3(C(=O)C2=C(C4=C1C=CC=C4O)O)O)O)C(=O)N)N(C)C)O",
    "CC1(C(N2C(S1)C(C2=O)NC(=O)C(C3=CC=C(C=C3)O)N)C(=O)O)C",
    "C1=COC(=C1)CNC2=CC(=C(C=C2C(=O)O)S(=O)(=O)N)Cl",
    "C1NC2=CC(=C(C=C2S(=O)(=O)N1)S(=O)(=O)N)Cl",
    "CC1=C(C(CCC1)(C)C)C=CC(=CC=CC(=CC(=O)O)C)C",
    #"CC1(C2CC3C(C(=O)C(=C(C3(C(=O)C2=C(C4=C1C=CC=C4O)O)O)O)C(=O)N)N(C)C)O",
]

# List of molecule names
drug_names = [
    #"Doxycycline",
    "Amoxicilline",
    "Furosemide",
    "Hydrochlorothiazide",
    "Isotretinoin",
    #"Tetracycline",
]

drug_sdf = [
    #"Doxycycline.sdf",
    "Amoxicilline.sdf",
    "Furosemide.sdf",
    "Hydrochlorothiazide.sdf",
    "Isotretinoin.sdf",
    #"Tetracycline.sdf",
]

drug_interactions = [
    ["Furosemide", "Isotretinoin"],
    ["Amoxicilline", "Isotretinoin"],
    [],
    ["Amoxicilline"],
]


In [136]:
# Input Data
drugs = pd.DataFrame({
    "id": drug_names, 
    "smiles":drug_smiles, 
    "sdf_file":drug_sdf,
    "ddi": drug_interactions
})

# constants used throughout the tutorial
num_drugs = drugs["id"].size

### Create a drug-drug 2D molecular structure similarity database using RDkit

We create a dictionary to map the drugs to their SMILES and then create molecules from said SMILES in order to calculate the MACCS fingerprints with the provided function from RDkit. Once we have the fingerprints we create a similarity matrix containing the Tanimoto Scores calculated from the MACCS fingerprints.

Since a similarity matrix is always symmetrical, we save calculation time by starting the inner loop at position i. That way we compare a drug only to those following after, since a comparison to the ones before were already done in the previous runs of the loop.

For easier access to the scores saved within the similarity matrix, we create a panda DataFrame, naming the columns and rows after our drugs. That way, we can access the scores by using the names of the drugs rather than having to figure out at which specific index said drugs are within the list.

In [137]:
# reate Molecule from SMILE
drugs["2D_molecule"] = [Chem.MolFromSmiles(x) for x in drugs["smiles"]]
# create MACCS fingerprint
drugs["maccs"] = [MACCSkeys.GenMACCSKeys(x) for x in drugs["2D_molecule"]]

# calculate Tanimoto Coefficent of MACCS fingerprints
scores_2D = np.zeros((num_drugs,num_drugs))
for i in range(num_drugs):
    for j in range(i, num_drugs):
        score = DataStructs.TanimotoSimilarity(drugs["maccs"][i],drugs["maccs"][j])
        scores_2D[i][j] = scores_2D[j][i] = score
drugs_2D_database = pd.DataFrame(scores_2D, drugs["id"], drugs["id"])
#print(drugs_2D_database)
#print(drugs_2D_database["Amoxicilline"]["Furosemide"])

### Create a drug-drug 3D pharmacophoric similarity database using the E3FP and RDkit

The first step we make is creating E3FP fingerprints with the help of the identically named package. This can be done directly from the provided 3D-SDF files with the *fprints_dict_from_sdf* function. This function takes the name of the SDF-file and an additional parametre *first=1* to define that we only want to look at the first conformer within the respective file.

After that, we convert the E3FP fingerprint into a fingerprint as recognized by RDkit before calculating and saving the Tanimoto Coefficients in a similarity matrix in the same manner as above.

In [138]:
# create 3D_fingerprints directly from the provided sdf_files
fp_dict = [fprints_dict_from_sdf(x,first=1) for x in "data/"+drugs["sdf_file"]]

# get the fingerprints out of the dictionary
fingerprints = [fp[5][0] for fp in fp_dict]

# convert e3fp fingerprint object into RDkit fingerprint object
drugs["e3fp"] = [fp.fold().to_rdkit() for fp in fingerprints]

# calculate Tanimoto coefficient
scores_3D = np.zeros((num_drugs,num_drugs))
for i in range(num_drugs):
    for j in range(i, num_drugs):
        score = DataStructs.TanimotoSimilarity(drugs["e3fp"][i],drugs["e3fp"][j])
        scores_3D[i][j] = scores_3D[j][i] = score
drugs_3D_database = pd.DataFrame(scores_3D, drugs["id"], drugs["id"])
#print(drugs_3D_database)
#print(drugs_3D_database["Amoxicilline"]["Furosemide"])


2023-06-19 21:05:33,090|INFO|Generating fingerprints for 33613.
2023-06-19 21:05:33,222|INFO|Generated 1 fingerprints for 33613.
2023-06-19 21:05:33,228|INFO|Generating fingerprints for 3440.
2023-06-19 21:05:33,329|INFO|Generated 1 fingerprints for 3440.
2023-06-19 21:05:33,332|INFO|Generating fingerprints for 3639.
2023-06-19 21:05:33,396|INFO|Generated 1 fingerprints for 3639.
2023-06-19 21:05:33,403|INFO|Generating fingerprints for 5282379.
2023-06-19 21:05:33,507|INFO|Generated 1 fingerprints for 5282379.


### Create a drug-drug interaction profile database using RDkit

Unlike in the above cases, we create the interaction profile fingerprints ourselves here as described in the Theory part. The resulting numpy arrays will then be converted into bitstrings and then fingerprints as recognized by RDkit. Said fingerprints will then be saved for later comparisons.

After that, we once again calculate and save the Tanimoto Coefficients in a similarity matrix as a pandas DataFrame with the names of the drugs to specify columns and rows of the matrix.

In [139]:
# Create interaction profile fingerprints
fps = []
for i in range(num_drugs):
    DDI_fp = np.zeros(num_drugs)
    for j in range(num_drugs):
        if drugs["id"][j] in drugs["ddi"][i]:
            DDI_fp[j] = 1
    # change array into bitstring
    bitstring = "".join(DDI_fp.astype(str))
    # change bitstring to RDkit fingerprint and add to fingerprints
    fps.append(DataStructs.cDataStructs.CreateFromBitString(bitstring))
    
drugs["ipfp"] = fps

# calculate Tanimoto Coefficient
scores_IP = np.zeros((num_drugs,num_drugs))
for i in range(num_drugs):
    for j in range(i, num_drugs):
        score = DataStructs.TanimotoSimilarity(drugs["ipfp"][i],drugs["ipfp"][j])
        scores_IP[i][j] = scores_IP[j][i] = score
drugs_IP_database = pd.DataFrame(scores_IP, drugs["id"], drugs["id"])
print(drugs_IP_database)
print(drugs_IP_database["Amoxicilline"]["Furosemide"])

id                   Amoxicilline  Furosemide  Hydrochlorothiazide   
id                                                                   
Amoxicilline             1.000000    0.333333                  0.0  \
Furosemide               0.333333    1.000000                  0.0   
Hydrochlorothiazide      0.000000    0.000000                  1.0   
Isotretinoin             0.000000    0.500000                  0.0   

id                   Isotretinoin  
id                                 
Amoxicilline                  0.0  
Furosemide                    0.5  
Hydrochlorothiazide           0.0  
Isotretinoin                  1.0  
0.3333333333333333


### Construct a combined pairwise similarity matrix for the kernel function

We first create our actual training-data points for the SVM, meaning we create instances of DDI (aka drug-pairs) and label them with 1 for an interaction and 0 otherwise.

In [None]:
# create labelled drug-pair data points
drug_pairs_partner = []
drug_pairs_names = []
drug_pairs_labels = []
for i in range(num_drugs):
    for j in range(i+1, num_drugs):
        drug_pairs_partner.append([drugs["id"][i], drugs["id"][j]])
        drug_pairs_names.append(drugs["id"][i] + " + " + drugs["id"][j])
        if drugs["id"][j] in drugs["ddi"][i] or drugs["id"][i] in drugs["ddi"][j]:
            drug_pairs_labels.append(1)
        else:
            drug_pairs_labels.append(0)

print(drug_pairs_partner)
print(drug_pairs_names)
print(drug_pairs_labels)

# create our training dataset for the SVM
pairs = pd.DataFrame({
    "partner" : drug_pairs_partner,
    "id": drug_pairs_names, 
    "label": drug_pairs_labels, 
})
num_pairs = pairs["id"].size

Now that we have our drug-pairs as single instances, we have to calculate similarity matrices for 2D, 3D, and interaction profile similarity and save them.

In [None]:
# create pairwise similarity matrices
pair_scores_2D = np.zeros((num_pairs,num_pairs))
pair_scores_3D = np.zeros((num_pairs,num_pairs))
pair_scores_IP = np.zeros((num_pairs,num_pairs))
for i in range(num_pairs):
    for j in range(i, num_pairs):
        d1 = pairs["partner"][i]
        d2 = pairs["partner"][j]
        # calculate 2D pairwise similarity
        score1 = np.dot(drugs_2D_database[d1[0]][d2[0]] , drugs_2D_database[d1[1]][d2[1]])
        score2 = np.dot(drugs_2D_database[d1[1]][d2[0]] , drugs_2D_database[d1[0]][d2[1]])
        pair_scores_2D[i][j] = pair_scores_2D[j][i] = max(score1, score2)
        
        #calculate 3D pairwise similarity
        score1 = np.dot(drugs_3D_database[d1[0]][d2[0]] , drugs_3D_database[d1[1]][d2[1]])
        score2 = np.dot(drugs_3D_database[d1[1]][d2[0]] , drugs_3D_database[d1[0]][d2[1]])
        pair_scores_3D[i][j] = pair_scores_3D[j][i] = max(score1, score2)

        #calculate IP pairwise similarity
        score1 = np.dot(drugs_IP_database[d1[0]][d2[0]] , drugs_IP_database[d1[1]][d2[1]])
        score2 = np.dot(drugs_IP_database[d1[1]][d2[0]] , drugs_IP_database[d1[0]][d2[1]])
        pair_scores_IP[i][j] = pair_scores_IP[j][i] = max(score1, score2)

# save pairwise similarity databases for quick checkups
pair_2D_database = pd.DataFrame(pair_scores_2D, pairs["id"], pairs["id"])
pair_3D_database = pd.DataFrame(pair_scores_3D, pairs["id"], pairs["id"])
pair_IP_database = pd.DataFrame(pair_scores_IP, pairs["id"], pairs["id"])
#print(pair_IP_database)

The next step is to combine these three similarity matrices into one that will be used for the kernel function.

In [163]:
pairwise_kernel_matrix = pair_2D_database + pair_3D_database + pair_IP_database
#print(pairwise_kernel_matrix)
kernel = metrics.pairwise.pairwise_kernels(pairwise_kernel_matrix, metric='precomputed')
print(kernel)

[[3.         0.87848297 0.70433821 0.3935743  0.48167239 0.16666578]
 [0.87848297 3.         0.11111111 0.83130699 0.16666578 0.31500573]
 [0.70433821 0.11111111 3.         0.04777607 0.83130699 0.3935743 ]
 [0.3935743  0.83130699 0.04777607 3.         0.12247791 0.70433821]
 [0.48167239 0.16666578 0.83130699 0.12247791 3.         0.87848297]
 [0.16666578 0.31500573 0.3935743  0.70433821 0.87848297 3.        ]]


### Model and evaluate the SVM

Now it is time to build the Support Vector Machine.

In [165]:
classifier = svm.SVC(kernel='precomputed',C=1.0)
print(drug_pairs_labels)
classifier.fit(drug_pairs_partner,drug_pairs_labels)

[1, 0, 1, 0, 1, 0]


ValueError: could not convert string to float: 'Amoxicilline'

## Discussion

Wrap up the talktorial's content here and discuss pros/cons and open questions/challenges.

## Quiz

Ask three questions that the user should be able to answer after doing this talktorial. Choose important take-aways from this talktorial for your questions.

1. Question
2. Question
3. Question

<div class="alert alert-block alert-info">

<b>Useful checks at the end</b>: 
    
<ul>
<li>Clear output and rerun your complete notebook. Does it finish without errors?</li>
<li>Check if your talktorial's runtime is as excepted. If not, try to find out which step(s) take unexpectedly long.</li>
<li>Flag code cells with <code># TODO: CI</code> that have deterministic output and should be tested within our Continuous Integration (CI) framework.</li>
</ul>

</div>