<div class="alert alert-block alert-info">

<b>Thank you for contributing to TeachOpenCADD!</b>

</div>

<div class="alert alert-block alert-info">

<b>Set up your PR</b>: Please check out our <a href="https://github.com/volkamerlab/teachopencadd/issues/41">issue</a> on how to set up a PR for new talktorials, including standard checks and TODOs.

</div>

# T04 · Predicting Drug Drug Interactions using SVM

Authors:

- Vanessa Siegel, 2023, CADD Seminar, Centre for Bioinformatics

## Aim of this talktorial

This talktorial introduces and explores the subject of drug-drug interactions (DDI) and their different types, paying special attention to the concepts of antagonism, additivity, and synergism. This will be followed by a closer look at Support Vector Machines (SVM) that use soft margin classifiers and then move towards a more detailed explanation on how to use a combined similarity matrix as a pairwise kernel function to solve the non-linear classification problem of predicting new DDI by comparing them to already known DDI.

To build the combined similarity matrix, we will look at 2D and 3D structural similarity as well as similarity between interaction profiles and create databases for each of them. The dataset used during the practical part of this talktorial will be retrieved from [__DrugBank__](www.drugbank.ca) and filtered to only contain small molecule drugs that are annotated as approved for medical use in at least one country and for which 2D and 3D structural data is available.


### Contents in *Theory*

-	Drug-Drug Interactions
    - Importance of drug-drug interactions
    - Drug-drug interaction types
-	Drug Similarity
    - Defining drug similarity to a computer
    - 2D similarity
    - 3D similarity
    - Interaction profile similarity
-	Support Vector Machines
    - Soft Margin Classifer
    - Kernel Trick
-	DrugBank
    - History
    - Drug Entries
-	Workflow: Similarity-Based SVM for DDI Prediction
    - Feature Selection
    - Data Selection
    - Creating drug-drug similarity databases
    - Creating drug-pair similarity matrices
    - Creating the pairwise kernel
    - Modelling and evaluating the SVM


### Contents in *Practical*

-	Reading input data from DrugBank
-	Create a drug-drug 2D molecular structure similarity database using RDKit
-	Create a drug-drug 3D pharmacophoric similarity database using the E3FP and RDKit
-	Create a drug-drug interaction profile database using RDKit
-	Construct a combined pairwise similarity matrix for the kernel function
-	Model and evaluate the SVM with Scikit-Learn


### References

* Similarity-based SVM predictor for DDI: [*J. Clin. Pharm. Ther.* (2018), __44(2)__, 268-275](https://pubmed.ncbi.nlm.nih.gov/30565313/)
* Explanation of E3FP: [*J. Med. Chem.* (2017), __60__, 7393-7409](https://pubs.acs.org/doi/full/10.1021/acs.jmedchem.7b00696)
* DrugBank: [*Nucleic Acids Res.* (2017), __8__](https://pubmed.ncbi.nlm.nih.gov/29126136/)
* Fingerprints and Drug Similarity: [__Talktorial 004__](https://github.com/volkamerlab/teachopencadd/blob/master/teachopencadd/talktorials/T004_compound_similarity/talktorial.ipynb)
* RDKit package [__documentation__](https://www.rdkit.org/docs/index.html)
* E3FP package [__documentation__](https://e3fp.readthedocs.io/en/latest/index.html)
* Scikit-Learn package [__Documentation__](https://scikit-learn.org/stable/index.html)

## Theory

### Drug-Drug Interactions

#### Importance of drug-drug interactions


Clinical investigations and research in the biomedical field showed that to treat more complex diseases the administration of just one drug is often not enough. Diseases like HIV, cancer, or kidney failure, to name a few, often require a combination of drugs to achieve satisfactory results or improvements in the patients' health. However, the simultaneous use of multiple drugs often leads to the occurrence of drug-drug interactions (DDI).

DDI are caused when one drug interfers with another in one or more stages of its lifetime circle in the body and through that influences the effectiveness of said drug. This means they either cause an unexpected medical effect or creates an unexpected but measurable difference of the two drugs in the patient's bloodstream. 

Notably, it does not matter if said influence or effect is beneficial or harmful to the biological system to be classified as a DDI. Thus, while DDI can be taken advantage of to increase a therapeutic effect, they are also a mayor cause for unwanted or unexpected adverse side effects in patients. Therefore, extensive knowledge about potential DDI is crucial in medical care.

From ensuring that the patient doesn't take drugs which are known to negatively affect each other (e.g., causing adverse reactions, making one drug unusable, potentiating side effects to a severely harmful degree), over preventing existing medical conditions from worsening, up to taking advantage of synergistic DDI to give better treatment, knowledge of DDI has a wide field of applications. As such analysing new drugs or drug combinations for potential DDI is an important aspect in the advancement of personalized medicine.

Since finding and verifying DDI is a costly process due to the amount of in vitro and in vivo experiments, computational means started to develop to filter the large number of potential drug combinations for those who show a high possibility for expressing DDI. Many of these computational methods use machine-learning techniques, such as the one discussed in this talktorial in more detail.

#### Drug-drug interaction types

Drug-drug interactions are commonly classified by the cause of their occurrence, meaning they are either caused by pharmacokinetic (PK) or pharmacodynamic (PD) interactions.

__Pharmacokinetic DDI__ occur when drug A influences drug B's concentration in the blood stream. It does not matter if it is the active component of drug A or one of the additives, added to assist in the delivery of the drug, that is the cause of this interaction. 
PK DDI are separated into four categories depending on which stage of a drug's lifetime circle is affected: absorption, distribution, metabolism, or excretion (ADME).

__Pharmacodynamic DDI__, on the other hand, occur when the pharmacological effect of drug A influences the pharmacological effect of drug B. This happens when A and B target similar or related biological pathways or targets. However, a PD DDI can also occur when the drugs affected pathways seem to be completely unrelated, but their pharmacological effects still cause an unexpected medical observation.

PD DDI get classified into three groups: 
-	__Antagonistic__: 
The combined effect caused by a drug combination is smaller than the sum of the pharmacological effects seen when each drug is given alone.
-	__Additive__:
The combined effect caused by a drug combination is the sum of the pharmacological effects seen when each drug is given alone.
-	__Synergistic__:
The combined effect caused by a drug combination is greater than the sum of the pharmacological effects seen when each drug is given alone.


![Comparison Pharmacodynamic DDI Types](./images/PD%20DDI.jpg)

*Figure 1:* Graphical representation of the three types of pharmacodynamic drug-drug interactions

It is important to note that a DDI can be of both types, signifying that this way of classification is to be seen more as a widely used guideline rather than hard-split categories. Similarly, the words antagonistic, additive, synergistic, and their synonyms are sometimes used to also categorize PK DDI, especially in the medical field where the distinction between PK and PD may not be of importance.

<div class="alert alert-block alert-info">

__IMPORTANT__:

For the remainder of this talktorial, unless specified otherwise, DDI will not be differentiated into PK or PD. Likewise, the terms antagonistic, additive, and synergistic – if applied – will be used to describe all DDI as smaller, equal, and greater than the sum of their __therapeutic effect__ respectively as to not distract from the actual topic of this talktorial.

</div>

### Drug Similarity

A base assumption in drug discovery is that similar drugs express similar properties and thus behave similar when introduced to a biological system. Computer-based methods use that assumption to cluster drugs or find molecules that have the potential to have an increased therapeutic effect to already known drugs. As such, it is important to clearly define how *similarity* is to be judged computationally.


#### Defining drug similarity to a computer

There are multiple different categories and properties that can be used to define similarity between different drugs, and machine learning algorithms often use a combination of them to make better predictions. This process is called feature selection and can have significant impact on the reliability of a model. Similarly impactful is how eaxctly we compare the chosen properties, meaning which feature encoding methods we employ during the calculations of drug similarity. But how does a computer compare two drugs?

The most common way is to use fingerprints for the different properties and calculate similarity between them with the Tanimoto coefficient as described in Talktorial T004.

For the Tanimoto coefficient we will use the following formula during this talktorial:

$ TC(A,B) = |A \cap B| / |A \cup B| $

As shown the formula divides the intersection of fingerprints A and B with the number of features present in the union of both fingerprints and calculates the *similarity* as a float value between 0 and 1 with 0 representing no similarity at all and 1 indicating the two drugs are *identical* in the given property.

![Schematic example of 2D structural similarity calculation](./images/Fingerprint-based-molecular-similarity-approache-Modified.png)

*Figure 2:*
Example for the comparison between fingerprints of two molecules and the resulting Tanimoto coefficient. 

(Modified figure. Original taken from [*Molecular informatics.* (2021), __40__](https://www.researchgate.net/publication/351084895_Differential_Consistency_Analysis_Which_Similarity_Measures_can_be_Applied_in_Drug_Discovery))

#### 2D structural similarity
For 2D structural similarity, we will create MACCS fingerprints and use the Tanimoto coefficient to calculate our 2D similarity score. A more detailed explanation of MACCS fingerprints can be found in [__Talktorial T004__](https://github.com/volkamerlab/teachopencadd/blob/master/teachopencadd/talktorials/T004_compound_similarity/talktorial.ipynb).


#### 3D structural similarity

For the calculation of the 3D structural similarity, we will use the [__Extended 3-Dimensional FingerPirnt__](https://e3fp.readthedocs.io/en/latest/index.html) (E3FP) package developped and provided by Keiser Lab at University of California San Francisco.

This package calculates 3D fingerprints by applying the logic behind extended connectivity fingerprints (ECFP) to a 3-dimensional space. E3FP represent, similar to MACCS fingerprints, the absence or presence of structural fragments within a molecule. These fragments retain their 3-dimensional orientation, meaning that while two fragments would be identical in the 2-dimensional space, the spacial orientation of the molecular bonds will be taken into account and declare them different enough to warrant seperate representations within the E3FP.

How the E3FP package calculates said fingerprints is explained in more detail in the corresponding [__research document__](https://pubs.acs.org/doi/full/10.1021/acs.jmedchem.7b00696) and will not be discussed here any further. Likewise the [__documentation__](https://e3fp.readthedocs.io/en/latest/index.html) of the package offers further inside into the used functions if necessary.

Once we have the E3FP we will then convert them into fingerprints as recognized by RDkit before using them to calculate the Tanimoto Coefficients for our 3D similarity matrix

#### Interaction profile similarity

Like 2D structural similarity, we will use the fingerprint method here as well and calculate the Tanimoto coefficient. For this, our vector will be the size of the number of drugs in our database, each position filled with a 1 if the drug has a known DDI with the drug corresponding to the respective cell and 0 otherwise.


### Support Vector Machines

Support Vector Machines (SVM) are supervised learning models, used to analyse data for classification and regression purposes. Initially developed by Vladimir Vapnik and colleagues during the 1990s, SVMs became more and more popular as one of the most robust prediction methods in machine learning.

Using a set of training examples and a binary label system, the SVM training algorithm maps each labelled training example to a point in space and then determines an optimal hyperplane to separate the two classes from each other. The hyperplane itself gets defined by a set of points from both classes, the so-called support vectors, and situated to maximize the margin, meaning the distance between itself and the chosen support vectors on both sides, to reliably separate the two classes.


![SVM schema](./images/optimal-hyperplane.png)

*Figure 3:* A schematic example of an optimal separating hyperplane.
Figure was taken from [Open Source Computer Vision](https://docs.opencv.org/4.x/d1/d73/tutorial_introduction_to_svm.html)

#### Soft Margin Classifer

In this talktorial, we will use a so-called __Soft Margin Classifier__, opposed to the stricter Hard Margin Classifier. This means that we allow both, outliers and potential misclassifications, in our training data when determining the position of the optimal separating hyperplane.

That way, we guarantee that the positioning of the separating hyperplane is much more robust and less prone to overfit our training data, making the whole model more reliable when classifying new data points.

![Hard vs Soft Margin Classifier](./images/Hard-vs-Soft-Margin.jpg)

*Figure 4:* A comparison between a Hard Margin Classifier (left) and a Soft Margin Classifier (right). Both optimal separating hyperplanes would be identical if the one blue outlier was not part of the data set. Figure was taken from [Mubaris NK](https://mubaris.com/posts/svm/).

However, to do so, the error parameter C which is used to punish the presence of outliers and misclassifications must be carefully chosen as to neither overfit our model nor to reduce specificity to the point that predictions will mean nothing. It is a very important step in handling the Bias/Variance Trade-off typical for machine learning models and unique to Soft Margin Classifiers.

![Effect of C](./images/Effect-of-soft-margin-constant-C-Modified.png)

*Figure 5:* Comparison of different values for error parameter C and their impact on determining the optimal separating hyperplane.

(Figure was modified. Original from [*PLoS computational biology.* (2008), __4__](https://www.researchgate.net/publication/23442384_Support_Vector_Machines_and_Kernels_for_Computational_Biology))

#### Kernel Trick

Notably the figures above all show examples of linear classification problems, meaning the two classes can be easily separated from each other with a linear hyperplane. However, this does not work when faced with a classification problem as depicted in the figure below. 

![CNon-linear Classification Problem](./images/nonlinear%20data.png)

*Figure 6:* An example of a non-linear classification problem.
Figure was taken from [Andrea Perlato](https://www.andreaperlato.com/theorypost/introduction-to-support-vector-machine/).

For problems like this, SVMs have to make use of the __kernel trick__ to change a non-linear classification problem into a linear one before finding the optimal separating hyperplane.

To do so, SVMs take the data points and artificially move them into a higher dimension with the help of a __kernel function__ which plots the data points in a manner to make them linearly separable. In this higher dimension, the SVM will then find an optimal separating hyperplane as described. In the last step, said hyperplane will get transformed back into the original dimension where it may no longer be linear.

A simple example of how the kernel trick works is detailed in Figure 7, where we artificially plot the datapoints from a one-dimensional space into a two-dimensional space with a polynomial kernel function, determine the hyperplane, and then transform the data back into the original space.

![Kernel Trick for Non-Linear Classification Problem](./images/Kernel%20trick.png)

*Figure 7:* An example of the kernel trick on one-dimensional drug-dosage data. The chosen kernel function is a polynomial function of degree two: $f(x) = dosage^2$.
Figure was taken from [Andrea Perlato](https://www.andreaperlato.com/theorypost/introduction-to-support-vector-machine/).

Since the kernel function does not actually transform the data into a higher dimension and only calculates the relationships between every pair of points as if they were in the higher dimension, the whole method is called the kernel __trick__.

As for the kernel function itself, there are many different ways of creating one. The kernel function introduced in this talktorial is a pairwise kernel method which uses a similarity matrix of drug pairs to perform the kernel trick. How exactly we calculate said function will be discussed in detail in the corresponding subsection of the *Workflow* chapter.

### DrugBank

#### History

[__DrugBank__](https://go.drugbank.com/about) is a comprehensive, free-to-access, online database containing information on drugs and drug targets. First established in 2006 by Dr David Wishart's lab at the University of Alberta as a project to grant academic researchers easier access to detailed, structured information about drugs, DrugBank grew in size and popularity thanks to the backing of various research organisations as well as government funding. Now in its 5th version (version 5.1.10 as of 1st April 2023), Drug Bank contains over 15,000 drug entries with almost 5,296 non-redundant protein sequences being linked to them. Each entry contains more than 200 data fields and combines detailed drug data (chemical, pharmacological, pharmaceutical, etc.) with comprehensive drug target data like sequence, structure, and pathway, collected from bioinformatics and cheminformatics resources. [cited from https://go.drugbank.com/about]

#### Drug Entries

Each entry in the DrugBank database is clearly structured, systemically ordered, and possesses a unique number through which each drug can be clearly identified and addressed within the database. The toolbar to the left provides an easy way to navigate through the different categories and subcategories for which information for the drug at hand is provided, allowing for intuitive and easy access to the sought after information.

![Screenshot of DrugBank Entry Aspirin](./images/DrugBank%20Aspirin.png)

*Figure 8:* 
Screenshot of the DrugBank Entry of [__Aspirin__](https://go.drugbank.com/drugs/DB00945). The red-framed toolbar provides easy access to different categories of drug information provided by DrugBank. The blue square marks the unique accession number through which each drug can be identified within the database.

It is to note that while categories for which no information is available might be removed from the navigation help, the presence of a certain field does not necessitate the presence of information. The example below nicely illustrates that while the category Pharmacology is present, for most of its subfields, information is not available.

![Screenshot of DrugBank Entry 1,2-Benzodiazepin](./images/DrugBank%201%202-Benzodiazepine.png)

*Figure 9:* 
Screenshot of the DrugBank Entry of [__1,2-Benzodiazepin__](https://go.drugbank.com/drugs/DB12537).

For the purpose of this talktorial the only relevant information needed from these entries are:

-	The __DrugBank Accession Number__
-	The __Type__ of the drug (Small Molecule or Biotech)
-	Which __Group(s)__ the drug belongs to
-	The 2D structure in the __SMILE__ format
-	The 3D structure in the __3D-SDF file__ format
-	The list of known __Drug Interactions__


The DrugBank accession number will serve as an identification system during computations and consists out of the letters DB followed by a five-digit number.

The Type and Group information is required to filter the database for only those drugs that qualify as small molecule drugs, and which were approved for use. 

The 2D and 3D structure information will be used to create fingerprints with RDkit and the E3FP package. 

Lastly the drug interactions will be used to create interaction profile fingerprints for each drug and also serve as the foundation for the assignment of labels to our training data.

<div class="alert alert-block alert-info">

__IMPORTANT__:

It is to note that unlike the *Type*, the *Groups* are not an either-or classification system since DrugBank incorporates information from various countries and a drug that might be approved in one can still be illicit or experimental in another. 

</div>

### Workflow: Similarity-Based SVM for DDI Prediction

The work-flow described here is for the most part taken from the paper Similarity-based SVM predictor for DDI. In this talktorial we use some simplifications of the algorithm to focus more on the underlying ideas rather than on tweaking the accuracy of the SVM to its best possible outcome. The paper can be found in the References and will not be cited again during the following paragraphs.

#### Feature Selection

Feature selection is a very big part of machine learning algorithms, especially when handling biological data where there are lots of features but only a limited amount of datapoints to use. How one chooses features depends on the availability of the data, the impact said feature might have on the problem at hand, and the computational effort to use it.

As such, the first thing we need to do is decide which features we want to focus on. As previously mentioned, three different features were chosen for the sake of this tutorial: 2D structure, 3D structure, and drug interactions.

2D structure is a commonly used feature in drug discovery since it allows for relatively easy first comparisons between drugs and proved reliable enough to build the basis for QSAR methods in pharmacy and drug Discovery. Moreover, a 2D molecular representation is generally available for drugs.

The 3D structure was chosen to take into account that spatial orientation plays a very important role in all forms of protein-ligand interactions, and just like 2D structure, 3D structure is often available as well.

Since drugs who already share a list of drugs which they interact with are much more likely to also interact with the drugs for which experimental data only exists for one of them, drug interactions were chosen as a feature that is both closely related to the purpose of the SVM we build and has to be sufficiently available for our future training dataset.


#### Data Selection

Now that we have our features, it is time to collect our data. To predict DDI with a SVM, we require labelled datapoints for training with sufficiently enough examples for both classes to prevent too much bias towards one category.

In our practical, we label drug pairs as having (1) or not having a DDI (0). For this example, neither the strength nor the type of DDI is of importance. If one wishes to pay more attention to a certain type of interaction, they need to adjust the labels accordingly.

We took our data from the DrugBank database version 5.1.10 manually and stored the information in different files to be used later for the algorithm in the Practical part.

The drugs we chose are all small molecule drugs that were approved for use in at least one country and which have SMILE strings and 3D-SDF files provided by DrugBank. If Drug Interactions are marked as 'Not Available' we will assume that these drugs do not show any noteworthy DDI.

#### Creating drug-drug similarity databases

With our dataset ready, it is now time to calculate similarity for each of the three features while looking at all possible drug combinations.

To create the 2D structure similarity database we use RDkit to transform the SMILE representation of each drug into a MACCS fingerprint. Once we have these, we calculate the Tanimoto Coefficient for all possible combinations of drugs and store these values in a similarity matrix for easy access.

For the 3D structure similarity database, we use, as mentioned before, the E3FP package to create fingerprints directly from the 3D-SDF files. Those will then be transformed into fingerprints as they are used by RDkit, before we calculate a similarity score for each possible drug combination.

Lastly, to create an interaction profile database we first create fingerprints the size of our number of drugs and then transform them into RDkit fingerprints. Afterwards, we use the Tanimoto Coefficient to calculate the similarity between the interaction profiles between two drugs.

All these fingerprints as well as the three final databases will be stored for quick check-ups during later calculations.

#### Creating drug-pair similarity matrices

Now that we have tables to look up similarity scores for each possible drug combination, it is time to build the drug-pair similarity matrices. The process is the same for all three features.

Since the SVM treats pairs of drugs as singular instances, the similarity matrix needs to contain similarity scores between drug-pairs rather than individual drugs as is the case in the previously created databases.

To calculate the entries in the similarity matrices for each feature we use the following formula:

$S\big((d_1,d_1') , (d_2,d_2')) = max \big( s(d_1,d_2) \cdot s(d_1',d_2') , s(d_2,d_1') \cdot s(d_1, d_2'))$

The letter $s$ denotes the similarity scores previously calculated and stored in the above-mentioned databases, while $S$ stands for the similarity score of the drug-pairs and will be saved in the new similarity matrix.

Although the values $s(d_1,d_1')$ and $s(d_1',d_1)$ are identical, there is a difference between $s(d_1,d_2) \cdot s(d_1',d_2')$ and $s(d_1',d_2) \cdot s(d_1,d_2')$. Therefore, we choose the maximum between those two values for $S$ to ensure we store the score for the comparison with maximum similarity between the individual drugs for better accuracy and predictability.

Notably, the resulting drug-pair similarity matrices only contains the pair $(d_1,d_1')$ once and no cell or row marked with $(d_1',d_1)$.

#### Creating the pairwise kernel

In this part we simply add the scores of all three drug-pair similarity matrices together, giving each feature the same importance. As a result, the values in the final similarity matrix we use as our pairwise kernel function range from 0 to 3 rather than 0 to 1 as in the underlying drug-pair similarity matrices.

#### Modelling and Evaluating the SVM

Before we build the actual model with the help of the Scikit-Learn package, we split our DDI instances into a training set of 90% and a testing set of 10% and adjust our kernel function accordingly to match our training data by removing the data rows and columns of our test set.

Using our labelled training dataset and our new kernel function we create the SVM. Since we use a soft margin classifier, we add an error penalty parameter C with C>0 which is called the soft-margin constant. We chose four different values for $C$ in the range of {0.01, 0.1, 1.0, 10.0} and built consequently four different SVM.

Afterwards, we predict the labels for our testing data and compare the predictions to the lables taken from DrugBank's Drug Interactions, counting how many predictions are 'correct' or 'false' and output the results.

## Practical

In this practical, we will create drug similarity databases for 2D similarity, 3D similarity and interaction profile. Said databases will then be used to create a combined drug-pair similarity matrix shich will serve as the kernel function for a soft-margin-classifier SVM to predict DDI.

In [27]:
from pathlib import Path
#import math

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
#from matplotlib.lines import Line2D
#import matplotlib.patches as mpatches

# RDkit imports
from rdkit import Chem, DataStructs
from rdkit.Chem import (
    Descriptors,
    Draw,
    PandasTools,
    MACCSkeys,
    rdFingerprintGenerator,
    AllChem,
)
# E3FP package imports
from e3fp.fingerprint.generate import (
    fp,
    fprints_dict_from_sdf,
)

# Scikit-Learn imports
from sklearn import (
    svm,
    datasets,
    metrics
)
from sklearn.model_selection import (
    train_test_split,
    cross_validate,
    cross_val_predict,
    cross_val_score,
)

In [28]:
HERE = Path(_dh[-1])
DATA = HERE / "data"

### Reading input data from DrugBank

Our input data, which was retrieved from DrugBank's webserver, was saved in the form of an Excel Sheet. Each row is one drug entry, and each entry has four fields or columns. The columns are the following:

* DrugBank Acession Number
* SMILES
* Name of the Drug
* Comma-separated list of DrugBank Accession Numbers indicating DDI

We use the pandas package to parse the Excel file into a dictionary with the fields *id*, *smiles*, *name*, and *ddi* to represent the four columns.

In [29]:
drugs = pd.read_excel("data/Input Data.xlsx", header=None, names=["id", "smiles", "name", "ddi"], index_col=None)

# constants used throughout the tutorial
num_drugs = drugs["id"].size

### Create a drug-drug 2D molecular structure similarity database using RDkit

We create a function *create_2D_fp* which takes a SMILE and returns a MACCS fingerprint.

To do so, our function first builds a molecule from the provided SMILE and then uses said molecule to calculate the MACCS fingerprint. Both will be done by the provided functions from the RDKit package.

In [30]:
def create_2D_fp(smile):
    """
    takes a SMILE string and return a MACCS fingerprint
    """
    mol = Chem.MolFromSmiles(smile)
    fp = MACCSkeys.GenMACCSKeys(mol)
    return fp

The resulting MACCS fingerprints will be stored as a new category in our *drugs* dictionary.

In [31]:
drugs["maccs"] = [create_2D_fp(x) for x in drugs["smiles"]]

Once we have the fingerprints we create a similarity matrix containing the Tanimoto Scores calculated from the MACCS fingerprints.

Since a similarity matrix is always symmetrical, we save calculation time by starting the inner loop at position i. That way we compare a drug only to those following after, since a comparison to the ones before were already done in the previous runs of the loop.

For easier access to the scores saved within the similarity matrix, we create a panda DataFrame, naming the columns and rows after our drugs. That way, we can access the scores by using the names of the drugs rather than having to figure out at which specific index said drugs are within the list.

In [32]:
# calculate Tanimoto Coefficent of MACCS fingerprints
scores_2D = np.zeros((num_drugs,num_drugs))
for i in range(num_drugs):
    for j in range(i, num_drugs):
        score = DataStructs.TanimotoSimilarity(drugs["maccs"][i],drugs["maccs"][j])
        scores_2D[i][j] = scores_2D[j][i] = score
drugs_2D_database = pd.DataFrame(scores_2D, drugs["id"], drugs["id"])
#print(drugs_2D_database)
#print(drugs_2D_database["Amoxicilline"]["Furosemide"])

### Create a drug-drug 3D pharmacophoric similarity database using the E3FP and RDkit

The first step we do is build a function *create_3D_fp* which takes the path to a 3D-SDF file and creates an E3FP fingerprint with the help of the identically named package.

For this, the E3FP package provides a helpful function *fprints_dict_from_sdf* which can calculate an E3FP fingerprint directly from the provided 3D-SDF file. This function takes the name of the SDF-file and an additional parameter *first=1* to define that we only want to look at the first conformer within the respective file.

After that, we convert the E3FP fingerprint into a fingerprint as recognized by RDkit.

In [33]:
def create_3D_fp(path):
    """
    takes a 3D-SDF file and returns an E3FP fingerprint 

    params: path = str
    path towards the 3D-SDF file
    """
    fp_dict = fprints_dict_from_sdf(path, first=1)
    fps = fp_dict[5][0]
    fp = fps.fold().to_rdkit()
    return fp

Again, we create fingerprints for all our drugs and save them in our dictionary. 

In [34]:
drugs["e3fp"] = [create_3D_fp(x) for x in "data/"+drugs["name"]+".sdf"]

2023-07-03 00:49:01,696|INFO|Generating fingerprints for 2244.
2023-07-03 00:49:01,721|INFO|Generated 1 fingerprints for 2244.
2023-07-03 00:49:01,724|INFO|Generating fingerprints for 441300.
2023-07-03 00:49:01,778|INFO|Generated 1 fingerprints for 441300.
2023-07-03 00:49:01,782|INFO|Generating fingerprints for 71158.
2023-07-03 00:49:01,803|INFO|Generated 1 fingerprints for 71158.
2023-07-03 00:49:01,805|INFO|Generating fingerprints for 9811704.
2023-07-03 00:49:01,957|INFO|Generated 1 fingerprints for 9811704.
2023-07-03 00:49:01,960|INFO|Generating fingerprints for 1978.
2023-07-03 00:49:02,023|INFO|Generated 1 fingerprints for 1978.
2023-07-03 00:49:02,025|INFO|Generating fingerprints for 71771.
2023-07-03 00:49:02,100|INFO|Generated 1 fingerprints for 71771.
2023-07-03 00:49:02,103|INFO|Generating fingerprints for 1981.
2023-07-03 00:49:02,216|INFO|Generated 1 fingerprints for 1981.
2023-07-03 00:49:02,220|INFO|Generating fingerprints for 54676537.
2023-07-03 00:49:02,288|INFO|G

Lastly we calculate and save the Tanimoto Coefficients for the E3FP fingerprints in a similarity matrix in the same manner as above.

In [35]:
# calculate Tanimoto coefficient
scores_3D = np.zeros((num_drugs,num_drugs))
for i in range(num_drugs):
    for j in range(i, num_drugs):
        score = DataStructs.TanimotoSimilarity(drugs["e3fp"][i],drugs["e3fp"][j])
        scores_3D[i][j] = scores_3D[j][i] = score
drugs_3D_database = pd.DataFrame(scores_3D, drugs["id"], drugs["id"])
#print(drugs_3D_database)
#print(drugs_3D_database["Amoxicilline"]["Furosemide"])


### Create a drug-drug interaction profile database using RDkit

We create a function *create_IP_fp* to calculate the interaction profile fingerprint of a single drug.

Unlike in the above cases, instead of calling pre-existing functions, we have to create the interaction profile fingerprint from scratch. As described in the respetive Theory part, we create our own fingerprint-template and fill it with 0 and 1 accordingly.

The fingerprint-template is defined by the parameter *y* which gives us a list of drugs in a predetermined order. The fingerprint, represented as a numpy array, will have the same length as *y* and each position in this array correlates to a drug at the same position in *y*. This ensures that all created fingerprints are uniform as long as *y* does not change.

The resulting numpy array will then be converted into a bitstring and then a fingerprint as recognized by RDkit.

In [36]:
def create_IP_fp(x, y):
    """
    Takes 2 lists of strings and creates an interaction profile fingerprint

    params:
    x = list of drugs with known DDI
    y = list of drugs used as basis for the fingerprint
    """
    DDI_fp = np.zeros(y.size)
    for i in range(y.size):
        if y[i] in x:
            DDI_fp[i] = 1
    # change array into bitstring
    bitstring = "".join(DDI_fp.astype(str))
    # change bitstring to RDkit fingerprint
    fp = DataStructs.cDataStructs.CreateFromBitString(bitstring)
    return fp

We call this function for all elements within our drugs dataset and save the resulting fingerprints.

In [37]:
drugs["ipfp"] = [create_IP_fp(x,drugs["id"]) for x in drugs["ddi"]]

After that, we once again calculate and save the Tanimoto Coefficients in a similarity matrix as a pandas DataFrame with the names of the drugs to specify columns and rows of the matrix.

In [38]:
# calculate Tanimoto Coefficient
scores_IP = np.zeros((num_drugs,num_drugs))
for i in range(num_drugs):
    for j in range(i, num_drugs):
        score = DataStructs.TanimotoSimilarity(drugs["ipfp"][i],drugs["ipfp"][j])
        scores_IP[i][j] = scores_IP[j][i] = score
drugs_IP_database = pd.DataFrame(scores_IP, drugs["id"], drugs["id"])
#print(drugs_IP_database)
#print(drugs_IP_database["Amoxicilline"]["Furosemide"])

### Construct a combined pairwise similarity matrix for the kernel function

We first create our actual training-data points for the SVM, meaning we create instances of DDI (aka drug-pairs) and label them with 1 for an interaction and 0 otherwise.

In [39]:
# create labelled drug-pair data points
drug_pairs_partner = []
drug_pairs_names = []
drug_pairs_labels = []
for i in range(num_drugs):
    for j in range(i+1, num_drugs):
        drug_pairs_partner.append([drugs["id"][i], drugs["id"][j]])
        drug_pairs_names.append(drugs["name"][i] + " + " + drugs["name"][j])
        if drugs["id"][j] in drugs["ddi"][i] or drugs["id"][i] in drugs["ddi"][j]:
            drug_pairs_labels.append(1)
        else:
            drug_pairs_labels.append(0)

#print(drug_pairs_partner)
#print(drug_pairs_names)
#print(drug_pairs_labels)

num_pairs = len(drug_pairs_names)

Now that we have our drug-pairs as single instances, we have to calculate similarity matrices for 2D, 3D, and interaction profile similarity and save them.

In [40]:
# create pairwise similarity matrices
pair_scores_2D = np.zeros((num_pairs,num_pairs))
pair_scores_3D = np.zeros((num_pairs,num_pairs))
pair_scores_IP = np.zeros((num_pairs,num_pairs))
for i in range(num_pairs):
    for j in range(i, num_pairs):
        d1 = drug_pairs_partner[i]
        d2 = drug_pairs_partner[j]
        # calculate 2D pairwise similarity
        score1 = np.dot(drugs_2D_database[d1[0]][d2[0]] , drugs_2D_database[d1[1]][d2[1]])
        score2 = np.dot(drugs_2D_database[d1[1]][d2[0]] , drugs_2D_database[d1[0]][d2[1]])
        pair_scores_2D[i][j] = pair_scores_2D[j][i] = max(score1, score2)
        
        #calculate 3D pairwise similarity
        score1 = np.dot(drugs_3D_database[d1[0]][d2[0]] , drugs_3D_database[d1[1]][d2[1]])
        score2 = np.dot(drugs_3D_database[d1[1]][d2[0]] , drugs_3D_database[d1[0]][d2[1]])
        pair_scores_3D[i][j] = pair_scores_3D[j][i] = max(score1, score2)

        #calculate IP pairwise similarity
        score1 = np.dot(drugs_IP_database[d1[0]][d2[0]] , drugs_IP_database[d1[1]][d2[1]])
        score2 = np.dot(drugs_IP_database[d1[1]][d2[0]] , drugs_IP_database[d1[0]][d2[1]])
        pair_scores_IP[i][j] = pair_scores_IP[j][i] = max(score1, score2)

# save pairwise similarity databases for quick checkups
pair_2D_database = pd.DataFrame(pair_scores_2D, drug_pairs_names, drug_pairs_names)
pair_3D_database = pd.DataFrame(pair_scores_3D, drug_pairs_names, drug_pairs_names)
pair_IP_database = pd.DataFrame(pair_scores_IP, drug_pairs_names, drug_pairs_names)
#print(pair_IP_database)

The next step is to combine these three similarity matrices into one that will be used for the kernel function by performing a simple matrix addition.

In [41]:
combined_similarity_matrix = pair_2D_database + pair_3D_database + pair_IP_database

In [42]:
combined_similarity_matrix

Unnamed: 0,Aspirin + Abacavir,Aspirin + Acamprosate,Aspirin + Acarbose,Aspirin + Acebutolol,Aspirin + Aceclofenac,Aspirin + Acemetacin,Aspirin + Acenocoumarol,Aspirin + Benziodarone,Aspirin + Benzylpenicillin,Aspirin + Bepridil,...,Desvenlafaxine + Cholecalciferol,Desvenlafaxine + Cisapride,Desvenlafaxine + Amphetamine,Desvenlafaxine + Amyl Nitrite,Cholecalciferol + Cisapride,Cholecalciferol + Amphetamine,Cholecalciferol + Amyl Nitrite,Cisapride + Amphetamine,Cisapride + Amyl Nitrite,Amphetamine + Amyl Nitrite
Aspirin + Abacavir,3.000000,0.445389,0.609524,0.640335,0.772584,0.801610,0.683420,0.395382,0.729177,0.734696,...,0.182980,0.246319,0.305352,0.095773,0.198760,0.194130,0.077670,0.284634,0.093931,0.120369
Aspirin + Acamprosate,0.445389,3.000000,0.468112,0.613094,0.465688,0.479514,0.458279,0.404861,0.554216,0.281948,...,0.079757,0.110612,0.108409,0.114941,0.086620,0.077953,0.084203,0.107962,0.115223,0.083186
Aspirin + Acarbose,0.609524,0.468112,3.000000,0.991841,0.898018,0.781938,0.751327,0.560429,0.763201,0.585270,...,0.282001,0.287635,0.256554,0.306619,0.283767,0.217588,0.183875,0.247272,0.288974,0.241609
Aspirin + Acebutolol,0.640335,0.613094,0.991841,3.000000,0.990330,0.948382,0.925766,0.535153,0.850612,1.254789,...,0.366656,0.437728,0.419276,0.235098,0.379933,0.278310,0.159617,0.415895,0.226217,0.180046
Aspirin + Aceclofenac,0.772584,0.465688,0.898018,0.990330,3.000000,1.630169,0.900549,0.574863,0.795383,0.692873,...,0.226046,0.390838,0.368951,0.181740,0.252595,0.220404,0.114998,0.351631,0.178231,0.146324
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
Cholecalciferol + Amphetamine,0.194130,0.077953,0.217588,0.278310,0.220404,0.205427,0.196075,0.083083,0.137448,0.301382,...,0.932854,0.379479,0.939104,0.245960,0.663581,3.000000,0.444940,0.803030,0.163681,0.541008
Cholecalciferol + Amyl Nitrite,0.077670,0.084203,0.183875,0.159617,0.114998,0.109826,0.135929,0.069316,0.121056,0.171560,...,0.496010,0.220297,0.164338,0.939104,0.576809,0.444940,3.000000,0.154237,0.803030,0.597291
Cisapride + Amphetamine,0.284634,0.107962,0.247272,0.415895,0.351631,0.363012,0.302794,0.097821,0.242336,0.366744,...,0.358707,0.932854,1.257953,0.247396,0.597291,0.803030,0.154237,3.000000,0.444940,0.576809
Cisapride + Amyl Nitrite,0.093931,0.115223,0.288974,0.226217,0.178231,0.190175,0.191928,0.091763,0.196902,0.185998,...,0.314005,0.496010,0.232432,1.257953,0.541008,0.163681,0.803030,0.444940,3.000000,0.663581


### Model and evaluate the SVM

Now it is time to build the Support Vector Machine. For this, we will use the provided functions from the Scikit-Learn package, a python package used to build different machine learning models. More information can be aquired from the documentation of [__Scikit-Learn__](https://scikit-learn.org/stable/index.html).


First thing we have to do is split our data into training and testing sets. The parameter *test_size* determines which percentile of our data will be used for testing and which will be used for training. In the code below, we take a tenth of the DDI instances as our test data.

In [43]:
training_pairs, test_pairs, training_labels, test_labels = train_test_split(
    drug_pairs_names, drug_pairs_labels, test_size = 0.1, random_state=0)

With our split dataset, we also have to split our similarity matrix into one containing only the combinations of our *training_pairs* and one containing rows of our *test_pairs* and columns of our *training_pairs*.

For this we create true copies of the *combined_similarity_matrix* and then remove the rows and columns we do not need respectively.

Once this is done, we change the pandas DataFrames into ordinary matrices, removing the labels for columns and rows so we can use them as input into the SVM.

In [44]:
# create the similarity matrix used for training the SVM
training_matrix = combined_similarity_matrix.copy()
# remove all rows and columns labeled with names of test_set
for i in range(len(test_pairs)):
    # inplace=True as to not create a copy
    # axis=0 drop rows    axis=1 drop columns
    training_matrix.drop(test_pairs[i], axis=0, inplace=True) 
    training_matrix.drop(test_pairs[i], axis=1, inplace=True)

# create feature vectors for the prediction of testing data
test_matrix = combined_similarity_matrix.copy()
# remove all columns labeled with names of test_pairs
for i in range(len(test_pairs)):
    test_matrix.drop(test_pairs[i], axis=1, inplace=True)
# remove all rows labeled with names of training_pairs
for i in range(len(training_pairs)):
    test_matrix.drop(training_pairs[i], axis=0, inplace=True)

# turn pandas DataFrame to ordinary matrix (remove labels)
kernel_matrix = training_matrix.to_numpy()
test_data = test_matrix.to_numpy()

Since we want to use a similarity matrix as a kernel function, we have to declare during the creation of the SVM that we will use a precomputed kernel. This is done by setting the parameter *kernel* of the *svm.SVC* function to 'precomputed' instead of keeping it on 'linear' per default. Then, because we use a Soft Margin Classifier, we have to set the parameter *C* which determines the punishment score for missclassifications and outliers.

To show the influence of *C* on the model, we create four different SVM, each with a different value for *C*.
* classifier_1 : C=0.01
* classifier_2 : C=0.1
* classifier_3 : C=1.0
* classifier_4 : C=10.0

In [45]:
classifier_1 = svm.SVC(kernel='precomputed', C=0.01)
classifier_2 = svm.SVC(kernel='precomputed', C=0.1)
classifier_3 = svm.SVC(kernel='precomputed', C=1.0)
classifier_4 = svm.SVC(kernel='precomputed', C=10.0)

With setup done, we next train our four SVM with the *fit* function giving them the kernel matrix as the first argument and the list of labels for our training DDI instances as the second one.

In [46]:
# train the SVM
classifier_1.fit(kernel_matrix,training_labels)
classifier_2.fit(kernel_matrix,training_labels)
classifier_3.fit(kernel_matrix,training_labels)
classifier_4.fit(kernel_matrix,training_labels)

Now that we have our trained SVM, we use our test-data for predictions.

In [26]:
def predict(classifier, X, y):
    """
    Takes a SVM classifier, test_data X, correct classfification labels y
    and predicts the class the test_data belongs to. After that the algorithm
    compares the predicted labels to the true labels and counts how many had
    been predicted correctly and how many had been predicted wrongly before
    outputting them.

    params:
        classifier = a SVM classifier
        X = set of test data to be fed to the SVM classifier for prediction
        y = array containing the correct classification labels of X
    """

    predictions = classifier.predict(X)
    #print(predictions)
    correct_predictions = 0
    false_predictions = 0
    for i in range(len(predictions)):
        if predictions[i] == y[i]:
            correct_predictions += 1
        else:
            false_predictions += 1

    print("correct_predictions = " + str(correct_predictions))
    print("false_predictions = " + str(false_predictions))

print("Classifier C=0.01:")
predict(classifier_1, test_data, test_labels)
print("Classifier C=0.1:")
predict(classifier_2, test_data, test_labels)
print("Classifier C=1.0:")
predict(classifier_3, test_data, test_labels)
print("Classifier C=10.0:")
predict(classifier_4, test_data, test_labels)

Classifier C=0.01:
correct_predictions = 72
false_predictions = 51
Classifier C=0.1:
correct_predictions = 66
false_predictions = 57
Classifier C=1.0:
correct_predictions = 64
false_predictions = 59
Classifier C=10.0:
correct_predictions = 59
false_predictions = 64


## Discussion

Wrap up the talktorial's content here and discuss pros/cons and open questions/challenges.

### Results

We created four different SVM for a similarity-based prediction of DDI. Each model took the same input and output data, but differed in the value assigned to the punishment score *C*.

### Possible Improvements to the algorithm

This talktorial is clearly only an example on how we can use a similarity matrix as a kernel function for predicting DDI and as such has several points where improvements can be made.

First of all, the bigger the dataset the more reliable the SVM classifier becomes. We only used 100 drugs here, which certainly creates a large number of possible combinations, but there are far more drugs out on the market that only using 100 of them to train the model leaves us with a substantial lack of information.

The next aspect that can be improved - as already mentioned in the Theory part of this talktorial - is which features are chosen and how we handle them computationally. Using more varied properties of drugs to determine similarity can improve the accuracy of predictions, and fingerprints are one of the more basic ways of translating chemical information into data that algorithms can work with. There are other methods as well or various intermediate steps to prepare the input data in advance that can have an effect on the end results. Likewise, the Tanimoto Coefficient is not the only distance metric that can be used to compare between fingerprints and each metric has their own strengths and weaknesses.

The similarity matrix in this talktorial is fairly straight-forward. We simply performed matrix addition for the pair-wise similarity matrices of the different features. However, if one wants to put more emphasis on one feature over another the different matrices could be weighted by scalar multiplication before being added to each other. Thus, valuing 3D similarity more than 2D similarity for example.

All the above-mentioned improvements are at the discreation of the one setting up the model and depend highly on which input is available and what preferences one has as well as computational capacity.

However, an improvement to the above algorithm of the Practical part that should be taken into account is the usage of cross-validation during the fitting process of the SVM classifier. Cross-validation is a good way to prevent overfitting one specific set of training data and allows to see how well the model does overall on different training data. We chose to not show this during this talktorial because we wanted to focus more on the creation and usage of a similarity matrix for SVM modelling rather than on machine modelling etiquette.


## Quiz


1. Which types of DDI do exist?
2. Why should you employ a Soft Margin Classifier for the SVM?
3. How can one create a pairwise similarity matrix?

<div class="alert alert-block alert-info">

<b>Useful checks at the end</b>: 
    
<ul>
<li>Clear output and rerun your complete notebook. Does it finish without errors?</li>
<li>Check if your talktorial's runtime is as excepted. If not, try to find out which step(s) take unexpectedly long.</li>
<li>Flag code cells with <code># TODO: CI</code> that have deterministic output and should be tested within our Continuous Integration (CI) framework.</li>
</ul>

</div>