<div class="alert alert-block alert-info">

<b>Thank you for contributing to TeachOpenCADD!</b>

</div>

<div class="alert alert-block alert-info">

<b>Set up your PR</b>: Please check out our <a href="https://github.com/volkamerlab/teachopencadd/issues/41">issue</a> on how to set up a PR for new talktorials, including standard checks and TODOs.

</div>

# T04 · Predicting Drug-Drug Interactions using SVM

Authors:

- Vanessa Siegel, 2023, CADD Seminar, Centre for Bioinformatics
- Floriane Odje, [Volkamer Lab](https://volkamerlab.org/)
- Prof. Dr. Andrea Volkamer, [Volkamer Lab](https://volkamerlab.org/)

## Aim of this talktorial

This talktorial introduces and explores the subject of drug-drug interactions (DDI) and their different types, paying special attention to the concepts of antagonism, additivity, and synergism. This will be followed by a closer look at Support Vector Machines (SVM) that use soft margin classifiers and then move towards a more detailed explanation of how to use a combined similarity matrix as a pairwise kernel function to solve the non-linear classification problem of predicting new DDI by comparing them to already known DDI.

To build the combined similarity matrix, we will look at 2D and 3D structural similarity as well as the similarity between interaction profiles and create databases for each of them. The dataset used during the practical part of this talktorial will be retrieved from [__DrugBank__](www.drugbank.ca) and filtered to only contain small molecule drugs that are annotated as approved for medical use in at least one country and for which 2D and 3D structural data is available.


### Contents in *Theory*

-	Drug-Drug Interactions
    - Importance of drug-drug interactions
    - Drug-drug interaction types
-	Drug Similarity
    - Defining drug similarity to a computer
    - 2D similarity
    - 3D similarity
    - Interaction profile similarity
-	Support Vector Machines
    - Soft Margin Classifier
    - Kernel Trick
-	DrugBank
    - History
    - Drug Entries
-	Workflow: Similarity-Based SVM for DDI Prediction
    - Feature Selection
    - Data Selection
    - Creating drug-drug similarity databases
    - Creating drug-pair similarity matrices
    - Creating the pairwise kernel
    - Modelling and evaluating the SVM


### Contents in *Practical*

-	Reading input data from DrugBank
-	Create a drug-drug 2D molecular structure similarity database using RDKit
-	Create a drug-drug 3D pharmacophoric similarity database using E3FP and RDKit
-	Create a drug-drug interaction profile similarity database using RDKit
-	Construct a combined pairwise similarity matrix for the kernel function
-	Model and evaluate the SVM with Scikit-Learn


### References

* Similarity-based SVM predictor for DDI: [*J. Clin. Pharm. Ther.* (2018), __44(2)__, 268-275](https://pubmed.ncbi.nlm.nih.gov/30565313/)
* Explanation of E3FP: [*J. Med. Chem.* (2017), __60__, 7393-7409](https://pubs.acs.org/doi/full/10.1021/acs.jmedchem.7b00696)
* DrugBank: [*Nucleic Acids Res.* (2017), __8__](https://pubmed.ncbi.nlm.nih.gov/29126136/)
* [__Introduction__](https://www.andreaperlato.com/theorypost/introduction-to-support-vector-machine/) to SVMs
* Fingerprints and Drug Similarity: [__Talktorial 004__](https://github.com/volkamerlab/teachopencadd/blob/master/teachopencadd/talktorials/T004_compound_similarity/talktorial.ipynb)
* RDKit package [__documentation__](https://www.rdkit.org/docs/index.html)
* E3FP package [__documentation__](https://e3fp.readthedocs.io/en/latest/index.html)
* Scikit-Learn package [__Documentation__](https://scikit-learn.org/stable/index.html)

## Theory

### Drug-Drug Interactions

#### Importance of drug-drug interactions


Clinical investigations and research in the biomedical field showed that to treat more complex diseases the administration of just one drug is often not enough. Diseases like HIV, cancer, or kidney failure, to name a few, often require a combination of drugs to achieve satisfactory results or improvements in the patient's health. However, the simultaneous use of multiple drugs often leads to the occurrence of drug-drug interactions (DDI).

DDIs are caused when one drug interferes with another in one or more stages of its lifetime circle in the body and through that influences the effectiveness of said drug. This means they either cause an unexpected medical effect or creates an unexpected but measurable difference in the two drugs' concentration in the patient's bloodstream. 

Notably, it does not matter if said influence or effect is beneficial or harmful to the biological system to be classified as a DDI. Thus, while DDI can be taken advantage of to increase a therapeutic effect, they are also a major cause of unwanted or unexpected adverse side effects in patients. Therefore, extensive knowledge about potential DDI is crucial in medical care.

From ensuring that the patient doesn't take drugs which are known to negatively affect each other (e.g., causing adverse reactions, making one drug unusable, potentiating side effects to a severely harmful degree), over preventing existing medical conditions from worsening, up to taking advantage of synergistic DDI to give better treatment, knowledge of DDI has a wide field of applications. As such analysing new drugs or drug combinations for potential DDI is an important aspect of the advancement of personalized medicine.

Since finding and verifying DDI is a costly process due to the amount of in vitro and in vivo experiments, computational means started to develop to filter the large number of potential drug combinations for those who show a high possibility for expressing DDI. Many of these computational methods use machine-learning techniques, such as the one discussed in this talktorial in more detail.

#### Drug-drug interaction types

Drug-drug interactions are commonly classified by the cause of their occurrence, meaning they are either caused by pharmacokinetic (PK) or pharmacodynamic (PD) interactions.

__Pharmacokinetic DDIs__ occur when drug A influences drug B's concentration in the bloodstream. It does not matter if it is the active component of drug A or one of the additives, added to assist in the delivery of the drug, that is the cause of this interaction. 
PK DDIs are separated into four categories depending on which stage of a drug's lifetime circle is affected: absorption, distribution, metabolism, or excretion (ADME).

__Pharmacodynamic DDIs__, on the other hand, occur when the pharmacological effect of drug A influences the pharmacological effect of drug B. This happens when A and B target similar or related biological pathways or targets. However, a PD DDI can also occur when the drugs' affected pathways seem to be completely unrelated, but their pharmacological effects still cause an unexpected medical observation.

PD DDIs get classified into three groups: 
-	__Antagonistic__: 
The combined effect caused by a drug combination is smaller than the sum of the pharmacological effects seen when each drug is given alone.
-	__Additive__:
The combined effect caused by a drug combination is the sum of the pharmacological effects seen when each drug is given alone.
-	__Synergistic__:
The combined effect caused by a drug combination is greater than the sum of the pharmacological effects seen when each drug is given alone.


![Comparison Pharmacodynamic DDI Types](./images/PD_DDI.jpg)

*Figure 1:* Graphical representation of the three types of pharmacodynamic drug-drug interactions

It is important to note that a DDI can be of both types, signifying that this way of classification is to be seen more as a widely used guideline rather than a set of hard-split categories. Similarly, the words antagonistic, additive, synergistic, and their synonyms are sometimes used to also categorize PK DDI, especially in the medical field where the distinction between PK and PD may not be of importance.

<div class="alert alert-block alert-info">

__IMPORTANT__:

For the remainder of this talktorial, unless specified otherwise, DDI will not be differentiated into PK or PD. Likewise, the terms antagonistic, additive, and synergistic – if applied – will be used to describe all DDI as smaller, equal, and greater than the sum of their __therapeutic effect__ respectively to not distract from the actual topic of this talktorial.

</div>

### Drug Similarity

A base assumption in drug discovery is that similar drugs express similar properties and thus behave similarly when introduced to a biological system. Computer-based methods use that assumption to cluster drugs or find molecules that have the potential to have an increased therapeutic effect compared to already known drugs. As such, it is important to clearly define how *similarity* is to be judged computationally.

#### Defining drug similarity to a computer

Multiple different categories and properties can be used to define a similarity between different drugs, and machine learning algorithms often use a combination of them to make better predictions. This process is called feature selection and can have a significant impact on the reliability of a model. Likewise, how exactly we compare the chosen properties, meaning which feature encoding methods we employ during the calculations of drug similarity, can have an impact on model reliability. But how does a computer compare two drugs?

The most common way is to use fingerprints for the different properties and calculate the similarity between them with the Tanimoto coefficient as described in [__Talktorial T004__](https://github.com/volkamerlab/teachopencadd/blob/master/teachopencadd/talktorials/T004_compound_similarity/talktorial.ipynb).

For the Tanimoto coefficient, we will use the following formula during this talktorial:

$ TC(A,B) = |A \cap B| / |A \cup B| $

As shown the formula divides the intersection of fingerprints A and B with the number of features present in the union of both fingerprints and calculates the *similarity* as a float value between 0 and 1 with 0 representing no similarity at all and 1 indicating the two drugs are *identical* in the given property.

![Schematic example of 2D structural similarity calculation](./images/Fingerprint_based_molecular_similarity_Modified.png)

*Figure 2:*
Example for the comparison between fingerprints of two molecules and the resulting Tanimoto coefficient. 
(Modified figure. The original was taken from [*Molecular informatics.* (2021), __40__](https://www.researchgate.net/publication/351084895_Differential_Consistency_Analysis_Which_Similarity_Measures_can_be_Applied_in_Drug_Discovery))

#### 2D structural similarity
For 2D structural similarity, we will create MACCS fingerprints and use the Tanimoto coefficient to calculate our 2D similarity score. A more detailed explanation of MACCS fingerprints can be found in [__Talktorial T004__](https://github.com/volkamerlab/teachopencadd/blob/master/teachopencadd/talktorials/T004_compound_similarity/talktorial.ipynb).


#### 3D structural similarity

For the calculation of the 3D structural similarity, we will use the [__Extended 3-Dimensional FingerPirnt__](https://e3fp.readthedocs.io/en/latest/index.html) (E3FP) package developed and provided by Keiser Lab at University of California San Francisco.

This package calculates 3D fingerprints by applying the logic behind extended connectivity fingerprints (ECFP) to a 3-dimensional space. E3FP represent, similar to MACCS fingerprints, the absence or presence of structural fragments within a molecule. These fragments retain their 3-dimensional orientation, meaning that while two fragments would be identical in the 2-dimensional space, the spatial orientation of the molecular bonds will be taken into account and thus consider them different enough to warrant separate representations within the E3FP.

How the E3FP package calculates said fingerprints is explained in more detail in the corresponding [__research document__](https://pubs.acs.org/doi/full/10.1021/acs.jmedchem.7b00696) and will not be discussed here any further. Likewise, the [__documentation__](https://e3fp.readthedocs.io/en/latest/index.html) of the package offers further inside into the used functions if necessary.

Once we have the E3FP we will then convert them into fingerprints as recognized by RDkit before using them to calculate the Tanimoto Coefficients for our 3D similarity matrix

#### Interaction profile similarity

Like 2D structural similarity, we will use the fingerprint method here as well and calculate the Tanimoto coefficient. For this, our vector will be the size of the number of drugs in our database, each position filled with a 1 if the drug has a known DDI with the drug corresponding to the respective cell and 0 otherwise.


### Support Vector Machines

Support Vector Machines (SVM) are supervised learning models, used to analyse data for classification and regression purposes. Initially developed by Vladimir Vapnik and colleagues during the 1990s, SVMs became more and more popular as one of the most robust prediction methods in machine learning.

Using a set of training examples and a binary label system, the SVM training algorithm maps each labelled training example to a point in space and then determines an optimal hyperplane to separate the two classes from each other. The hyperplane itself gets defined by a set of points from both classes, the so-called support vectors, and is situated to maximize the margin, meaning the distance between itself and the chosen support vectors on both sides, to reliably separate the two classes.


![SVM schema](./images/optimal_hyperplane.png)

*Figure 3:* A schematic example of an optimal separating hyperplane.
(The figure was taken from [Open Source Computer Vision](https://docs.opencv.org/4.x/d1/d73/tutorial_introduction_to_svm.html))

#### Soft Margin Classifier

In this talktorial, we will use a so-called __Soft Margin Classifier__, as opposed to the stricter Hard Margin Classifier. This means that we allow both, outliers and potential misclassifications, in our training data when determining the position of the optimal separating hyperplane.

That way, we guarantee that the positioning of the separating hyperplane is much more robust and less prone to overfit our training data, making the whole model more reliable when classifying new data points.

![Hard vs Soft Margin Classifier](./images/Hard_vs_Soft_Margin.jpg)

*Figure 4:* A comparison between a Hard Margin Classifier (left) and a Soft Margin Classifier (right). Both optimal separating hyperplanes would be identical if the one blue outlier was not part of the data set. (The figure was taken from [Mubaris NK](https://mubaris.com/posts/svm/))

However, to do so, the error parameter C which is used to punish the presence of outliers and misclassifications must be carefully chosen to neither overfit our model nor to reduce specificity to the point that predictions will mean nothing. It is a very important step in handling the Bias/Variance Trade-off typical for machine learning models and unique to Soft Margin Classifiers.

![Effect of C](./images/Effect_of_soft_margin_constant_C_Modified.png)

*Figure 5:* Comparison of different values for error parameter C and their impact on determining the optimal separating hyperplane. (The figure was modified. The original is from [*PLoS computational biology.* (2008), __4__](https://www.researchgate.net/publication/23442384_Support_Vector_Machines_and_Kernels_for_Computational_Biology))

#### Kernel Trick

Notably, the figures above all show examples of linear classification problems, meaning the two classes can be easily separated from each other with a linear hyperplane. However, this does not work when faced with a classification problem as depicted in the figure below. 

![CNon-linear Classification Problem](./images/nonlinear_data.png)

*Figure 6:* An example of a non-linear classification problem. (The figure was taken from [Andrea Perlato](https://www.andreaperlato.com/theorypost/introduction-to-support-vector-machine/))

For problems like this, SVMs have to make use of the __kernel trick__ to change a non-linear classification problem into a linear one before finding the optimal separating hyperplane.

To do so, SVMs take the data points and artificially move them into a higher dimension with the help of a __kernel function__ which plots the data points in a manner to make them linearly separable. In this higher dimension, the SVM will then find an optimal separating hyperplane as described. In the last step, said hyperplane will get transformed back into the original dimension where it may no longer be linear.

A simple example of how the kernel trick works is detailed in Figure 7, where we artificially plot the data points from a one-dimensional space into a two-dimensional space with a polynomial kernel function, determine the hyperplane, and then transform the data back into the original space.

![Kernel Trick for Non-Linear Classification Problem](./images/Kernel_trick.png)

*Figure 7:* An example of the kernel trick on one-dimensional drug-dosage data. The chosen kernel function is a polynomial function of degree two: $f(x) = dosage^2$. (The figure was taken from [Andrea Perlato](https://www.andreaperlato.com/theorypost/introduction-to-support-vector-machine/))

Since the kernel function does not actually transform the data into a higher dimension and only calculates the relationships between every pair of points as if they were in the higher dimension, the whole method is called the kernel __trick__.

As for the kernel function itself, there are many different ways of creating one. The kernel function introduced in this talktorial is a pairwise kernel method which uses a similarity matrix of drug pairs to perform the kernel trick. How exactly we calculate said function will be discussed in detail in the corresponding subsection of the *Workflow* chapter.

### DrugBank

#### History

[__DrugBank__](https://go.drugbank.com/about) is a comprehensive, free-to-access, online database containing information on drugs and drug targets. First established in 2006 by Dr David Wishart's lab at the University of Alberta as a project to grant academic researchers easier access to detailed, structured information about drugs, DrugBank grew in size and popularity thanks to the backing of various research organisations as well as government funding. Now in its 5th version (version 5.1.10 as of 1st April 2023), Drug Bank contains over 15,000 drug entries with almost 5,296 non-redundant protein sequences being linked to them. Each entry contains more than 200 data fields and combines detailed drug data (chemical, pharmacological, pharmaceutical, etc.) with comprehensive drug targets data like sequence, structure, and pathway, collected from bioinformatics and cheminformatics resources. [cited from https://go.drugbank.com/about]

#### Drug Entries

Each entry in the DrugBank database is clearly structured, systemically ordered, and possesses a unique number through which each drug can be identified and addressed within the database. The toolbar to the left provides an easy way to navigate through the different categories and subcategories for which information on the drug at hand is provided, allowing for intuitive and easy access to the sought-after information.

![Screenshot of DrugBank Entry Aspirin](./images/DrugBank_Aspirin.png)

*Figure 8:* 
Screenshot of the DrugBank Entry of [__Aspirin__](https://go.drugbank.com/drugs/DB00945). The red-framed toolbar provides easy access to different categories of drug information provided by DrugBank. The blue square marks the unique accession number through which each drug can be identified within the database.

It is to note that while categories for which no information is available might be removed from the navigation help, the presence of a certain field does not necessitate the presence of information. The example below nicely illustrates that while the category Pharmacology is present, for most of its subfields, information is not available.

![Screenshot of DrugBank Entry 1,2-Benzodiazepin](./images/DrugBank_1_2-Benzodiazepine.png)

*Figure 9:* 
Screenshot of the DrugBank Entry of [__1,2-Benzodiazepin__](https://go.drugbank.com/drugs/DB12537).

For the purpose of this talktorial the only relevant information needed from these entries are:

-	The __DrugBank Accession Number__
-	The __Type__ of the drug (Small Molecule or Biotech)
-	Which __Group(s)__ the drug belongs to
-	The 2D structure in the __SMILE__ format
-	The 3D structure in the __3D-SDF file__ format
-	The list of known __Drug Interactions__


The DrugBank accession number will serve as an identification system during computations and consists of the letters 'DB' followed by a five-digit number.

The Type and Group information is required to filter the database for only those drugs that qualify as small molecule drugs, and which were approved for use. 

The 2D and 3D structure information will be used to create fingerprints with RDkit and the E3FP package.

Lastly, the drug interactions will be used to create interaction profile fingerprints for each drug and also serve as the foundation for the assignment of labels to our training data.

<div class="alert alert-block alert-info">

__IMPORTANT__:

It is to note that, unlike the *Type*, the *Groups* are not an either-or classification system since DrugBank incorporates information from various countries and a drug that might be approved in one can still be illicit or experimental in another. 

</div>

### Workflow: Similarity-Based SVM for DDI Prediction

The work-flow described here is for the most part taken from the paper [__Similarity-based SVM predictor for DDI__](https://pubmed.ncbi.nlm.nih.gov/30565313/). In this talktorial we use some simplifications of the algorithm to focus more on the underlying ideas rather than on tweaking the accuracy of the SVM to its best possible outcome. The paper can be found in the References and will not be cited again in the following paragraphs.

#### Feature Selection

Feature selection is a very big part of machine learning algorithms, especially when handling biological data where there are lots of features but only a limited amount of data points to use. How one chooses features depends on the availability of the data, the impact said feature might have on the problem at hand, and the computational effort to use it.

As such, the first thing we need to do is decide which features we want to focus on. As previously mentioned, three different features were chosen for the sake of this tutorial: 2D structure, 3D structure, and drug interactions.

2D structure is a commonly used feature in drug discovery since it allows for relatively easy first comparisons between drugs and proved reliable enough to build the basis for QSAR methods in pharmacy and drug Discovery. Moreover, a 2D molecular representation is generally available for drugs.

The 3D structure was chosen to take into account that spatial orientation plays a very important role in all forms of protein-ligand interactions, and just like 2D structure, 3D structure is often available as well.

Since drugs that already share a set of drug-interaction partners are much more likely to also interact with drugs for which experimental data only exists for one of them, drug interactions were chosen as a feature that is both closely related to the purpose of the SVM we build and has to be sufficiently available for our future training dataset.


#### Data Selection

Now that we have our features, it is time to collect our data. To predict DDI with an SVM, we require labelled data points for training with sufficiently enough examples for both classes to prevent too much bias towards one category.

In our practical, we label drug pairs as having (1) or not having a DDI (0). For this example, neither the strength nor the type of DDI is of importance. If one wishes to pay more attention to a certain type of interaction, they need to adjust the labels accordingly.

We took our data from the DrugBank database version 5.1.10 manually and stored the information in different files to be used later for the algorithm in the Practical part. Most information was compiled in a singular Excel file (DrugBank Accession Number, SMILES, drug names, and reported DDI), but the 3D structural information was saved in individual SDF files.

The drugs we chose are all small molecule drugs that were approved for use in at least one country and which have SMILE strings and 3D-SDF files provided by DrugBank. If Drug Interactions are marked as 'Not Available' we will assume that these drugs do not show any noteworthy DDI.

#### Creating drug-drug similarity databases

With our dataset ready, it is now time to calculate the similarity databases for each of the three features while looking at all possible drug combinations.

To create the 2D structure similarity database we use RDkit to transform the SMILE representation of each drug into a MACCS fingerprint. Once we have these, we calculate the Tanimoto Coefficient for all possible combinations of drugs and store these values in a similarity matrix for easy access.

For the 3D structure similarity database, we use, as mentioned before, the E3FP package to create fingerprints directly from the downloaded 3D-SDF files. Those will then be transformed into fingerprints as they are used by RDkit, before we calculate a similarity score for each possible drug combination.

Lastly, to create an interaction profile database we first create fingerprints the size of our number of drugs and then transform them into RDkit fingerprints. Afterwards, we use the Tanimoto Coefficient to calculate the similarity between the interaction profiles between two drugs.

All these fingerprints as well as the three final databases will be stored for quick check-ups during later calculations.

#### Creating drug-pair similarity matrices

Now that we have tables to look up similarity scores for each possible drug combination, it is time to build the drug-pair similarity matrices. The process is the same for all three features.

Since the SVM treats pairs of drugs as singular instances, the similarity matrix needs to contain similarity scores between drug pairs rather than individual drugs as is the case in the previously created databases.

To calculate the entries in the similarity matrices for each feature we use the following formula:

$S\big((d_1,d_1') , (d_2,d_2')) = max \big( s(d_1,d_2) \cdot s(d_1',d_2') , s(d_2,d_1') \cdot s(d_1, d_2'))$

The letter $s$ denotes the similarity scores previously calculated and stored in the above-mentioned databases, while $S$ stands for the similarity score of the drug pairs and will be saved in the new similarity matrix.

Although the values $s(d_1,d_1')$ and $s(d_1',d_1)$ are identical, there is a difference between $s(d_1,d_2) \cdot s(d_1',d_2')$ and $s(d_1',d_2) \cdot s(d_1,d_2')$. Therefore, we choose the maximum between those two values for $S$ to ensure we store the score for the comparison with maximum similarity between the individual drugs for better accuracy and predictability.

Notably, the resulting drug-pair similarity matrices only contains the pair $(d_1,d_1')$ once and no cell or row marked with $(d_1',d_1)$.

#### Creating the pairwise kernel

In this part we simply add the scores of all three drug-pair similarity matrices together, giving each feature the same importance. As a result, the values in the final similarity matrix we use as our pairwise kernel function range from 0 to 3 rather than 0 to 1 as in the underlying drug-pair similarity matrices.

#### Modelling and Evaluating the SVM

Before we build the actual model with the help of the Scikit-Learn package, we split our DDI instances into a training set of 90% and a testing set of 10% and adjust our kernel function accordingly to match our training data by removing the data rows and columns of our test set.

Using our labelled training dataset and our new kernel function we create the SVM. Since we use a soft margin classifier, we add an error penalty parameter C with C>0 which is called the soft-margin constant. We chose four different values for $C$ in the range of {0.01, 0.1, 1.0, 10.0} and built consequently four different SVM.

Afterwards, we predict the labels for our testing data and compare the predictions to the labels taken from DrugBank's Drug Interactions, counting how many predictions are 'correct' or 'false' and output the results.

## Practical

In this practical, we will create drug similarity databases for 2D similarity, 3D similarity and interaction profile similarity. Said databases will then be used to create a combined drug-pair similarity matrix which will serve as the kernel function for a soft-margin-classifier SVM to predict DDI.

In [36]:
from pathlib import Path
#import math

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
#from matplotlib.lines import Line2D
#import matplotlib.patches as mpatches

# RDkit imports
from rdkit import Chem, DataStructs
from rdkit.Chem import (
    Descriptors,
    Draw,
    PandasTools,
    MACCSkeys,
    rdFingerprintGenerator,
    AllChem,
)
# E3FP package imports
from e3fp.fingerprint.generate import (
    fp,
    fprints_dict_from_sdf,
)

# Scikit-Learn imports
from sklearn import (
    svm,
    datasets,
    metrics
)
from sklearn.model_selection import (
    train_test_split,
    cross_validate,
    cross_val_predict,
    cross_val_score,
)

In [37]:
HERE = Path(_dh[-1])
DATA = HERE / "data"

### Reading input data from DrugBank

Our input data, which was retrieved from DrugBank's webserver, was saved in the form of an Excel file. Each row represents one drug entry, and each entry has four fields or columns:

* DrugBank Accession Number
* SMILES
* Name of the Drug
* Comma-separated list of DrugBank Accession Numbers indicating DDI

We use the pandas package to parse the Excel file into a dictionary with the fields *id*, *smiles*, *name*, and *ddi* to represent the four columns.

In [38]:
drugs = pd.read_excel("data/Input Data.xlsx", header=None, names=["id", "smiles", "name", "ddi"], index_col=None)

# constants used throughout the tutorial
num_drugs = drugs["id"].size

### Create a drug-drug 2D molecular structure similarity database using RDkit

We create a function *create_2D_fp* which takes a SMILE and returns a MACCS fingerprint.

To do so, our function first builds a molecule from the provided SMILE and then uses said molecule to calculate the MACCS fingerprint. Both will be done by the provided functions from the RDKit package.

In [39]:
def create_2D_fp(smile):
    """
    takes a SMILE string and returns a MACCS fingerprint

    params:
    smile = smile string

    returns:
    fp = a MACCS fingerprint
    """
    mol = Chem.MolFromSmiles(smile)
    fp = MACCSkeys.GenMACCSKeys(mol)
    return fp

The resulting MACCS fingerprints will be stored as a new category in our *drugs* dictionary.

In [40]:
drugs["maccs"] = [create_2D_fp(x) for x in drugs["smiles"]]

Once we have the fingerprints we create a similarity matrix containing the Tanimoto Scores calculated from the MACCS fingerprints.

Since a similarity matrix is always symmetrical, we save calculation time by starting the inner loop at position i. That way we compare a drug only to those following after, since a comparison to the ones before has already been done in the previous runs of the loop.

For easier access to the scores saved within the similarity matrix, we create a panda DataFrame, naming the columns and rows after our drugs. That way, we can access the scores by using the names of the drugs rather than having to figure out at which specific index said drugs are within the list.

In [41]:
# calculate Tanimoto Coefficient of MACCS fingerprints
scores_2D = np.zeros((num_drugs,num_drugs))
for i in range(num_drugs):
    for j in range(i, num_drugs):
        score = DataStructs.TanimotoSimilarity(drugs["maccs"][i],drugs["maccs"][j])
        scores_2D[i][j] = scores_2D[j][i] = score
drugs_2D_database = pd.DataFrame(scores_2D, drugs["id"], drugs["id"])

### Create a drug-drug 3D pharmacophoric similarity database using E3FP and RDkit

The first step we perform is build a function *create_3D_fp* which takes the path to a 3D-SDF file and creates an E3FP fingerprint with the help of the identically named package.

For this, the E3FP package provides a helpful function *fprints_dict_from_sdf* which can calculate an E3FP fingerprint directly from the provided 3D-SDF file. This function takes the name of the SDF file and an additional parameter *first=1* to define that we only want to look at the first conformer within the respective file.

After that, we convert the E3FP fingerprint into a fingerprint as recognized by RDkit.

In [42]:
def create_3D_fp(path):
    """
    takes a 3D-SDF file and returns an E3FP fingerprint 

    params:
    path = the path towards the 3D-SDF file

    returns:
    fp = an E3FP fingerprint
    """
    fp_dict = fprints_dict_from_sdf(path, first=1)
    fps = fp_dict[5][0]
    fp = fps.fold().to_rdkit()
    return fp

Again, we create fingerprints for all our drugs and save them in our dictionary. 

In [43]:
drugs["e3fp"] = [create_3D_fp(x) for x in "data/"+drugs["name"]+".sdf"]

2023-07-03 18:28:28,955|INFO|Generating fingerprints for 2244.
2023-07-03 18:28:29,043|INFO|Generated 1 fingerprints for 2244.
2023-07-03 18:28:29,050|INFO|Generating fingerprints for 441300.
2023-07-03 18:28:29,164|INFO|Generated 1 fingerprints for 441300.
2023-07-03 18:28:29,168|INFO|Generating fingerprints for 71158.
2023-07-03 18:28:29,210|INFO|Generated 1 fingerprints for 71158.
2023-07-03 18:28:29,216|INFO|Generating fingerprints for 9811704.


2023-07-03 18:28:29,536|INFO|Generated 1 fingerprints for 9811704.
2023-07-03 18:28:29,543|INFO|Generating fingerprints for 1978.
2023-07-03 18:28:29,653|INFO|Generated 1 fingerprints for 1978.
2023-07-03 18:28:29,658|INFO|Generating fingerprints for 71771.
2023-07-03 18:28:29,758|INFO|Generated 1 fingerprints for 71771.
2023-07-03 18:28:29,764|INFO|Generating fingerprints for 1981.
2023-07-03 18:28:29,937|INFO|Generated 1 fingerprints for 1981.
2023-07-03 18:28:29,937|INFO|Generating fingerprints for 54676537.
2023-07-03 18:28:30,061|INFO|Generated 1 fingerprints for 54676537.
2023-07-03 18:28:30,069|INFO|Generating fingerprints for 6237.
2023-07-03 18:28:30,173|INFO|Generated 1 fingerprints for 6237.
2023-07-03 18:28:30,179|INFO|Generating fingerprints for 5904.
2023-07-03 18:28:30,279|INFO|Generated 1 fingerprints for 5904.
2023-07-03 18:28:30,289|INFO|Generating fingerprints for 2351.
2023-07-03 18:28:30,404|INFO|Generated 1 fingerprints for 2351.
2023-07-03 18:28:30,417|INFO|Gener

Lastly, we calculate and save the Tanimoto Coefficients for the E3FP fingerprints in a similarity matrix in the same manner as above.

In [44]:
# calculate Tanimoto coefficient
scores_3D = np.zeros((num_drugs,num_drugs))
for i in range(num_drugs):
    for j in range(i, num_drugs):
        score = DataStructs.TanimotoSimilarity(drugs["e3fp"][i],drugs["e3fp"][j])
        scores_3D[i][j] = scores_3D[j][i] = score
drugs_3D_database = pd.DataFrame(scores_3D, drugs["id"], drugs["id"])


### Create a drug-drug interaction profile similarity database using RDkit

We create a function *create_IP_fp* to calculate the interaction profile fingerprint of a single drug.

Unlike in the above cases, instead of calling pre-existing functions, we have to create the interaction profile fingerprint from scratch. As described in the respective Theory part, we create our own fingerprint template and fill it with 0 and 1 accordingly.

The fingerprint template is defined by the parameter *y* which gives us a list of drugs in a predetermined order. The fingerprint, represented as a numpy array, will have the same length as *y* and each position in this array correlates to a drug at the same position in *y*. This ensures that all created fingerprints are uniform as long as *y* does not change.

The resulting numpy array will then be converted into a bitstring and then a fingerprint as recognized by RDkit.

In [45]:
def create_IP_fp(x, y):
    """
    Takes 2 lists of strings and creates an interaction profile fingerprint fp the size of y
    and containing 1 or 0 depending if the drug corresponding to the position is found in the
    list x or not.

    params:
    x = list of drugs with known DDI
    y = list of drugs used as the basis for the fingerprint

    returns:
    fp = interaction profile fingerprint
    """
    DDI_fp = np.zeros(y.size)
    for i in range(y.size):
        if y[i] in x:
            DDI_fp[i] = 1
    # Change array into bitstring
    bitstring = "".join(DDI_fp.astype(str))
    # Change bitstring to RDKit fingerprint
    fp = DataStructs.cDataStructs.CreateFromBitString(bitstring)
    return fp

We call this function for all elements within our drugs dataset and save the resulting fingerprints.

In [46]:
drugs["ipfp"] = [create_IP_fp(x,drugs["id"]) for x in drugs["ddi"]]

After that, we once again calculate and save the Tanimoto Coefficients in a similarity matrix as a pandas DataFrame with the names of the drugs to specify columns and rows of the matrix.

In [47]:
# calculate Tanimoto Coefficient
scores_IP = np.zeros((num_drugs,num_drugs))
for i in range(num_drugs):
    for j in range(i, num_drugs):
        score = DataStructs.TanimotoSimilarity(drugs["ipfp"][i],drugs["ipfp"][j])
        scores_IP[i][j] = scores_IP[j][i] = score
drugs_IP_database = pd.DataFrame(scores_IP, drugs["id"], drugs["id"])

### Construct a combined pairwise similarity matrix for the kernel function

We first create our actual training data points for the SVM, meaning we create instances of DDI (aka drug pairs) and label them with 1 for an interaction and 0 otherwise.

In [48]:
# Create labelled drug-pair data points
drug_pairs_partner = []
drug_pairs_names = []
drug_pairs_labels = []
for i in range(num_drugs):
    for j in range(i+1, num_drugs):
        drug_pairs_partner.append([drugs["id"][i], drugs["id"][j]])
        drug_pairs_names.append(drugs["name"][i] + " + " + drugs["name"][j])
        if drugs["id"][j] in drugs["ddi"][i] or drugs["id"][i] in drugs["ddi"][j]:
            drug_pairs_labels.append(1)
        else:
            drug_pairs_labels.append(0)

num_pairs = len(drug_pairs_names)

Now that we have our drug pairs as single instances, we have to calculate similarity matrices for 2D, 3D, and interaction profile similarity and save them.

In [49]:
# create pairwise similarity matrices
pair_scores_2D = np.zeros((num_pairs,num_pairs))
pair_scores_3D = np.zeros((num_pairs,num_pairs))
pair_scores_IP = np.zeros((num_pairs,num_pairs))
for i in range(num_pairs):
    for j in range(i, num_pairs):
        d1 = drug_pairs_partner[i]
        d2 = drug_pairs_partner[j]
        # calculate 2D pairwise similarity
        score1 = np.dot(drugs_2D_database[d1[0]][d2[0]] , drugs_2D_database[d1[1]][d2[1]])
        score2 = np.dot(drugs_2D_database[d1[1]][d2[0]] , drugs_2D_database[d1[0]][d2[1]])
        pair_scores_2D[i][j] = pair_scores_2D[j][i] = max(score1, score2)
        
        #calculate 3D pairwise similarity
        score1 = np.dot(drugs_3D_database[d1[0]][d2[0]] , drugs_3D_database[d1[1]][d2[1]])
        score2 = np.dot(drugs_3D_database[d1[1]][d2[0]] , drugs_3D_database[d1[0]][d2[1]])
        pair_scores_3D[i][j] = pair_scores_3D[j][i] = max(score1, score2)

        #calculate IP pairwise similarity
        score1 = np.dot(drugs_IP_database[d1[0]][d2[0]] , drugs_IP_database[d1[1]][d2[1]])
        score2 = np.dot(drugs_IP_database[d1[1]][d2[0]] , drugs_IP_database[d1[0]][d2[1]])
        pair_scores_IP[i][j] = pair_scores_IP[j][i] = max(score1, score2)

# save pairwise similarity databases for quick checkups
pair_2D_database = pd.DataFrame(pair_scores_2D, drug_pairs_names, drug_pairs_names)
pair_3D_database = pd.DataFrame(pair_scores_3D, drug_pairs_names, drug_pairs_names)
pair_IP_database = pd.DataFrame(pair_scores_IP, drug_pairs_names, drug_pairs_names)

The next step is to combine these three similarity matrices into one that will be used for the kernel function by performing a simple matrix addition.

In [50]:
combined_similarity_matrix = pair_2D_database + pair_3D_database + pair_IP_database

### Model and evaluate the SVM

Now it is time to build the Support Vector Machine. For this, we will use the provided functions from the Scikit-Learn package, a python package used to build different machine-learning models. More information can be acquired from the documentation of [__Scikit-Learn__](https://scikit-learn.org/stable/index.html).


The first thing we have to do is split our data into training and testing sets. The parameter *test_size* determines which percentile of our data will be used for testing and which will be used for training. In the code below, we take a tenth of the DDI instances as our test data.

In [51]:
training_pairs, test_pairs, training_labels, test_labels = train_test_split(
    drug_pairs_names, drug_pairs_labels, test_size = 0.1, random_state=0)

With our split dataset, we also have to split our similarity matrix into one containing only the combinations of our *training_pairs* and one containing rows of our *test_pairs* and columns of our *training_pairs*.

For this, we create true copies of the *combined_similarity_matrix* and then remove the rows and columns we do not need respectively.

Once this is done, we change the resulting pandas DataFrames into ordinary matrices, removing the labels for columns and rows so we can use them as input into the SVM.

In [52]:
# create the similarity matrix used for training the SVM
training_matrix = combined_similarity_matrix.copy()
# remove all rows and columns labelled with names of test_set
for i in range(len(test_pairs)):
    # inplace=True as to not create a copy
    # axis=0 drop rows    axis=1 drop columns
    training_matrix.drop(test_pairs[i], axis=0, inplace=True) 
    training_matrix.drop(test_pairs[i], axis=1, inplace=True)

# Create feature vectors for the prediction of testing data
test_matrix = combined_similarity_matrix.copy()
# Remove all columns labelled with names of test_pairs
for i in range(len(test_pairs)):
    test_matrix.drop(test_pairs[i], axis=1, inplace=True)
# Remove all rows labelled with names of training_pairs
for i in range(len(training_pairs)):
    test_matrix.drop(training_pairs[i], axis=0, inplace=True)

# Turn pandas DataFrame to an ordinary matrix (remove labels)
kernel_matrix = training_matrix.to_numpy()
test_data = test_matrix.to_numpy()

Since we want to use a similarity matrix as a kernel function, we have to declare during the creation of the SVM that we will use a precomputed kernel. This is done by setting the parameter *kernel* of the *svm.SVC* function to 'precomputed' instead of keeping it on 'linear' per default. Then, because we use a Soft Margin Classifier, we have to set the parameter *C* which determines the punishment score for misclassifications and outliers.

To show the influence of *C* on the model, we create four different SVM, each with a different value for *C*.
* classifier_1 : C=0.01
* classifier_2 : C=0.1
* classifier_3 : C=1.0
* classifier_4 : C=10.0

In [53]:
classifier_1 = svm.SVC(kernel='precomputed', C=0.01)
classifier_2 = svm.SVC(kernel='precomputed', C=0.1)
classifier_3 = svm.SVC(kernel='precomputed', C=1.0)
classifier_4 = svm.SVC(kernel='precomputed', C=10.0)

With setup done, we next train our four SVM with the *fit* function giving them the kernel matrix as the first argument and the list of labels for our training DDI instances as the second one.

In [54]:
# train the SVM
classifier_1.fit(kernel_matrix,training_labels)
classifier_2.fit(kernel_matrix,training_labels)
classifier_3.fit(kernel_matrix,training_labels)
classifier_4.fit(kernel_matrix,training_labels)

Now that we have our trained SVM, we use our test data for predictions and output the results.

In [55]:
def evaluate(predictions, records):
    """
    Takes two arrays containing predicted and reported classification labels
    of a test set X respectively and compares them. The function counts how
    many had been predicted correctly and how many had been predicted wrongly
    before returning the values as an array.

    params:
        predictions = array containing the predicted classification labels of test_data
        records = array containing the recorded and correct classification labels of test_data

    returns:
    [correct_p, false_p] = number of correct and false predictions respectively compared to the labels in records
    """
    correct_p = 0
    false_p = 0
    for i in range(len(predictions)):
        if predictions[i] == records[i]:
            correct_p += 1
        else:
            false_p += 1
            
    return[correct_p, false_p]


# Classifier C=0.01
predictions_1 = classifier_1.predict(test_data)
row_1 = evaluate(predictions_1, test_labels)
report_1 = pd.DataFrame(metrics.classification_report(test_labels, predictions_1, output_dict=True))

# Classifier C=0.1
predictions_2 = classifier_2.predict(test_data)
row_2 = evaluate(predictions_2, test_labels)
report_2 = pd.DataFrame(metrics.classification_report(test_labels, predictions_2, output_dict=True))

# Classifier C=1.0
predictions_3 = classifier_3.predict(test_data)
row_3 = evaluate(predictions_3, test_labels)
report_3 = pd.DataFrame(metrics.classification_report(test_labels, predictions_3, output_dict=True))

# Classifier C=10.0
predictions_4 = classifier_4.predict(test_data)
row_4 = evaluate(predictions_4, test_labels)
report_4 = pd.DataFrame(metrics.classification_report(test_labels, predictions_4, output_dict=True))

# Create DataFrame to store results for viewing
results = pd.DataFrame([row_1, row_2, row_3, row_4], index=["C=0.01", "C=0.1", "C=1.0", "C=10.0"], columns = ["correct predictions", "false predictions"])

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


## Discussion

### Results

We created four different SVM for a similarity-based prediction of DDI. Each model took the same input and output data but differed in the value assigned to the punishment score *C*. 

In [56]:
results

Unnamed: 0,correct predictions,false predictions
C=0.01,72,51
C=0.1,66,57
C=1.0,64,59
C=10.0,59,64


If we look just at the numbers of 'correctly' and 'falsely' predicted DDI we already see that choosing an appropriate value for *C* has a measurable impact on the classification of new data points.

On the first glance it would seem that *C=0.01* would be the best choice since we have the highest number of correctly predicted DDI in this case. However, if we look at the more detailed classification report below, we see that this is solely by chance and due to the distribution of examples within the test data.

In [57]:
report_1

Unnamed: 0,0,1,accuracy,macro avg,weighted avg
precision,0.0,0.585366,0.585366,0.292683,0.342653
recall,0.0,1.0,0.585366,0.5,0.585366
f1-score,0.0,0.738462,0.585366,0.369231,0.43227
support,51.0,72.0,0.585366,123.0,123.0


We can clearly see that *precision*, *recall* and *f1-score* are all 0.0 for the label 0 indicating no DDI. This means that our classifier sorted everything into class 1 and the number of correct predictions relies entirely on the fact that we have more pairs in our test data that have DDI than haven't.

This is an excellent example of how using a too small value for *C* can give rise to a bias towards one class over another. We allowed so many misclassifications and outliers that per default sorting everything into class 1 (has DDI) gives better results than making an actual effort at trying to separate our two classes. 

Let's compare this to our classifier which used *C=0.1*.

In [58]:
report_2

Unnamed: 0,0,1,accuracy,macro avg,weighted avg
precision,0.442308,0.605634,0.536585,0.523971,0.537913
recall,0.45098,0.597222,0.536585,0.524101,0.536585
f1-score,0.446602,0.601399,0.536585,0.524,0.537215
support,51.0,72.0,0.536585,123.0,123.0


Here we can see immediately that our test data points had been sorted into class 1 or class 0. Our accuracy, however, is barely above 50% which is just as well as flipping a coin.

We have a similar result for our classifier using *C=1.0* with only a small drop in accuracy of barely 0.017. This means that for our example there is very little difference between *C=0.1* and *C=1.0*.

In [59]:
report_3

Unnamed: 0,0,1,accuracy,macro avg,weighted avg
precision,0.4375,0.610169,0.520325,0.523835,0.538575
recall,0.54902,0.5,0.520325,0.52451,0.520325
f1-score,0.486957,0.549618,0.520325,0.518287,0.523637
support,51.0,72.0,0.520325,123.0,123.0


Let us now look at our last SVM, which used a the value 10.0 for the soft margin constant *C*.

In [60]:
report_4

Unnamed: 0,0,1,accuracy,macro avg,weighted avg
precision,0.4,0.568966,0.479675,0.484483,0.498907
recall,0.509804,0.458333,0.479675,0.484069,0.479675
f1-score,0.448276,0.507692,0.479675,0.477984,0.483056
support,51.0,72.0,0.479675,123.0,123.0


We immediately see a significant drop in accuracy, falling far below the 50% mark where tossing a coin is more likely to predict DDI correctly than our model. This shows that we overfitted our training data, resulting in bad predictions of new drug pair combinations. We ended up on the variance side of the Bias/Variance Tradeoff, making the model just as useless as the one where we sorted everything into category 1 by default.


Now that we found out that our SVM using the values *C=0.01* and *C=10.0* are unreliable due to drifting off towards either too much bias or too much variance, let us look at the remaining two models a little closer.


As stated before, both models have an accuracy just above 50% making them just a little better than tossing a coin when it comes to predicting whether a given drug pair has a DDI or not. There are multiple reasons as to why this is the case. From the small training data set, over the number of chosen features to compare, up to the way in which we calculated the similarity between individual drugs. It is likely that a combination of all of these is at fault for the less than stellar accuracy of our models.

However, we also have to be aware that our correct labels with which we compared our predictions are not necessarily a representation of the truth either.

Our labels were recovered from DrugBank, which in itself only provides information collected from reports from various sources which might be biased themselves. One source could declare two drugs to have a drug-drug-interaction due to a small deviation from what was expected, while another source might interpret the same deviation insignificant and close enough to an additive state to not warrant it being an actual DDI. Likewise, with technological improvements and an increased number of experiments researchers collect more and more valuable information which might indicate or prove that previously reported DDIs were the results of other influences independent of the drugs, or vise versa. Hence, the terms 'correct' and 'false' labels are to be taken with a grain of salt here, and only reflects the current state of knowledge as retrieved from DrugBank.

Regardless of the fact that our accuracy for our best performing model is only 0.537, we showed that an SVM using a similarity matrix as a kernel function can be used to predict possible DDI, even if the algorithm that was developed in the Practical part has a low reliability.

### Possible Improvements to the algorithm

This talktorial is clearly only an example of how we can use a similarity matrix as a kernel function for predicting DDI and as such has several points where improvements can be made.

First of all, the bigger the dataset the more reliable the SVM classifier becomes. We only used 50 drugs here, which did create 1225 possible combinations, but that pales compared to the more than 2,500 approved, small molecule drugs currently out on the market (stand: DrugBank, 3 July 2023). Thus, only using 50 of them to train and test the model leaves us with a substantial lack of information and potential bias.

The next aspect that can be improved - as already mentioned in the Theory part of this talktorial - is which features are chosen and how we handle them computationally. Using more varied properties of drugs to determine similarity can improve the accuracy of predictions, and fingerprints are one of the more basic ways of translating chemical information into data that algorithms can work with. There are other methods as well or various intermediate steps to prepare the input data in advance that can affect the results. Likewise, the Tanimoto Coefficient is not the only distance metric that can be used to compare fingerprints and each metric has its strengths and weaknesses.

The combined similarity matrix in this talktorial is fairly straightforward as well. We simply performed matrix addition for the pair-wise similarity matrices of the different features. However, if one wants to put more emphasis on one feature over another the different matrices could be weighted by scalar multiplication before being added to each other. Thus, valuing 3D similarity more than 2D similarity for example.

Choosing a more appropriate value for our Soft Margin Constant *C* via cross-validation can help as well in improving the accuracy of the algorithm.

All the above-mentioned improvements are at the discretion of the one setting up the model and depend highly on which input is available and what preferences one has as well as computational capacity.

However, an improvement to the above algorithm of the Practical part that should be taken into account is the usage of cross-validation during the fitting process of the SVM classifier. Cross-validation is a good way to prevent overfitting one specific set of training data and allows us to see how well the model does overall on different training data. We chose to not show this during this talktorial because we wanted to focus more on the creation and usage of a similarity matrix for SVM modelling rather than on machine modelling etiquette. Nevertheless, it is something to look into to improve the results of the SVM.

### Pros and Cons

Now that we discussed the results and possible algorithmic improvements, it is time to look at the pros and cons of predicting DDI with an SVM that uses a pairwise similarity matrix as a kernel function.

PROS:
+ __Easy to expand__: The most noticeable advantage of this algorithm is the simplicity with which more features can get added without having to rework the entire program. The pairwise kernel uses a combination of separately computed pairwise similarity matrices. It is easy enough to simply add another similarity matrix to the mix that considers similarity for another property of drugs, like drug-target interactions.
+ __High modularity__: In the same way, the high modularity of the algorithm allows tweaking the computation of one feature similarity without having to adjust the computation of the similarity scores of another feature. 
+ __Feature-Independence__: We can also weigh different features as more important than others, or use a mix of distance measurements to try and improve predictions.

CONS:
- __Lots of setup calculations__: Setting up an SVM with a similarity matrix as a pairwise kernel is not entirely intuitive and requires us a lot of previous calculations. Since all test data has to fit the structure of our similarity matrix before we can predict anything, we have to calculate a lot of fingerprints and similarity values both for the singular drugs as well as for the pairs. We also have to ensure that the resulting vector aligns with the structure of the kernel matrix or predictions will be either impossible or useless due to comparing the wrong values with each other.
- __Memory usage__: Furthermore, using so many similarity matrices for storage, checkups, training, and predictions costs a significant amount of memory. Efficient storage methods and exploiting the symmetry of similarity matrices can certainly reduce the amount of memory space needed, but it will still be a lot.
- __Data availability__: On that note, the comprehensiveness of the available data for the different drugs is very much a limiting factor for this type of model, be it limiting the number of drugs available for training or the number of features that can be compared for a fixed set of drugs.
- __Rerunning algorithm for new drugs__: Another big drawback of the algorithm as described above is that there is no easy way to predict DDI for new drugs without the need to rerun the whole algorithm. This is due to the previously mentioned necessity of test data to be of a certain format, but also because of issues arising with the construction of the interaction profile fingerprints. While MACCS and E3FP fingerprints are standardized, our interaction profile fingerprints are not. The way we construct them assumes that we have no new drugs to add and one comprehensive list and order of the drugs for which we calculate interaction profiles. Said order is the one used in the Exel file we take as input. Hence, to add a new drug, the entire program needs to be run again to ensure all interaction profile fingerprints are consistent and exhaustive.

Overall, the method to predict DDI via the usage of a pairwise similarity matrix as an SVM kernel has its good and bad sides like any other model and can, if carefully attuned, be a valuable tool to predict possible DDI. Especially when our fingerprints are all standardized. This would reduce the necessity to rerun the whole algorithm and allow for easier adding of new compounds for testing. Using a large set of drugs for the creation of the drug similarity databases can help as well. Especially when only a subset of drug pairs are of interest for comparisons since the calculation of the similarity matrices is where the bulk of computations happens. Once these matrices exist, training the SVMs and predicting DDIs is relatively quick.

## Quiz


1. Focusing on therapeutic effect, which types of DDI do exist?
2. Why should you employ a Soft Margin Classifier for the SVM?
3. How can one create a pairwise similarity matrix?

### Answers:

1: Antagonistic, additive and synergistic DDI.

2: To allow for outliers and misclassification which are common in biological data, as well as to handle other noise issues. It also helps with the Bias/Variance Trade-Off.

3: By comparing drug similarity between the individual drugs of two compared pairs and taking the maximum similarity value.
$S\big((d_1,d_1') , (d_2,d_2')) = max \big( s(d_1,d_2) \cdot s(d_1',d_2') , s(d_2,d_1') \cdot s(d_1, d_2'))$
