# DengueDrugRep: Drug repurposing for dengue using a biomedicine knowledge graph and graph neural networks

**Author:** [Sebastián Ayala Ruano][myweb]

This notebook is part of my capstone project for the Scientific Programming course from the [MSc in Systems Biology][sysbio] at [Maastricht University][maasuni].

<a href="https://colab.research.google.com/github/sayalaruano/DengueDrugRep/blob/main/Training_KGNN_models_Pykeen.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### About the project

[Dengue][dengue] is a viral infection transmitted to humans through the bite of Aedes mosquitoes. This disease is a  neglected tropical disease that mainly affects poor populations with no access to safe water, sanitation, and high-quality healthcare. Currently, there is no specific treatment for dengue and the focus is on treating pain symptoms. Therefore, there is an urgent need to find new drugs to treat this disease.

The goal of this project is to predict new repurposed drugs for dengue using a biomedical knowledge graph and graph neural networks. A knowledge graph (KG) is a heterogenous network with different types of nodes and edges, where nodes represent entities (e.g., drugs, diseases, genes, etc.) and edges are semmantic relationships between entities (e.g., drug-disease, drug-gene, etc.). Graph neural networks (GNNs) are a class of neural networks that can learn from graph data. The following figure illustrates the general structure of a knowledge graph neural network (KGNN): 

<figure>
  <img src="../img/KGNN_pipeline.png" alt="my alt text"/>
  <figcaption><strong>Figure 1.</strong> Anatomy of a Knowledge Graph Neural Network. Obtained from <a href="https://kge-tutorial-ecai2020.github.io/">[1]</a>. </figcaption>
</figure>


The drug repurposing problem can be formulated as a link prediction task in a KG. The goal is to predict new drug-disease associations. For this project, four GNN algorithms, namely [PairRE][pairRE], [DistMult][distmult], [ERMLP][ermlp], and [TransR][transr], were trained to predict new drug-disease associations using the Drug Repurposing Knowledge Graph ([DRKG][drkg]). These algorithms are implemented in the [PyKEEN][pykeen] library.

### Dataset

The [DRKG][drkg] is a large-scale biomedical KG that integrates information from six existing databases: DrugBank, Hetionet, Global network of biomedical relationships (GNBR), String, IntAct, and DGIdb. This KG contains 97.238 nodes belonging to 13 entity-types (e.g., drugs, diseases, genes, etc.) and 5.874.257 triplets belonging to 107 edge-types. The following figure shows a schematic representation of the DRKG:

<figure>
  <img src="../img/DRKG.png" alt="my alt text"/>
  <figcaption><strong>Figure 2.</strong> Interactions in the DRKG. The number next to an edge indicates the number of relation-types for that entity-pair in the KG. Obtained from <a href="https://github.com/gnn4dr/DRKG">[2]</a>. </figcaption>
</figure>

### Project structure

This notebook is divided into the following sections:

0. [Work environment setup](#0-work-environment-setup-and-definition-of-helper-functionsclasses)
1. [Training the KGNN models](#4-bulding-and-evaluating-machine-learning-models)
2. [Further analysis](#5-further-analysis)

### Acknowledgments and further details of the project

**Credits:** Part of the code for this project was inspired by the PyKEEN [tutorials][pykeen_tutorials], a [conference paper][confpaper], and this [GitHub repository][githubrepo].

You can find more details about the project in this [GitHub repository][githubrepo].

[dengue]: https://www.who.int/news-room/fact-sheets/detail/dengue-and-severe-dengue
[confpaper]: https://link.springer.com/chapter/10.1007/978-3-031-40942-4_8
[pykeen_tutorials]: https://github.com/pykeen/pykeen/tree/master/notebooks
[sysbio]: https://www.maastrichtuniversity.nl/education/master/systems-biology
[maasuni]: https://www.maastrichtuniversity.nl/
[myweb]: https://sayalaruano.github.io/
[githubrepo]: https://github.com/sayalaruano/DengueDrugRep
[pairRE]: http://arxiv.org/abs/2011.03798
[distmult]: https://arxiv.org/abs/1412.6575
[transr]: http://www.aaai.org/ocs/index.php/AAAI/AAAI15/paper/download/9571/9523/
[ermlp]: https://storage.googleapis.com/pub-tools-public-publication-data/pdf/45634.pdf
[pykeen]: https://github.com/pykeen/pykeen
[drkg]: https://github.com/gnn4dr/DRKG

## 0. Work environment setup
### 0.1. Installing PyKEEN

In [None]:
# Install packages if they're not already found
! pip install --upgrade pip
! python -c "import pykeen" || pip install git+https://github.com/pykeen/pykeen.git
! python -c "import wordcloud" || pip install wordcloud

### 0.2. Importing libraries

In [None]:
import os

import matplotlib.pyplot as plt
import torch

import pykeen
from pykeen.datasets import DRKG
from pykeen.pipeline import pipeline
from pykeen import predict

%matplotlib inline
%config InlineBackend.figure_format = 'svg'

In [None]:
pykeen.env()

## 1. Training the KGNN models

### 1.1. General evaluation
These models are evaluated using all the triplets in the DRKG, so evaluation results reflect the link prediction performance for all the entity-pairs in the KG.

#### 1.1.1. PairRE

In [None]:
# Define the PairRE KGNN model
result_PairRE_50ep_genev = pipeline(
    dataset="DRKG",
    model="PairRE",
    loss = "MarginRankingLoss",
    device='cuda',
    # Training configuration
    training_kwargs=dict(
        num_epochs=50,
        use_tqdm_batch=False,
        checkpoint_name='results_PairRE_genev.pt',
        checkpoint_on_failure=True,
        checkpoint_frequency=2
    ),
    # Runtime configuration
    random_seed=1235,
)

In [None]:
# Save the PairRE model to a directory
PairRE_dir_genenv = "results_PairRE_50ep_genenv"
result_PairRE_50ep_genev.save_to_directory(PairRE_dir_genenv)
os.listdir(PairRE_dir_genenv)

In [None]:
# Zip the directory to download it
!zip -r result_PairRE_50epochs.zip results_PairRE_50ep_genenv

#### 1.1.2. DistMult

In [None]:
# Define the DisMult KGNN model
result_DisMult_50ep_genev = pipeline(
    dataset="DRKG",
    model="DISMULT",
    loss = "MarginRankingLoss",
    device='cuda',
    # Training configuration
    training_kwargs=dict(
        num_epochs=50,
        use_tqdm_batch=False,
        checkpoint_name='results_DisMult_genev.pt',
        checkpoint_on_failure=True,
        checkpoint_frequency=2
    ),
    # Runtime configuration
    random_seed=1235,
)

In [None]:
# Save the DisMult model to a directory
DisMult_dir_genenv = "results_DisMult_50ep_genenv"
result_DisMult_50ep_genev.save_to_directory(DisMult_dir_genenv)
os.listdir(DisMult_dir_genenv)

In [None]:
# Zip the directory to download it
!zip -r result_DisMult_50epochs.zip results_DisMult_50ep_genenv

#### 1.1.3. ERMLP

In [None]:
# Define the ERMLP KGNN model
result_ERMLP_50ep_genev = pipeline(
    dataset="DRKG",
    model="ERMLP",
    loss = "MarginRankingLoss",
    device='cuda',
    # Training configuration
    training_kwargs=dict(
        num_epochs=50,
        use_tqdm_batch=False,
        checkpoint_name='results_ERMLP_genev.pt',
        checkpoint_on_failure=True,
        checkpoint_frequency=2
    ),
    # Runtime configuration
    random_seed=1235,
)

In [None]:
# Save the ERMLP model to a directory
ERMLP_dir_genenv = "results_ERMLP_50ep_genenv"
result_ERMLP_50ep_genev.save_to_directory(ERMLP_dir_genenv)
os.listdir(ERMLP_dir_genenv)

In [None]:
# Zip the directory to download it
!zip -r result_ERMLP_50epochs.zip results_ERMLP_50ep_genenv

#### 1.1.4. TransR

In [None]:
# Define the TransR KGNN model
result_TransR_50ep_genev = pipeline(
    dataset="DRKG",
    model="TransR",
    loss = "MarginRankingLoss",
    device='cuda',
    # Training configuration
    training_kwargs=dict(
        num_epochs=50,
        use_tqdm_batch=False,
        checkpoint_name='results_TransR_genev.pt',
        checkpoint_on_failure=True,
        checkpoint_frequency=2
    ),
    # Runtime configuration
    random_seed=1235,
)

In [None]:
# Save the TransR model to a directory
TransR_dir_genenv = "results_TransR_50ep_genenv"
result_TransR_50ep_genev.save_to_directory(TransR_dir_genenv)
os.listdir(TransR_dir_genenv)

In [None]:
# Zip the directory to download it
!zip -r result_TransR_50epochs.zip results_TransR_50ep_genenv

### 1.2. Drug repurposing evaluation
Instead of evaluating the models using all the triplets in the KG, we can evaluate the models using only the triplets that involve drugs and diseases. This is the drug repurposing evaluation. In this way, the models are evaluated only on the task of predicting new drug-disease associations.

In [None]:
# List of the drug-disease relations in DRKG
drud_disease_relations = ['DRUGBANK::treats::Compound:Disease',
                        'GNBR::C::Compound:Disease',
                        'GNBR::J::Compound:Disease',
                        'GNBR::Mp::Compound:Disease',
                        'GNBR::Pa::Compound:Disease',
                        'GNBR::Pr::Compound:Disease',
                        'GNBR::Sa::Compound:Disease',
                        'GNBR::T::Compound:Disease',
                        'Hetionet::CpD::Compound:Disease',
                        'Hetionet::CtD::Compound:Disease']

#### 1.2.1. PairRE

In [None]:
result_PairRE_drev = pipeline(
    dataset="DRKG",
    model="PairRE",
    loss = "MarginRankingLoss",
    device='cuda',
    evaluation_relation_whitelist=drud_disease_relations,
    # Training configuration
    training_kwargs=dict(
        num_epochs=10,
        use_tqdm_batch=False,
        checkpoint_name='results_PairRE_10.pt',
        checkpoint_on_failure=True,
        checkpoint_frequency=2
    ),
    # Runtime configuration
    random_seed=1235,
)

In [None]:
# Save the PairRE model to a directory
PairRE_dir_drev = "results_PairRE_10ep_drev"
result_PairRE_drev.save_to_directory(PairRE_dir_drev)
os.listdir(PairRE_dir_drev)

In [None]:
# Zip the directory to download it
!zip -r result_PairRE_drev.zip results_PairRE_10ep_drev

#### 1.2.2. DistMult

In [None]:
# Define the DisMult KGNN model
result_DisMult_drev = pipeline(
    dataset="DRKG",
    model="DISMULT",
    loss = "MarginRankingLoss",
    device='cuda',
    evaluation_relation_whitelist=drud_disease_relations,
    # Training configuration
    training_kwargs=dict(
        num_epochs=10,
        use_tqdm_batch=False,
        checkpoint_name='results_DisMult_10.pt',
        checkpoint_on_failure=True,
        checkpoint_frequency=2
    ),
    # Runtime configuration
    random_seed=1235,
)

In [None]:
# Save the DisMult model to a directory
DisMult_dir_drev = "results_DisMult_10ep_drev"
result_DisMult_drev.save_to_directory(DisMult_dir_drev)
os.listdir(DisMult_dir_drev)

In [None]:
# Zip the directory to download it
!zip -r result_DisMult_drev.zip results_DisMult_10ep_drev

#### 1.2.3. ERMLP

In [None]:
# Define the ERMLP KGNN model
result_ERMLP_drev = pipeline(
    dataset="DRKG",
    model="ERMLP",
    loss = "MarginRankingLoss",
    device='cuda',
    evaluation_relation_whitelist=drud_disease_relations,
    # Training configuration
    training_kwargs=dict(
        num_epochs=10,
        use_tqdm_batch=False,
        checkpoint_name='results_ERMLP_10.pt',
        checkpoint_on_failure=True,
        checkpoint_frequency=2
    ),
    # Runtime configuration
    random_seed=1235,
)

In [None]:
# Save the ERMLP model to a directory
ERMLP_dir_drev = "results_ERMLP_10ep_drev"
result_ERMLP_drev.save_to_directory(ERMLP_dir_drev)
os.listdir(ERMLP_dir_drev)

In [None]:
# Zip the directory to download it
!zip -r result_ERMLP_drev.zip results_ERMLP_10ep_drev

#### 1.2.4. TransR

In [None]:
# Define the TransR KGNN model
result_TransR_drev = pipeline(
    dataset="DRKG",
    model="TransR",
    loss = "MarginRankingLoss",
    device='cuda',
    evaluation_relation_whitelist=drud_disease_relations,
    # Training configuration
    training_kwargs=dict(
        num_epochs=10,
        use_tqdm_batch=False,
        checkpoint_name='results_TransR_10.pt',
        checkpoint_on_failure=True,
        checkpoint_frequency=2
    ),
    # Runtime configuration
    random_seed=1235,
)
    

In [None]:
# Save the TransR model to a directory
TransR_dir_drev = "results_TransR_10ep_drev"
result_TransR_drev.save_to_directory(TransR_dir_drev)
os.listdir(TransR_dir_drev)

In [None]:
# Zip the directory to download it
!zip -r result_TransR_drev.zip results_TransR_10ep_drev

## 2. Further analysis
After training the KGNN models on Google Colab, I downloaded them to my local machine. Then, I ran the scripts to perform the internal and external evaluation. These scripts and the results are available in this [GitHub repository](https://github.com/sayalaruano/DengueDrugRep).