# iBKH-based Knowledge Discovery Pipeline

This is the implementation of Knowledge Discovery pipeline in our iBKH portal at http://ibkh.ai/.

Given a target entity of interest, the task is to discover the Top-N entities from different entity types (currently supporting gene, drug, symptom, and pathway entities) that potentially link to the target entity. 


Generally, the pipeline contains 3 steps, including: 
1. Data preparation (triplets generation); 

2. Knowledge graph embedding learning; 

3. Knowledge discovery based on link prediction – predicting drug entities that potentially link to AD. 

### Step 1 – Data preparation (triplets generation)

######  Collecting iBKH knowledge graph source data

Download the latest version of iBKH knowledge graph data (entities and relations) at: https://github.com/wcm-wanglab/iBKH/tree/main/iBKH


Please make sure putting the downloaded files following the structure below.

```
.
├── Case Study-AD Drug Repurposing.ipynb
├── Data
│   ├── iBKH                                 
│   │   ├── Entity
│   │   │   ├── anatomy_vocab.csv
│   │   │   ├── disease_vocab.csv
│   │   │   ├── drug_vocab.csv
│   │   │   ├── dsp_vocab.csv
│   │   │   ├── gene_vocab.csv
│   │   │   ├── molecule_vocab.csv
│   │   │   ├── pathway_vocab.csv
│   │   │   ├── sdsi_vocab.csv
│   │   │   ├── side_effect_vocab.csv
│   │   │   ├── symptom_vocab.csv
│   │   │   ├── tc_vocab.csv
│   │   │   ├── ...
│   │   │   │ 
│   │   ├── Relation
│   │   │   ├── A_G_res.csv
│   │   │   ├── D_D_res.csv
│   │   │   ├── D_Di_res.csv
│   │   │   ├── D_G_res.csv
│   │   │   ├── D_Pwy_res.csv
│   │   │   ├── D_SE_res.csv
│   │   │   ├── Di_Di_res.csv
│   │   │   ├── Di_G_res.csv
│   │   │   ├── Di_Pwy_res.csv
│   │   │   ├── Di_Sy_res.csv
│   │   │   ├── DSP_SDSI_res.csv
│   │   │   ├── G_G_res.csv
│   │   │   ├── G_Pwy_res.csv
│   │   │   ├── SDSI_A_res.csv
│   │   │   ├── SDSI_D_res.csv
│   │   │   ├── SDSI_Di_res.csv
│   │   │   ├── SDSI_Sy.csv
│   │   │   ├── SDSI_TC_res.csv
│   │   │   ├── ...
│   │   │   └──                      
│   │   └── 
│   └── ...
└── ...
```

In [None]:
# import required packages

import warnings
warnings.filterwarnings("ignore")

import pandas as pd
import numpy as np
import pickle

import torch as th
import torch.nn.functional as fn

from sklearn.preprocessing import MinMaxScaler

import os

import sys
sys.path.append('.') # Use only with Jupyter Notebook

import funcs.KG_processing as KG_processing

### Step 1:  Generate Triplet Set from iBKH 

A triplet, i.e., (h, r, t), is the basic unit for a knowledge graph. We generate triplet set from iBKH, which will be used for knowledge graph embedding learning.

In [None]:
kg_folder = 'data/iBKH/' # The folder is used to store the iBKH-KG data
triplet_path = 'data/triplets/' # The folder is used to store processed results
if not os.path.exists(triplet_path):
    os.makedirs(triplet_path)   
output_path = 'data/dataset/' # Output folder
if not os.path.exists(output_path):
    os.makedirs(output_path)

Generating triplets for different entity type pairs.

In [None]:
KG_processing.DDi_triplets(kg_folder, triplet_path)
KG_processing.DG_triplets(kg_folder, triplet_path)
KG_processing.DPwy_triplets(kg_folder, triplet_path)
KG_processing.DSE_triplets(kg_folder, triplet_path)
KG_processing.DiDi_triplets(kg_folder, triplet_path)
KG_processing.DiG_triplets(kg_folder, triplet_path)
KG_processing.DiPwy_triplets(kg_folder, triplet_path)
KG_processing.DiSy_triplets(kg_folder, triplet_path)
KG_processing.GG_triplets(kg_folder, triplet_path)
KG_processing.GPwy_triplets(kg_folder, triplet_path)
KG_processing.DD_triplets(kg_folder, triplet_path)

Combine all the triplets set extracted from the relation results among the entities, then convert the triplet set from .csv format to the .tsv format based on the DGL input requirement.

In [None]:
# Specifying triplet type you want to use.
included_pair_type = ['DDi', 'DiG', 'DG', 'GG', 'DD', 'DiDi',
                      'GPwy', 'DiPwy', 'DPwy', 'DiSy',  'DSE']

# Running below script will return a csv file, which combines all triplets extracted from the above functions.
KG_processing.generate_triplet_set(triplet_path=triplet_path)  

In [None]:
# Split the data into training, validation, and testing sets.
# And convert data to TSV files following DGK-KE requirements.
KG_processing.generate_DGL_data_set(triplet_path=triplet_path, 
                                    output_path=output_path, 
                                    train_val_test_ratio=[.9, .05, .05])

### Step 2:  Knowledge graph embedding

We invoke the command line toolkit provided by DGL-KE to learn the embedding of entities and relations in iBKH. Here, we use four different models to learn the entity and edge representations of iBKH, namely TransE, TransR, DistMult, and ComplEx. To use other KGE model or AWS instances please refer to DGL-KE’s <a href="https://aws-dglke.readthedocs.io/en/latest/index.html" target="_blank">Document</a>.


Open command line (Windows OS and UNIX OS) or terminal (MAC OS) and change directory as 

In [None]:
cd [your file path]/iBKH-KD-protocol

Train and evaluate the knowledge graph embedding model by running the command below.

In [None]:
DGLBACKEND=pytorch \
dglke_train --dataset iBKH --data_path ./data/dataset \
            --data_files training_triplets.tsv \
                          validation_triplets.tsv \
                          testing_triplets.tsv \
            --format raw_udd_hrt --model_name [model name] \
            --batch_size [batch size] --hidden_dim [hidden dim] \
            --neg_sample_size [neg sample size] --gamma [gamma] \
            --lr [learning rate] --max_step [max step] \
            --log_interval [log interval] \
            --batch_size_eval [batch size eval] \
            -adv --regularization_coef [regularization coef] \
            --num_thread [num thread] --num_proc [num proc] \
            --neg_sample_size_eval [neg sample size eval] \
            --save_path ./data/embeddings --test

Running above command will train the specific knowledge graph embedding model in the training dataset and evaluate the model performance in link prediction task in the testing set. This will result in multiple metrics including: Hit@k (the average number of times the positive triplet is among the k highest ranked triplets); Mean Rank (MR, the average rank of the positive triplets); Mean Reciprocal Rank (MRR, the average reciprocal rank of the positive instances). Higher values of Hit@k and MRR and a lower value of MR indicate good performance, and vice versa.


Of note, the user can use above command to find optimal hyperparameters of the model. For simplicity, the user can also use our suggested hyperparameters as below.

```
Arguments 	            TransE	      TransR	  ComplEx	    DistMult
--model_name	        TransE_l2	  TransR	  ComplEx	    DistMult
--batch_size	        1024	      1024	      1024	        1024
--batch_size_eval	    1000	      1000	      1000	        1000
--neg_sample_size	    256	          256	      256	        256
--neg_sample_size_eval	1000	      1000	      1000	        1000
--hidden_dim	        400	          200	      200	        400
--gamma	                12.0	      12.0	      12.0	        12.0
--lr	                0.1	          0.005	      0.005	        0.005
--max_step	            10000	      10000	      10000	        10000
--log_interval      	100	          100	      100	        100
--regularization_coef	1.00E-09	  1.00E-07	  1.00E-07	    1.00E-07

```

After determining hyperparameters that can lead to desirable performance, we then re-train the model using the whole dataset by running

In [None]:
DGLBACKEND=pytorch \
dglke_train --dataset iBKH --data_path ./data/dataset \
            --data_files whole_triplets.tsv \
            --format raw_udd_hrt --model_name [model name] \
            --batch_size [batch size] --hidden_dim [hidden dim] \
            --neg_sample_size [neg sample size] --gamma [gamma] \
            --lr [learning rate] --max_step [max step] \
            --log_interval [log interval] \
            -adv --regularization_coef [regularization coef] \
            --num_thread [num thread] --num_proc [num proc] \
            --save_path ./data/embeddings

This will generate two output files for each model: “iBKH_[model name]\_entity.npy”, containing the low dimension embeddings of entities in iBKH and “iBKH_[model name]\_relation.npy”, containing the low dimension embeddings of relations in iBKH. These embeddings can be used in downstream knowledge discovery tasks.

### Step 3: Knowledge Discovery Based on iBKH - Hypothesis Generation

This step conducts knowledge discovery based on iBKH. 

We showcases an example -- drug repurposing hypothesis generation for Parkinson's disease.

In [None]:
from funcs.KG_link_pred import generate_hypothesis,\
                               generate_hypothesis_ensemble_model

In [None]:
PD = ["parkinson's disease", "late onset parkinson's disease"]

In [None]:
r_type = ["Treats_DDi", "Palliates_DDi"]

######  Drug repurposing hypothesis generation based on graph embedding using the TransE model.

In [None]:
proposed_df = generate_hypothesis(target_entity=PD, candidate_entity_type='drug',
                                  relation_type=r_type, embedding_folder='data/embeddings',
                                  method='transE_l2', kg_folder = 'data/iBKH', 
                                  triplet_folder = 'data/triplets', topK=100, 
                                  save_path='output', save=True,
                                  without_any_rel=False)

This will result in an output CSV file stored in the "output" folder.

In [None]:
# print the predicted drugs.

proposed_df

We provide an ensemble model that integrates TransE, TransR, complEx, and DistMult to generate hypotheses.

In [None]:
ensemble_proposed_df = generate_hypothesis_ensemble_model(target_entity=PD, candidate_entity_type='drug',
                                                          relation_type=r_type, 
                                                          embedding_folder='data/embeddings',
                                                          kg_folder = 'data/iBKH', 
                                                          triplet_folder = 'data/triplets',
                                                          topK=100, save_path='output', save=True, 
                                                          without_any_rel=False)

In [None]:
# print the predicted drugs using ensemble method
ensemble_proposed_df

######  Interpreting prediction results in knowledge graph.

Finally, we interpret predicted repurposing drug candidates using knowledge graph. We can extract intermediate entities that construct the shortest paths linking the target entity (i.e., Parkinson's disease) and the predicted drug candidates.

1. To achive this goal, we first deploy the iBKH knoweldge graph using Neo4j with an AWS server. Please refer the following instruction to set up the knoweldge graph: https://docs.google.com/document/d/1cLDPLp_nVCJ5xrDlJ-B-Q3wf24tb-Dyq55nAXxaNgTM/edit

2. Interpreting repurposing drug candidates.

In [None]:
import funcs.knowledge_visualization as knowledge_visualization

In [None]:
# List of predicted repurposing drug candidates to interprete

drug_list = ['Glutathione', 'Clioquinol', 'Steroids', 'Taurine']

In [None]:
knowledge_visualization.subgraph_visualization(target_type='Disease', target_list=PD,
                                               predicted_type='Drug', predicted_list=drug_list, 
                                               neo4j_url = "neo4j://54.210.251.104:7687", 
                                               username = "neo4j", password = "password",
                                               alpha=1.5, k=0.8, figsize=(15, 10), save=True)