# Machine learning on semantic type and category information for identifying drug-disease indications

## Introduction

This document describes the experiments we performed on the RepoDB dataset. We created a feature set by extracting the direct and indirect paths between the drugs and the diseases from the Euretos Knowledge Platform. This is a generic knowledge graph, which contains information from almost 200 data sources. The features are created by counting the frequency of specific semantic types and semantic groups in the intermediate concepts of the drugs and diseases, as well as a binary feature which indicates whether there is a direct relationship between the drug and the disease.

Based on this feature set, we try to recreate the “Approved” and “Terminated” classes of the RepoDB reference set (referred to in this document as “VALID” and “INVALID” or “positive” and “negative” respectively). Furthermore, we predict which drugs will be therapeutic for PKD. The experiments we performed for this investigation are described below.

## Methods and Results

This section describes the experiments and analyses we performed to substantiate our conclusions. Furthermore, we perform a number of control experiments to ensure we do not fall into pitfalls described in the literature. All code described below is R-code. The R-version and the versions of the packages are described at the end of the document.

In [None]:
# Machine learning and data handling
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import roc_auc_score, roc_curve, precision_recall_curve, average_precision_score

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Parallel processing (20 cores)
import joblib
from joblib import Parallel, delayed

# Set seaborn style
sns.set()

# Set number of parallel jobs for joblib-based parallelization (used in scikit-learn)
n_jobs = 20