# Split PharmacotherapyDB into multiple pieces

Tong Shu Li

For cross validation, the original training data needs to be split into multiple pieces in order to keep training and testing data separate.

In [1]:
import pandas as pd
import numpy as np

In [2]:
np.__version__

'1.10.4'

In [3]:
np.random.seed(20161018)

---

## Load PharmacotherapyDB

In [4]:
goldstd = pd.read_csv("data/indications.tsv", sep = '\t')

In [5]:
goldstd.shape

(1388, 7)

In [6]:
goldstd.head()

Unnamed: 0,doid_id,drugbank_id,disease,drug,category,n_curators,n_resources
0,DOID:10652,DB00843,Alzheimer's disease,Donepezil,DM,2,1
1,DOID:10652,DB00674,Alzheimer's disease,Galantamine,DM,1,4
2,DOID:10652,DB01043,Alzheimer's disease,Memantine,DM,1,3
3,DOID:10652,DB00989,Alzheimer's disease,Rivastigmine,DM,1,3
4,DOID:10652,DB00245,Alzheimer's disease,Benzatropine,SYM,3,1


In [7]:
goldstd["category"].value_counts()

DM     755
SYM    390
NOT    243
Name: category, dtype: int64

---

## Split into multiple pieces

For K-fold validation, the entire workflow needs to be run K times. The value of K is chosen to be 5 to avoid excessive computational requirements.

We will split the data by assigning each piece of data a number from 0 to K-1, and group data rows according to the piece number. This will ensure that each row of data is used, and that the ratios of true/false examples per group is the same.

In [8]:
K = 5
goldstd["piece"] = np.random.randint(0, K, len(goldstd))

In [9]:
goldstd.head()

Unnamed: 0,doid_id,drugbank_id,disease,drug,category,n_curators,n_resources,piece
0,DOID:10652,DB00843,Alzheimer's disease,Donepezil,DM,2,1,0
1,DOID:10652,DB00674,Alzheimer's disease,Galantamine,DM,1,4,0
2,DOID:10652,DB01043,Alzheimer's disease,Memantine,DM,1,3,2
3,DOID:10652,DB00989,Alzheimer's disease,Rivastigmine,DM,1,3,4
4,DOID:10652,DB00245,Alzheimer's disease,Benzatropine,SYM,3,1,1


In [10]:
goldstd["piece"].value_counts()

0    291
1    289
2    275
4    269
3    264
Name: piece, dtype: int64

Each piece has almost exactly 20% of the data.

In [11]:
# verify that the true/false ratios for each piece is similar
for group, df in goldstd.groupby("piece"):
    print("Group:", group)
    print(df["category"].value_counts())
    print()

Group: 0
DM     163
SYM     79
NOT     49
Name: category, dtype: int64

Group: 1
DM     160
SYM     81
NOT     48
Name: category, dtype: int64

Group: 2
DM     158
SYM     72
NOT     45
Name: category, dtype: int64

Group: 3
DM     134
SYM     76
NOT     54
Name: category, dtype: int64

Group: 4
DM     140
SYM     82
NOT     47
Name: category, dtype: int64



## Sort the data so that the data is stable

In [12]:
goldstd = goldstd.sort_values(["doid_id", "drugbank_id"])

## Save split pieces to file

In [13]:
goldstd.to_csv("data/split_indications/labeled_pharmacotherapydb.tsv", sep = '\t', index = False)

In [14]:
for group, df in goldstd.groupby("piece"):
    fname = "pharmacotherapydb_piece{}.tsv".format(group)
    df.to_csv("data/split_indications/{}".format(fname), sep = '\t', index = False)