# Can we use the semmeddb gold standard for evaluating hetionet with deepwalk?

In [1]:
import pandas as pd
import re

## Read hetionet to umls mapping

In [2]:
hmap = pd.read_csv("../merge/hetionet/hetio_nodes.tsv", sep='\t')

In [3]:
hmap.head()

Unnamed: 0,hetio_id,name,het_type,cui
0,DB00795,Sulfasalazine,Compound,UMLS:C0036078
1,DB00795,Sulfasalazine,Compound,UMLS:C0699547
2,DB00795,Sulfasalazine,Compound,UMLS:C4255898
3,N0000000151,Histamine H2 Receptor Antagonists,Pharmacologic Class,UMLS:C2757005
4,100996420,DNM1P50,Gene,100996420


## Read original full gold std

In [4]:
orig = (pd
    .read_csv("../semmed/data/indications_slim.csv", sep=',')
    [["compound_umlscui", "compound_name", "disease_umlscui", "disease_name", "etype"]]
    .rename(columns={
        "compound_umlscui": "chemical_id",
        "compound_name": "chemical_name",
        "disease_umlscui": "disease_id",
    })
)

In [5]:
orig.shape

(8176, 5)

In [6]:
orig.head()

Unnamed: 0,chemical_id,chemical_name,disease_id,disease_name,etype
0,C3661315,rucaparib,C1299247,ovarian cancer,TREATS_CDtDO
1,C4044947,baricitinib,C0003873,rheumatoid arthritis,TREATS_CDtDO
2,C0520442,acetyldigitoxin,C0004238,Atrial fibrillation,TREATS_CDtDO
3,C0520442,acetyldigitoxin,C0018802,Congestive heart failure,TREATS_CDtDO
4,C0771809,acexamic acid,C0037299,Skin ulcer,TREATS_CDtDO


## Read filtered gold std used for semmeddb only

In [7]:
gold = pd.read_csv("data/merged_gold.tsv", sep='\t')

In [8]:
gold.head()

Unnamed: 0,chemical_id,chemical_name,disease_id,disease_name,etype,chemical_uid,disease_uid
0,UMLS:C0520442,acetyldigitoxin,UMLS:C0004238,Atrial fibrillation,TREATS_CDtDO,108642,22585
1,UMLS:C0033497,propranolol,UMLS:C0004238,Atrial fibrillation,TREATS_CDtDO,32398,22585
2,UMLS:C0012265,digoxin,UMLS:C0004238,Atrial fibrillation,TREATS_CDtDO,25280,22585
3,UMLS:C0084273,quinidine,UMLS:C0004238,Atrial fibrillation,TREATS_CDtDO,55039,22585
4,UMLS:C0766326,dronedarone,UMLS:C0004238,Atrial fibrillation,TREATS_CDtDO,134609,22585


In [9]:
gold.shape

(6329, 7)

---

## How many chemicals and diseases in hetionet from original gold?

In [10]:
def is_cui(s):
    return re.match(r'^C[0-9]{7}$', s) is not None

In [11]:
orig["chemical_id"].map(is_cui).all()

True

In [12]:
orig["disease_id"].map(is_cui).all()

False

In [13]:
ochem = set(orig["chemical_id"].map(lambda v: "UMLS:{}".format(v)))

odise = set(orig["disease_id"].map(lambda v: "UMLS:{}".format(v) if is_cui(v) else v))

---

In [14]:
len(ochem)

2165

In [15]:
len(odise)

1245

---

In [16]:
hcuis = set(hmap["cui"])

In [17]:
len(hcuis & ochem)

1056

In [18]:
len(hcuis & odise)

618

In [19]:
len(hcuis & ochem) / len(ochem) * 100

48.775981524249424

In [20]:
len(hcuis & odise) / len(odise) * 100

49.63855421686747

The original gold standard (unfiltered by semmeddb) only overlaps roughly 50% with hetionet.

### semmed filtered gold std

In [21]:
gold.head(3)

Unnamed: 0,chemical_id,chemical_name,disease_id,disease_name,etype,chemical_uid,disease_uid
0,UMLS:C0520442,acetyldigitoxin,UMLS:C0004238,Atrial fibrillation,TREATS_CDtDO,108642,22585
1,UMLS:C0033497,propranolol,UMLS:C0004238,Atrial fibrillation,TREATS_CDtDO,32398,22585
2,UMLS:C0012265,digoxin,UMLS:C0004238,Atrial fibrillation,TREATS_CDtDO,25280,22585


In [22]:
gchems = set(gold["chemical_id"])
gdises = set(gold["disease_id"])

In [23]:
len(gchems)

1618

In [24]:
len(gdises)

963

In [25]:
len(hcuis & gchems) / len(gchems) * 100

57.045735475896166

In [26]:
len(hcuis & gdises) / len(gdises) * 100

60.020768431983385

Using the gold standard which was used for semmeddb gives slightly better overlap results with hetionet (~60% instead of 50%).

---

In [27]:
temp = (gold
    .query("chemical_id in @hcuis and disease_id in @hcuis")
)

In [28]:
gold.shape

(6329, 7)

In [29]:
temp.shape

(2924, 7)

In [30]:
len(temp) / len(gold) * 100

46.200031600568806

so subsetting the semmed gold standard to what is existing in hetionet will only give us 45% of the dataset, which is substantially smaller than what we were using to evaluate semmeddb. so we can run it i guess, but not sure if the results will be comparable.