# Examine how well we have mapped Hetionet to the UMLS

In [1]:
import pandas as pd

## Read Hetionet nodes

In [2]:
hnodes = pd.read_csv("hetionet/hetio_nodes.tsv", sep='\t')

In [3]:
hnodes.shape

(58700, 4)

In [4]:
hnodes.head()

Unnamed: 0,hetio_id,name,het_type,cui
0,DB00795,Sulfasalazine,Compound,UMLS:C0036078
1,DB00795,Sulfasalazine,Compound,UMLS:C0699547
2,DB00795,Sulfasalazine,Compound,UMLS:C4255898
3,N0000000151,Histamine H2 Receptor Antagonists,Pharmacologic Class,UMLS:C2757005
4,100996420,DNM1P50,Gene,100996420


## Some mapping statistics

In [5]:
temp = (hnodes
    .assign(iscui = lambda df: df["cui"].str.startswith("UMLS"))
    [["hetio_id", "het_type", "iscui"]]
    .drop_duplicates()
)

In [6]:
temp.groupby("het_type")["iscui"].value_counts().to_frame()

Unnamed: 0_level_0,Unnamed: 1_level_0,iscui
het_type,iscui,Unnamed: 2_level_1
Anatomy,True,402
Biological Process,True,11321
Biological Process,False,60
Cellular Component,True,1385
Cellular Component,False,6
Compound,True,1544
Compound,False,8
Disease,True,135
Disease,False,2
Gene,True,19383


In [7]:
(temp
    .groupby("het_type")
    ["iscui"]
    .value_counts(normalize=True)
    .multiply(100)
    .to_frame()
)

Unnamed: 0_level_0,Unnamed: 1_level_0,iscui
het_type,iscui,Unnamed: 2_level_1
Anatomy,True,100.0
Biological Process,True,99.472806
Biological Process,False,0.527194
Cellular Component,True,99.568656
Cellular Component,False,0.431344
Compound,True,99.484536
Compound,False,0.515464
Disease,True,98.540146
Disease,False,1.459854
Gene,True,92.542373


The vast majority of concepts were mapped from Hetionet to the UMLS. The main category which was ignored was the pathways class. One contributing factor to unmapped concepts was the fact that some concepts have degraded and been removed from their source ontologies since creation of Hetionet.

These concepts will need to be upgraded to their latest versions before they can be harmonized with the UMLS.