# Examine how well we have mapped Hetionet to the UMLS

In [1]:
import pandas as pd

## Read Hetionet nodes

In [2]:
hnodes = pd.read_csv("hetionet/hetio_nodes.tsv", sep='\t')

In [3]:
hnodes.shape

(58700, 4)

In [4]:
hnodes.head()

Unnamed: 0,uid,name,node_type,fuid
0,DB00795,Sulfasalazine,Compound,UMLS:C0036078
1,DB00795,Sulfasalazine,Compound,UMLS:C0699547
2,DB00795,Sulfasalazine,Compound,UMLS:C4255898
3,N0000000151,Histamine H2 Receptor Antagonists,Pharmacologic Class,UMLS:C2757005
4,100996420,DNM1P50,Gene,100996420


## Read Hetionet edges

In [5]:
hedges = pd.read_csv("hetionet/hetio_edges.tsv", sep='\t')

In [6]:
hedges.head()

Unnamed: 0,edge_type,source_uid,target_uid
0,PARTICIPATES_GpBP,UMLS:C1857700,UMLS:C1149564
1,PARTICIPATES_GpBP,UMLS:C1857700,UMLS:C1156140
2,PARTICIPATES_GpBP,UMLS:C1419751,UMLS:C1149564
3,PARTICIPATES_GpBP,UMLS:C1419751,UMLS:C1156140
4,PARTICIPATES_GpBP,UMLS:C1335641,UMLS:C1149564


In [7]:
hedges.shape

(3434083, 3)

## Some mapping statistics

In [8]:
temp = (hnodes
    .assign(iscui = lambda df: df["fuid"].str.startswith("UMLS"))
    [["uid", "node_type", "iscui"]]
    .drop_duplicates()
)

In [9]:
temp.groupby("node_type")["iscui"].value_counts()

node_type            iscui
Anatomy              True       402
Biological Process   True     11321
                     False       60
Cellular Component   True      1385
                     False        6
Compound             True      1544
                     False        8
Disease              True       135
                     False        2
Gene                 True     19383
                     False     1562
Molecular Function   True      2871
                     False       13
Pathway              False     1822
Pharmacologic Class  True       344
                     False        1
Side Effect          True      5734
Symptom              True       438
Name: iscui, dtype: int64

In [10]:
temp.groupby("node_type")["iscui"].value_counts(normalize=True)*100

node_type            iscui
Anatomy              True     100.000000
Biological Process   True      99.472806
                     False      0.527194
Cellular Component   True      99.568656
                     False      0.431344
Compound             True      99.484536
                     False      0.515464
Disease              True      98.540146
                     False      1.459854
Gene                 True      92.542373
                     False      7.457627
Molecular Function   True      99.549237
                     False      0.450763
Pathway              False    100.000000
Pharmacologic Class  True      99.710145
                     False      0.289855
Side Effect          True     100.000000
Symptom              True     100.000000
Name: iscui, dtype: float64

The vast majority of concepts were mapped from Hetionet to the UMLS. The main category which was ignored was the pathways class. One contributing factor to unmapped concepts was the fact that some concepts have degraded and been removed from their source ontologies since creation of Hetionet.

These concepts will need to be upgraded to their latest versions before they can be harmonized with the UMLS.

---

## Which edges are missing UMLS ids?

In [11]:
temp = (hedges
    .assign(scui = lambda df: df["source_uid"].str.startswith("UMLS"))
    .assign(tcui = lambda df: df["target_uid"].str.startswith("UMLS"))
    .assign(both = lambda df: df["scui"] & df["tcui"])
)

In [12]:
temp["both"].value_counts(normalize=True) * 100

True     97.413283
False     2.586717
Name: both, dtype: float64

In [13]:
temp.groupby(["scui", "tcui"]).size()

scui   tcui 
False  False         30
       True        1001
True   False      87799
       True     3345253
dtype: int64

Only ~3% of edges aren't mapped to the UMLS, which is quite small. This will be sufficient for a first pass. We can go back and update the networks later.