# Checking to see if the map from Hetionet ids to UMLS CUIs could have been better

2019-01-24

First step of merging Hetionet with SemmedDB is to map the Hetionet ids to UMLS CUIs so they can be cross-referenced with SemmedDB nodes. Here we try to see if the mapping to CUIs could have been better.

In [1]:
import pandas as pd

# Read the Hetionet to UMLS mapping file

In [2]:
hnodes = pd.read_csv("../pipeline/hetionet/hetionet_nodes_umls.tsv", sep='\t')

In [3]:
hnodes.shape

(58700, 4)

In [4]:
hnodes.head()

Unnamed: 0,hetio_id,name,het_type,cui
0,1,A1BG,Gene,UMLS:C1412045
1,10,NAT2,Gene,UMLS:C0796518
2,100,ADA,Gene,UMLS:C1412179
3,1000,CDH2,Gene,UMLS:C1413277
4,10000,AKT3,Gene,UMLS:C1332074


---

## Questions to answer

1. Percentage of nodes with CUIs (grouped by node type)
2. Which nodes could not be mapped to CUIs

### Number of unique Hetionet nodes

In [5]:
hnodes["hetio_id"].nunique()

47031

### Check whether final id is a CUI

In [6]:
hnodes = hnodes.assign(is_cui = lambda df: df["cui"].str.startswith("UMLS:C"))

In [7]:
hnodes.head()

Unnamed: 0,hetio_id,name,het_type,cui,is_cui
0,1,A1BG,Gene,UMLS:C1412045,True
1,10,NAT2,Gene,UMLS:C0796518,True
2,100,ADA,Gene,UMLS:C1412179,True
3,1000,CDH2,Gene,UMLS:C1413277,True
4,10000,AKT3,Gene,UMLS:C1332074,True


## Number of unique Hetionet ids which have UMLS CUIs

In [8]:
hnodes.groupby("is_cui")["hetio_id"].nunique()

is_cui
False     3474
True     43557
Name: hetio_id, dtype: int64

In [9]:
(hnodes
    [["hetio_id", "is_cui"]]
    .drop_duplicates()
    ["is_cui"]
    .value_counts(normalize=True)
    .multiply(100)
)

True     92.613383
False     7.386617
Name: is_cui, dtype: float64

Of the 47031 Hetionet nodes, 92.6% were successfully mapped to their respective CUIs.

## Number of each node type mapped to CUIs

In [10]:
hnodes.groupby(["het_type", "is_cui"])["hetio_id"].nunique()

het_type             is_cui
Anatomy              True        402
Biological Process   False        60
                     True      11321
Cellular Component   False         6
                     True       1385
Compound             False         8
                     True       1544
Disease              False         2
                     True        135
Gene                 False      1562
                     True      19383
Molecular Function   False        13
                     True       2871
Pathway              False      1822
Pharmacologic Class  False         1
                     True        344
Side Effect          True       5734
Symptom              True        438
Name: hetio_id, dtype: int64

## Percentage of each node type mapped to CUIs

In [11]:
(hnodes
    [["hetio_id", "het_type", "is_cui"]]
    .drop_duplicates()
    .groupby("het_type")
    ["is_cui"]
    .value_counts(normalize=True)
    .multiply(100)
)

het_type             is_cui
Anatomy              True      100.000000
Biological Process   True       99.472806
                     False       0.527194
Cellular Component   True       99.568656
                     False       0.431344
Compound             True       99.484536
                     False       0.515464
Disease              True       98.540146
                     False       1.459854
Gene                 True       92.542373
                     False       7.457627
Molecular Function   True       99.549237
                     False       0.450763
Pathway              False     100.000000
Pharmacologic Class  True       99.710145
                     False       0.289855
Side Effect          True      100.000000
Symptom              True      100.000000
Name: is_cui, dtype: float64

### What were the node types of nodes without UMLS CUIs?

In [12]:
hnodes.query("~is_cui")["het_type"].value_counts(normalize=True).multiply(100)

Pathway                52.446747
Gene                   44.962579
Biological Process      1.727116
Molecular Function      0.374208
Compound                0.230282
Cellular Component      0.172712
Disease                 0.057571
Pharmacologic Class     0.028785
Name: het_type, dtype: float64

# Conclusion

The vast majority of concepts were mapped from Hetionet to the UMLS.

Two types of nodes (Pathways and Genes) account for 97% of the concepts without CUI mappings.

Based on these results it seems like the mapping from Hetionet ids to CUIs is very good, and not the main problem.