## How many UMLS cuis do I need to map to mesh?

How many UMLS cuis have mesh terms that aren't already here?

In [1]:
import pandas as pd

---

In [2]:
terms = pd.read_csv("data/hpo_terms.tsv", sep='\t')
xrefs = pd.read_csv("data/hpo_xrefs.tsv", sep='\t')
alts = pd.read_csv("data/hpo_alts.tsv", sep='\t')

---

In [3]:
phenos = pd.read_csv("data/orpha_phenos.tsv", sep='\t')

---

## Are any of the annotated phenotypes using alternate HPO terms?

In [4]:
len(set(phenos["hpo_id"]) & set(alts["alt_id"]))

25

---

## Map the alternate ids to the official ids

In [5]:
alt_ids = set(alts["alt_id"])

In [6]:
phenos["is_alt_id"] = phenos["hpo_id"].map(lambda v: v in alt_ids)

In [7]:
phenos["is_alt_id"].value_counts()

False    55985
True       277
Name: is_alt_id, dtype: int64

Most of these are obsolete ids and need to be replaced.

In [8]:
temp = (phenos
    .query("is_alt_id")
    .drop("is_alt_id", axis=1)
    .rename(columns={"hpo_id": "alt_id"})
    .merge(alts, how="inner", on="alt_id")
    .drop("alt_id", axis=1)
)

In [9]:
good_phenos = (phenos
    .query("~is_alt_id")
    .append(temp)
    .drop(["is_alt_id", "hpo_name"], axis=1)
    .merge(terms, how="inner", on="hpo_id")
)

In [10]:
good_phenos.shape

(56262, 5)

In [11]:
good_phenos.head()

Unnamed: 0,dise_id,dise_name,freq,hpo_id,hpo_name
0,166024,"Multiple epiphyseal dysplasia, Al-Gazali type",Very frequent (99-80%),HP:0000256,Macrocephaly
1,58,Alexander disease,Very frequent (99-80%),HP:0000256,Macrocephaly
2,61,Alpha-mannosidosis,Occasional (29-5%),HP:0000256,Macrocephaly
3,585,Multiple sulfatase deficiency,Frequent (79-30%),HP:0000256,Macrocephaly
4,141,Canavan disease,Frequent (79-30%),HP:0000256,Macrocephaly


---

## Determine which of these terms have Mesh cross references

In [12]:
xrefs.head()

Unnamed: 0,hpo_id,xref,xtype
0,HP:0000001,UMLS:C0444868,UMLS
1,HP:0000002,UMLS:C4025901,UMLS
2,HP:0000003,MSH:D021782,MSH
3,HP:0000003,SNOMEDCT_US:204962002,SNOMEDCT_US
4,HP:0000003,SNOMEDCT_US:82525005,SNOMEDCT_US


In [13]:
xrefs["xtype"].value_counts()

UMLS                                     13167
SNOMEDCT_US                               4703
MSH                                       2186
MEDDRA                                      93
ICD-10                                      38
pmid                                        23
EPCC                                        13
MP                                          10
NCIT                                        10
ORPHA                                        6
MPATH                                        4
PMID                                         3
ICD-O                                        3
ICD-9                                        3
DOI                                          2
NCIT_C345                                    1
DOID                                         1
http                                         1
Absence of ceruloplasmin in the blood        1
Name: xtype, dtype: int64

In [14]:
len(set(good_phenos["hpo_id"]) & set(xrefs["hpo_id"]))

4866

In [15]:
good_phenos["hpo_id"].nunique()

4897

Most of the HPO terms have UMLS mappings, but it's not clear if the missing terms have mesh terms as well.

A sampling of the HPO terms with UMLS CUIs but no Mesh ids shows that mappings for these terms do not exist at all. Therefore we will add the mappings for Mesh when possible and just use HPO ids if the mappings do not exist.

## Map to MeSH

In [16]:
res = good_phenos.merge(
    (xrefs
        .query("xtype == 'MSH'")
        .assign(mesh_id = lambda df: df["xref"].str[4:])
        .drop(["xref", "xtype"], axis=1)
    ),
    how="left", on="hpo_id"
)

In [17]:
res.shape

(58866, 6)

In [18]:
res.head()

Unnamed: 0,dise_id,dise_name,freq,hpo_id,hpo_name,mesh_id
0,166024,"Multiple epiphyseal dysplasia, Al-Gazali type",Very frequent (99-80%),HP:0000256,Macrocephaly,
1,58,Alexander disease,Very frequent (99-80%),HP:0000256,Macrocephaly,
2,61,Alpha-mannosidosis,Occasional (29-5%),HP:0000256,Macrocephaly,
3,585,Multiple sulfatase deficiency,Frequent (79-30%),HP:0000256,Macrocephaly,
4,141,Canavan disease,Frequent (79-30%),HP:0000256,Macrocephaly,


## Write to file

In [19]:
res.to_csv("dise_pheno_links.tsv", sep='\t', index=False)