# Convert UMLS CUIs in the indication file to OMIM ids

There are fewer UMLS CUIs in the indications than there are OMIM ids in the Orphanet list of rare diseases. Since the UMLS API is slow, we will convert from CUIs to OMIM ids and use it to match up the indications with rare diseases.

In [1]:
import pandas as pd
import numpy as np

from collections import defaultdict
from tqdm import tqdm
from src.auth import UMLSAPI

In [None]:
api_key = ""

In [3]:
api = UMLSAPI(api_key)

## Read CUIs which need to be converted

In [4]:
cuis = (pd
    .read_csv("cd_links/simple_inds.tsv", sep='\t')
    ["umls_cui"]
    .drop_duplicates()
    .tolist()
)

## Query the UMLS API to convert CUIs

In [5]:
def map_cui(cui):
    params = {"sabs": "OMIM"}
    url = "https://uts-ws.nlm.nih.gov/rest/content/current/CUI/{}/atoms".format(cui)
    return api.query(url, params)

In [6]:
res = dict()
for cui in tqdm(cuis):
    res[cui] = map_cui(cui)

100%|██████████| 2192/2192 [41:03<00:00,  1.13it/s] 


## Convert JSON to DataFrame

In [7]:
def parse(obj):
    # should only have one page of results
    assert obj["pageCount"] == 1
    
    for val in obj["result"]:
        assert val["obsolete"] == "false" and val["rootSource"] == "OMIM"

        # get rid of the annoying part of the URI
        yield (val["termType"], val["code"][54:])

In [8]:
ans = defaultdict(list)
for cui, data in res.items():
    if data is not None:
        for ttype, code in parse(data):
            ans["cui"].append(cui)
            ans["term_type"].append(ttype)
            ans["code"].append(code)

In [9]:
final = pd.DataFrame(ans)

In [10]:
final.head()

Unnamed: 0,code,cui,term_type
0,OMIM/231680,C0268596,ACR
1,OMIM/231680,C0268596,ETAL
2,OMIM/231680,C0268596,ETAL
3,OMIM/231680,C0268596,ETAL
4,OMIM/231680,C0268596,ETAL


In [11]:
final["cui"].nunique()

815

## Subset down to proper OMIM ids

The "MTHU" OMIM ids are incorrect and need to be discarded.

In [12]:
final = (final
    .assign(length = lambda df: df["code"].str.len())
    .query("length == 11")
    .assign(omim_id = lambda df: df["code"].str[5:])
    .astype({"omim_id": np.int64})
    .drop(["length", "code"], axis=1)
    .drop_duplicates()
)

In [13]:
final.shape

(606, 3)

In [14]:
final.head()

Unnamed: 0,cui,term_type,omim_id
0,C0268596,ACR,231680
1,C0268596,ETAL,231680
6,C0268596,PT,231680
10,C0149925,ACR,182280
11,C0149925,ETAL,182280


In [15]:
final["cui"].nunique()

219

In [16]:
final["omim_id"].nunique()

351

In [17]:
final.groupby("cui")["omim_id"].nunique().value_counts()

1     116
2      85
3      12
13      1
11      1
8       1
6       1
5       1
4       1
Name: omim_id, dtype: int64

## Save mapping to file

In [18]:
final.to_csv("cui_to_omim.tsv", sep='\t', index=False)