# Missing ICD-10 conversion

Map Orphanet diseases with ICD-10 codes (no UMLS CUIs) directly to SNOMED ids through exact string matching on the disease names.

#### Notes:

SNOMED contains 1.2 million different concepts, which is too many to try to map to the 1200 missing Orphanet diseases through fuzzy string matching (takes ~20 seconds for one disease). MONDO contains cross references to UMLS CUIs for some diseases, but it seems that this data is not in the API. Therefore unless we can find a UMLS and OMIM mapper, we will just do exact string matching.

In [1]:
import pandas as pd

In [2]:
snomed = (pd
    .read_csv("all_snomed_concepts.tsv", sep='\t')
    .assign(snomed_name = lambda df: df["snomed_name"].str.lower())
)
          
dises = (pd
    .read_csv("need_conv.tsv", sep='\t')
    .drop_duplicates()
    .assign(dise_name = lambda df: df["dise_name"].str.lower())
)

In [3]:
dises.shape

(1499, 2)

---

## Map through exact name matching

In [4]:
matches = dises.merge(snomed, how="inner", left_on="dise_name", right_on="snomed_name")

In [5]:
matches.shape

(351, 4)

In [6]:
matches.head()

Unnamed: 0,orphanet_id,dise_name,snomed_id,snomed_name
0,104,leber hereditary optic neuropathy,58610003,leber hereditary optic neuropathy
1,169090,combined immunodeficiency due to crac channel ...,717811007,combined immunodeficiency due to crac channel ...
2,168566,fatal mitochondrial disease due to combined ox...,720951008,fatal mitochondrial disease due to combined ox...
3,1310,caffey disease,24752008,caffey disease
4,833,encephalopathy due to sulfite oxidase deficiency,715980003,encephalopathy due to sulfite oxidase deficiency


In [7]:
matches["orphanet_id"].nunique()

276

We were able to match 276 unique diseases to SNOMED directly.

## Save matches to file

In [8]:
matches[["orphanet_id", "snomed_id"]].to_csv("orpha_snomed_map.tsv", sep='\t', index=False)