# Identify rare diseases cataloged by OrphaNet

2017-06-06

1. Parse OrphaNet to get identifiers of rare diseases
2. Parse OrphaNet to identify which rare diseases are genetic in origin

## Parse out rare diseases from OrphatNet

1. Identify the diseases by OrphaNet id
2. Find cross references to other ontologies

In [1]:
import xml.etree.ElementTree as ET
from collections import defaultdict

import pandas as pd

File was downloaded from Orphanet (http://www.orphadata.org/data/xml/en_product1.xml) and renamed to `rare_disease_list.xml` for simplicity.

In [2]:
tree = ET.parse("../data/raw/orphanet/rare_disease_list.xml")

root = tree.getroot()

In [3]:
res = defaultdict(list)

for disease in root.iter("Disorder"):
    orpha_id = disease.find("OrphaNumber").text
    name = disease.find("Name").text
    
    for ref in disease.iter("ExternalReference"):
        ref_name = ref.find("Source").text
        ref_id = ref.find("Reference").text

        res["orphanet_id"].append(orpha_id)
        res["dise_name"].append(name)
        
        res["ref_name"].append(ref_name)
        res["ref_id"].append(ref_id)
    
rare = pd.DataFrame(res)

In [4]:
rare.shape

(21445, 4)

In [5]:
rare.head()

Unnamed: 0,dise_name,orphanet_id,ref_id,ref_name
0,"Multiple epiphyseal dysplasia, Al-Gazali type",166024,Q77.3,ICD-10
1,"Multiple epiphyseal dysplasia, Al-Gazali type",166024,607131,OMIM
2,"Multiple epiphyseal dysplasia, with miniepiphyses",166032,Q77.3,ICD-10
3,"Multiple epiphyseal dysplasia, with miniepiphyses",166032,609325,OMIM
4,Alexander disease,58,203450,OMIM


---

### Which are the most common cross references?

In [6]:
rare.groupby("ref_name")["orphanet_id"].nunique().sort_values(ascending=False)

ref_name
ICD-10    7076
OMIM      4381
UMLS      2879
MeSH      1760
MedDRA    1166
Name: orphanet_id, dtype: int64

OrphaNet seems to be most uniquely identified by ICD-10 codes.

### Are ICD-10 codes uniquely assigned to OrphaNet rare diseases?

In [7]:
rare.query("ref_name == 'ICD-10'").groupby("orphanet_id").size().value_counts()

1     6700
2      140
3       55
5       49
4       46
6       23
7       19
8       18
10      16
9        6
11       1
22       1
17       1
13       1
dtype: int64

Most rare diseases are uniquely mapped to ICD-10, but some have one-to-many relationships.

---

## Determine which rare diseases are genetic in origin

Original file was downloaded from http://www.orphadata.org/data/xml/en_product3_156.xml. File was renamed to `rare_genetic_dises.xml` for simplicity.

In [8]:
tree = ET.parse("../data/raw/orphanet/rare_genetic_dises.xml")

root = tree.getroot()

Looks like this file gives a hierarchy of classifications, which is not at this moment particularly important. We just want to know if the disease is genetic in origin (and not something like a rare infectious disease).

In [9]:
res = defaultdict(list)

for disease in root.iter("Disorder"):
    orpha_id = disease.find("OrphaNumber").text
    name = disease.find("Name").text
    
    res["orphanet_id"].append(orpha_id)
    res["dise_name"].append(name)
    
genetic = (pd
    .DataFrame(res)
    .assign(dise_type = "rare_genetic")
    .drop_duplicates()
)

In [10]:
genetic.shape

(6490, 3)

In [11]:
genetic.head()

Unnamed: 0,dise_name,orphanet_id,dise_type
0,Rare genetic disease,98053,rare_genetic
1,Biological anomaly without phenotypic characte...,447874,rare_genetic
2,Congenital deficiency in alpha-fetoprotein,168612,rare_genetic
3,Hereditary persistence of alpha-fetoprotein,168615,rare_genetic
4,Genetic hyperferritinemia without iron overload,254704,rare_genetic


---

## Annotate which rare diseases are genetic

In [12]:
res = (rare
    .merge(
        genetic.drop("dise_name", axis=1),
        how="left", on="orphanet_id"
    )
    .fillna("not_genetic")
)

In [13]:
res.shape

(21445, 5)

In [14]:
res.head()

Unnamed: 0,dise_name,orphanet_id,ref_id,ref_name,dise_type
0,"Multiple epiphyseal dysplasia, Al-Gazali type",166024,Q77.3,ICD-10,rare_genetic
1,"Multiple epiphyseal dysplasia, Al-Gazali type",166024,607131,OMIM,rare_genetic
2,"Multiple epiphyseal dysplasia, with miniepiphyses",166032,Q77.3,ICD-10,rare_genetic
3,"Multiple epiphyseal dysplasia, with miniepiphyses",166032,609325,OMIM,rare_genetic
4,Alexander disease,58,203450,OMIM,rare_genetic


In [15]:
res.groupby("dise_type")["orphanet_id"].nunique()

dise_type
not_genetic     2142
rare_genetic    5620
Name: orphanet_id, dtype: int64

---

## Output results to file

In [16]:
res.to_csv("data/rare_disease_info.tsv", sep='\t', index=False)