# Parse all SNOMED concepts

2017-06-22

As a last ditch attempt to extract as much information as possible, we will use exact string matching on disease names to determine if Orphanet diseases have SNOMED identifiers.

In [1]:
import pandas as pd

---

## Read SNOMED ID file

In [2]:
data = pd.read_csv("../snomed_full.txt", sep='\t')

In [3]:
data.shape

(1613155, 9)

In [4]:
data.head()

Unnamed: 0,id,effectiveTime,active,moduleId,conceptId,languageCode,typeId,term,caseSignificanceId
0,100000016,20020131,0,900000000000207008,60203004,en,900000000000013009,Abrasion or friction burn of gum with infection,900000000000020002
1,10000010,20020131,1,900000000000207008,5379004,en,900000000000013009,Small intestine muscularis propria,900000000000020002
2,100001017,20020131,1,900000000000207008,60203004,en,900000000000013009,Gingival abrasion with infection,900000000000020002
3,100002012,20020131,1,900000000000207008,60204005,en,900000000000013009,Cauterization of conjunctival lesion,900000000000020002
4,100003019,20020131,1,900000000000207008,60205006,en,900000000000013009,Retired procedure,900000000000020002


---

## Extract out relevant info

In [5]:
info = data[["conceptId", "active", "term"]].drop_duplicates()

In [6]:
info.head()

Unnamed: 0,conceptId,active,term
0,60203004,0,Abrasion or friction burn of gum with infection
1,5379004,1,Small intestine muscularis propria
2,60203004,1,Gingival abrasion with infection
3,60204005,1,Cauterization of conjunctival lesion
4,60205006,1,Retired procedure


## Clean up data: drop inactive

Drop any concepts which are not active. There is also one concept with no textual name, and will be dropped.

In [7]:
res = (info
    .query("active == 1")
    .rename(columns={"conceptId": "snomed_id", "term": "snomed_name"})
    .drop("active", axis=1)
    .dropna(how="any")
)

In [8]:
res.shape

(1231897, 2)

In [9]:
res.head()

Unnamed: 0,snomed_id,snomed_name
1,5379004,Small intestine muscularis propria
2,60203004,Gingival abrasion with infection
3,60204005,Cauterization of conjunctival lesion
4,60205006,Retired procedure
5,60206007,Cutaneous actinobacillosis of sheep AND/OR cattle


## Save to file

In [10]:
res.to_csv("all_snomed_concepts.tsv", sep='\t', index=False)