# Clinical Coding and Ontology Practical

First we will take a look at some ontology stuff. We will use some tools to take a look at different ontology related items.

In [None]:
# Set up Komenti
!rm -rf komenti
!rm komenti-0.2.0-SNAPSHOT.zip
!wget https://github.com/reality/komenti/releases/download/0.2.0-SNAPSHOT-4/komenti-0.2.0-SNAPSHOT.zip
!unzip komenti-0.2.0-SNAPSHOT.zip
!mv komenti-0.2.0-SNAPSHOT komenti

# Querying Biomedical Ontologies

So before we talked about the Disease Ontology (DO), and how this stratifies different diseases and so on. We can use Komenti to look at different diseases and so on.

To run a query on a biomedical ontology, we can use Komenti, which links to an ontology repository called AberOWL (which you can also browse on the web from https://aber-owl.net/)

In the following command, we use komenti to 

* -q: This argument identifies the query you're running. Note: if the label is more than one word, it needs to additionally have a single quote around it.
* --class-mode
* --query-type

You can get a full-ish list of the query arguments [here](https://github.com/reality/Komenti)


In [2]:
!komenti query -q "cardiomyopathy" -o DOID --class-mode --query-type equivalent

SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
cardiomyopathy	http://purl.obolibrary.org/obo/DOID_0050700	cardiomyopathy	DOID	1


By changing the query type, we can also retrieve more specific kinds of disease:

In [3]:
!komenti query -q "'mood disorder'" -o DOID --class-mode --query-type subclass

SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
cyclothymic disorder	http://purl.obolibrary.org/obo/DOID_845	'mood disorder'	DOID	1
endogenous depression	http://purl.obolibrary.org/obo/DOID_1595	'mood disorder'	DOID	1
bipolar disorder	http://purl.obolibrary.org/obo/DOID_3312	'mood disorder'	DOID	1
mental depression	http://purl.obolibrary.org/obo/DOID_1596	'mood disorder'	DOID	1
major depressive disorder	http://purl.obolibrary.org/obo/DOID_1470	'mood disorder'	DOID	1
postpartum depression	http://purl.obolibrary.org/obo/DOID_9478	'mood disorder'	DOID	1
seasonal affective disorder	http://purl.obolibrary.org/obo/DOID_0060167	'mood disorder'	DOID	1
bipolar ll disorder	http://purl.obolibrary.org/obo/DOID_0060166	'mood disorder'	DOID	1
bipolar i disorder	http://purl.obolibrary.org/obo/DOID_14042	'mood disorder'	DOID	1
atypical depres

We can also retrieve, from the Human Phenotype ontology, different phenotypes:

In [None]:
!komenti query -q "hypertension" -o HP --class-mode --query-type subeq

You can also retrieve classes directly from their IRIs. You just have to wrap the argument with angled brackets (< and >). Try going to [AberOWL](https://aber-owl.net/ontology/HP) and picking a phenotype you're interested in, then insert it into the query below:

In [None]:
!komenti query -q "<[INSERT IRI HERE]>" -o HP --class-mode --query-type equivalent

# Some Clinical Data!

Now we can start to take a look at our demo dataset, which in some way you should perhaps be familiar with. Let's download it:

In [6]:
!wget https://physionet.org/static/published-projects/mimiciii-demo/mimic-iii-clinical-database-demo-1.4.zip
!unzip mimic-iii-clinical-database-demo-1.4.zip
!mv mimic-iii-clinical-database-demo-1.4 mimic

--2021-10-03 15:36:58--  https://physionet.org/static/published-projects/mimiciii-demo/mimic-iii-clinical-database-demo-1.4.zip
SSL_INIT
Ni ŝargis la CA-atestilon '/etc/ssl/certs/ca-certificates.crt'
Ni solvigas physionet.org (physionet.org)... 18.18.42.54
Konektado al physionet.org (physionet.org)|18.18.42.54|:443... konektita.
HTTP peto sendita, ni atendas respondon... 200 OK
Grando: 11350216 (11M) [application/zip]
Ni konservas al: ‘mimic-iii-clinical-database-demo-1.4.zip’


2021-10-03 15:37:01 (5,72 MB/s) - ‘mimic-iii-clinical-database-demo-1.4.zip’ konservita [11350216/11350216]

Archive:  mimic-iii-clinical-database-demo-1.4.zip
   creating: mimic-iii-clinical-database-demo-1.4/
  inflating: mimic-iii-clinical-database-demo-1.4/ADMISSIONS.csv  
  inflating: mimic-iii-clinical-database-demo-1.4/CALLOUT.csv  
  inflating: mimic-iii-clinical-database-demo-1.4/CAREGIVERS.csv  
  inflating: mimic-iii-clinical-database-demo-1.4/CHARTEVENTS.csv  
  inflating: mimic-iii-clinical-databas

We can load some of thes interesting tables into pandas, and take a look:

In [44]:
import pandas as pd

diagnoses = pd.read_csv("mimic/DIAGNOSES_ICD.csv")
diagnoses

Unnamed: 0,row_id,subject_id,hadm_id,seq_num,icd9_code
0,112344,10006,142345,1,99591
1,112345,10006,142345,2,99662
2,112346,10006,142345,3,5672
3,112347,10006,142345,4,40391
4,112348,10006,142345,5,42731
...,...,...,...,...,...
1756,397673,44228,103379,7,1975
1757,397674,44228,103379,8,45182
1758,397675,44228,103379,9,99592
1759,397676,44228,103379,10,2449


So, here we have a few columns:

* row_id: This identifies the row in this table (boring
* subject_id: Identifies the individual/patient
* hadm_id: This identifies the admission, or the particular stay
* seq_num
* icd9_code

So, this table describes a list of diagnoses received for each patient admission, identified by ICD9. These diagnoses are applied by professional clinical coders, who apply these codes.  we also have definition information provided by another CSV file, which we can cross-reference with our diagnoses data frame to get some more information, or at least the names, of these diagnoses.

Let's first load the definitions table:

In [45]:
diag_defs = pd.read_csv("mimic/D_ICD_DIAGNOSES.csv")[['icd9_code', 'long_title']]
diag_defs

Unnamed: 0,icd9_code,long_title
0,01716,Erythema nodosum with hypersensitivity reactio...
1,01720,"Tuberculosis of peripheral lymph nodes, unspec..."
2,01721,"Tuberculosis of peripheral lymph nodes, bacter..."
3,01722,"Tuberculosis of peripheral lymph nodes, bacter..."
4,01723,"Tuberculosis of peripheral lymph nodes, tuberc..."
...,...,...
14562,V8712,Contact with and (suspected) exposure to benzene
14563,V8719,Contact with and (suspected) exposure to other...
14564,V872,Contact with and (suspected) exposure to other...
14565,V8731,Contact with and (suspected) exposure to mold


We can see here that we have label information for each ICD9 code. We can now do a simple join on the diagnoses table to add the disease titles to the table... 

In [46]:
diag_with_titles = diagnoses.merge(diag_defs, on="icd9_code", how="left")
diag_with_titles

Unnamed: 0,row_id,subject_id,hadm_id,seq_num,icd9_code,long_title
0,112344,10006,142345,1,99591,Sepsis
1,112345,10006,142345,2,99662,Infection and inflammatory reaction due to oth...
2,112346,10006,142345,3,5672,
3,112347,10006,142345,4,40391,"Hypertensive chronic kidney disease, unspecifi..."
4,112348,10006,142345,5,42731,Atrial fibrillation
...,...,...,...,...,...,...
1756,397673,44228,103379,7,1975,Secondary malignant neoplasm of large intestin...
1757,397674,44228,103379,8,45182,Phlebitis and thrombophlebitis of superficial ...
1758,397675,44228,103379,9,99592,Severe sepsis
1759,397676,44228,103379,10,2449,Unspecified acquired hypothyroidism


That will help us read it. This demonstrates the role of vocabulary codes to uniquely identify a particular concept: in this case a disease.

This is kind of nice, but one of the problems we have here is that ICD-9 is a serious outmoded. seriously I think it's older than me. This, and the lack of formal semantic features makes it difficult to do useful things with this terminology.

Well, we can use Komenti to download metadata assocatied with classes. However, that would take a little time, so let's take a look at once I prepared earlier 

Now, this can take a little bit of work to parse and match up with the . Especially if we were interested in all of them. Thankfully, we can take this handy DOID<->ICD9 mapping I created earlier!

In [47]:
!rm doid_icd_map*
!wget http://lokero.xyz/doid_icd_mappings.txt

doid_icd_map = pd.read_csv('doid_icd_mappings.txt', sep='\t')
doid_icd_map['icd9_code'] = doid_icd_map['icd9_code'].map(lambda icd: icd.replace('.', ''))

doid_icd_map

--2021-10-03 19:40:44--  http://lokero.xyz/doid_icd_mappings.txt
Ni solvigas lokero.xyz (lokero.xyz)... 176.58.107.169
Konektado al lokero.xyz (lokero.xyz)|176.58.107.169|:80... konektita.
HTTP peto sendita, ni atendas respondon... 200 OK
Grando: 101061 (99K) [text/plain]
Ni konservas al: ‘doid_icd_mappings.txt’


2021-10-03 19:40:45 (349 KB/s) - ‘doid_icd_mappings.txt’ konservita [101061/101061]



Unnamed: 0,icd9_code,doid
0,11510,http://purl.obolibrary.org/obo/DOID_11315
1,3310,http://purl.obolibrary.org/obo/DOID_10652
2,1150,http://purl.obolibrary.org/obo/DOID_1759
3,37205,http://purl.obolibrary.org/obo/DOID_11203
4,37945,http://purl.obolibrary.org/obo/DOID_14523
...,...,...
2113,1112,http://purl.obolibrary.org/obo/DOID_13902
2114,37333,http://purl.obolibrary.org/obo/DOID_9140
2115,37515,http://purl.obolibrary.org/obo/DOID_10138
2116,102,http://purl.obolibrary.org/obo/DOID_10371


Very nice, now we can do another simple left join on our diagnoses table to add our maps

In [50]:
diag_with_doid = diag_with_titles.merge(doid_icd_map, on="icd9_code", how="left")
diag_with_doid = diag_with_doid[diag_with_doid['doid'].notnull()]
print(str(len(diag_with_doid)) + " out of")

diag_with_doid

334 out of


Unnamed: 0,row_id,subject_id,hadm_id,seq_num,icd9_code,long_title,doid
0,112344,10006,142345,1,99591,Sepsis,http://purl.obolibrary.org/obo/DOID_0040085
4,112348,10006,142345,5,42731,Atrial fibrillation,http://purl.obolibrary.org/obo/DOID_0060224
6,112350,10006,142345,7,4241,Aortic valve disorders,http://purl.obolibrary.org/obo/DOID_62
8,112352,10006,142345,9,2874,,http://purl.obolibrary.org/obo/DOID_11126
21,112394,10011,105331,1,570,Acute and subacute necrosis of liver,http://purl.obolibrary.org/obo/DOID_2237
...,...,...,...,...,...,...,...
1746,397650,44222,192189,8,3572,Polyneuropathy in diabetes,http://purl.obolibrary.org/obo/DOID_12785
1749,397653,44222,192189,11,72400,"Spinal stenosis, unspecified region",http://purl.obolibrary.org/obo/DOID_6725
1755,397672,44228,103379,6,1561,Malignant neoplasm of extrahepatic bile ducts,http://purl.obolibrary.org/obo/DOID_4606
1759,397676,44228,103379,10,2449,Unspecified acquired hypothyroidism,http://purl.obolibrary.org/obo/DOID_1459


You can see already from above that some are missing. sad. we could also be more advanced, and this is a whole field of ontology alignment that might allow us to do nice things here. For now we're just going to ignore everything that's not mapped.

So this can help us do things like data lookup. Let's say we're interested in hypertension, and different inds of hypertension

In [67]:
!komenti query -q "'heart disease'" -o DOID --class-mode --query-type subeq --out hd_classes.tsv
hd_classes = dict()
with open('hd_classes.tsv') as data:
  reader = csv.reader(data, delimiter='\t')
  for row in reader:
    if row[1] not in hd_classes:
        hd_classes[row[1]] = row[0]
hd_classes

SLF4J: Failed to load class "org.slf4j.impl.StaticLoggerBinder".
SLF4J: Defaulting to no-operation (NOP) logger implementation
SLF4J: See http://www.slf4j.org/codes.html#StaticLoggerBinder for further details.
Saved 702 labels for 256 terms to hd_classes.tsv


{'http://purl.obolibrary.org/obo/DOID_6419': 'tetralogy of fallot',
 'http://purl.obolibrary.org/obo/DOID_0110647': 'long qt syndrome 5',
 'http://purl.obolibrary.org/obo/DOID_0110648': 'long qt syndrome 6',
 'http://purl.obolibrary.org/obo/DOID_11516': 'hypertensive heart disease',
 'http://purl.obolibrary.org/obo/DOID_1882': 'atrial heart septal defect',
 'http://purl.obolibrary.org/obo/DOID_0060068': 'nonbacterial thrombotic endocarditis',
 'http://purl.obolibrary.org/obo/DOID_0110083': 'arrhythmogenic right ventricular dysplasia 12',
 'http://purl.obolibrary.org/obo/DOID_0110084': 'arrhythmogenic right ventricular dysplasia 13',
 'http://purl.obolibrary.org/obo/DOID_0110081': 'arrhythmogenic right ventricular dysplasia 10',
 'http://purl.obolibrary.org/obo/DOID_0110082': 'arrhythmogenic right ventricular dysplasia 11',
 'http://purl.obolibrary.org/obo/DOID_0110645': 'long qt syndrome 2',
 'http://purl.obolibrary.org/obo/DOID_0110646': 'long qt syndrome 3',
 'http://purl.obolibrary.

Now let's find some of these in our rows...

In [75]:
hd_admissions = diag_with_doid[diag_with_doid['doid'].isin(hd_classes.keys())]
hd_admissions['long_title'].value_counts()

Atrial fibrillation                       48
Aortic valve disorders                     6
Diseases of tricuspid valve                5
Primary pulmonary hypertension             3
Other chronic pulmonary heart diseases     3
Cardiac arrest                             3
Rheumatic heart failure (congestive)       2
Other heart block                          1
Name: long_title, dtype: int64

Now try this with another disease or set of diseases you're interested in:

In [None]:
!komenti query -q "'heart disease'" -o DOID --class-mode --query-type subeq --out hd_classes.tsv
hd_classes = dict()
with open('hd_classes.tsv') as data:
  reader = csv.reader(data, delimiter='\t')
  for row in reader:
    if row[1] not in hd_classes:
        hd_classes[row[1]] = row[0]
hd_classes
hd_admissions = diag_with_doid[diag_with_doid['doid'].isin(hd_classes.keys())]
hd_admissions

# Part two: Semantic analysis and characterisation

So now we can start to think about how 

In [96]:
admissions = dict()
for index, row in diag_with_doid.iterrows():
    if row['hadm_id'] not in admissions:
        admissions[row['hadm_id']] = {
            'codes': list(),
            'contains_af': False,
            'contains_other_hd': False
        }
        
    admissions[row['hadm_id']]['codes'].append(row['doid'])
    if row['doid'] in hd_classes.keys():
        if(row['doid'] == 'http://purl.obolibrary.org/obo/DOID_0060224'):
          admissions[row['hadm_id']]['contains_af'] = True  
        else:
          admissions[row['hadm_id']]['contains_other_hd'] = True

with open('klar_input.tsv', 'w') as f:
    for aid, v in admissions.items():
      groups = list()
      if v['contains_af']:
        groups.append('AF')
      if v['contains_other_hd']:
        groups.append('OtherHD')
      if len(groups) == 0:
        groups.append('OtherNonHD')
        
      line = "\t".join((str(aid), ";".join(v['codes']), ";".join(groups)))
      f.write(line + '\n')
        

In [90]:
!rm -rf klarigi
!rm doid.owl
!wget http://lokero.xyz/klarigi-0.0.12-SNAPSHOT.tar
!tar -xvf klarigi-0.0.12-SNAPSHOT.tar
!mv klarigi-0.0.12-SNAPSHOT klarigi
!wget http://purl.obolibrary.org/obo/doid.owl

--2021-10-03 20:56:56--  http://lokero.xyz/klarigi-0.0.12-SNAPSHOT.tar
Ni solvigas lokero.xyz (lokero.xyz)... 176.58.107.169
Konektado al lokero.xyz (lokero.xyz)|176.58.107.169|:80... konektita.
HTTP peto sendita, ni atendas respondon... 200 OK
Grando: 53022720 (51M) [application/x-tar]
Ni konservas al: ‘klarigi-0.0.12-SNAPSHOT.tar’


2021-10-03 20:57:03 (7,22 MB/s) - ‘klarigi-0.0.12-SNAPSHOT.tar’ konservita [53022720/53022720]

klarigi-0.0.12-SNAPSHOT/
klarigi-0.0.12-SNAPSHOT/lib/
klarigi-0.0.12-SNAPSHOT/lib/klarigi-0.0.12-SNAPSHOT.jar
klarigi-0.0.12-SNAPSHOT/lib/slib-sml-0.9.1.jar
klarigi-0.0.12-SNAPSHOT/lib/slib-tools-module-0.9.1.jar
klarigi-0.0.12-SNAPSHOT/lib/commons-cli-1.4.jar
klarigi-0.0.12-SNAPSHOT/lib/slf4j-nop-1.7.30.jar
klarigi-0.0.12-SNAPSHOT/lib/elk-owlapi-0.4.3.jar
klarigi-0.0.12-SNAPSHOT/lib/owlapi-distribution-4.5.19.jar
klarigi-0.0.12-SNAPSHOT/lib/owlapi-compatibility-4.5.19.jar
klarigi-0.0.12-SNAPSHOT/lib/owlapi-apibinding-4.5.19.jar
klarigi-0.0.12-SNAPSHOT/lib/owla

--2021-10-03 20:57:03--  http://purl.obolibrary.org/obo/doid.owl
Ni solvigas purl.obolibrary.org (purl.obolibrary.org)... 52.3.123.63
Konektado al purl.obolibrary.org (purl.obolibrary.org)|52.3.123.63|:80... konektita.
HTTP peto sendita, ni atendas respondon... 302 Found
Loko: https://raw.githubusercontent.com/DiseaseOntology/HumanDiseaseOntology/main/src/ontology/doid.owl [sekvanta]
--2021-10-03 20:57:04--  https://raw.githubusercontent.com/DiseaseOntology/HumanDiseaseOntology/main/src/ontology/doid.owl
SSL_INIT
Ni ŝargis la CA-atestilon '/etc/ssl/certs/ca-certificates.crt'
Ni solvigas raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ...
Konektado al raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... konektita.
HTTP peto sendita, ni atendas respondon... Leg-eraro (Malĉifrigo fiaskis.) en kapoj.
Ni reprovas.

--2021-10-03 20:57:05--  (provo: 2)  https://raw.githubusercontent.com/DiseaseOntology/HumanD

In [97]:
!klarigi/bin/klarigi --data klar_input.tsv --ontology doid.owl









AF: Scoring completed. Candidates: 236
OtherHD: Scoring completed. Candidates: 236
OtherNonHD: Scoring completed. Candidates: 236
----------------
Group: AF (48 members)
Overall inclusion: 100.0%
Overall exclusion: 98.78%
Explanatory classes:
  IRI: atrial fibrillation (http://purl.obolibrary.org/obo/DOID_0060224), Power: 0.85 (inc: 1.0, exc: 0.85), IC: 0.8
----------------

----------------
Group: OtherHD (19 members)
Overall inclusion: 100.0%
Overall exclusion: 99.1%
Explanatory classes:
  IRI: heart valve disease (http://purl.obolibrary.org/obo/DOID_4079), Power: 0.51 (inc: 0.58, exc: 0.93), IC: 0.63
  IRI: atrial fibrillation (http://purl.obolibrary.org/obo/DOID_0060224), Power: 0.2 (inc: 0.63, exc: 0.57), IC: 0.8
  IRI: congestive heart failure (http://purl.obolibrary.org/obo/DOID_6000), Power: 0.52 (inc: 0.58, exc: 0.94), IC: 0.67
  IRI: heart conduction disease (http://purl.obolibrary.org/obo/DOID_10273), Power: 0.2 (inc: 0.63, exc: 0.57), IC: 0.61
----------------

------------