Welcome to chapter six of Methods in Medical Informatics! In this section, we will be the International Classification of Diseases (ICD). ICD is a nomenclature of the disease occurring in humans, with each listed disease assigned a unique identifying code. The World Health Organization also produces a specialized cancer nomenclature, known as the ICD-O (ICD-Oncology). In this chapter, we will be using ICD and ICD-O in scripts. Lets begin!

> Disclaimer: The content below is adapted from the book "Methods in Medical Informatics - Fundamental of Healthcare Programming in Perl, Python, and Ruby" by Jules J. Berman. All content is for testing, education, and teaching purposes only. No content will be openly released to the internet. 

# Creating the ICD Dictionary

If we have a computer-parsable list of ICD codes, we can write a short program that assigns human-readable terms (full names of diseases) to the codes in the mortality file. An electronic version of the ICD is provided from the CDC under the filename "each10.txt". We will create a dictionary data objects consisting of ICD codes (as dictionary keys) and their corresponding terms (as dictionary values).

In [None]:
import sys, os, re, string
linearray = []
dictionary = {}
code = ""
term = ""
in_text = open('./K11946_Files/each10.txt', "r")
in_text_string = in_text.read()
in_text.close()
linearray = in_text_string.split("\n")
for item in linearray:
    m = re.search(r'^[ \*]*([A-Z][0-9\.]{1,7}) ?([^0-9].+)', item)
    if m:
        code = m.group(1)
        term = m.group(2)
        dictionary[code] = term
out_text = open('./K11946_Files/each10.out', "w")
dict_list = dictionary.keys()
sort_list = sorted(dict_list)
for i in sort_list:
    print(out_text, "%-8.08s %s" % (i, dictionary[i]))
out_text.close()

## Script Algorithm: Creating the ICD Dictionary

Open the each10.txt file

In [None]:
import sys, os, re, string
linearray = []
dictionary = {}
code = ""
term = ""
in_text = open('./K11946_Files/each10.txt', "r")

Put the entire file into a string variable

In [None]:
in_text_string = in_text.read()
in_text.close()

Split the string variable wherever the newline character is followed by an ICD code

In [None]:
linearray = in_text_string.split("\n")

For each split item, add the code (as the key) and the term  (as the value) to the dictionary

In [None]:
for item in linearray:
    m = re.search(r'^[ \*]*([A-Z][0-9\.]{1,7}) ?([^0-9].+)', item)
    if m:
        code = m.group(1)
        term = m.group(2)
        dictionary[code] = term
out_text = open('./K11946_Files/each10.out', "w")
dict_list = dictionary.keys()

Print out all of the dictionary key-value pairs, with the keys sorted alphabetically, to the "each10.out" file. 

In [None]:
sort_list = sorted(dict_list)
for i in sort_list:
    print(out_text, "%-8.08s %s" % (i, dictionary[i]))
out_text.close()

## Analysis: Creating the ICD Dictionary

The output file, each10.out, contains about __ code-term pairs and has a lenght of about __ bytes.

`<_io.TextIOWrapper name='./K11946_Files/each10.out' mode='w' encoding='cp1252'> Y88.2    Sequelae of adverse incidents associated with medical devices in diagnostic and therapeutic use
<_io.TextIOWrapper name='./K11946_Files/each10.out' mode='w' encoding='cp1252'> Y88.3    Sequelae of surgical and medical procedures as the cause of abnormal reaction of the patient, or of later
<_io.TextIOWrapper name='./K11946_Files/each10.out' mode='w' encoding='cp1252'> Y89      Sequelae of other external causes
<_io.TextIOWrapper name='./K11946_Files/each10.out' mode='w' encoding='cp1252'> Y89.0    Sequelae of legal intervention
<_io.TextIOWrapper name='./K11946_Files/each10.out' mode='w' encoding='cp1252'> Y89.1    Sequelae of war operations
<_io.TextIOWrapper name='./K11946_Files/each10.out' mode='w' encoding='cp1252'> Y89.9    Sequelae of unspecified external cause`

# Building the ICD-O (Oncology) Dictionary

ICD-O is a specialized vocabulary created by the World Health Organization. ICD-O contains the dictionary of neoplasm codes and terms used by cancer registrars. The ICD-O contains codes for 9,769 neoplasm terms, and is freely available from SEER (Surveillance Epidemiology and End Results). The ICD-O file can be parsed into code-term pairs. 

In [None]:
import sys, os, re, string
f = open("./K11946_Files/ICDO3.TXT", "r")
codehash = {}
for line in f:
    linematch = re.search(r'([0-9]{4})\/([0-9]{1}) +(.+)$', line)
    if (linematch):
        icdcode = linematch.group(1) + linematch.group(2)
        term.rstrip(linematch.group(3))
        codehash[icdcode] = term
f.close
keylist = codehash.keys()
sorted(keylist)
for item in keylist:
    print(item, codehash[item])

## Script Algorithm: Building the ICD-O (Oncology) Dictionary

Open the "icdo3.txt" file

In [None]:
import sys, os, re, string
f = open("./K11946_Files/ICDO3.TXT", "r")
codehash = {}

Parse the "icdo3.txt" file, line by line. Each line begins with a code, consisting of four digits followed by a slash, followed by one digit, followed by a space, followed by the term. Create a regex expression for the line, placing the five digits from the code into a key variable, and the term into a value variable, for a hash object. 

In [None]:
for line in f:
    linematch = re.search(r'([0-9]{4})\/([0-9]{1}) +(.+)$', line)
    if (linematch):
        icdcode = linematch.group(1) + linematch.group(2)
        term.rstrip(linematch.group(3))
        codehash[icdcode] = term
f.close

Sort the keys of the hash object, and print the key (code) value (term) pairs

In [None]:
keylist = codehash.keys()
sorted(keylist)
for item in keylist:
    print(item, codehash[item])

## Analysis: Building the ICD-O (Oncology) Dictionary

Here are a few of the code-term pairs from ICD-O:

`80003 Sequelae of unspecified external cause
80013 Sequelae of unspecified external cause
80023 Sequelae of unspecified external cause
80033 Sequelae of unspecified external cause
80043 Sequelae of unspecified external cause
80053 Sequelae of unspecified external cause
80102 Sequelae of unspecified external cause
80103 Sequelae of unspecified external cause`