<a href="https://datamine.unc.edu/methods_in_medical_informatics_yuchenh/" ><h1>Back to Notebook List</h3></a>
<br/>

Welcome to chapter six of Methods in Medical Informatics! In this section, we will be the International Classification of Diseases (ICD). ICD is a nomenclature of the disease occurring in humans, with each listed disease assigned a unique identifying code. The World Health Organization also produces a specialized cancer nomenclature, known as the ICD-O (ICD-Oncology). In this chapter, we will be using ICD and ICD-O in scripts. Lets begin!

> Disclaimer: The content below is adapted from the book "Methods in Medical Informatics - Fundamental of Healthcare Programming in Perl, Python, and Ruby" by Jules J. Berman. All content is for testing, education, and teaching purposes only. No content will be openly released to the internet. 

# 6.1 Creating the ICD Dictionary

If we have a computer-parsable list of ICD codes, we can write a short program that assigns human-readable terms (full names of diseases) to the codes in the mortality file. An electronic version of the ICD is provided from the CDC under the filename "each10.txt". We will create a dictionary data objects consisting of ICD codes (as dictionary keys) and their corresponding terms (as dictionary values).*

> This script will utilize the file [each10.txt](./K11946_Files/each10.txt). This is a text file which contains an electronic version of the ICD. Additional information [here](https://datamine.unc.edu/datafiles_yuchenh/)

**Description adapted from pages 99-100 of "Methods in Medical Informatics"*

In [2]:
import sys, os, re

linearray = []
dictionary = {}
code = ""
term = ""

in_text = open("./K11946_Files/each10.txt", "r", encoding="cp1252")
in_text_string = in_text.read()
in_text.close()

linearray = in_text_string.split("\n")

for item in linearray:
    m = re.search(r'^[ \*]*([A-Z][0-9\.]{1,7}) ?([^0-9].+)', item)
    if m:
        code = m.group(1)
        term = m.group(2)
        dictionary[code] = term

out_text = open("./K11946_Files/each10.out", "w")

dict_list = dictionary.keys()
sort_list = sorted(dict_list)

for i in sort_list:
    print("%-8.08s %s" % (i, dictionary[i]))
    print("%-8.08s %s" % (i, dictionary[i]), file=out_text) 
out_text.close()

A00      Cholera
A00.0    Cholera due to Vibrio cholerae 01, biovar cholerae
A00.1    Cholera due to Vibrio cholerae 01, biovar el tor
A00.9    Cholera, unspecified
A01      Typhoid and paratyphoid fevers
A01.0    Typhoid fever
A01.1    Paratyphoid fever A
A01.2    Paratyphoid fever B
A01.3    Paratyphoid fever C
A01.4    Paratyphoid fever, unspecified
A02      Other salmonella infections
A02.0    Salmonella gastroenteritis
A02.1    Salmonella septicemia
A02.2    Localized salmonella infections
A02.8    Other specified salmonella infections
A02.9    Salmonella infection, unspecified
A03      Shigellosis
A03.0    Shigellosis due to Shigella dysenteriae
A03.1    Shigellosis due to Shigella flexneri
A03.2    Shigellosis due to Shigella boydii
A03.3    Shigellosis due to Shigella sonnei
A03.8    Other shigellosis
A03.9    Shigellosis, unspecified
A04      Other bacterial intestinal infections
A04.0    Enteropathogenic Escherichia coli infection
A04.1    Enterotoxigenic Escherichia coli inf

## Script Algorithm: Creating the ICD Dictionary

Open the each10.txt file*

In [3]:
import sys, os, re

linearray = []
dictionary = {}
code = ""
term = ""

in_text = open("./K11946_Files/each10.txt", "r", encoding="cp1252")


Put the entire file into a string variable

In [4]:
in_text_string = in_text.read()
in_text.close()

Split the string variable wherever the newline character is followed by an ICD code

In [5]:
linearray = in_text_string.split("\n")

For each split item, add the code (as the key) and the term  (as the value) to the dictionary

In [6]:
for item in linearray:
    m = re.search(r'^[ \*]*([A-Z][0-9\.]{1,7}) ?([^0-9].+)', item)
    if m:
        code = m.group(1)
        term = m.group(2)
        dictionary[code] = term

out_text = open("./K11946_Files/each10.out", "w")
dict_list = dictionary.keys()


Print out all of the dictionary key-value pairs, with the keys sorted alphabetically, to the "each10.out" file. 

In [7]:
sort_list = sorted(dict_list)

for i in sort_list:
    print("%-8.08s %s" % (i, dictionary[i]))
    print("%-8.08s %s" % (i, dictionary[i]), file=out_text) 
out_text.close()

A00      Cholera
A00.0    Cholera due to Vibrio cholerae 01, biovar cholerae
A00.1    Cholera due to Vibrio cholerae 01, biovar el tor
A00.9    Cholera, unspecified
A01      Typhoid and paratyphoid fevers
A01.0    Typhoid fever
A01.1    Paratyphoid fever A
A01.2    Paratyphoid fever B
A01.3    Paratyphoid fever C
A01.4    Paratyphoid fever, unspecified
A02      Other salmonella infections
A02.0    Salmonella gastroenteritis
A02.1    Salmonella septicemia
A02.2    Localized salmonella infections
A02.8    Other specified salmonella infections
A02.9    Salmonella infection, unspecified
A03      Shigellosis
A03.0    Shigellosis due to Shigella dysenteriae
A03.1    Shigellosis due to Shigella flexneri
A03.2    Shigellosis due to Shigella boydii
A03.3    Shigellosis due to Shigella sonnei
A03.8    Other shigellosis
A03.9    Shigellosis, unspecified
A04      Other bacterial intestinal infections
A04.0    Enteropathogenic Escherichia coli infection
A04.1    Enterotoxigenic Escherichia coli inf

**This section is adapted from section 6.1.1, "Script Algorithm", of page 100 from "Methods in Medical Informatics".*

## Analysis: Creating the ICD Dictionary

The output file, each10.out, contains 9272 code-term pairs and has a length of 431824 bytes.*

`
Y88.2    Sequelae of adverse incidents associated with medical devices in diagnostic and therapeutic use
Y88.3    Sequelae of surgical and medical procedures as the cause of abnormal reaction of the patient, or of later
Y89      Sequelae of other external causes
Y89.0    Sequelae of legal intervention
Y89.1    Sequelae of war operations
Y89.9    Sequelae of unspecified external cause
`

**This section is adapted from section 6.1.2, "Analysis", of pages 102-103 from "Methods in Medical Informatics".*

# 6.2 Building the ICD-O (Oncology) Dictionary

ICD-O is a specialized vocabulary created by the World Health Organization. ICD-O contains the dictionary of neoplasm codes and terms used by cancer registrars. The ICD-O contains codes for 9,769 neoplasm terms, and is freely available from SEER (Surveillance Epidemiology and End Results). The ICD-O file can be parsed into code-term pairs.*

> This script will utilize the file [icdo3.txt](./K11946_Files/ICDO3.TXT). This is a text file which contains an electronic version of the ICD-O vocabulary. Additional information [here](https://datamine.unc.edu/datafiles_yuchenh/)


**Description adapted from pages 102-103 of "Methods in Medical Informatics".*

In [8]:
import sys, os, re
f = open("./K11946_Files/ICDO3.TXT", "r")
codehash = {}
for line in f:
    linematch = re.search(r'([0-9]{4})\/([0-9]{1}) +(.+)$', line)
    if (linematch):
        icdcode = linematch.group(1) + linematch.group(2)
        term = linematch.group(3)
        codehash[icdcode] = term
f.close
keylist = codehash.keys()

for item in sorted(keylist):
   print (item, codehash[item])
exit

80000 Neoplasm, benign
80001 Neoplasm, uncertain whether benign or malignant
80003 Neoplasm, malignant
80010 Tumor cells, benign
80011 Tumor cells, uncertain whether benign or malignant
80013 Tumor cells, malignant
80023 Malignant tumor, small cell type
80033 Malignant tumor, giant cell type
80043 Malignant tumor, spindle cell type
80050 Clear cell tumor, NOS
80053 Malignant tumor, clear cell type
80100 Epithelial tumor, benign
80102 Carcinoma in situ, NOS
80103 Carcinoma, NOS
80113 Epithelioma, malignant
80123 Large cell carcinoma, NOS
80133 Large cell neuroendocrine carcinoma
80143 Large cell carcinoma with rhabdoid phenotype
80153 Glassy cell carcinoma
80203 Carcinoma, undifferentiated type, NOS
80213 Carcinoma, anaplastic type, NOS
80223 Pleomorphic carcinoma
80303 Giant cell and spindle cell carcinoma
80313 Giant cell carcinoma
80323 Spindle cell carcinoma
80333 Pseudosarcomatous carcinoma
80343 Polygonal cell carcinoma
80353 Carcinoma with osteoclast-like giant cells
80413 Small 

<IPython.core.autocall.ZMQExitAutocall at 0x7ff6a4e41190>

## Script Algorithm: Building the ICD-O (Oncology) Dictionary

Open the "icdo3.txt" file*

In [9]:
import sys, os, re
f = open("./K11946_Files/ICDO3.TXT", "r")
codehash = {}

Parse the "icdo3.txt" file, line by line. Each line begins with a code, consisting of four digits followed by a slash, followed by one digit, followed by a space, followed by the term. Create a regex expression for the line, placing the five digits from the code into a key variable, and the term into a value variable, for a hash object. 

In [10]:
for line in f:
    linematch = re.search(r'([0-9]{4})\/([0-9]{1}) +(.+)$', line)
    if (linematch):
        icdcode = linematch.group(1) + linematch.group(2)
        term = linematch.group(3)
        codehash[icdcode] = term
f.close

<function TextIOWrapper.close()>

Sort the keys of the hash object, and print the key (code) value (term) pairs

In [11]:
keylist = codehash.keys()
for item in sorted(keylist):
    print(item, codehash[item])

80000 Neoplasm, benign
80001 Neoplasm, uncertain whether benign or malignant
80003 Neoplasm, malignant
80010 Tumor cells, benign
80011 Tumor cells, uncertain whether benign or malignant
80013 Tumor cells, malignant
80023 Malignant tumor, small cell type
80033 Malignant tumor, giant cell type
80043 Malignant tumor, spindle cell type
80050 Clear cell tumor, NOS
80053 Malignant tumor, clear cell type
80100 Epithelial tumor, benign
80102 Carcinoma in situ, NOS
80103 Carcinoma, NOS
80113 Epithelioma, malignant
80123 Large cell carcinoma, NOS
80133 Large cell neuroendocrine carcinoma
80143 Large cell carcinoma with rhabdoid phenotype
80153 Glassy cell carcinoma
80203 Carcinoma, undifferentiated type, NOS
80213 Carcinoma, anaplastic type, NOS
80223 Pleomorphic carcinoma
80303 Giant cell and spindle cell carcinoma
80313 Giant cell carcinoma
80323 Spindle cell carcinoma
80333 Pseudosarcomatous carcinoma
80343 Polygonal cell carcinoma
80353 Carcinoma with osteoclast-like giant cells
80413 Small 

**This section is adapted from section 6.2.1, "Script Algorithm", of page 103 from "Methods in Medical Informatics".*

## Analysis: Building the ICD-O (Oncology) Dictionary

Here are a few of the code-term pairs from ICD-O:*
`             
80003 Neoplasm, malignant
80013 Tumor cells, malignant
80023 Malignant tumor, small cell type
80033 Malignant tumor, giant cell type
80043 Malignant tumor, spindle cell type
80053 Malignant tumor, clear cell type
80102 Carcinoma in situ, NOS
80103 Carcinoma, NOS
`
**This section is adapted from section 6.2.2, "Analysis", of pages 104-105 from "Methods in Medical Informatics".*