Welcome to chapter fourteen of Methods in Medical Informatics! In this section, we will be exploring autocoding. Often in biomedical informatics, it is necessary to extract medical terms from text and attach a nomenclature concept code to the extracted term. A software product that computationally parses and codes medical text is called an autocoder or an automatic coder. By coding, concepts of interest contained in text can be retrieved regardless of the choice of words used to describe them. Lets begin!

> Disclaimer: The content below is adapted from the book "Methods in Medical Informatics - Fundamental of Healthcare Programming in Perl, Python, and Ruby" by Jules J. Berman. All content is for testing, education, and teaching purposes only. No content will be openly released to the internet. 

# Neoplasm Autocoder

The script requires two external files, neocl.xml, the Neoplasm Classification in XML
format, available for download as a gzipped file from
http://www.julesberman.info/neoclxml.gz
There are about 135,000 unique terms in the nomenclature. Each term is listed in a
consistent format, as shown in these two examples:
<name nci-code = “C3084300”>polymorphous haemangioendothelioma</name>
<name nci-code = “C3085000”>angioma</name>
The terms are enclosed by angle brackets:
>polymorphous haemangioendothelioma<
>angioma<
The codes are enclosed by quotations:
“C3084300”
“C3085000”
Terms and corresponding codes can be easily extracted by a simple regex expression.
We will use an external file that we can autocode. For this sample project, we will
parse through tumorabs.txt, a file of 20,000 abstract titles extracted from PubMed
and available for download at
http://www.julesberman.info/book/tumorabs.txt
A portion of the file is shown in Figure 14.1.
We described the process of obtaining PubMed search result files in Chapter 9,
Section 9.1.

In [None]:
import re
text = open("./K11946_Files/NEOCL.XML", "r")
literalhash = {}
codematch = re.compile('\”(C\d{7})\”')
phrasematch = re.compile('\”\> ?(.+) ?\<\/')
for line in text:
    m= codematch.search(line)
    if m:
        code = m.group(1)
    else:
        continue
    x = phrasematch.search(line)
    if x:
        phrase = x.group(1)
    else:
        continue
    literalhash[phrase] = code
text.close()
print("Neoplasm code hash has been created. Autocoding will start now")
absfile = open("./K11946_Files/tumorabs.txt", "r")
outfile = open("tumorpy.out", "w")
singular = re.compile('omas')
england = re.compile('tumo[u]?rs')
for line in absfile:
    sentence = line
    sentence = singular.sub("oma",sentence)
    sentence = england.sub("tumor",sentence)
    sentence = sentence.rstrip()
    print(outfile,"\nAbstract title..." + sentence + ".")
    sentence_array = sentence.split(" ")
    length = len(sentence_array)
    for i in range(length):
        for place_length in range(len(sentence_array)):
            last_element = place_length + 1
            phrase = ' '.join(sentence_array[0:last_element])
            if phrase in literalhash:
                print(outfile,"Neoplasm term..." + phrase + " " + literalhash[phrase])
    sentence_array.pop(0)

## Script Algoriothm: Neoplasm Autocoder

1. Open the nomenclature file, which will be the source of coded terms to match
against the text that needs to be autocoded. For this example, we will use the
neoplasm taxonomy, but it could be any nomenclature that consists of codes
listed with their corresponding medical terms.
2. Create a dictionary object with keys corresponding to the terms (names of
neoplasms, in this case) of the medical nomenclature and values comprising
the corresponding codes for the terms.
3. Open the file to be parsed (tumorabs.txt).
4. Parse through the file, line by line, each line containing a sentence.
5. As each sentence is parsed, break the file into every possible ordered subsequence
of words (a phrase array). For example, “Everybody loves to eat pizza”
would be broken into an array containing the following items:
Everybody loves to eat pizza
Everybody loves to eat
Everybody loves
Everybody
loves to eat pizza
loves to eat
loves to
loves
to eat pizza
to eat
to
eat pizza
eat
pizza
6. For each item in the phrase array, determine whether the item matches a term
in the neoplasm dictionary object.
7. If there is a match, print the phrase and the corresponding code to an external
file.
8. The external file will consist of the lines from the text, followed by the phrases
from the lines that are neoplasm terms, along with their nomenclature codes.

## Analysis: Neoplasm Autocoder

The output of the coder is virtually perfect. Browse through the 10,000 abstract titles
on this page and look for the named neoplasms in the abstract text. See if you can find
named neoplasms included in the abstract title that were excluded from the autocoded
terms that follow each abstract title.
Each abstract line parsed from the tumorabs.txt file is printed and then followed by
the list of autocoded terms extracted from the title.
Note that the terms coded “C0000000” are general neoplasm terms such as “tumor”
or “cancer” and not specific names of neoplasms, or they are names of neoplasms that
have not yet been classified within the neoplasm taxonomy. Also, the program codes
each occurrence of a neoplasm term, even if it is repeated.
Abstract title. Local versus diffuse recurrences of meningioma factors correlated
to the extent of the recurrence.
Neoplasm term. Meningioma C3230000.
Abstract title. The effect of an unplanned excision of a soft tissue sarcoma on
prognosis.
Neoplasm term. Soft tissue sarcoma C9306000.
Neoplasm term. Sarcoma C0000000.
Abstract title. Obstructive jaundice associated burkitt lymphoma mimicking
pancreatic carcinoma.
Neoplasm term. Jaundice C0000000.
Neoplasm term. Burkitt lymphoma C7188000.
Neoplasm term. Lymphoma C7065000.
Neoplasm term. Pancreatic carcinoma C3850000.
Neoplasm term. Carcinoma C0000000.
Abstract title. Efficacy of zoledronate in treating persisting isolated tumor cells
in bone marrow in patients with breast cancer a phase II pilot study.
Neoplasm term. Tumor C0000000.
Neoplasm term. Breast cancer C4872000.
Neoplasm term. Cancer C0000000.
Abstract title. Metastatic lymph node number in epithelial ovarian carcinoma
does it have any clinical significance.
Neoplasm term. Epithelial ovarian carcinoma C4908000.
Neoplasm term. Ovarian carcinoma C4908000.
Neoplasm term. Carcinoma C0000000.
Abstract title. Extended three-dimensional impedance map methods for identifying
ultrasonic scattering sites.
Abstract title. Aberrant expression of connexin 26 is associated with lung metastasis
of colorectal cancer.
Neoplasm term. Colorectal cancer C5105000.
Neoplasm term. Cancer C0000000.
Abstract title. Microrna expression profiles of esophageal cancer.
Neoplasm term. Esophageal cancer C3513000.
Neoplasm term. Cancer C0000000.
Abstract title. State and trait anxiety and depression in patients with primary
brain tumor before and after surgery 1 year longitudinal study.
Neoplasm term. Primary brain tumor C0000000.
Neoplasm term. Brain tumor C0000000.
Neoplasm term. Tumor C0000000.
Abstract title. Laparoscopic resection of large adrenal ganglioneuroma.
Neoplasm term. Ganglioneuroma C3049000.
Abstract title. Case records of the Massachusetts general hospital case 4 2008 a
33- year-old pregnant woman with swelling of the left breast and shortness
of breath.
Abstract title. Evaluation of higher order time domain perturbation theory of
photon diffusion on breast equivalent phantoms and optical mammograms.
Abstract title. Meningeal melanocytosis in a young patient an autopsy diagnosis.
Abstract title. Oncogenic hypophosphataemic osteomalacia biomarker roles of
fibroblast growth factor 23 1 25 dihydroxyvitamin d3 and lymphatic vessel
endothelial hyaluronan receptor 1.
Abstract title. Microrna expression profiles associated with prognosis and therapeutic
outcome in colon adenocarcinoma.
Neoplasm term. Colon adenocarcinoma C4349000.
Neoplasm term. Adenocarcinoma C0000000.

# Recoding

The medical informatics literature has lots of descriptions of medical autocoders, but most of these descriptions fail to include their speed. The autocoder included here is fast, coding 20,000 citations in about 20 second or less on a 2.5 GHz desktop CPU with 512 megabytes of RAM. This is a rate of about 100 kilobytes per second. By
the time this book is published, most readers will have computers that operate much
faster than mine, providing a much faster autocoding rate.
Why is it important to have a fast autocoder? Why can’t you load your parser with
a big file and let it run in the background, taking as long as it takes to finish?
There are three reasons why you absolutely must have a fast autocoder:
1. Medical files today are large. It is not unusual for a large medical center to
generate a terabyte of data each week. A slow autocoder could never keep up
with the volume of medical information that is produced each day.
2. Autocoders, and the nomenclatures they draw terms from, need to be modified
to accommodate unexpected oddities in the text that they parse (particularly
formatting oddities and the inclusion of idiosyncratic language to express
medical terms). The cycles of running a program, reviewing output, making
modifications in software or nomenclatures, and repeating the whole process
many times cannot be undertaken if you need to wait a week for your autocoding
software to parse your text.
3. Autocoding is as much about recoding as it is about the initial process of providing
nomenclature codes.
You need to recode (supply a new set of nomenclature codes for terms in your medical
text) whenever you want to change from one nomenclature to another.
You need to recode whenever you introduce a new version of a nomenclature.
You need to recode whenever you want to use a new coding algorithm (e.g., parsimonious
coding versus comprehensive, or linking code to a particular extracted portion
of report).
You need to recode whenever you add legacy data to your laboratory information
systems.
You need to recode whenever you merge different medical data sets (especially,
medical data sets that have been coded with different medical nomenclatures).
All of this recoding adds to the data burden placed on a medical autocoder.
It has been my personal observation that computational tasks that take much time
(more than a few seconds) tend to be put on the back burner. The same observations
would apply to medical deidentification software (Chapter 15), software designed to
classify data into related groups (so-called intelligent computing) and software that
draws inferences from classes of data (so-called artificial intelligence). Smart informaticians
understand that program execution speed is always very important.