<a href="https://datamine.unc.edu/home/methods_in_medical_informatics_yuchenh/" ><h1>Back to Notebook List</h3></a>
<br/>

Welcome to chapter fourteen of Methods in Medical Informatics! In this section, we will be exploring autocoding. Often in biomedical informatics, it is necessary to extract medical terms from text and attach a nomenclature concept code to the extracted term. A software product that computationally parses and codes medical text is called an autocoder or an automatic coder. By coding, concepts of interest contained in text can be retrieved regardless of the choice of words used to describe them. Lets begin!

> Disclaimer: The content below is adapted from the book "Methods in Medical Informatics - Fundamental of Healthcare Programming in Perl, Python, and Ruby" by Jules J. Berman. All content is for testing, education, and teaching purposes only. No content will be openly released to the internet. 

# 14.1 Neoplasm Autocoder

The script below will parse a large collection of journal articles. The autocoder will then identify neoplasm codes from the Classification of Neoplasms XML file, "NEOCL.XML" and automatically code any appropriate terminologies from the article titles.*

> This script will utilize the file [neocl.xml](./K11946_Files/NEOCL.XML). neocl.xml is the Neoplasm Classification formated as an XML document. Additional information [here](https://datamine.unc.edu/datafiles_yuchenh/)

**Description adapted from pages 209-210 of "Methods in Medical Informatics"*

In [None]:
import re
text = open("./K11946_Files/NEOCL.XML", "r")
literalhash = {}
codematch = re.compile('(C\d{7})')
phrasematch = re.compile('(?<=\>)(.+)(?=\<)')
for line in text:
    m= codematch.search(line)
    if m:
        code = m.group()
    else:
        continue
    x = phrasematch.search(line)
    if x:
        phrase = x.group()
    else:
        continue
    literalhash[phrase] = code
text.close()
print("Neoplasm code hash has been created. Autocoding will start now")
absfile = open("./K11946_Files/tumorabs.txt", "r")
outfile = open("tumorpy.out", "w")
singular = re.compile('omas')
england = re.compile('tumo[u]?rs')
for line in absfile:
    sentence = line
    sentence = singular.sub("oma",sentence)
    sentence = england.sub("tumor",sentence)
    sentence = sentence.rstrip()
    print("\nAbstract title..." + sentence + ".")
    sentence_array = sentence.split(" ")
    length = len(sentence_array)
    for i in range(length):
        for place_length in range(len(sentence_array)):
            last_element = place_length + 1
            phrase = ' '.join(sentence_array[0:last_element])
            if phrase in literalhash:
                print("Neoplasm term..." + phrase + " " + literalhash[phrase])
    sentence_array.pop(0)

## Script Algoriothm: Neoplasm Autocoder

Open the nomenclature file, which will be the source of coded terms to match
against the text that needs to be autocoded. For this example, we will use the
neoplasm taxonomy, but it could be any nomenclature that consists of codes
listed with their corresponding medical terms.*

In [None]:
import re
text = open("./K11946_Files/NEOCL.XML", "r")

Create a dictionary object with keys corresponding to the terms (names of
neoplasms, in this case) of the medical nomenclature and values comprising
the corresponding codes for the terms.

In [None]:
literalhash = {}
codematch = re.compile('(C\d{7})')
phrasematch = re.compile('(?<=\>)(.+)(?=\<)')
for line in text:
    m= codematch.search(line)
    if m:
        code = m.group()
    else:
        continue
    x = phrasematch.search(line)
    if x:
        phrase = x.group()
    else:
        continue
    literalhash[phrase] = code
text.close()
print("Neoplasm code hash has been created. Autocoding will start now")

Open the file to be parsed (tumorabs.txt). Parse through the file, line by line, each line containing a sentence. As each sentence is parsed, break the file into every possible ordered subsequence
of words (a phrase array). For example, “Everybody loves to eat pizza”
would be broken into an array containing the following items:
Everybody loves to eat pizza
Everybody loves to eat
Everybody loves
Everybody
loves to eat pizza
loves to eat
loves to
loves
to eat pizza
to eat
to
eat pizza
eat
pizza
For each item in the phrase array, determine whether the item matches a term
in the neoplasm dictionary object. If there is a match, print the phrase and the corresponding code to an external
file. The external file will consist of the lines from the text, followed by the phrases from the lines that are neoplasm terms, along with their nomenclature codes.

In [None]:
absfile = open("./K11946_Files/tumorabs.txt", "r")
outfile = open("tumorpy.out", "w")
singular = re.compile('omas')
england = re.compile('tumo[u]?rs')
for line in absfile:
    sentence = line
    sentence = singular.sub("oma",sentence)
    sentence = england.sub("tumor",sentence)
    sentence = sentence.rstrip()
    print("\nAbstract title..." + sentence + ".")
    sentence_array = sentence.split(" ")
    length = len(sentence_array)
    for i in range(length):
        for place_length in range(len(sentence_array)):
            last_element = place_length + 1
            phrase = ' '.join(sentence_array[0:last_element])
            if phrase in literalhash:
                print("Neoplasm term..." + phrase + " " + literalhash[phrase])
    sentence_array.pop(0)

**This section is adapted from section 14.1.1, "Script Algorithm", of pages 210-212 from "Methods in Medical Informatics".*

## Analysis: Neoplasm Autocoder

Each abstract line parsed from the tumorabs.txt file is printed and then followed by the list of autocoded terms extracted from the title.
Note that the terms coded “C0000000” are general neoplasm terms such as “tumor” or “cancer” and not specific names of neoplasms, or they are names of neoplasms that have not yet been classified within the neoplasm taxonomy. Also, the program codes each occurrence of a neoplasm term, even if it is repeated.*

**This section is adapted from section 14.1.2, "Analysis", of pages 215-216 in "Methods in Medical Informatics".*

# Recoding

The medical informatics literature has lots of descriptions of medical autocoders, but most of these descriptions fail to include their speed. The autocoder included here is fast, coding 20,000 citations in about 20 second or less on a 2.5 GHz desktop CPU with 512 megabytes of RAM. By the time you read this, most readers will have computers that operate much
faster.
Why is it important to have a fast autocoder? 

There are three reasons why:*
1. Medical files today are large. It is not unusual for a large medical center to
generate a terabyte of data each week. A slow autocoder could never keep up
with the volume of medical information that is produced each day.
2. Autocoders, and the nomenclatures they draw terms from, need to be modified
to accommodate unexpected oddities in the text that they parse . The cycles of running a program, reviewing output, making
modifications in software or nomenclatures, and repeating the whole process
many times cannot be undertaken if you need to wait a week for your autocoding
software to parse your text.
3. Autocoding is as much about recoding as it is about the initial process of providing
nomenclature codes.
You need to recode (supply a new set of nomenclature codes for terms in your medical
text) whenever you want to change from one nomenclature to another.
You need to recode whenever you introduce a new version of a nomenclature.
You need to recode whenever you want to use a new coding algorithm (e.g., parsimonious
coding versus comprehensive, or linking code to a particular extracted portion
of report).
You need to recode whenever you add legacy data to your laboratory information
systems.
You need to recode whenever you merge different medical data sets (especially,
medical data sets that have been coded with different medical nomenclatures).
All of this recoding adds to the data burden placed on a medical autocoder.

Computational tasks that take much time
(more than a few seconds) are often put on the back burner. Smart informaticians
understand that program execution speed is always very important.

**Description adapted from pages 216-217 of "Methods in Medical Informatics"*