# Building the Doublet Hash

The utility of the doublet method is derived in part from the observation that most
medical terms are multiword terms. In the Neoplasm Classification, all but about
250 terms are multiword terms. Unlike single words, which often have several different
meanings, multiword medical terms, with very rare exceptions, have a single,
specific meaning.
In Chapter 9, Section 9.2, we learned that any multiword term can be constructed
by a concatenation of overlapping doublets.
For example:
Serous borderline ovarian tumor -> (“serous borderline,” “borderline ovarian,”
“ovarian tumor”)
The doublets composing the multiword terms from the neoplasm nomenclature can
be combined into a list. The list of nomenclature doublets can be used to determine
whether a fragment of text is composed from doublets included in the list.
We would like to build a persistent data object (see Chapter 5, Section 5.2) containing
all of the doublet terms found in the Neoplasm Classification. We will use the
doublet list for a variety of informatics projects featured in this book.

In [None]:
import dbm, string, re
doubhash = dbm.open('doub', 'n')
literalhash = dbm.open('literal', 'n')
in_file = open('./K11946_Files/NEOCL.XML', "r")
singular = re.compile('omas')
england = re.compile('tumou?rs?')
phrase = ""
for line in in_file:
    print(1)
    neoplasm_match = re.search(r'\"\> ?(.+) ?\<', line)
    if neoplasm_match:
        print(2)
        phrase = neoplasm_match.group(1)
        phrase = singular.sub("oma",phrase)
        phrase = england.sub("tumor",phrase)
        literalhash[phrase] = ""
    print(4)
    hoparray = phrase.split()
    print(5)
    hoparray.append(" ")
    print(6)
    for i in range(len(hoparray)-1):
        print(7)
        doublet = hoparray[i] + " " + hoparray[i + 1]
        print(8)
        if doublet in doubhash:
            continue
        doubhash_match = re.search(r'[a-z]+ [a-z]+', doublet)
        print(9)
        if doubhash_match:
            print(10)
            doubhash[doublet] = ""
doubhash.close()
literalhash.close()

## Script Algorithm: Building the Doublet Hash

1. Create two external database objects.
2. We will tie one external database object to a dictionary object composed of
key–value pairs, where the keys are the neoplasm terms in the Neoplasm
Classification, and the values are the empty character (“ ”).
3. We will tie another external database object to a dictionary object composed
of key–value pairs, where the keys are the collection of word doublets from the
Neoplasm Classification, and the values are the empty character (“ ”).
4. Open the Neoplasm Classification for parsing. The compressed file is available
for download at
http://www.julesberman.info/neoclxml.gz.
Make certain that the unzipped file is named neocl.xml and that your script
lists its correct subdirectory location on your computer.
5. Parse through the file, line by line.
6. Neoplasm terms are flanked by angle brackets and can be extracted with a
simple regex expression.
7. The neoplasm term is added as a new key to the dictionary object containing
the terms in the nomenclature.
8. The term is parsed into doublets by iterating through each word in the term
and appending the next consecutive word. Add each doublet term to the dictionary
object containing word doublets as keys.
9. After the entire nomenclature file is parsed, the two dictionary objects achieve
persistence through the external database objects to which they were tied.

## Analysis: Building the Doublet Hash

We now have persistent data objects in external database files (i.e., the terms object
and the doublets object) that we can use in the next section.

# Scanning the Literature for Candidate Terms

Here is a simple method for extracting candidate new terms from any large corpus
of text.
The method depends on the empirical observation that terms in a nomenclature are composed
almost exclusively of doublets found in other terms in the same nomenclature.
The current version of the neoplasm nomenclature contains 135,000 unique terms.
Of these terms, 126,756 terms are classified terms and are composed of at least two
words (i.e., are doublets or greater in length). Of these 126,756 terms, all but 6,308
(4.97%) are composed entirely of doublets extracted from other terms in the reference
nomenclature. This means that 95% of the classified terms from the nomenclature are
formed entirely of doublet terms found in other terms from the same nomenclature.
The method compares connected word doublets in a medical text against a list of
word doublets found in a nomenclature. Text phrases composed of sequences of word
doublets found in an existing nomenclature are candidate new nomenclature terms.
This general method can be used with any text and any existing nomenclature. This
method permits curators to continually enhance their nomenclatures with new terms,
an essential activity needed to ensure the proper coding and annotation of biomedical
data.

In [None]:
import dbm, string, re
doubhash = dbm.open('doub')
literalhash = dbm.open('literal')
newhash = {}
in_file = open('./K11946_File/cancer_gene_titles.txt', 'r')
line = " "
count = 0
singular = re.compile('omas')
england = re.compile('tumou?rs?')
for line in in_file:
    bigline = line.rstrip(" \n")
    bigline = singular.sub("oma", bigline)
    bigline = england.sub("tumor", bigline)
    englishline = ""
    hoparray = bigline.split()
    hoparray.append(" ")
    for i in range(len(hoparray) - 1):
        doublet = hoparray[i] + " " + hoparray[i + 1]
        if doubhash.has_key(doublet):
            if (englishline != ""):
                englishline = englishline + " " + hoparray[i + 1]
            else:
                englishline = doublet
        else:
            if englishline != "":
                englishline = englishline.strip()
                englishline = re.sub(r'^the ', "", englishline)
                englishline = re.sub(r'^in ', "", englishline)
                englishline = re.sub(r'^of ', "", englishline)
                englishline = re.sub(r'^and ', "", englishline)
                englishline = re.sub(r'^with ', "", englishline)
                englishline = re.sub(r'^from ', "", englishline)
                englishline = re.sub(r'^ a', "", englishline)
                englishline = re.sub(r' the$', "", englishline)
                englishline = re.sub(r' in$', "", englishline)
                englishline = re.sub(r' of$', "", englishline)
                englishline = re.sub(r' and$', "", englishline)
                englishline = re.sub(r' with$', "", englishline)
                englishline = re.sub(r' from$', "", englishline)
                englishline = re.sub(r' a$', "", englishline)
                if literalhash.has_key(englishline):
                    continue
                if newhash.has_key(englishline):
                    continue
                phrase_match = re.search(r' [a-z]+ ', englishline)
                if phrase_match:
                    count = count + 1
                    print str(count) + " " + englishline
                    newhash[englishline] = ""
doubhash.close()
literalhash.close()

## Script Algorithm: Scanning the Literature for Candidate Terms

1. Collect all the doublets that occur in the entire nomenclature (i.e., use the
database object created in Section 11.1).
2. Parse text (in this case individual abstract titles) into an ordered array of
overlapping doublets (as per the example shown for the text string, “serous
borderline ovarian tumor”).
The text file that we use is cancer_gene_titles.txt (1,752,432 bytes), created
in Chapter 9, Section 9.1. It contains 28,102 titles related to the topic of genes
and cancer or tumors. It is available for download at
http://www.julesberman.info/book/cancer_gene_titles.txt.
Alternatively, you can create your own file of titles by downloading a
PubMed search on a topic of your own interest and collecting the titles, using
the script provided in the previous section.
3. Compare each consecutive text doublet against the array of doublets from
the nomenclature to determine whether the doublet exists somewhere in the
nomenclature.
4. If the doublet from the text does not exist in the nomenclature, it can be
deleted. If it exists in the nomenclature, it is concatenated with the following
doublet if the following doublet exists in the nomenclature. Otherwise, it is
deleted. This process continues, concatenating doublets that exist somewhere
in the nomenclature. Extraneous leading words (the, in, of, with, and) and
trailer words (the, and, with, from, a) are automatically deleted from the final
concatenated sequence. Final concatenated sequences of two or greater consecutive
doublets that match to doublets from the nomenclature are saved as
candidate terms.

## Analysis: Scanning the Literature for Candidate Terms

Parsing the file cancer_genes_titles.txt, we found about 4,100 new candidate neoplasm
terms. Here are some final terms from the output list:
intraneural perineurioma of the oral mucosa
due to promyelocytic leukemia
spinal cord primary extragonadal
spinal cord primary extragonadal sac tumor
epithelioid and spindle cell haemangioma of bone
cervical malformation neurofibromatosis type 1
osteoblastoma of the scapula
ameloblastic carcinoma in
pancreatic serous cystadenoma endocrine tumor
extrarenal rhabdoid tumor of the cervical spine
diffuse type cell tumor of the subcutaneous
ewing sarcoma neuroectodermal tumor of the kidney
low grade fibromyxoid sarcoma of the colon
inflammatory myofibroblastic tumor of the tongue
superficial angiomyxoma the floor of the mouth
young adult with acute lymphoblastic leukemia
burkitt lymphoma in pediatric
peripheral primitive neuroectodermal tumor of the maxilla
anaplastic large cell lymphoma of bone
A cursory examination of this small portion of the 4077 returned candidate terms
indicates that some of the terms seem to be legitimate names of neoplasms, which
should be added to our neoplasm vocabulary:
intraneural perineurioma of the oral mucosa
epithelioid and spindle cell haemangioma of bone
osteoblastoma of the scapula
pancreatic serous cystadenoma endocrine tumor
extrarenal rhabdoid tumor of the cervical spine
low grade fibromyxoid sarcoma of the colon
inflammatory myofibroblastic tumor of the tongue
peripheral primitive neuroectodermal tumor of the maxilla
anaplastic large cell lymphoma of bone
The majority of terms are phrases that happen to consist of doublets from our nomenclature,
but do not rise to the level of a new neoplasm term:
due to promyelocytic leukemia
spinal cord primary extragonadal
spinal cord primary extragonadal sac tumor
cervical malformation neurofibromatosis type 1
ameloblastic carcinoma in
diffuse type cell tumor of the subcutaneous
ewing sarcoma neuroectodermal tumor of the kidney
young adult with acute lymphoblastic leukemia
burkitt lymphoma in pediatric
There was one term that seems to be a poorly worded representation of a proper neoplasm’s
name:
superficial angiomyxoma the floor of the mouth
It should be
superficial angiomyxoma of the floor of the mouth
The original file of abstracts that contained the words cancer and gene exceeded
213 megabytes (MB) in length. The perfect curator would have read each abstract,
writing down the names of neoplasms that were not contained in the nomenclature.
The modern curator had the option of extracting the titles from the articles, and parsing
through the titles, extracting about 4,100 candidate terms, and then examining
the candidate terms to find likely new terms for the nomenclature. The semiautomated
process takes about one-half hour and provides hundreds of new terms that can be
added to the nomenclature.

# Adding Terms to the Neoplasm Classification

One of the most common tasks in informatics is the preparation of a subtraction list
(items present in one list and absent from another).
Curators need to prepare a subtraction list whenever they want to add terms to a
preexisting nomenclature. The list of candidate terms must be checked against the list
of terms found in the nomenclature, with removal of redundant terms in the new list.
We can use the Neoplasm Classification as a sample nomenclature. We will use the
file neocl.lst (available at http://www.julesberman.info/book/neocl/lst), which contains
the following list of candidate terms:
prostate cancer
adenocarcinoma of prostate
spiradenocylindroma of the kidney
spiradenocylindroma
pleomorphic myxoid liposarcoma
spindle cell myxoid liposarcoma
matrix producing carcinoma of breast
matrix producing carcinoma of the breast
dini of breast
precancer flat epithelial atypia
matrix-producing carcinoma of the breast
early onset cancer
early-onset neoplasm
early-onset neoplasia
carcinoma of the bellini collecting duct
adenocarcinoma of the prostate
We need to know which terms, among the candidate terms, are already included in
the Neoplasm Classification.

In [None]:
import re, string
vocab_in = open('./K11946_Files/NEOCL.XML', "r")
doub_hash = {}
for line in vocab_in:
    code_match = re.search(r'C[0-9]{7}', line)
    if not code_match:
        continue
    line_match = re.search(r'\”\> ?(.+) ?\<\/', line)
    if line_match:
        phrase = line_match.group(1)
        doub_hash[phrase] = ""
vocab_in.close()
candidate_file = open('./K11946_Files/neocl.lst', "r")
out_file = open("new.out", "w")
for line in candidate_file:
    line = re.sub(r'\n',"", line)
    if (line == ""):
        continue
    if line in doub_hash:
        print line + " already exists"
    else:
        print(out_file, line)

## Script Algorithm: Adding Terms to the Neoplasm Classification

1. Open the Neoplasm Classification file.
2. Parse through the file, collecting every code/term pair in the Neoplasm
Classification, and assigning each pair as the key and value (respectively) for a
dictionary object.
3. Open the file containing the list of candidate terms to be added to the
Neoplasm Classification.
4. Parse each term from the list, checking to see if it is already contained as a key
in the dictionary object.
5. For each term, if the term does not already exist as a key in the dictionary
object, print it to an external file.
6. After the script executes, you have a new file containing terms that can be
added to the Neoplasm Classification.

## Analysis: Adding Terms to the Neoplasm Classification

The script splits the output into the set of terms already contained in the Neoplasm
Classification, displayed on the computer monitor:
prostate cancer already exists
adenocarcinoma of prostate already exists
spiradenocylindroma of the kidney already exists
matrix producing carcinoma of breast already exists
matrix producing carcinoma of the breast already exists
matrix-producing carcinoma of the breast already exists
adenocarcinoma of the prostate already exists
And an output file, containing the list of terms that are not already included in the
Neoplasm Classification:
spiradenocylindroma
pleomorphic myxoid liposarcoma
spindle cell myxoid liposarcoma
dini of breast
precancer-flat epithelial atypia
early onset cancer
early-onset neoplasm
early-onset neoplasia
carcinoma of the bellini collecting duct

# Determining the Lineage of Every Neoplasm Concept

Biological classifications drive down the complexity of nomenclatures by assigning
every term to a class of objects that contain similar features, inherited from a lineage
of ancestral objects. We have seen, in the prior chapter, that knowing the lineages of
organisms can lead to treatments for newly encountered pathogens. Similarly, knowing
the lineage of neoplasms may help us find the tumors most likely to respond, as
a biological class, to molecular-targeted cancer treatments. The importance of tumor
lineage is one of the important concepts discussed in my book, Neoplasms: Principles of
Development and Diversity (Jones & Bartlett Publishers, 2009).
The Neoplasm Classification contains about 135,000 names of neoplasms, organized
under about 6,000 concepts. A concept is the collection of synonyms for a specific
type of neoplasm. Every neoplasm term and concept can be assigned a unique
position within a simple class hierarchy, consisting of several dozen ancestral classes
(Figure 11.1).
The Neoplasm Classification is packaged as an XML (eXtensible Markup Language)
file. The terms in the nomenclature are marked up with tags that provide each term
with a code number describing each term. Each term in the Neoplasm Classification
is nested under another element that names a class of neoplasms. Each named class of
neoplasms is nested under elements for the father of the class, and this nesting continues
up the classification hierarchy.
XML is a markup language created for the Internet, and data that is delivered
in XML files permits us to search for related information located anywhere in the
Internet. In Chapter 18, we will be describing XML in much more detail. For now,
we will take advantage of language-specific modules designed to parse XML, and we will determine the full neoplasm lineage for every term contained in the Neoplasm
Classification. If you are unfamiliar with XML, you can skip this section of the chapter
and come back to it after reading Chapter 18.

In [None]:
import xml.parsers.expat
import re
parsefile = open('./K11946_Files/NEOCL.XML', "r")
filestring = parsefile.read()
lastname = ""
code = ""
count = 0
text = ""
def start_element(name, attrs):
    global lastname
    global code
    if attrs.has_key("nci-code"):
        code = attrs["nci-code"]
    else:
        lastname = name + ">" + lastname
def end_element(name):
    global count
    global code
    global text
    global lastname
    if name == "name":
        count = count + 1
        print str(count) + "|" + text + "|" + code + "|" + lastname + "\n"
        text = ""
    lastname = re.sub(name + r'>','', lastname)
def char_data(data):
    global text
    text = repr(data)
    textmatch = re.search(r'\'(.+)\'',text)
    if textmatch:
        text = textmatch.group(1)
p = xml.parsers.expat.ParserCreate()
p.StartElementHandler = start_element
p.EndElementHandler = end_element
p.CharacterDataHandler = char_data
p.Parse(filestring)

## Script Algorithm: Determining the Lineage of Every Neoplasm Concept

1. Call the XML parser module into your script.
2. Define subroutines that process XML information for specific events that
occur as the XML file is parsed. These events happen whenever the script
encounters the start of an XML element; the script encounters the end of an
XML element; and the script encounters the data described by the XML tag.
3. Provide the parser object with the name of the XML file that you would like
to parse. In this case, it is the neocl.xml file.
4. When an element is encountered, the parser passes the name of the element
and any attributes within the element (in this case, the code number for the
term) to a list of variables. When the data contents of the element are encountered,
the parser passes the data to a variable. In this case, the data associated
with an element is the neoplasm term.
5. As the parser works its way down the hierarchy, it concatenates the names of
the ancestors into a string. When it finally encounters the lowest element in the hierarchy, it concatenates the data (the name of the term), and the attribute
(the code for the term), appends the hierarchical list of elements (ancestors) to
the string, and prints it to an external file. When it backs up through the hierarchy
(when it moves through different class lineages), it truncates the previously
built string of concatenated classes to exclude nonancestral classes.

## Analysis: Determining the Lineage of Every Neoplasm Concept

The neocl.xml file is over 10 MB in length. It takes several seconds, on most computers
(with 2–3 GHz CPUs) to run this script, producing an output file that exceeds
17 MB in length. Here are a few lines of the output file:

1|teratoma|C3403000|totipotent_or_multipotent_differentiating>
primitive_differentiating>primitive>embryonic>neoplasms>
tumor_classification>
2|embryonal ca|C3752000|totipotent_or_multipotent_differentiating>
primitive_differentiating>primitive>embryonic>neoplasms>
tumor_classification>
3|embryonal cancer|C3752000|totipotent_or_multipotent_differentiating>
primitive_differentiating>primitive>embryonic>neoplasms>
tumor_classification>
4|embryonal carcinoma|C3752000|totipotent_or_multipotent_differentiating>
primitive_differentiating>primitive>embryonic>neoplasms>
tumor_classification>

Because we know the structure of the Neoplasm Classification file, we could have
written a parsing script without using an external XML parser, if we had so chosen.
The script would have been similar to the script that we used to find the lineage of
organisms from the Taxonomy.dat file (Chapter 10). However, because the neocl.xml
file is created as an XML file, it is better to use the readily available XML parsing
module. Doing so shortens our script and, if you do much work with XML, is easy
to read. Once you have learned to parse XML files, you will be able to write scripts
that collect, transform, and analyze data from multiple, different XML files, collected
from anywhere on the Internet.