Welcome to chapter nine of Methods in Medical Informatics! In this section, we will be exploring PubMed. PubMed is the U.S National Library of Medicine's public search engine for about 19 million citations from the medical literature. In this chapter, we will be exploring script which allow us to explore PubMed further. Lets begin!

> Disclaimer: The content below is adapted from the book "Methods in Medical Informatics - Fundamental of Healthcare Programming in Perl, Python, and Ruby" by Jules J. Berman. All content is for testing, education, and teaching purposes only. No content will be openly released to the internet. 

# Building a Large Text Corpus of Biomedical Information

It is remarkably easy to create a large public domain text corpus fro almost any medical specialty. All you need to do is to enter a PubMed query and and the results to a file on your computer's hard disk. 

In [25]:
import string, re
in_text = open("cancer_citations.txt", "r", encoding="utf-8")
out_text = open("cancer_gene_titles.txt", "w")
clump = ""
for line in in_text:
    title_match = re.search(r'(?<=TI)(.*)', clump)
    if title_match:
        title = title_match.group(1)
        title = title.lower()
        title = re.sub(r'\'s', "", title)
        title = re.sub(r'\W', " ", title)
        title = re.sub(r'omas', "oma", title)
        title = re.sub(r'tumour', "tumor", title)
        title = re.sub(r'\n', " ", title)
        title = title.rstrip()
        title = title.lstrip()
        title = re.sub(r' +', " ", title)
        text_match = re.search(r'[a-z]+', title)
        if not text_match:
            clump = ""
            continue
        print(title)
        out_text.write(title)
        clump = ""
    else:
        clump = clump + line
out_text.close()

the nf1 somatic mutational landscape in sporadic human cancers
ini1 deficient tumors diagnostic features and molecular genetics
clinical actionability of molecular targets in endometrial cancer
biology and management of undifferentiated pleomorphic sarcoma myxofibrosarcoma
novel noninvasive diagnostics
restoration of p53 function leads to tumor regression in vivo
pulmonary sarcomatoid carcinoma a review
sarcomatoid carcinoma of the gallbladder clinicopathologic characteristics
methylation based classification of benign and malignant peripheral nerve sheath
genetic aberrations in small b cell lymphoma and leukemias molecular pathology
extramedullary leukemia behaving as solid cancer clinical histologic and genetic
acute lymphoblastic leukemia and lymphoma in the context of constitutional mismatch
tumor targeting by fusobacterium nucleatum a pilot study and future perspectives
recent concepts of ovarian carcinogenesis type i and type ii
cs of the
dual role of the alternative reading fram

UnicodeEncodeError: 'charmap' codec can't encode character '\u03b1' in position 32: character maps to <undefined>

## Script Algorithm: Building a Large Text Corpus of Biomedical Information

1. Open the PubMed download file.
2. Open an output file.
3. PubMed download files contain records that consistently begin with “PMID- ”. Parse through the PubMed download file, record by record, using “PMID- ” as the record separator.
4. Within the record, the title field is preceded by “TI - ”, and the title ends with a newline character followed by another field designator, such as the abstract field designator, “AB - ”. For example: TI—A Wnt Survival Guide: From Flies to Human Disease. AB—It has been two decades since investigators discovered the… From each record, extract the text that lies between the title field designator and the next field designator.
5. Convert the title to lowercase.
6. Clean the title line by removing nonalphanumeric characters, extra spaces, possessive markers (“’s”), and the plural forms of tumor names
7. Write titles to an external output file

## Analysis: Building a Large Text Corpus of Biomedical Information

The output is a public domain file consisting of lowercase reference titles, without punctuation. We will use this file in the next section

# Creating a List of Doublets from a PubMed Corpus

Autocoding is a specialized form of machine translation. The general idea behind machine translation is that computers have the patience, stamina, and speed to quickly parse through gigabytes of text, matching text terms with equivalent terms from an external vocabulary. We will be using doublets in later chapters, for a variety of different informatics projects. For all these projects, we will need to create an electronic list of the doublets contained in a text corpus. Let us create a doublet list from the PubMed corpus prepared in the previous section.

In [26]:
import re
import string
intext = open("cancer_gene_titles.txt", "r")
outtext = open("doubs.txt", "w")
doubhash = {}
doublet = ""
doub_match = re.compile(r'^[a-z]+ [a-z]+$')
for line in intext:
    line = line.strip()
    line_array = re.split(r'\s+',line)
    line_array.append("")
    for i in range(len(line_array)-1):
        doublet = line_array[i] + " " + line_array[i+1]
        if doub_match.search(doublet):
            doubhash[doublet]=""
for k,v in doubhash.items():
    print(k)

somatic mutational
mutational landscape
landscape in
in sporadic
sporadic human
deficient tumors
tumors diagnostic
diagnostic features
features and
and molecular
molecular geneticsclinical
geneticsclinical actionability
actionability of
of molecular
molecular targets
targets in
in endometrial
endometrial cancerbiology
cancerbiology and
and management
management of
of undifferentiated
undifferentiated pleomorphic
pleomorphic sarcoma
sarcoma myxofibrosarcomanovel
myxofibrosarcomanovel noninvasive
noninvasive diagnosticsrestoration
diagnosticsrestoration of
function leads
leads to
to tumor
tumor regression
regression in
in vivopulmonary
vivopulmonary sarcomatoid
sarcomatoid carcinoma
carcinoma a
a reviewsarcomatoid
reviewsarcomatoid carcinoma
carcinoma of
of the
the gallbladder
gallbladder clinicopathologic
clinicopathologic characteristicsmethylation
characteristicsmethylation based
based classification
classification of
of benign
benign and
and malignant
malignant peripheral
peripheral 

## Script Algorithm: Creating a List of Doublets from a PubMed Corpus

1. Open a text file. In this case, we will use cancer_gene_titles.txt, a list of
28,102 titles prepared in Section 9.1. Because the titles of copyrighted works
are exempted from copyright restrictions, the file belongs to the public domain.
A copy of the file can be downloaded at
http://www.julesberman.info/book/cancer_gene_titles.txt
2. Parse through the file, line by line.
3. For each line of the file, parse through every doublet on the line. This means,
looking at each two-word doublet consisting of each word in the line, with the
word that follows.
4. As each doublet is encountered, add the doublet to a dictionary object. The dictionary
object will have doublets as keys and the empty string, “”, as the value for each doublet. Some doublets will occur more than once in the text. A replicate
doublet will generate a preexisting key–value pair and will not increase
the size of the dictionary object.
5. After the text is parsed, print out the keys of the dictionary object to an external
file.

## Analysis: Creating a List of Doublets from a PubMed Corpus

9.2.2 Analysis
The output file, doubs.txt is 1,266,865 bytes in length and contains 77,257 doublets.
The file is available for download at
http://www.julesberman.info/book/doubs.txt
A few doublet entries from the output file are shown:
development of
favorable neuroblastoma
show evidence
carcinoma atypical
mediastinum a
localized hepatic
combining microarray
neoplastic metastasis
pathophysiology of
erbb receptor
illuminate intersection
by knock
antigen
candidate pro
hemangioma after
proper activation
lipoproteins and
of granular
the microscope
When the original text has no identifying, misspelled, profane, or otherwise objectionable
text, the resulting doublets can be used as “safe” for inclusion in confidential
text (see Chapter 15). In this case, we extracted doublets from a corpus consisting of
the titles of scientific articles. These titles would not be expected to contain identifying
or objectionable doublets.

# Downloading Gene Synonyms from PubMed

 At the pubmed site, select "Gene" as your Search engine, and enter "geneid" as your query. Pubmed will return a large set of geneid entries which you can download. The records serve as a text corpus form which you can extract a gene nomenclature

# Downloading Protein Synonyms from PubMed

Select the “Protein” database, and enter the query (Figure 9.4):
((protein AND human) AND “Homo sapiens”[porgn:__txid9606]) AND
“Homo sapiens”[porgn:__txid9606]
In this example, the results yielded 292,180 entries. It is easy to see that the output file
can be easily parsed, and protein information can be integrated with any other data
sets that contain information on any virtually any protein.
Proteins, along with their ontologic relationships, are also available for download
from GO, the Gene Ontology project.
http://www.geneontology.org/ontology/gene_ontology_edit.obo
This file is curated and updates are frequent.