Welcome to chapter nine of Methods in Medical Informatics! In this section, we will be exploring PubMed. PubMed is the U.S National Library of Medicine's public search engine for about 19 million citations from the medical literature. In this chapter, we will be exploring script which allow us to explore PubMed further. Lets begin!

> Disclaimer: The content below is adapted from the book "Methods in Medical Informatics - Fundamental of Healthcare Programming in Perl, Python, and Ruby" by Jules J. Berman. All content is for testing, education, and teaching purposes only. No content will be openly released to the internet. 

# Building a Large Text Corpus of Biomedical Information

It is remarkably easy to create a large public domain text corpus fro almost any medical specialty. All you need to do is to enter a PubMed query and and the results to a file on your computer's hard disk.*

**Description adapted from pages 131-132 of "Methods in Medical Informatics"*

In [None]:
import string, re
in_text = open("cancer_citations.txt", "r", encoding="utf-8")
out_text = open("cancer_gene_titles.txt", "w")
clump = ""
for line in in_text:
    title_match = re.search(r'(?<=TI)(.*)', clump)
    if title_match:
        title = title_match.group(1)
        title = title.lower()
        title = re.sub(r'\'s', "", title)
        title = re.sub(r'\W', " ", title)
        title = re.sub(r'omas', "oma", title)
        title = re.sub(r'tumour', "tumor", title)
        title = re.sub(r'\n', " ", title)
        title = title.rstrip()
        title = title.lstrip()
        title = re.sub(r' +', " ", title)
        text_match = re.search(r'[a-z]+', title)
        if not text_match:
            clump = ""
            continue
        print(title)
        out_text.write(title)
        clump = ""
    else:
        clump = clump + line
out_text.close()

## Script Algorithm: Building a Large Text Corpus of Biomedical Information

Open the PubMed download file.

In [None]:
import string, re
in_text = open("cancer_citations.txt", "r", encoding="utf-8")

Open an output file.

In [None]:
out_text = open("cancer_gene_titles.txt", "w")
clump = ""

PubMed download files contain records that consistently begin with “PMID- ”. Parse through the PubMed download file, record by record, using “PMID- ” as the record separator. Within the record, the title field is preceded by “TI - ”, and the title ends with a newline character followed by another field designator, such as the abstract field designator, “AB - ”. For example: TI—A Wnt Survival Guide: From Flies to Human Disease. AB—It has been two decades since investigators discovered the… From each record, extract the text that lies between the title field designator and the next field designator. Convert the title to lowercase. Clean the title line by removing nonalphanumeric characters, extra spaces, possessive markers (“’s”), and the plural forms of tumor names. Write titles to an external output file

In [None]:
for line in in_text:
    title_match = re.search(r'(?<=TI)(.*)', clump)
    if title_match:
        title = title_match.group(1)
        title = title.lower()
        title = re.sub(r'\'s', "", title)
        title = re.sub(r'\W', " ", title)
        title = re.sub(r'omas', "oma", title)
        title = re.sub(r'tumour', "tumor", title)
        title = re.sub(r'\n', " ", title)
        title = title.rstrip()
        title = title.lstrip()
        title = re.sub(r' +', " ", title)
        text_match = re.search(r'[a-z]+', title)
        if not text_match:
            clump = ""
            continue
        print(title)
        out_text.write(title)
        clump = ""
    else:
        clump = clump + line
out_text.close()

**This section is adapted from section 9.1.1, "Script Algorithm", of page 132 from "Methods in Medical Informatics".*

## Analysis: Building a Large Text Corpus of Biomedical Information

The output is a public domain file consisting of lowercase reference titles, without punctuation. We will use this file in the next section.*

**This section is adapted from section 9.1.2, "Analysis", of page 134 in "Methods in Medical Informatics".*

# Creating a List of Doublets from a PubMed Corpus

Autocoding is a specialized form of machine translation. The general idea behind machine translation is that computers have the patience, stamina, and speed to quickly parse through gigabytes of text, matching text terms with equivalent terms from an external vocabulary. We will be using doublets in later chapters, for a variety of different informatics projects. For all these projects, we will need to create an electronic list of the doublets contained in a text corpus. Let us create a doublet list from the PubMed corpus prepared in the previous section.*

**Description adapted from pages 134-136 of "Methods in Medical Informatics"*

In [None]:
import re
import string
intext = open("cancer_gene_titles.txt", "r")
outtext = open("doubs.txt", "w")
doubhash = {}
doublet = ""
doub_match = re.compile(r'^[a-z]+ [a-z]+$')
for line in intext:
    line = line.strip()
    line_array = re.split(r'\s+',line)
    line_array.append("")
    for i in range(len(line_array)-1):
        doublet = line_array[i] + " " + line_array[i+1]
        if doub_match.search(doublet):
            doubhash[doublet]=""
for k,v in doubhash.items():
    text = k + '\n'
    print(k)
    outtext.write(text)

## Script Algorithm: Creating a List of Doublets from a PubMed Corpus

Open a text file. In this case, we will use cancer_gene_titles.txt, a list of
titles prepared in Section 9.1. Because the titles of copyrighted works
are exempted from copyright restrictions, the file belongs to the public domain.*

In [None]:
import re
import string
intext = open("cancer_gene_titles.txt", "r")
outtext = open("doubs.txt", "w")

Parse through the file, line by line. For each line of the file, parse through every doublet on the line. This means,
looking at each two-word doublet consisting of each word in the line, with the
word that follows. As each doublet is encountered, add the doublet to a dictionary object. The dictionary
object will have doublets as keys and the empty string, “”, as the value for each doublet. Some doublets will occur more than once in the text. A replicate
doublet will generate a preexisting key–value pair and will not increase
the size of the dictionary object.

In [None]:
doubhash = {}
doublet = ""
doub_match = re.compile(r'^[a-z]+ [a-z]+$')
for line in intext:
    line = line.strip()
    line_array = re.split(r'\s+',line)
    line_array.append("")
    for i in range(len(line_array)-1):
        doublet = line_array[i] + " " + line_array[i+1]
        if doub_match.search(doublet):
            doubhash[doublet]=""

After the text is parsed, print out the keys of the dictionary object to an external
file.

In [None]:
for k,v in doubhash.items():
    print(k)

**This section is adapted from section 9.2.1, "Script Algorithm", of pages 136-137 from "Methods in Medical Informatics".*

## Analysis: Creating a List of Doublets from a PubMed Corpus

Analysis
The output file, doubs.txt is >24,000 bytes in length and contains thousands of doublets.
A few doublet entries from the output file are shown:
somatic mutational
mutational landscape
landscape in
in sporadic
sporadic human
deficient tumors
tumors diagnostic
diagnostic features
features and
and molecular
molecular geneticsclinical
geneticsclinical actionability
actionability of
of molecular
molecular targets
When the original text has no identifying, misspelled, profane, or otherwise objectionable
text, the resulting doublets can be used as “safe” for inclusion in confidential
text (see Chapter 15). In this case, we extracted doublets from a corpus consisting of
the titles of scientific articles. These titles would not be expected to contain identifying
or objectionable doublets.*

**This section is adapted from section 9.2.2, "Analysis", of pages 138-139 from "Methods in Medical Informatics".*

# Downloading Gene Synonyms from PubMed

 At the pubmed site (https://www.ncbi.nlm.nih.gov/gene/), select "Gene" as your Search engine, and enter "geneid" as your query. Pubmed will return a large set of geneid entries which you can download. The records serve as a text corpus form which you can extract a gene nomenclature*
 
**Description adapted from page 139 of "Methods in Medical Informatics"*

# Downloading Protein Synonyms from PubMed

Select the “Protein” database (https://www.ncbi.nlm.nih.gov/protein/), and enter the query:
- <p>((protein AND human) AND “Homo sapiens”[porgn:__txid9606]) AND “Homo sapiens”[porgn:__txid9606]</p>
In this example, the results yielded 1,392,492 entries. It is easy to see that the output file
can be easily parsed, and protein information can be integrated with any other data
sets that contain information on any virtually any protein.*

**Description adapted from page 140 of "Methods in Medical Informatics"*