Welcome to chapter four of Methods in Medical Informatics! In this section, we will be exploring how to view and modify image files. We will be exploring five different scripts which each illustrate  a different aspect of viewing and modifying image files. Lets begin!

> Disclaimer: The content below is adapted from the book "Methods in Medical Informatics - Fundamental of Healthcare Programming in Perl, Python, and Ruby" by Jules J. Berman. All content is for testing, education, and teaching purposes only. No content will be openly released to the internet. 

# ZIPF Distribution of a Text File

In almost every segment of life, a small number of items usually account for the bulk of the observable activities. This phenomenon is known as Zipf's law as a mathematical description. You can write a script to illustrate the Zipf distribution for any text.*

**Description adapted from pages 53-54 of "Methods in Medical Informatics"*

In [4]:
import re
import string
word_list = []
freq_list = []
format_list = []
freq = {}
in_text = open('d2020.bin', "r", encoding="utf-8")
in_text_string = in_text.read()
out_text = open("meshzipf.txt", "w")
word_list = re.findall(r'(\b[A-Za-z][a-z]{2,15}\b)', in_text_string)
in_text_string = ""
for item in word_list:
    count = freq.get(item,0)
    freq[item] = count + 1
for key, value in freq.items():
    value = "000000" + str(value)
    value = value[-6:]
    format_list += [value + " " + key]
format_list = reversed(sorted(format_list))
print(out_text, "\n".join(format_list))

<_io.TextIOWrapper name='meshzipf.txt' mode='w' encoding='cp1252'> 045270 the
036645 abcdef
034267 and
026575 abbcdef
017737 was
016454 see
014973 with
013647 under
010274 for
009718 that
008685 abcdefv
008624 Protein
007931 use
007519 The
007461 are
005355 from
004900 not
004744 which
004526 Receptor
004286 Cell
004124 used
004046 Syndrome
004026 specific
003870 Proteins
003837 also
003732 abbcde
003631 indexed
003416 Disease
003336 alpha
003268 Type
003139 search
003095 Agents
003043 Factor
002964 family
002962 beta
002893 coordinate
002805 Receptors
002721 Acid
002532 other
002411 abbbcdef
002378 genus
002368 may
002193 Diseases
002118 Kinase
002056 They
002035 cells
001986 confuse
001981 protein
001961 Health
001906 acid
001898 coord
001889 species
001880 abcdeef
001873 abbcdefv
001857 disease
001773 infection
001764 general
001752 Nerve
001717 but
001704 found
001679 abbcdeef
001589 available
001536 cell
001534 plant
001517 Drug
001516 has
001488 usually
001475 associated
001471 B

## Script Algorithm: Zipf Distribution of a Text File

Call the necessary packages*

In [None]:
import re
import string
word_list = []
freq_list = []
format_list = []
freq = {}

Open the necessary file to read and create a new file, meshzipf.txt, which will receive the output of the zipf distribution

In [None]:
in_text = open('d2020.bin', "r", encoding="utf-8")
in_text_string = in_text.read()
out_text = open("meshzipf.txt", "w")

Parse the string, matching against each occurrence of a latter followed by at least 2, and at most 15, lowercase letters, with the sequence bounded on either size by a word boundary. 

In [None]:
word_list = re.findall(r'(\b[A-Za-z][a-z]{2,15}\b)', in_text_string)
in_text_string = ""

Create a dictionary object that will include words (keys) and number of occurrences (values)

In [None]:
for item in word_list:
    count = freq.get(item,0)
    freq[item] = count + 1

After the dictionary object is complete, format the values in the dictionary, as a zero-padded string of uniform length. 

In [None]:
for key, value in freq.items():
    value = "000000" + str(value)
    value = value[-6:]
    format_list += [value + " " + key]

Sort the key-value pairs by values, descending. Print out sorted key-value pairs

In [None]:
format_list = reversed(sorted(format_list))
print(out_text, "\n".join(format_list))

**This section is adapted from section 4.1.1, "Script Algorithm", of page 54 from "Methods in Medical Informatics".*

## Analysis: Zipf Distribution of a Text File

The top entries from the MeSH file are:

`036645 abcdef
034267 and
026575 abbcdef
017737 was
016454 see
014973 with
013647 under
010274 for
009718 that`

For these scripts, the entire content of a file is loaded into a string variable. This variable is subsequently parsed into words, with each occurrence of the word counted. If the file is very large, the script can be modified to read the file line by line, incrementing the word/frequency tally for th words contained in each line. At the top of the Zipf list are the high-frequency words, such as “the”, “and”, and “was” that serve as connectors for lower-frequency, highly specific terms. Also included at the top of the Zipf list are frequently recurring letter sequences peculiar to the file; in this case, “abcdef” and “abbcdef”. Zipf distributions have many uses in informatics projects, including the preparation of “stopword” lists.*

**This section is adapted from section 4.1.2, "Analysis", of page 56 in "Methods in Medical Informatics".*

# Preparing a Concordance

A concordance is a special type of index, listing every location of every word in the text. Concordances can be used to support very fast proximity searches (finding the locations of words in proximity to other words), and phrase searches (finding sequences of words located in an ordered sequence somewhere in the text. Using only a concordance, it is a simple matter to computationally recreate the entire text. Preparing a concordance is quite simple.*

**Description adapted from page 57 of "Methods in Medical Informatics".*

In [10]:
import re
import string
sentence_list = []
word_list = []
word_dict = {}
format_list = []
count = 0
stopfile = open("./K11946_Files/STOP.TXT",'r')
stop_list = stopfile.readlines()
stopfile.close()
in_text = open('./K11946_Files/TITLES.TXT', "r")
in_text_string = in_text.read()
in_text_string = in_text_string.replace("\n"," ")
in_text_string = in_text_string.replace(" +"," ")
sentence_list = re.split(r'[\.\!\?] +(?=[A-Z])',in_text_string)
for sentence in sentence_list:
    count = count + 1
    sentence = sentence.lower()
    word_list = re.findall(r'(\b[a-z]{3,15}\b)', sentence)
    for word in word_list:
        if word in word_dict:
            word_dict[word] = word_dict[word] + ',' + str(count)
        else:
            word_dict[word] = str(count)
keylist = word_dict.keys()
sorted(keylist)
for key in keylist:
    print(key, word_dict[key])

carcinoid 1
tumor 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100
the 1,6,7,8,11,12,14,16,17,19,22,23,35,39,41,41,44,47,49,51,55,58,59,65,70,71,72,76,78,78,90,95,96,98
common 1
bile 1
duct 1,66
rare 1,39
complication 1
von 1,54
hippel 1
lindau 1
syndrome 1
establishment 2,13
and 2,3,6,10,10,13,13,16,16,18,18,19,20,20,21,22,27,33,35,35,36,38,40,42,44,45,46,46,47,47,49,52,52,55,56,59,63,64,64,67,68,69,70,71,71,74,77,78,79,82,83,84,85,85,89,91,95,96,97,97,98,98,99
new 2
cell 2,3,8,12,13,16,20,30,33,40,42,43,43,47,48,52,52,61,64,65,65,68,69,84,85,93,96
line 2,3
derived 2,42,49,59
from 2,13,71,89,93
human 2,5,12,13,35,37,44,56,61,62,98
colorectal 2,59,81
laterally 2
spreading 2
vivo 3,10,27,82
anti 3,34,36,48,51,58,82
effect 3,51
hybrid 3,37
vac

## Script Algorithm: Preparing a Concordance

Import the necessary packages*

In [None]:
import re
import string
sentence_list = []
word_list = []
word_dict = {}
format_list = []
count = 0

Read the entire contents of the titles.txt file into a string variable

In [None]:
stopfile = open("./K11946_Files/STOP.TXT",'r')
stop_list = stopfile.readlines()
stopfile.close()
in_text = open('./K11946_Files/TITLES.TXT', "r")
in_text_string = in_text.read()

Split the file into sentences

In [None]:
in_text_string = in_text_string.replace("\n"," ")
in_text_string = in_text_string.replace(" +"," ")
sentence_list = re.split(r'[\.\!\?] +(?=[A-Z])',in_text_string)

Parse each sentence into an array of words

In [None]:
for sentence in sentence_list:
    count = count + 1
    sentence = sentence.lower()
    word_list = re.findall(r'(\b[a-z]{3,15}\b)', sentence)
    for word in word_list:
        if word in word_dict:
            word_dict[word] = word_dict[word] + ',' + str(count)
        else:
            word_dict[word] = str(count)

Add the location of the word to the dictionary object that contains the encountered words and their locations

In [None]:
keylist = word_dict.keys()

Order the words alphabetically and print out each word in the dictionary object

In [None]:
sorted(keylist)
for key in keylist:
    print(key, word_dict[key])

**This section is adapted from section 4.2.1, "Script Algorithm", of page 57 from "Methods in Medical Informatics".*

## Analysis: Preparing a Concordance

The sample text consisted of 100 parsed sentences. Here are the first few lines of the output.

`carcinoid 1
tumor 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100
the 1,6,7,8,11,12,14,16,17,19,22,23,35,39,41,41,44,47,49,51,55,58,59,65,70,71,72,76,78,78,90,95,96,98
common 1
bile 1
duct 1,66
rare 1,39
complication 1`

# Extracting Phrases

All text is composed of words and phrases that represent specific concepts, that are connected together into a sequence of meaningful statements. One way to extract useful concepts is to remove common words or "stopwords". This script will demonstrate phrase extraction through stopword removal.*

**Description adapted from page 60 of "Methods in Medical Informatics".*

In [13]:
import re, string
item_list = []
stopfile = open("./K11946_Files/STOP.TXT",'r')
stop_list = stopfile.readlines()
stopfile.close()
in_text = open("./K11946_Files/cancer_gene_titles.txt", "r")
count = 0
for line in in_text:
    count = count + 1
    for stopword in stop_list:
        stopword = re.sub(r'\n', '', stopword)
        line = re.sub(r' *\b' + stopword + r'\b *', '\n', line)
    item_list.extend(line.split("\n"))
item_list = sorted(set(item_list))
out_text = open('phrases.txt', "w")
for item in item_list:
    print(out_text, item)

<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> 
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> 1
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> 1 25 dihydroxyvitamin d3 regulation
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> 1 25 oh 2d3
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> 1 3 butadiene data integration opportunities
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> 1 4 benzoquinone
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> 1 4 dichlorobenzene
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> 1 d microfluidic beads array
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> 1 lymphocytes
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> 1 molecular target drug discovery
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> 1 naphthol
<_io.TextIOWrapper name='phrases.txt' mode='w' encodi

<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> agriculture
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> agrobacterial oncogenes expression
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> agrobacterium tumefaciens chromosomal dna
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> ags cells
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> ah receptor
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> ah receptor agonist activity
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> ahr agonist
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> ahr dependent cyp1a1 expression
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> ahr mediated
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> ahr protein trafficking
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> ahr transcriptional activity
<_io.TextIOWrapper name=

<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> apc mutation
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> apc mutations
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> apc pathogenic mutations
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> apc promoter methylation
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> apc tumor suppressor gene
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> apcmin mice
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> apcmin mouse
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> apcmin tumors
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> ape1
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> ape1 gene promoter
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> ape1 ref 1 regulates pten expression mediated
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp125

<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> bcrp expression
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> bdii rats
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> bead microarrays
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> beaming
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> bears bile
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> beas 2b cells
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> beautiful micrornas
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> beckwith wiedemann
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> beckwith wiedemann syndrome
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> beckwith wiedemann syndrome multiple molecular mechanisms
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> beclin 1
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'

<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> c elegans micrornas
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> c elegans paralogs
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> c elegans pharynx
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> c flip
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> c flip l
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> c fos activator protein 1
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> c fos assessment
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> c hominis
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> c iap2
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> c jun
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> c jun dependent ap 1 transactivation
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> c jun homodimers
<_io.TextIOWrap

<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> caveolin 1
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> caveolin 1 expression
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> caveolin 1 mammary stem cells
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> cavernous angioma
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> cavernous sinus case report
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> cavorting
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> cb6f1 tgrash2 mice
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> cbf1
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> cbfa 1 runx2
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> cbfa2t3 znf652 corepressor complex regulates transcription
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> cbfbeta reduces cbfbeta smmhc associated acute myeloid leuk

<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> chimeric hcmv hsv 1 oncolytic viruses
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> chimeric hiv 1
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> chimeric pote actin genes
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> chimeric th1 antagonist
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> chimpanzee microsatellite evolution
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> chimpanzee origin adenovirus vectors
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> china
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> china demand
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> chinese
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> chinese adults
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> chinese early onset breast cancer patients
<_io.TextIOWr

<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> common diseases
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> common effector processing mediates cell specific responses
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> common epithelial cancers
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> common epithelial tumors
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> common event
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> common fragile site stability
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> common gene signature
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> common genetic polymorphisms
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> common genetic variants
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> common genetic variation
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> comm

<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> cyp2a6 polymorphisms
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> cyp2b6
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> cyp2b6 g15631t polymorphism
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> cyp2c19 polymorphisms
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> cyp2c9
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> cyp2d6 phenotype prediction
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> cyp2d6 testing
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> cyp2e1
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> cyp2f1 gene
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> cyp2f1 genetic polymorphism identification
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> cyp2s1 gene polymorphisms
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='c

<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> discoidin domain receptor function
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> discovered
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> discoveries
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> discovering high order patterns
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> discovery
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> discoveryspace
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> discs large homolog 1 regulates smooth muscle orientation
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> discs large homolog 5
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> disease
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> disease biology
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> disease causing myh mutations
<_io.TextIOWrapper name='

<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> ectopic tbx2 expression results
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> ectromelia virus induced apoptosis
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> ed
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> eda
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> edaradd locus
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> edd gene
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> edge
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> edges
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> edible vaccines current status
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> editing enzymes
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> editorial angiogenesis agents
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> editorial comment
<_io.

<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> er coregulator pelp1 mnar
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> er positive breast cancer
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> er status
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> er stress
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> era
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> eradicate tumors
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> eralpha
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> eralpha germline pvuii marker
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> eralpha negative breast cancer
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> eralpha suppresses slug expression directly
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> erbb 1
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp12

<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> flt3 internal tandem duplication
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> flt3 itd
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> flt3 itd size
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> flt3 kinase inhibitors
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> flt3 ligand gene transfer
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> flt3 mutations
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> fluid flow
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> fluid flow induces rankl expression
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> flumequine
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> fluorescence
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> fluorescence based method
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='

<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> gene testing
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> gene therapeutic
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> gene therapeutics
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> gene therapy
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> gene therapy approach
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> gene therapy approaches
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> gene therapy breakthrough
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> gene therapy clinical trials worldwide
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> gene therapy closer
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> gene therapy developments
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> gene therapy optimization
<_io.TextIOWrapper name='phras

<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> greenlandic districts
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> greglist
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> gremlin enhances
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> grg1 acts
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> grhl2
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> grim 19
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> grim 19 associates
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> grl method
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> grm1 gene
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> gro alpha
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> groove
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> groovy vaccine
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'

<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> hif 1alpha inhibitor rx 0047
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> hif 1alpha inhibits foam cells formation
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> hif 1alpha overexpression
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> hif 1alpha promotes metastasis
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> hif 1alpha rna interference
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> hif 1alpha stability
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> hif 2alpha
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> hif 2alpha results
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> hif alpha regulation
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> hif gene expression
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> hif independent senescence 

<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> human cytomegalovirus regulates bioactive sphingolipids
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> human cytosolic sulfotransferases
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> human dda3
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> human dead box atpase ddx3
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> human dek protein
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> human delta1 pyrroline 5 carboxylate synthase function
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> human dendritic cells
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> human development
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> human dicer c terminus functions
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> human diffuse large b cell lymphoma
<_io.TextIOWrapper name='phras

<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> il 20
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> il 22
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> il 23
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> il 28 elicits antitumor responses
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> il 3
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> il 3 induces
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> il 32
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> il 33
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> il 4
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> il 4 induced aid expression
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> il 4 regulates cox 2
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> il 4 stimulated nf kappab activity
<_io.TextIOWrapper name='phrases.txt'

<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> integrated microfluidic biochips
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> integrated molecular medicine
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> integrated molecular profiling
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> integrated profiling
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> integrated waveguides
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> integrating biological pathways
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> integrating cell signalling pathways
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> integrating copy number polymorphisms
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> integrating signals
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> integration
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'

<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> l alanosine sdx 102
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> l arginine availability regulates t lymphocyte cell cycle progression
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> l deficient mouse brain lysosomes
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> l h gray
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> l isoaspartate
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> l monocytogenes
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> l myc gene polymorphism
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> l myc polymorphism
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> l pk gene
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> l pneumophila
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> l type amino acid transporter 1 expressed
<

<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> malignant glioma cells
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> malignant glioma shifting
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> malignant glioma standard
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> malignant glioma therapy hype
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> malignant growth
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> malignant haematopoiesis
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> malignant head
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> malignant hematopoiesis
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> malignant hematopoietic cells
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> malignant human tissues
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> malignant lesions
<_io.TextIOWrapp

<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> metastin stimulates aldosterone synthesis
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> metazoa
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> metazoan genomes
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> metencephalon
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> metformin
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> metformin suppresses intestinal polyp growth
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> methacrylate conversion
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> methamphetamine interactions
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> methcancerdb aberrant dna methylation
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> methionine inhibits cellular growth dependent
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> m

<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> molecular pathology
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> molecular pathways
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> molecular pathways gli
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> molecular pathways linking
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> molecular pathways mediating liver metastasis
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> molecular pathways notch
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> molecular pathways regulating pro migratory effects
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> molecular perspective
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> molecular pet imaging
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> molecular phenotype
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'

<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> myelodysplastic syndromes
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> myelodysplastic syndromes guilty
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> myelodysplastic syndromes molecular pathogenesis
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> myeloid alphav integrins
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> myeloid cells
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> myeloid dendritic cell lectins
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> myeloid derived suppressor cells
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> myeloid disease
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> myeloid disorders
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> myeloid erythroid leukemia
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> my

<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> nf y connection
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> nf y dependent cyclin b2 expression
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> nf ya modulates nf y transcriptional activity
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> nf1
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> nf1 associated malignancies
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> nf1 gene
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> nf1 haploinsufficiency augments angiogenesis
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> nf1 molecular testing
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> nf1 mutations
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> nf1 phenotype focus
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> nf2
<_io.TextIOWrapper name='phras

<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> nuclear signalling
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> nuclear steroid receptors
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> nuclear structure
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> nuclear survivin
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> nuclear translocation
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> nuclear transport
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> nuclei
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> nucleic acid based therapeutics
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> nucleic acid beacons
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> nucleic acid delivery
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> nucleic acid medicines
<_io.TextIOWrapper name='phrases.txt' mode='w' en

<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> p52
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> p53
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> p53 aberration
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> p53 acetylation
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> p53 actions p53 represses rhamm expression
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> p53 activation
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> p53 activity
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> p53 alterations
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> p53 arg72pro polymorphism
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> p53 based cancer therapies
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> p53 binding
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> p53 binding induc

<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> pge2 induced apoptotic cell death
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> pge2 induced metalloproteinase 9
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> pge2 production
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> pgj2 stimulated beta cell apoptosis
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> pgk1
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> pgp9 5
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> pgp9 5 methylation
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> pgr 331
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> ph regulation
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> ph sensitive gene carriers
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> ph sensitive shielding
<_io.TextIOWrapper name='phrases.txt' mode='w' encod

<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> predictive marker
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> predictive markers
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> predictive modeling
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> predictive oncology
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> predictive pathology
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> predictive proteomics
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> predictive relevance
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> predictive role
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> predictive tests
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> predictive value
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> predictor
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> predict

<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> quantitative real time pcr
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> quantitative real time rt pcr
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> quantitative real time rt pcr assay
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> quantum dots
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> quantum dots emerging applications
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> quartets
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> quercetin
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> quercetin blocks airway epithelial cell chemokine expression
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> query
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> quest
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> question
<_io.TextIOWrapper name='phrases

<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> replicon plasmid based vaccines
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> reply
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> report
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> reporter bioassays
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> reporter gene
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> reporter gene expression
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> reports
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> repp86 stability
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> repress p53 transcriptional activity
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> repress stimulus induced activation
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> repress transcription
<_io.TextIOWrapper name='phrases.txt' mode='w' enco

<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> sam domain
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> sam gs
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> sample shortage
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> samples
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> saos 2 cells
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> sap
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> sap discovery
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> sap related adapters
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> sap related adaptors matters
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> sapho syndrome
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> saporin
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> saporin toxin
<_io.TextIOWrapper name='phrases.txt' mode='w' e

<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> small molecule antagonist
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> small molecule antagonists
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> small molecule compound inhibits akt pathway
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> small molecule inhibitor
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> small molecule inhibitors
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> small molecule mdm2 antagonists
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> small molecule screens
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> small molecules
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> small molecules destabilize ciap1
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> small molecules targeting p53
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding=

<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> su fu nuclear import
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> sub classification
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> sub functionalization
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> sub grouping
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> subcellular cell specific targeting
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> subcellular localization
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> subclass mapping identifying common subtypes
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> subclonal phylogenetic structures
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> subcomplexes
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> subcutaneous murine model
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> subcutaneous pannicu

<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> tenascin expression
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> tenascin production
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> tendons
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> tenets
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> tensin3
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> tensin4 expression
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> ter mice
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> terameprocol vaginal ointment
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> teratocarcinoma
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> teratogen induced activation
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> teratogenesis
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> term association
<_io.TextIO

<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> transforming viruses
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> transgelin
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> transgenerational epigenetic inheritance
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> transgenerational evolutionary adaptation
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> transgenerational transmission
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> transgenes
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> transgenesis
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> transgenic cyclin e triggers dysplasia
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> transgenic fish
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> transgenic livestock
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> transgenic method
<_io.TextI

<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> unique features
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> unique mechanism
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> unique microrna molecular profiles
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> unique microrna signature
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> unique molecular characteristics
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> unique signaling properties
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> unite
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> united states
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> universal bead arrays
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> universal character
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> universal phenomenon
<_io.TextIOWrapper name='phrases.txt

<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> ww domain containing oxidoreductase
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> wwox
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> wwox expression
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> wwox gene
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> wwox gene accelerates forestomach tumor progression
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> wwox gene transfection
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> wwox protein expression
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> wwox reveals
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> wwox suppresses c jun transcriptional activity
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> wwox tumor suppressor gene
<_io.TextIOWrapper name='phrases.txt' mode='w' encoding='cp1252'> wwox tumor suppressor

## Script Algorithm: Extracting Phrases

Call necessary packages*

In [None]:
import re, string
item_list = []

Open the stop.txt file, containing a list of common stopwords. Split into a list structure

In [None]:
stopfile = open("./K11946_Files/STOP.TXT",'r')
stop_list = stopfile.readlines()
stopfile.close()

Open cancer_gene_titles.txt

In [None]:
in_text = open("./K11946_Files/cancer_gene_titles.txt", "r")
count = 0

Pare through the lines of the text. Substittue a newline character for every occurrence of any stopword in the sentence.

In [None]:
for line in in_text:
    count = count + 1
    for stopword in stop_list:
        stopword = re.sub(r'\n', '', stopword)
        line = re.sub(r' *\b' + stopword + r'\b *', '\n', line)
    item_list.extend(line.split("\n"))
item_list = sorted(set(item_list))
out_text = open('phrases.txt', "w")

Sort item alphabetically and print

In [None]:
for item in item_list:
    print(out_text, item)

**This section is adapted from section 4.3.1, "Script Algorithm", of page 61 from "Methods in Medical Informatics".*

## Analysis: Extracting Phrases

The output is an alphabetic file of the phrases that might appear in a book's index. We used the file consisting of titles from a PubMed search. This file, cancer_gene_titles.txt, is about 1.1 MB in length, the size of a typical book. We only required about a dozen lines of code and a few seconds of execution time to create out list of index terms.*

**This section is adapted from section 4.3.2, "Analysis", of page 63 from "Methods in Medical Informatics".*

# Preparing an Index

An index is a list of the important words or phrases contained in a book, along with the locations where each of those words and phrases can be found. This is different from concordance because the index does not contain every word found in the text, and the index contains groups of selected phrases, in addition to individual words. Software can be used to create indexes. However, remember that a useful index is more selective than simply recording the location of every word and phrase. 

In [34]:
import re
import string
item_list = []
item_dictionary = {}
place_string = ""
stopfile = open("./K11946_Files/STOP.TXT",'r')
stop_list = stopfile.readlines()
stopfile.close()
in_text = open('./K11946_Files/TEXT.TXT', "r")
in_text_string = in_text.read()
in_text_string = in_text_string.replace("\n"," ")
in_text_string = in_text_string.replace(" +"," ")
sentence_list = re.split(r'[\.\!\?] +(?=[A-Z])',in_text_string)
norm = str.maketrans('','')
badascii = str.translate(norm,norm,string.printable)
badascii_table = badascii + (256 - len(badascii))*" "
junk_table = 256*" "
table = string.maketrans(badascii_table,junk_table)
count = 0
for item in sentence_list:
    count = count + 1
    count_string = str(count)
    item = string.lower(item)
    item = re.sub(r'\'s', "", item)
    item = item.translate(table)
    for stopword in stop_list:
        stopword = string.rstrip(stopword)
        item = re.sub(r' *\b' + stopword + r'\b *', '\n', item)
        item_list = item.split("\n")
        for phrase in item_list:
            phrasematch = re.match(r'^[0-9]', phrase)
            if (phrasematch):
                continue
            if item_dictionary.has_key(phrase):
                item_dictionary[phrase] = item_dictionary[phrase] + ',' + count_string
            else:
                item_dictionary[phrase] = count_string
keylist = item_dictionary.keys()
keylist.sort()
for key in keylist:
    print(key, item_dictionary[key])

TypeError: descriptor 'translate' requires a 'str' object but received a 'dict'

## Script Algorithm: Preparing an Index

1. Create an array containing stopwords. You can use any stopword list you prefer.
In this script, we use stop.txt available at http://www.julesberman.info/
book/stop.txt
2. Open a file to be indexed. You can use any file, but in this text, we use text.
txt, available at http://www.julesberman.info/book/text.txt
3. Strip the text of any non-ASCII characters (not necessary if you are using a
plain-text file).
4. Split the text into sentences and put the consecutive sentences into an array.
5. Create a dictionary object, which will hold phrases as keys and a commaseparated
list of numbers, representing the sentences in which the phrases
appear, as the values.
6. For each sentence in the array of consecutive sentences, split the sentence
wherever a stopword appears, and put the resulting phrases into an array.
7. For each array of phrases, from each sentence, parse through the array of
phrases, assigning each phrase to a dictionary key, and concatenating the sentence
number in which the phrase occurs, to the comma-separated list of sentence
numbers that serves as the value for the key (phrase)

## Analysis: Preparing an Index

An example of the kind of output produced by the script is shown

`adjustment 7,9
adjuvant chemotherapy 83
adjuvant imrt 23`

The numbers represent the sentence numbers in which each phrase occurs. AUtomated indexing invariably produces a product that a human indexer can improve. The strength of automatic indexing is found when the texts are very long. Humans cannot index long texts. A flawed computer-generated index is usually better than no index at all

# Comparing Texts Using Similarity Scores

When you have extracted all of the phrases occurring in a text, you have created something akin to the signature of the text. We can then determine whether two different text are similar, when we compare their signatures. Similarity scores are very useful in medical science. We can use similarity scores to establish relatedness of objects (ie. DNA sequences), to find trends and outliers in population data, to provide "best-fit" search results, and to classify groups of items. This script will demonstrate calculating the similarity between two documents using Pearson correlation.

In [41]:
import re
import string
from math import sqrt
from math import pow
treasure = {}
paradise = {}
filelist = ["./K11946_Files/treasure.txt", "./K11946_Files/paradise.txt"]
stopfile = open("./K11946_Files/stop.txt",'r')
stop_list = stopfile.readlines()
stopfile.close()
phraseform = re.compile(r'^[a-z]+ [a-z ]+$')
for filename in filelist:
    in_text = open(filename, "r", encoding="utf-8")
    in_text_string = in_text.read()
    in_text.close()
    in_text_string = in_text_string.replace("\n"," ")
    for stopword in stop_list:
        stopword = stopword.rstrip()
        in_text_string = re.sub(r' *\b' + stopword + r'\b *', '\n',in_text_string)
    in_text_string = re.sub(r'[\,\:\;\(\)]','\n',in_text_string)
    in_text_string = re.sub(r'[\.\!\?] +(?=[A-Z])', '\n', in_text_string)
    in_text_string = in_text_string.lower()
    item_list = re.split(r' *\n *', in_text_string)
    for phrase in item_list:
        phrase = re.sub(r' +',' ', phrase)
        phrase = phrase.strip()
        phrasematch = phraseform.match(phrase)
        if not (phrasematch):
            continue
        if (filename == "./K11946_Files/paradise.txt"):
            if phrase in paradise:
                paradise[phrase] = paradise[phrase] + 1
            else:
                paradise[phrase] = 1
            if not (phrase in treasure):
                treasure[phrase] = 0
        if (filename == "./K11946_Files/treasure.txt"):
            if phrase in treasure:
                treasure[phrase] = treasure[phrase] + 1
            else:
                treasure[phrase] = 1
            if not (phrase in paradise):
                paradise[phrase] = 0
count = 0; sumtally1 = 0; sumtally2 = 0; sqtally1 = 0; sqtally2 = 0
prodtally12 = 0; part1 = 0; part2 = 0; part3 = 0;
keylist = paradise.keys()
for key in keylist:
    count = count + 1;
    sumtally1 = sumtally1 + paradise[key]
    sumtally2 = sumtally2 + treasure[key]
    sqtally1 = sqtally1 + pow(paradise[key],2)
    sqtally2 = sqtally2 + pow(treasure[key],2)
    prodtally12 = prodtally12 + (paradise[key] * treasure[key])
part1 = prodtally12 - (float(sumtally1 * sumtally2) / count)
part2 = sqtally1 - (float(pow(sumtally1,2)) / count)
part3 = sqtally2 - (float(pow(sumtally2,2)) / count)
similarity12 = float(part1) / float(sqrt(part2 * part3))
print("The Pearson score is", similarity12)

The Pearson score is -0.38257389584927015


## Script Algorithm: Comparing Texts Using Similarity Scores

Fill in Script Algorithm Component

## Analysis: Comparing Texts Using Similarity Scores

Pearson scores range from -1 to 1. A score of 1 occurs when a document is compared against itself. When we compute the Pearson score between two highly dissimilar texts, the yielded score is -0.38257. We expected and received a low-end Pearson score.  